Introduction

Tokenization is Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words) on the basis of delimiter.

Use-Case

Tokenize the data on the basis of delimiter

Tokenized output will be an array of delimited tokens

If you want to have sentence to be broken into tokens of words

Source Field name : e.g sentence(Type: String)

Target field name : e.g words(Type: String[])

User Stories

User should be able to specify the column name on which tokenization is to be done.
User should be able to specify the output column name.
User should be able to specify the delimiter which will be used by Tokenizer.

Conditions

Source field
User wants to extract the hashtags from the twitter feeds.User would tokenize the words based on space and then can identify the words that start with hashtags
Input source:
topic
sentence
cask
cask is #data application #platform
Tokenizer:
- - User wants to tokenize the sentence data using “ ” as a pattern
Output:
topic sentence words
cask cask is #data application #platform [cask,is,#data,application,#platform]

User Stories

As a Hydrator user,I want to tokenize the data in a column from source schema and output the tokens into output schema which will have a single column having tokenized data.
As a Hydrator user I want to have configuration for specifying the column name from input schema on which tokenization has to be performed.
As a Hydrator user I want to have configuration to specify the delimiter which could be used for tokenization.
As a Hydrator user I want to have configuration to specify output column name wherein tokenized data will be emitted.

Conditions

Source field ,to be tokenized,can be of only string type.
User can tokenize single column only from the source schema.
Output schema will have a single column of type string array.

Example

Input source:

topic	sentence
Java	Hello world / is the /basic application
HDFS	HDFS/ is a /file system
Spark	Spark /is an engine for /bigdata processing

Tokenizer:

- User wants to tokenize the sentence data using “/” as a delimiter
- Mandatory inputs from user:
- Column on which tokenization to be done:”sentence”
- Delimiter for tokenization:”/”
- Output column name for tokenized data:”words”
- Tokenizer plugin will tokenize “sentence” data from input source and put tokenized data in “words” in output.

Output:

topic	sentence	words

{Hello

Java

Hello world / is the /basic application

[hello world, is the, basic application

}

]

HDFS

{HDFS

HDFS/ is a /file system

[hdfs, is a ,file system

}

]

{Spark

Spark

Spark /is engine for /bigdata processing

[spark ,is

an

engine for ,bigdata processing

}

]

Design

This is a sparkcompute type of plugin and is meant to work with Spark only.

Properties:

columnToBeTokenized :Column name on which tokenization is to be donedelimiter:Delimiter for
tokenizationpatternSeparator:Pattern Separator
outputColumn:Output column name for tokenized data

Input JSON:

        "name": "Tokenizer",

        "plugin": {

        "name": "Tokenizer",

        "type": "sparkcompute",

        "label": "Tokenizer",

        "properties": {

           " columnToBeTokenized ": "sentence",

           " delimiter patternSeparator": "/",

           " outputColumn": "words",

Table of Contents

Table of Contents

style	circle

Checklist

User stories documented
User stories reviewed
Design documented
Design reviewed
Feature merged
Examples and guides
Integration tests
Documentation for feature
Short video demonstrating the feature

Versions Compared

Old Version 7

New Version Current

Key

Use-Case

User Stories

Conditions

User Stories

Conditions

topic	sentence
cask	cask is #data application #platform

Page Comparison

Versions Compared

Old Version 7

New Version Current

Key

Use-Case

User Stories

Conditions

User Stories

Conditions