Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words) on the basis of delimiter.
Use-Case
User wants to extract the hashtags from the twitter feeds.User would tokenize the words based on space and then can identify the words that start with hashtags
Input source:
topic
sentence
cask
cask is #data application #platform
Tokenizer:
User wants to tokenize the sentence data using “” as a pattern
Output:
words
[cask,is,#data,application,#platform]
User Stories
As a Hydrator user,I want to tokenize the data in a column from source schema and output the tokens into output schema which will have a single column having tokenized data.
As a Hydrator user I want to have configuration for specifying the column name from input schema on which tokenization has to be performed.
As a Hydrator user I want to have configuration to specify the delimiter which could be used for tokenization.
As a Hydrator user I want to have configuration to specify output column name wherein tokenized data will be emitted.
Conditions
Source field ,to be tokenized,can be of only string type.
User can tokenize single column only from the source schema.
Output schema will have a single column of type string array.
Example
Input source:
topic
sentence
Java
Hello world / is the /basic application
HDFS
HDFS/ is a /file system
Spark
Spark /is engine for /bigdata processing
Tokenizer:
User wants to tokenize the sentence data using “/” as a delimiter
Mandatory inputs from user:
Column on which tokenization to be done:”sentence”
Delimiter for tokenization:”/”
Output column name for tokenized data:”words”
Tokenizer plugin will tokenize “sentence” data from input source and put tokenized data in “words” in output.
Output:
words
[hello world, is the, basic application]
[hdfs, is a ,file system]
[spark ,is engine for ,bigdata processing]
Design
This is a sparkcompute type of plugin and is meant to work with Spark only.
Properties:
columnToBeTokenized :Column name on which tokenization is to be done
delimiter:Delimiter for tokenization
outputColumn:Output column name for tokenized data