Tokenizer
- Shashank
- abhinavc (Unlicensed)
- priyanambiar
Owned by Shashank
Introduction
Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words) on the basis of delimiter.
Use-Case
User wants to extract the hashtags from the twitter feeds.User would tokenize the words based on space and then can identify the words that start with hashtags
Input source:
topic
sentence
cask
cask is #data application #platform Tokenizer:
- User wants to tokenize the sentence data using “ ” as a pattern
Output:
topic sentence words cask cask is #data application #platform [cask,is,#data,application,#platform]
User Stories
- As a Hydrator user,I want to tokenize the data in a column from source schema and output the tokens into output schema which will have a single column having tokenized data.
- As a Hydrator user I want to have configuration for specifying the column name from input schema on which tokenization has to be performed.
- As a Hydrator user I want to have configuration to specify the delimiter which could be used for tokenization.
- As a Hydrator user I want to have configuration to specify output column name wherein tokenized data will be emitted.
Conditions
- Source field ,to be tokenized,can be of only string type.
- User can tokenize single column only from the source schema.
Example
Input source:
topic | sentence |
Java | Hello world / is the /basic application |
HDFS | HDFS/ is a /file system |
Spark | Spark /is engine for /bigdata processing |
Tokenizer:
- User wants to tokenize the sentence data using “/” as a delimiter
- Mandatory inputs from user:
- Column on which tokenization to be done:”sentence”
- Delimiter for tokenization:”/”
- Output column name for tokenized data:”words”
- Tokenizer plugin will tokenize “sentence” data from input source and put tokenized data in “words” in output.
Output:
topic | sentence | words |
---|---|---|
Java | Hello world / is the /basic application | [hello world, is the, basic application] |
HDFS | HDFS/ is a /file system | [hdfs, is a ,file system] |
Spark | Spark /is engine for /bigdata processing | [spark ,is engine for ,bigdata processing] |
Design
This is a sparkcompute type of plugin and is meant to work with Spark only.
Properties:
- columnToBeTokenized :Column name on which tokenization is to be done
- patternSeparator:Pattern Separator
- outputColumn:Output column name for tokenized data
Input JSON:
{
"name": "Tokenizer",
"plugin": {
"name": "Tokenizer",
"type": "sparkcompute",
"label": "Tokenizer",
"properties": {
" columnToBeTokenized": "sentence",
" patternSeparator": "/",
" outputColumn": "words",
}
}
}
Table of Contents
Checklist
- User stories documented
- User stories reviewed
- Design documented
- Design reviewed
- Feature merged
- Examples and guides
- Integration tests
- Documentation for feature
- Short video demonstrating the feature