Versions Compared
Key
- This line was added.
- This line was removed.
- Formatting was changed.
Introduction
Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words) on the basis of delimiter.
Use-Case
User wants to extract the hashtags from the twitter feeds.User would tokenize the words based on space and then can identify the words that start with hashtags
Input source:
topic
sentence
cask
cask is #data application #platform Tokenizer:
- User wants to tokenize the sentence data using “ ” as a pattern
Output:
words
[cask,is,#data,application,#platform]
User Stories
- As a Hydrator user,I want to tokenize the data in a column from source schema and output the tokens into output schema which will have a single column having tokenized data.
- As a Hydrator user I want to have configuration for specifying the column name from input schema on which tokenization has to be performed.
- As a Hydrator user I want to have configuration to specify the delimiter which could be used for tokenization.
- As a Hydrator user I want to have configuration to specify output column name wherein tokenized data will be emitted.
Conditions
- Source field ,to be tokenized,can be of only string type.
- User can tokenize single column only from the source schema.
- Output schema will have a single column of type string array.
Example
Input source:
topic | sentence |
Java | Hello world / is the /basic application |
HDFS | HDFS/ is a /file system |
Spark | Spark /is engine for /bigdata processing |
Tokenizer:
- User wants to tokenize the sentence data using “/” as a delimiter
- Mandatory inputs from user:
- Column on which tokenization to be done:”sentence”
- Delimiter for tokenization:”/”
- Output column name for tokenized data:”words”
- Tokenizer plugin will tokenize “sentence” data from input source and put tokenized data in “words” in output.
Output:
words |
[hello world, is the, basic application] |
[hdfs, is a ,file system] |
[spark ,is engine for ,bigdata processing] |
Design
This is a sparkcompute type of plugin and is meant to work with Spark only.
Properties:
- columnToBeTokenized :Column name on which tokenization is to be done
- delimiter:Delimiter for tokenization
- outputColumn:Output column name for tokenized data
Input JSON:
{
"name": "Tokenizer",
"plugin": {
"name": "Tokenizer",
"type": "sparkcompute",
"label": "Tokenizer",
"properties": {
" columnToBeTokenized": "sentence",
" delimiter": "/",
" outputColumn": "words",
}
}
}
Table of Contents
Table of Contents style circle
Checklist
- User stories documented
- User stories reviewed
- Design documented
- Design reviewed
- Feature merged
- Examples and guides
- Integration tests
- Documentation for feature
- Short video demonstrating the feature