Versions Compared
compared with
Key
- This line was added.
- This line was removed.
- Formatting was changed.
Introduction
Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words) on the basis of delimiter.
Use-Case
- Tokenize the data on the basis of delimiter
- Tokenized output will be an array of delimited tokens
- If you want to have sentence to be broken into tokens of words
- Source Field name : e.g sentence(Type: String)
- Target field name : e.g words(Type: String[])
User Stories
- User should be able to specify the column name on which tokenization is to be done.
- User should be able to specify the output column name.
- User should be able to specify the delimiter which will be used by Tokenizer.
Conditions
- Source field can be of only string type.
- User can tokenize single column only from the source schema.
- Output schema will have a single column of type string array.
- Tokenized data will be converted to lower case
Example
Input source:
topic | sentence |
Java | Hello world / is the /basic application |
HDFS | HDFS/ is a /file system |
Spark | Spark /is an engine for /bigdata processing |
Tokenizer:
- User wants to tokenize the sentence data using “/” as a delimiter
- Mandatory inputs from user:
- Column on which tokenization to be done:”sentence”
- Delimiter for tokenization:”/”
- Output column name for tokenized data:”words”
- Tokenizer plugin will tokenize “sentence” data from input source and put tokenized data in “words” in output.
Output:
words |
{Hello [hello world, is the, basic application}] |
{HDFS[hdfs, is a ,file system}] |
{Spark [spark ,is an engine for ,bigdata processing}] |
Design
Properties:
- columnToBeTokenized :Column name on which tokenization is to be done
- delimiter:Delimiter for tokenization
- outputColumn:Output column name for tokenized data
Input JSON:
{
"name": "Tokenizer",
"plugin": {
"name": "Tokenizer",
"type": "sparkcompute",
"label": "Tokenizer",
"properties": {
" columnToBeTokenized": "sentence",
" delimiter": "/",
" outputColumn": "words",
}
}
}
Table of Contents
Table of Contents style circle
Checklist
- User stories documented
- User stories reviewed
- Design documented
- Design reviewed
- Feature merged
- Examples and guides
- Integration tests
- Documentation for feature
- Short video demonstrating the feature