Versions Compared
Key
- This line was added.
- This line was removed.
- Formatting was changed.
Introduction
Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words) on the basis of delimiter.
Use-Case
If you want to have sentence to be broken into tokens of words
- Source Field name : e.g sentence(Type: String)
- Target field name : e.g words(Type: String[])
Conditions
- Source field can be of only string type
- User can tokenize single column only from the source
- Output schema will have a single column of type string array
Options
Following are the mandatory inputs that will be provided to user to configure
- Column name on which tokenization to be done
- Delimiter for tokenization
- Output column name for tokenized data
Example
Input source:
topic | sentence |
Java | Hello world / is the /basic application |
HDFS | HDFS/ is a /file system |
Spark | Spark /is an engine for /bigdata processing |
Tokenizer:
User wants to tokenize the sentence data using “/” as a delimiter
Mandatory inputs from user:
Column on which tokenization to be done:”sentence”
Delimiter for tokenization:”/”
Output column name for tokenized data:”words”
Tokenizer plugin will tokenize “sentence” data from input source and put tokenized data in “words” in output.
Output:
words |
{Hello world, is the, basic application} |
{HDFS, is a ,file system} |
{Spark ,is an engine for ,bigdata processing} |
Design
Input JSON:
{
"name": "Tokenizer",
"plugin": {
"name": "Tokenizer",
"type": "sparkcompute",
"label": " Tokenizer ",
"properties": {
" columnToBeTokenized ": "sentence",
" delimiter ": "/",
" outputColumn": "words",
}
}
}
Table of Contents
Table of Contents style circle
Checklist
- User stories documented
- User stories reviewed
- Design documented
- Design reviewed
- Feature merged
- Examples and guides
- Integration tests
- Documentation for feature
- Short video demonstrating the feature