Versions Compared
compared with
Key
- This line was added.
- This line was removed.
- Formatting was changed.
Introduction
Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words) on the basis of delimiter.
Use-Case
- Tokenize the data on the basis of delimiter
- Tokenized output will be an array of delimited tokens
- If you want to have sentence to be broken into tokens of words
- Source Field name : e.g sentence(Type: String)
- Target field name : e.g words(Type: String[])
User Stories
- User should be able to specify the column name(mandatory) on which tokenization is to be done.
- User should be able to specify the output column name(mandatory).
- Output schema should only have a column of type array.
- User should be able to specify the delimiter(mandatory) which will be used by Tokenizer.
Conditions
- Source field can be of only string type.
- User can tokenize single column only from the source schema.
- Output schema will have a single column of type string array.
Example
Input source:
topic | sentence |
Java | Hello world / is the /basic application |
HDFS | HDFS/ is a /file system |
Spark | Spark /is an engine for /bigdata processing |
Tokenizer:
- User wants to tokenize the sentence data using “/” as a delimiter
- Mandatory inputs from user:
- Column on which tokenization to be done:”sentence”
- Delimiter for tokenization:”/”
- Output column name for tokenized data:”words”
- Tokenizer plugin will tokenize “sentence” data from input source and put tokenized data in “words” in output.
Output:
words |
{Hello world, is the, basic application} |
{HDFS, is a ,file system} |
{Spark ,is an engine for ,bigdata processing} |
Design
Properties:
- columnToBeTokenized :Output column name for tokenized data
- delimiter:Delimiter for tokenization
- outputColumn:Output column name for tokenized data
Input JSON:
{
"name": "Tokenizer",
"plugin": {
"name": "Tokenizer",
"type": "sparkcompute",
"label": " Tokenizer ",
"properties": {
" columnToBeTokenized ": "sentence",
" delimiter ": "/",
" outputColumn": "words",
}
}
}
Table of Contents
Table of Contents style circle
Checklist
- User stories documented
- User stories reviewed
- Design documented
- Design reviewed
- Feature merged
- Examples and guides
- Integration tests
- Documentation for feature
- Short video demonstrating the feature