You are viewing an old version of this page. View the current version.
Compare with Current View Page History
Version 1 Next »
Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words) on the basis of delimiter.
If you want to have sentence to be broken into tokens of words
Following are the mandatory inputs that will be provided to user to configure
Example
Input source:
topic
sentence
Java
Hello world / is the /basic application
HDFS
HDFS/ is a /file system
Spark
Spark /is an engine for /bigdata processing
Tokenizer:
User wants to tokenize the sentence data using “/” as a delimiter
Mandatory inputs from user:
Column on which tokenization to be done:”sentence”
Delimiter for tokenization:”/”
Output column name for tokenized data:”words”
Tokenizer plugin will tokenize “sentence” data from input source and put tokenized data in “words” in output.
Output:
words
{Hello world, is the, basic application}
{HDFS, is a ,file system}
{Spark ,is an engine for ,bigdata processing}
Input JSON:
{
"name": "Tokenizer",
"plugin": {
"type": "sparkcompute",
"label": " Tokenizer ",
"properties": {
" columnToBeTokenized ": "sentence",
" delimiter ": "/",
" outputColumn": "words",
}
Table of Contents
Checklist