You are viewing an old version of this page. View the current version.
Compare with Current View Page History
« Previous Version 9 Next »
Introduction
Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words) on the basis of delimiter.
Example
Input source:
topic
sentence
Java
Hello world / is the /basic application
HDFS
HDFS/ is a /file system
Spark
Spark /is engine for /bigdata processing
Tokenizer:
Output:
words
[hello world, is the, basic application]
[hdfs, is a ,file system]
[spark ,is engine for ,bigdata processing]
Properties:
Input JSON:
{
"name": "Tokenizer",
"plugin": {
"type": "sparkcompute",
"label": "Tokenizer",
"properties": {
" columnToBeTokenized": "sentence",
" delimiter": "/",
" outputColumn": "words",
}
Table of Contents
Checklist