You are viewing an old version of this page. View the current version.
Compare with Current View Page History
« Previous Version 6 Next »
Introduction
Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words) on the basis of delimiter.
Example
Input source:
topic
sentence
Java
Hello world / is the /basic application
HDFS
HDFS/ is a /file system
Spark
Spark /is an engine for /bigdata processing
Tokenizer:
Output:
words
{Hello world, is the, basic application}
{HDFS, is a ,file system}
{Spark ,is an engine for ,bigdata processing}
Properties:
Input JSON:
{
"name": "Tokenizer",
"plugin": {
"type": "sparkcompute",
"label": " Tokenizer ",
"properties": {
" columnToBeTokenized ": "sentence",
" delimiter ": "/",
" outputColumn": "words",
}
Table of Contents
Checklist