Introduction

Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words) on the basis of delimiter.

Use-Case

If you want to have sentence to be broken into tokens of words

Following are the mandatory inputs that will be provided to user to configure

Example

Input source:

topic	sentence
Java	Hello world / is the /basic application
HDFS	HDFS/ is a /file system
Spark	Spark /is an engine for /bigdata processing

Tokenizer:

User wants to tokenize the sentence data using “/” as a delimiter

Mandatory inputs from user:

Column on which tokenization to be done:”sentence”

Delimiter for tokenization:”/”

Output column name for tokenized data:”words”

Tokenizer plugin will tokenize “sentence” data from input source and put tokenized data in “words” in output.

Output:

words

{Hello world, is the, basic application}

{HDFS, is a ,file system}

{Spark ,is an engine for ,bigdata processing}

Input JSON:

        "name": "Tokenizer",

        "plugin": {

        "name": "Tokenizer",

        "type": "sparkcompute",

        "label": " Tokenizer ",

        "properties": {

           " columnToBeTokenized ": "sentence",

           " delimiter ": "/",

           " outputColumn": "words",

Table of Contents

Checklist