Introduction

Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words) on the basis of delimiter.

Use-Case

Tokenize the data on the basis of delimiter
Tokenized output will be an array of delimited tokens
If you want to have sentence to be broken into tokens of words
- Source Field name : e.g sentence(Type: String)
- Target field name : e.g words(Type: String[])

User Stories

User should be able to specify the column name(mandatory) on which tokenization is to be done.
User should be able to specify the output column name(mandatory).
Output schema should only have a column of type array.
User should be able to specify the delimiter(mandatory) which will be used by Tokenizer.

Example

Input source:

topic	sentence
Java	Hello world / is the /basic application
HDFS	HDFS/ is a /file system
Spark	Spark /is an engine for /bigdata processing

Tokenizer:

Output:

words

{Hello world, is the, basic application}

{HDFS, is a ,file system}

{Spark ,is an engine for ,bigdata processing}

Properties:

Input JSON:

        "name": "Tokenizer",

        "plugin": {

        "name": "Tokenizer",

        "type": "sparkcompute",

        "label": " Tokenizer ",

        "properties": {

           " columnToBeTokenized ": "sentence",

           " delimiter ": "/",

           " outputColumn": "words",

Table of Contents

Table of Contents

style	circle

Checklist