Introduction

Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words) on the basis of delimiter.

Use-Case

If you want to have sentence to be broken into tokens of words

Source Field name : e.g sentence(Type: String)
Target field name : e.g words(Type: String[])

Conditions

Source field can be of only string type
User can tokenize single column only from the source
Output schema will have a single column of type string array

Options

Following are the mandatory inputs that will be provided to user to configure

Column name on which tokenization to be done
Delimiter for tokenization
Output column name for tokenized data

Example

Input source:

topic	sentence
Java	Hello world / is the /basic application
HDFS	HDFS/ is a /file system
Spark	Spark /is an engine for /bigdata processing

Tokenizer:

User wants to tokenize the sentence data using “/” as a delimiter

Mandatory inputs from user:

Column on which tokenization to be done:”sentence”

Delimiter for tokenization:”/”

Output column name for tokenized data:”words”

Tokenizer plugin will tokenize “sentence” data from input source and put tokenized data in “words” in output.

Output:

words

{Hello world, is the, basic application}

{HDFS, is a ,file system}

{Spark ,is an engine for ,bigdata processing}

Design

Input JSON:

        "name": "Tokenizer",

        "plugin": {

        "name": "Tokenizer",

        "type": "sparkcompute",

        "label": " Tokenizer ",

        "properties": {

           " columnToBeTokenized ": "sentence",

           " delimiter ": "/",

           " outputColumn": "words",

Table of Contents

Table of Contents

style	circle

Checklist

User stories documented
User stories reviewed
Design documented
Design reviewed
Feature merged
Examples and guides
Integration tests
Documentation for feature
Short video demonstrating the feature

Versions Compared

Old Version 1

New Version 2

Key

Introduction

Use-Case

Conditions

Options

Design

Page Comparison

Versions Compared

Old Version 1

New Version 2

Key

Introduction

Use-Case

Conditions

Options

Design