Introduction

Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words) on the basis of delimiter.

Use-Case

A data scientist wants to identify individual words and other single coherent constructs from the data.
e.g Hashtags in Twitter feeds are example of constructs consisting of special and alphanumeric characters that should be treated as one coherent token.

User Stories

As a Hydrator user,I want to tokenize the data in a column from source schema and output the tokens into output schema which will have a single column having tokenized data.
As a Hydrator user I want to have configuration for specifying the column name from input schema on which tokenization has to be performed.
As a Hydrator user I want to have configuration to specify the delimiter which could be used for tokenization.
As a Hydrator user I want to have configuration to specify output column name wherein tokenized data will be emitted.

Example

Input source:

topic	sentence
Java	Hello world / is the /basic application
HDFS	HDFS/ is a /file system
Spark	Spark /is engine for /bigdata processing

Tokenizer:

Output:

words

[hello world, is the, basic application]

[hdfs, is a ,file system]

[spark ,is engine for ,bigdata processing]

This is a sparkcompute type of plugin and is meant to work with Spark only.

Properties:

Input JSON:

        "name": "Tokenizer",

        "plugin": {

        "name": "Tokenizer",

        "type": "sparkcompute",

        "label": "Tokenizer",

        "properties": {

           " columnToBeTokenized": "sentence",

           " delimiter": "/",

           " outputColumn": "words",

Table of Contents

Checklist