Introduction

Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words) on the basis of delimiter.

Use-Case

User in their Hydrator Pipeline can receive the data from the source and would want to emit the tokenized data into a output source for one of the columns from the source using the specified delimiter.

User Stories

As a Hydrator user,I want to tokenize the data in a column from source schema and output the tokens into output schema which will have a single column having tokenized data.
As a Hydrator user I want to have configuration for specifying the column name from input schema on which tokenization has to be performed.
As a Hydrator user I want to have configuration to specify the delimiter which could be used for tokenization.
As a Hydrator user I want to have configuration to specify output column name wherein tokenized data will be emitted.

Example

Input source:

topic	sentence
Java	Hello world / is the /basic application
HDFS	HDFS/ is a /file system
Spark	Spark /is engine for /bigdata processing

Tokenizer:

Output:

words

[hello world, is the, basic application]

[hdfs, is a ,file system]

[spark ,is engine for ,bigdata processing]

This is a sparkcompute type of plugin and is meant to work with Spark only.

Properties:

Input JSON:

        "name": "Tokenizer",

        "plugin": {

        "name": "Tokenizer",

        "type": "sparkcompute",

        "label": "Tokenizer",

        "properties": {

           " columnToBeTokenized": "sentence",

           " delimiter": "/",

           " outputColumn": "words",

Table of Contents

Checklist