Introduction

Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words) on the basis of delimiter.

Use-Case

User Stories

As a Hydrator user,I want to tokenize the data in a column from source schema and output the tokens into output schema which will have a single column having tokenized data.
As a Hydrator user I want to have configuration for specifying the column name from input schema on which tokenization has to be performed.
As a Hydrator user I want to have configuration to specify the delimiter which could be used for tokenization.
As a Hydrator user I want to have configuration to specify output column name wherein tokenized data will be emitted.

Example

Input source:

topic	sentence
Java	Hello world / is the /basic application
HDFS	HDFS/ is a /file system
Spark	Spark /is engine for /bigdata processing

Tokenizer:

Output:

topic	sentence	words
Java	Hello world / is the /basic application	[hello world, is the, basic application]
HDFS	HDFS/ is a /file system	[hdfs, is a ,file system]
Spark	Spark /is engine for /bigdata processing	[spark ,is engine for ,bigdata processing]

Design

This is a sparkcompute type of plugin and is meant to work with Spark only.

Properties:

Input JSON:

        "name": "Tokenizer",

        "plugin": {

        "name": "Tokenizer",

        "type": "sparkcompute",

        "label": "Tokenizer",

        "properties": {

           " columnToBeTokenized": "sentence",

           " patternSeparator": "/",

           " outputColumn": "words",

Table of Contents

Table of Contents

style	circle

Checklist

topic	sentence
cask	cask is #data application #platform