Tokenizer

Introduction


       Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words) on the basis of delimiter. 

Use-Case

  • User wants to extract the hashtags from the twitter feeds.User would tokenize the words based on space and then can identify the words that start with hashtags

    Input source:

    topic

    sentence

    cask

    cask is #data application #platform

    Tokenizer:

      • User wants to tokenize the sentence data using “ ” as a pattern

    Output:

    topicsentencewords
    caskcask is #data application #platform[cask,is,#data,application,#platform]


User Stories

  • As a Hydrator user,I want to tokenize the data in a column from source schema and output the tokens into output schema which will have a single column having tokenized data.
  • As a Hydrator user I want to have configuration for specifying the column name from input schema on which tokenization has to be performed.
  • As a Hydrator user I want to have configuration to specify the delimiter which could be used for tokenization.
  • As a Hydrator user I want to have configuration to specify output column name wherein tokenized data will be emitted.

Conditions

  • Source field ,to be tokenized,can be of only string type.
  • User can tokenize single column only from the source schema.

Example

Input source:

topic

sentence

Java

Hello world / is the /basic application

HDFS

HDFS/ is a /file system

Spark

Spark /is engine for /bigdata processing

Tokenizer:

    • User wants to tokenize the sentence data using “/” as a delimiter
    • Mandatory inputs from user:
    • Column on which tokenization to be done:”sentence”
    • Delimiter for tokenization:”/”
    • Output column name for tokenized data:”words”
    • Tokenizer plugin will tokenize “sentence” data from input source and put tokenized data in “words” in output.

Output:

topicsentencewords

Java

Hello world / is the /basic application

[hello world, is the, basic application]

HDFS

HDFS/ is a /file system

[hdfs, is a ,file system]

Spark

 Spark /is engine for /bigdata processing[spark ,is engine for ,bigdata processing]

 


Design

This is a sparkcompute type of plugin and is meant to work with Spark only.

Properties:

  • columnToBeTokenized :Column name on which tokenization is to be done
  • patternSeparator:Pattern Separator
  • outputColumn:Output column name for tokenized data 


Input JSON:

{
        "name": "Tokenizer",
        "plugin": {
        "name": "Tokenizer",
        "type": "sparkcompute",
        "label": "Tokenizer",
        "properties": {
           " columnToBeTokenized": "sentence",
           " patternSeparator": "/",
           " outputColumn": "words",
 
         }
       }
     }

 

Table of Contents

Checklist

  • User stories documented 
  • User stories reviewed 
  • Design documented 
  • Design reviewed 
  • Feature merged 
  • Examples and guides 
  • Integration tests 
  • Documentation for feature 
  • Short video demonstrating the feature