Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

 

Introduction


       Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words) on the basis of delimiter.

 

Use-Case

  • Tokenize the data on the basis of delimiter
  • Tokenized output will be an array of delimited tokens
  • If you want to have sentence to be broken into tokens of words
  • Source Field name : e.g sentence(Type: String)
  • Target field name : e.g words(Type: String[])

    User Stories

  • User should be able to specify the column name on which tokenization is to be done.
  • User should be able to specify the output column name.
  • User should be able to specify the delimiter which will be used by Tokenizer.

    Conditions

    • Source field User in their Hydrator Pipeline can receive the data from the source and would want to emit the tokenized data into a output source for one of the columns from the source using the specified delimiter.

    User Stories

    • As a Hydrator user,I want to tokenize the data in a column from source schema and output the tokens into output schema which will have a single column having tokenized data.
    • As a Hydrator user I want to have configuration for specifying the column name from input schema on which tokenization has to be performed.
    • As a Hydrator user I want to have configuration to specify the delimiter which could be used for tokenization.
    • As a Hydrator user I want to have configuration to specify output column name wherein tokenized data will be emitted.

    Conditions

    • Source field ,to be tokenized,can be of only string type.
    • User can tokenize single column only from the source schema.
    • Output schema will have a single column of type string array.
    • Tokenized data will be converted to lower case

    Example

    Input source:

    topic

    sentence

    Java

    Hello world / is the /basic application

    HDFS

    HDFS/ is a /file system

    Spark

    Spark /is engine for /bigdata processing

    Tokenizer:

      • User wants to tokenize the sentence data using “/” as a delimiter
      • Mandatory inputs from user:
      • Column on which tokenization to be done:”sentence”
      • Delimiter for tokenization:”/”
      • Output column name for tokenized data:”words”
      • Tokenizer plugin will tokenize “sentence” data from input source and put tokenized data in “words” in output.

    Output:

    words

    [hello world, is the, basic application]

    [hdfs, is a ,file system]

    [spark ,is engine for ,bigdata processing]

     

    Design

    Properties:

    • columnToBeTokenized :Column name on which tokenization is to be done
    • delimiter:Delimiter for tokenization
    • outputColumn:Output column name for tokenized data 


    Input JSON:

    {
            "name": "Tokenizer",
            "plugin": {
            "name": "Tokenizer",
            "type": "sparkcompute",
            "label": "Tokenizer",
            "properties": {
               " columnToBeTokenized": "sentence",
               " delimiter": "/",
               " outputColumn": "words",
     
             }
           }
         }

     

    Table of Contents

    Table of Contents
    stylecircle

    Checklist

    •  User stories documented 
    •  User stories reviewed 
    •  Design documented 
    •  Design reviewed 
    •  Feature merged 
    •  Examples and guides 
    •  Integration tests 
    •  Documentation for feature 
    •  Short video demonstrating the feature