Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 9 Next »

 

Introduction


Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words) on the basis of delimiter.

 

Use-Case

  • Tokenize the data on the basis of delimiter
  • Tokenized output will be an array of delimited tokens
  • If you want to have sentence to be broken into tokens of words
    • Source Field name : e.g sentence(Type: String)
    • Target field name : e.g words(Type: String[])

User Stories

  • User should be able to specify the column name on which tokenization is to be done.
  • User should be able to specify the output column name.
  • User should be able to specify the delimiter which will be used by Tokenizer.

Conditions

  • Source field can be of only string type.
  • User can tokenize single column only from the source schema.
  • Output schema will have a single column of type string array.
  • Tokenized data will be converted to lower case

Example

Input source:

topic

sentence

Java

Hello world / is the /basic application

HDFS

HDFS/ is a /file system

Spark

Spark /is engine for /bigdata processing

Tokenizer:

    • User wants to tokenize the sentence data using “/” as a delimiter
    • Mandatory inputs from user:
    • Column on which tokenization to be done:”sentence”
    • Delimiter for tokenization:”/”
    • Output column name for tokenized data:”words”
    • Tokenizer plugin will tokenize “sentence” data from input source and put tokenized data in “words” in output.

Output:

words

[hello world, is the, basic application]

[hdfs, is a ,file system]

[spark ,is engine for ,bigdata processing]

 

Design

Properties:

  • columnToBeTokenized :Column name on which tokenization is to be done
  • delimiter:Delimiter for tokenization
  • outputColumn:Output column name for tokenized data 


Input JSON:

{
        "name": "Tokenizer",
        "plugin": {
        "name": "Tokenizer",
        "type": "sparkcompute",
        "label": "Tokenizer",
        "properties": {
           " columnToBeTokenized": "sentence",
           " delimiter": "/",
           " outputColumn": "words",
 
         }
       }
     }

 

Table of Contents

Checklist

  • User stories documented 
  • User stories reviewed 
  • Design documented 
  • Design reviewed 
  • Feature merged 
  • Examples and guides 
  • Integration tests 
  • Documentation for feature 
  • Short video demonstrating the feature
  • No labels