Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

 

Introduction


Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words) on the basis of delimiter.

 

Use-Case

If you want to have sentence to be broken into tokens of words

  • Source Field name : e.g sentence(Type: String)
  • Target field name : e.g words(Type: String[])

Conditions

  • Source field can be of only string type
  • User can tokenize single column only from the source
  • Output schema will have a single column of type string array

Options

Following are the mandatory inputs that will be provided to user to configure

  • Column name on which tokenization to be done
  • Delimiter for tokenization
  • Output column name for tokenized data

Example

 

Input source:

topic

sentence

Java

Hello world / is the /basic application

HDFS

HDFS/ is a /file system

Spark

Spark /is an engine for /bigdata processing

Tokenizer:

User wants to tokenize the sentence data using “/” as a delimiter

Mandatory inputs from user:

Column on which tokenization to be done:”sentence”

Delimiter for tokenization:”/”

Output column name for tokenized data:”words”

Tokenizer plugin will tokenize “sentence” data from input source and put tokenized data in “words” in output.

Output:

words

{Hello world, is the, basic application}

{HDFS, is a ,file system}

{Spark ,is an engine for ,bigdata processing}

 

Design

Input JSON:

{
        "name": "Tokenizer",
        "plugin": {
        "name": "Tokenizer",
        "type": "sparkcompute",
        "label": " Tokenizer ",
        "properties": {
           " columnToBeTokenized ": "sentence",
           " delimiter ": "/",
           " outputColumn": "words",
 
         }
       }
     }

 

Table of Contents

Checklist

  • User stories documented 
  • User stories reviewed 
  • Design documented 
  • Design reviewed 
  • Feature merged 
  • Examples and guides 
  • Integration tests 
  • Documentation for feature 
  • Short video demonstrating the feature
  • No labels