Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Task marked incomplete

 

Introduction


Tokenization is        Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words) on the basis of delimiter. 

Use-Case

If you want to have sentence to be broken into tokens of words

  • Source Field name : e.g sentence(Type: String)
  • Target field name : e.g words(Type: String[])

Conditions

  • Source field can be of only string type
  • User can tokenize single column only from the source
  • Output schema will have a single column of type string array

    Options

    Following are the mandatory inputs that will be provided to user to configure

    • Column name on which tokenization to be done
    • Delimiter for tokenization
    • Output column name for tokenized data

    Example

     

    • User wants to extract the hashtags from the twitter feeds.User would tokenize the words based on space and then can identify the words that start with hashtags

      Input source:

      topic

      sentence

      cask

      cask is #data application #platform

      Tokenizer:

        • User wants to tokenize the sentence data using “ ” as a pattern

      Output:

      topicsentencewords
      caskcask is #data application #platform[cask,is,#data,application,#platform]


    User Stories

    • As a Hydrator user,I want to tokenize the data in a column from source schema and output the tokens into output schema which will have a single column having tokenized data.
    • As a Hydrator user I want to have configuration for specifying the column name from input schema on which tokenization has to be performed.
    • As a Hydrator user I want to have configuration to specify the delimiter which could be used for tokenization.
    • As a Hydrator user I want to have configuration to specify output column name wherein tokenized data will be emitted.

    Conditions

    • Source field ,to be tokenized,can be of only string type.
    • User can tokenize single column only from the source schema.

    Example

    Input source:

    topic

    sentence

    Java

    Hello world / is the /basic application

    HDFS

    HDFS/ is a /file system

    Spark

    Spark /is an engine for /bigdata processing

    Tokenizer:

      • User wants to tokenize the sentence data using “/” as a delimiter
      • Mandatory inputs from user:
      • Column on which tokenization to be done:”sentence”
      • Delimiter for tokenization:”/”
      • Output column name for tokenized data:”words”
      • Tokenizer plugin will tokenize “sentence” data from input source and put tokenized data in “words” in output.

    Output:

    topicsentencewords
    {

    Java

    Hello world / is the /basic application

    [hello world, is the, basic application
    }
    ]

    HDFS

    {HDFS

    HDFS/ is a /file system

    [hdfs, is a ,file system
    }
    ]
    {Spark

    Spark

     Spark /is engine for /bigdata processing[spark ,is
    an
    engine for ,bigdata processing
    }
    ]

     


    Design

    This is a sparkcompute type of plugin and is meant to work with Spark only.

    Properties:

    • columnToBeTokenized :Column name on which tokenization is to be done
    • patternSeparator:Pattern Separator
    • outputColumn:Output column name for tokenized data 


    Input JSON:

    {
            "name": "Tokenizer",
            "plugin": {
            "name": "Tokenizer",
            "type": "sparkcompute",
            "label": "
    Tokenizer
    ",
            "properties": {
               " columnToBeTokenized
    ": "sentence",
               " 
    delimiter
    patternSeparator": "/",
               " outputColumn": "words",
     
             }
           }
         }

     

    Table of Contents

    Table of Contents
    stylecircle

    Checklist

    •  User stories documented 
    •  User stories reviewed 
    •  Design documented 
    •  Design reviewed 
    •  Feature merged 
    •  Examples and guides 
    •  Integration tests 
    •  Documentation for feature 
    •  Short video demonstrating the feature