Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Task marked incomplete

 

Introduction


Tokenization is        Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words) on the basis of delimiter. 

Use-Case

  • Tokenize the data on the basis of delimiter
  • Tokenized output will be an array of delimited tokens
  • If you want to have sentence to be broken into tokens of words
  • Source Field name : e.g sentence(Type: String)
  • Target field name : e.g words(Type: String[])

    User Stories

    • User should be able to specify the column name on which tokenization is to be done.
    • User should be able to specify the output column name.
    • User should be able to specify the delimiter which will be used by Tokenizer.

    Conditions

    • Source field

      User wants to extract the hashtags from the twitter feeds.User would tokenize the words based on space and then can identify the words that start with hashtags

      Input source:

      topic

      sentence

      cask

      cask is #data application #platform

      Tokenizer:

        • User wants to tokenize the sentence data using “ ” as a pattern

      Output:

      topicsentencewords
      caskcask is #data application #platform[cask,is,#data,application,#platform]


    User Stories

    • As a Hydrator user,I want to tokenize the data in a column from source schema and output the tokens into output schema which will have a single column having tokenized data.
    • As a Hydrator user I want to have configuration for specifying the column name from input schema on which tokenization has to be performed.
    • As a Hydrator user I want to have configuration to specify the delimiter which could be used for tokenization.
    • As a Hydrator user I want to have configuration to specify output column name wherein tokenized data will be emitted.

    Conditions

    • Source field ,to be tokenized,can be of only string type.
    • User can tokenize single column only from the source schema.
      Output schema will have a single column of type string array.

    Example

    Input source:

    topic

    sentence

    Java

    Hello world / is the /basic application

    HDFS

    HDFS/ is a /file system

    Spark

    Spark /is an engine for /bigdata processing

    Tokenizer:

      • User wants to tokenize the sentence data using “/” as a delimiter
      • Mandatory inputs from user:
      • Column on which tokenization to be done:”sentence”
      • Delimiter for tokenization:”/”
      • Output column name for tokenized data:”words”
      • Tokenizer plugin will tokenize “sentence” data from input source and put tokenized data in “words” in output.

    Output:

    topicsentencewords
    {Hello

    Java

    Hello world / is the /basic application

    [hello world, is the, basic application
    }
    ]

    HDFS

    {HDFS

    HDFS/ is a /file system

    [hdfs, is a ,file system
    }
    ]
    {Spark

    Spark

     Spark /is engine for /bigdata processing[spark ,is
    an
    engine for ,bigdata processing
    }
    ]

     


    Design

    This is a sparkcompute type of plugin and is meant to work with Spark only.

    Properties:

    • columnToBeTokenized :Column name on which tokenization is to be donedelimiter:Delimiter for
    • tokenizationpatternSeparator:Pattern Separator
    • outputColumn:Output column name for tokenized data 


    Input JSON:

    {
            "name": "Tokenizer",
            "plugin": {
            "name": "Tokenizer",
            "type": "sparkcompute",
            "label": "Tokenizer",
            "properties": {
               " columnToBeTokenized ": "sentence",
               " delimiter patternSeparator": "/",
               " outputColumn": "words",
     
             }
           }
         }

     

    Table of Contents

    Table of Contents
    stylecircle

    Checklist

    •  User stories documented 
    •  User stories reviewed 
    •  Design documented 
    •  Design reviewed 
    •  Feature merged 
    •  Examples and guides 
    •  Integration tests 
    •  Documentation for feature 
    •  Short video demonstrating the feature