Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Task marked incomplete

 IntroductionIntroduction


       An n-gram is a sequence of n tokens (typically words) for some integer n.

      NGramTransform plugin would be used to transform input features into n-grams. 

Use-Case

  • User wants bigrams from the text which can be used for statistical analysis purpose.
    Example:User wants to tokenize the text on the basis of '.' and create bigrams.
    Text:hello my friend.how are you today bye
    Output for bigrams:
    hello my
    my friend
    how are
    are you
    you today
    today bye

User Stories

  • A bio data scientist wants to  study the sequence of the nucleotides using the input stream of DNA sequencing to identify the bonds.
    The input Stream contains the DNA sequence eg AGCTTCGA. The output contains the bigram sequence AG, GC, CT, TT, TC, CG, GA

    Input source: 

    DNASequence
    AGCTTCGA

    Mandatory inputs from user:NGramTransform: 

    • Field to be used to transform input features into n-grams:”DNASequence”
    • Number of terms in each n-gram:”2”
    • Transformed field for sequence of n-gram:”bigram” 
    • Tokenization unit used to tokenize the input string before n-gram could be created:"Character" 

    Output: 

    DNASequencebigram
    AGCTTCGA[AG, GC, CT, TT, TC, CG, GA]

 


User Stories

 

  • As a Hydrator user,I want to transfom input features data in a column from source schema into output schema which will have a single column having n n-gram data in one of the columns in output schema.
  • As a Hydrator user I want to have configuration for specifying the column name from input schema on which transformation has to be performed.
  • As a Hydrator user I want to have configuration to specify the no of terms which would be used for transformation of input features into n-grams.
  • As a Hydrator user I want to have configuration to specify output column name wherein ngrams will be emitted.
  • As a Hydrator user I want to specify the tokenization unit for the input to be tokenized before it could be converted to n-gram

Conditions

  • Source field ,to be transformed,can be of only type string array.
  • User can transform single field only from the source schema.
  • Output schema will have a single field of type string array.
  • If the input sequence contains fewer than n strings, no output is produced.

Example

Input source:

topic

tokens

Java

[hi,i,heard,about,spark]

HDFS

[hdfs,is,file,system]

Spark

[spark,is,an,engine]

NGramTransform:

Mandatory inputs from user:

  • Field to be used to transform input features into n-grams:”tokens”
  • Number of terms in each n-gram:”2”
  • Transformed field for sequence of n-gram:”ngrams”

    Output:

    ngrams

    [hi i,i heard,heard about,about spark]

    [hdfs is,is file,file system]

    [spark is,is an,an engine]


    End to End Example pipeline:
             

    StreamTokenizerNGramTransformTPFSAvro

     

    Input source:

     

    topicsentence
    javahi i heard about spark
    HDFShdfs is a file system
    Sparkspark is an engine
    Tokenizer:Mandatory inputs from user:
      • Column on which tokenization to be done:”sentence”
      • Delimiter for tokenization:” ”
      • Output column name for tokenized data:”tokens”


    NGramTransform:

    Mandatory inputs from user:

      • Field to be used to transform input features into n-grams:”tokens”
      • Number of terms in each n-gram:”2”
      • Transformed field for sequence of n-gram:”ngrams”
      • Tokenization unit: "words"

    TPFSAvro Output

    topicsentencengrams
    javahi i heard about spark[hi i,i heard,heard about,about spark]
    HDFShdfs is a file system[hdfs is,is a,a file,file system]
    Sparkspark is an engine[spark is,is an,an engine]

     

    Design

    This is a sparkcompute type of plugin and is meant to work with Spark only.

    Properties:

    • **fieldToBeTransformed:** Column to be used to transform input features into n-grams.
    • **numberOfTerms:** Number of terms in each n-gram.
    • **outputField:** Transformed column for sequence of n-gram.
    • **tokenizationUnit** Unit into which the input string will be tokenized.

    Input JSON:

             {
               "name": "NGramTransform",
               "type": "sparkcompute",
               "properties": {
                                       "fieldToBeTransformed": "tokens",
                                       "numberOfTerms": "2",

                                        "tokenizationUnit":"word",

                                       "outputField": "ngrams"
                                    }
              }


    Table of Contents

    Table of Contents
    stylecircle

    Checklist

    •  User stories documented 
    •  User stories reviewed 
    •  Design documented 
    •  Design reviewed 
    •  Feature merged 
    •  Examples and guides 
    •  Integration tests 
    •  Documentation for feature 
    •  Short video demonstrating the feature

    ...