Feature Generator

Introduction 

Feature generator plugin will be used to generate text based feature from a string field.

Use-case

A user has training data that has labeled various tweets as positive, neutral, or negative. The user wants to train a model (Eg. Decision Tree) from the data, then use it to tag new tweets as positive, neutral, or negative.

User Stories

  • The user should be able to generate text based features from a string field using HashingTF.

  • The user should be able to specify number of features to use with HashingTF.

  • The user should be able to train and store a skip-gram model (Spark's Word2Vec) to use later for feature generation.

  • The user should be able to set vector size, min count, num partitions, num iterations, and window size when training skip-gram model.

  • The user should be able to set which fileset and path to use when storing the skip-gram model.

  • The user should be able to generate text based features from a string field using a stored skip-gram model (Spark's Word2Vec).

  • The user should be able to use generated features to train a model or for prediction.

Example

Skip-Gram (Spark's Word2Vec)

Following is a simple example showing how Spark's word2vec can be used for text based generation using skip-gram model.

The SkipGram Trainer will fit the data for input column specified and for parameters vectorSize : 3, minCount: 2, numPartitions: 1, numIterations: 1 and windowSize: 3, and save the model into a fileSet.

Suppose the SkipGramGenerator receives the following input records:

offsettext
1Spark ML plugins 
2Classes in Java

The SkipGramFeatureGenerator will use the saved model and generate records that will contain all the fields along with the output       fields mentioned in ``outputColumnMapping``.

offsettextresult
1

Spark ML plugins

[0.040902843077977494, -0.010430609186490376, -0.04750693837801615]
2Classes in Java[-0.04352385476231575, 3.2448768615722656E-4, 0.02223073500208557]

 

HashingTF Feature Generator:

Suppose the feature generator receives the following records:

offsettext
1Hi I heard about Spark 
2Logistic regression models are neat

The HashingTF Feature Generator will transform column ``text`` to generate fixed length vector of size 10 and emit the generated sparse vector as a cobination of three columns: result_size, result_indices, result_value.

offsettextresult_sizeresult_indicesresult_value
1Hi I heard about Spark 10[3, 6, 7, 9] [2.0, 1.0, 1.0, 1.0]
2Logistic regression models are neat10[0, 2, 4, 5, 8][1.0, 1.0, 1.0, 1.0, 1.0]

 

Design 

SkipGramFeatureTrainer:

SparkSink to train and store a skip-gram model (Spark's Word2Vec) to use later for feature generation.

Properties:

    • fileSetName : The name of the FileSet to save the model to.
    • path : Path of the FileSet to save the model to.
    • vectorSize: The dimension of codes after transforming from words.
    • minCount: The minimum number of times a token must appear to be included in the word2vec model's vocabulary.
    • numPartitions: Number of partitions for sentences of words.
    • numIterations : Maximum number of iterations (>= 0).
    • windowSize The window size (context words from [-window, window]) default 5.
    • inputCol: Input column to train the skip-gram model (Spark's Word2Vec).

Input Json Format

{
    "name": "FeatureTrainer",
    "type": "sparksink",
    "properties": {
        "fileSetName": "feature-generator",
        "path": "feature",
        "vectorSize": "3",
        "minCount": "2",
        "numPartitions": "1",
        "numIterations ": "1",
        "windowSize ": "3",
        "inputCol": "text"
    }
}

SkipGramFeatureGenerator:

SparkCompute to generate text based feature from string using stored skip-gram model (Spark's Word2Vec).

The sparkcompute will emit record containing the original input schema along with the transformed columns as mentioned in the outputMapping.

Properties:

    • fileSetName : The name of the FileSet to load the skip-gram model from.
    • path : Path of the FileSet to load the skip-gram model from.
    • outpuColumntMapping: Input column to output column mapping where each output column will contain the generated feature vector for the corresponding input field as double array.

Input Json Format

{
    "name": "FeatureGenerator",
    "type": "sparkcompute",
    "properties": {
        "fileSetName": "feature-generator",
        "path": "feature",
        "outputColumnMapping": "text:result"
    }
}

 

HashingTFFeatureGenerator:

SparkCompute to generate text based feature from string using HashingTF or stored skip-gram model (Spark's Word2Vec).

The sparkcompute will emit record containing the original input schema along with the 3 extra columns(representing the sparse vector representation of the value) for every transformed column as mentioned in the outputMapping.

Properties:

    • numFeatures: Number of features to be used for HashingTF.
    • outputColumnMapping: Input column to output column mapping where for each input column, output will contain 3 corresponding fields as <output>_size, <output>_indices, <output>_value. The 3 columns combined will give the sparse vector value for the input column.

       

{
    "name": "FeatureGenerator",
    "type": "sparkcompute",
    "properties": {
        "numFeatures": "16"
        "outputColumnMapping": "text:result"
    }
}

Table of Contents

 

Checklist

  • User stories documented 
  • User stories reviewed 
  • Design documented 
  • Design reviewed 
  • Feature merged 
  • Examples and guides 
  • Integration tests 
  • Documentation for feature 
  • Short video demonstrating the feature