Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Task marked complete

Introduction 

Spark plugins that trains and classify data based on Multinomial/Binary Logistic Regression.

Use-case

Following are the use-cases that the plugin should support:

A manager wants to predict if the customer will give the tip or not based on some features of hotel food order data.

For above purpose, he wants to train a tip classifier based on order data feed using the starter, dessert as features and tip labeled as tip provided or not. 

Label → tip or no tip- 1.0 in case of tip and 0.0 otherwise.

Feature → {Starter, Dessert}

 

User Stories

  1. User should be able to train the data.
  2. User should be able to classify the test data using the model build while training the data.
  3. User should be able to provide the list of columns(features) to use for training.
  4. User should be able to provide the list of columns(features) to classify.
  5. User should be able to provide the column to be used as prediction field while training/classification.
  6. User should be able to provide the number of features to be used while training/classification.
  7. User should be able to provide the number of classes to be used while training/classification.
  8. User should be able to provide the file set name to save the training model.
  9. User should be able to provide the path of the file set.

 

User Stories

  • User should be able to train the data.
  • User should be able to classify the test data using the model build while training the data.

    Example

    Suppose the Trainer plugin gets below records to train the Logistic Regression Model:

    StarterDessertTip
    100.0
    111.0
    010.0
    000.0


    Trained on the above records,

    trainer plugin will provide the create regression model and save it to a Fileset location provided by the  user.

    Implementation Tips

    Design 

    trainer plugin will save the model in a fileSet, which will be used later for predicting the tip value using Logistic regression classifier.

    Implementation Tips


    Design 

    Logistic Regression Trainer

    Input Json Format

    Code Block
    languagejs
    linenumberstrue
    {
      "name": "LogisticRegressionTrainer",
      "type": "sparksink",
      "properties": {
            "fileSetName": "logical-regression-model",
            "path": "/home/cdap/model",
            "featureFields": "Starter,Dessert",
            "labelField": "Tip",
            "numFeatures": "2",
            "numClasses": "2"
       }
    }

     

     Plugin will take above inputs from user and trains the model based on "featureFields" and  "labelField" fields as features and label points respectively.

    Properties:

    • fileSetNameThe name of the FileSet to save the model to.
    • path: Path of the FileSet to save the model to.
    • featureFields: A comma-separated sequence of field names to used as features for training.
    • labelField: It should be the column name from input structure record containing the data to be treated as label for prediction.
    • numFeatures: The number of features to be used in HashingTF to generate features from string fields.
    • numClasses: The number of classes to be used in training model. It should be of type integer.

    The model generated from this plugin will further be used by Logistic Regression Classifier plugin to classify the input data.

     

    Logistic Regression Classifier

    Input Json Format

    Code Block
    languagejs
    linenumberstrue
    {
      "name": "LogisticRegressionClassifier",
      "type": "sparkcompute",
      "properties": {
            "fileSetName": "logical-regression-model",
            "path": "/home/cdap/model",
            "fieldsToClassify": "Starter,Dessert",
            "predictionField": "Tip",
            "numFeatures": "2"
       }
    }

     

     Classifier plugin will take above inputs from user and the logistic regression model from the "fileSetName" and classify the order data whether the customer will give the tip or not.

    Properties:

    • fileSetName: The name of the FileSet model.
    • path: Path of the FileSet from which model needs to be retrieved.
    • fieldsToClassify: A comma-separated sequence of field names to used as features for classification.
    • predictionField: It should be the column name in which the prediction data needs to be saved.
    • numFeatures: The number of features to be used in HashingTF to generate features from string fields.

    Table of Contents

    Table of Contents
    stylecircle

    Checklist

    •  User stories documented 
    •  User stories reviewed 
    •  Design documented 
    •  Design reviewed 
    •  Feature merged 
    •  Examples and guides 
    •  Integration tests 
    •  Documentation for feature 
    •  Short video demonstrating the feature