Introduction

Spark plugins that trains and classify data based on Multinomial/Binary Logistic Regression.

Use-case

A manager wants to predict if the customer will give the tip or not based on some features of hotel food order data.

For above purpose, he wants to train a tip classifier based on order data feed using the starter, dessert as features and tip labeled as tip provided or not.

Label → tip or no tip- 1.0 in case of tip and 0.0 otherwise.

Feature → {Starter, Dessert}

User Stories

User should be able to train the data.
User should be able to classify the test data using the model build while training the data.
User should be able to provide the list of columns(features) to use for training.
User should be able to provide the list of columns(features) to classify.
User should be able to provide the column to be used as prediction field while training/classification.
User should be able to provide the number of features to be used while training/classification.
User should be able to provide the number of classes to be used while training/classification.
User should be able to provide the file set name to save the training model.
User should be able to provide the path of the file set.

Example

Suppose the Trainer plugin gets below records to train the Logistic Regression Model:

Starter	Dessert	Tip
1	0	0.0
1	1	1.0
0	1	0.0
0	0	0.0

Trained on the above records, trainer plugin will save the model in a fileSet, which will be used later for predicting the tip value using Logistic regression classifier.

Implementation Tips

Design

Logistic Regression Trainer

Input Json Format

{
  "name": "LogisticRegressionTrainer",
  "type": "sparksink",
  "properties": {
        "fileSetName": "logical-regression-model",
        "path": "/home/cdap/model",
        "fieldsToClassify": "Starter,Dessert",
        "predictionField": "Tip",
        "numFeatures": "2",
        "numClasses": "2"
   }
}

Plugin will take above inputs from user and trains the model based on "fieldsToClassify" and "predictionField" fields as features and label points respectively.

"fieldsToClassify" can include multiple columns of structure record as features.

"predictionField" should be the column from input structure record containing the data to be treated as label for prediction.

The model generated from this plugin will further be used by Logistic Regression Classifier plugin to classify the input data.

Table of Contents

Checklist

User stories documented
User stories reviewed
Design documented
Design reviewed
Feature merged
Examples and guides
Integration tests
Documentation for feature
Short video demonstrating the feature