Introduction

Spark plugin that trains and predicts the label data based on the Linear Regression.

Use-case

Assume, there is a list of people and user wants to predict the each person's lung_capacity. The lung_capacity is a label in this case. Each person is identified by age, height, smoke and gender. These are the features of a person.

User Stories

User should be able to train the data.
User should be able to provide the list of columns(features) to use for training.
User should be able to provide the list of columns(features) to be used for prediction.
User should be able to provide the column to be used as prediction field while training/regression.
User should be able to specify the number of iterations to use for training.
User should be able to provide the step size to use for training.
User should be able to provide the file set name to save the training model.
User should be able to provide the path of the file set.

Example

Following is a simple example showing how Linear Regression would work to predict the each person's 'lung_capacity'. For each person, we have the following information:

+===================================+
| age  | height  | smoke  | gender  |
+===================================+
| 11   | 58.7    | no     | female  |
| 8    | 63.3    | no     | male    |
| 18   | 74.7    | yes    | female  |
+===================================+

The Linear Regression Trainer will train the data based on the features, for example, 'age' in the above case. Trainer will save the model in a File Set, which will be used later by Linear Regression Predictor for predicting the ' lung_capacity'.

Output records will contain all the fields along with the predicted field:

+====================================================+
| age  | height  | smoke  | gender  | lung_capacity  |
+====================================================+
| 11   | 58.7    | no     | female  | 6.471          |
| 8    | 63.3    | no     | male    | 4.706          |
| 18   | 74.7    | yes    | female  | 10.025         |
+====================================================+

Conditions

Fields to be used for training and prediction, using Regression(features) should be of simple type : int, double, float, and long.
Fields to be used for training and prediction using Regression(features) should not be of type NULL.

Design

Linear Regression Trainer

Properties:

fileSetName : The name of the FileSet to save the model to.
path : Path of the FileSet to save the model to.
featuresToInclude: A comma-separated sequence of fields to use for training. If empty, all fields except the label will be used for training. Features to be used, must be from one of the following types: int, long, float or double. Both featuresToInclude and featuresToExclude fields cannot be specified.
featuresToExclude: A comma-separated sequence of fields to excluded when training. If empty, all fields except the label will be used for training. Both featuresToInclude and featuresToExclude fields cannot be specified.
labelField : The field from which to get the prediction. It must be of type double.
numIterations: The number of iterations to be used for training the model. It must be of type Integer. Default is 100.
stepSize: The step size to be used for training the model. It must be of type of Double. Default is 1.0.

Input Json Format:

{
     "name": "LinearRegressionTrainer",
     "type": "sparksink",
     "properties": {
         "fileSetName": "linear-regression-model",
         "path": "linearRegression",
         "featuresToInclude": "age",
         "labelField": "lung_capacity",
         "numIterations": "50",
         "stepSize": "0.001"
     }
 }

Linear Regression Predictor

Properties:

fileSetName : The name of the FileSet to load the model from.
path : Path of the FileSet to load the model from.
featuresToInclude: A comma-separated sequence of fields for Linear Regression. If empty, all fields will be used for prediction. Features to be used, must be from one of the following types: int, long, float or double. Both featuresToInclude and featuresToExclude fields cannot be specified.
featuresToExclude: A comma-separated sequence of fields to be excluded when calculating for prediction. If empty, all fields will be used for prediction. Both featuresToInclude and featuresToExclude fields cannot be specified.
predictionField: The field on which to set the prediction. It will be of type double.

Input Json Format:

{
     "name": "LinearRegressionPredictor",
     "type": "sparkcompute",
     "properties": {
         "fileSetName": "linear-regression-model",
         "path": "linearRegression",
         "featuresToInclude": "age",
         "predictionField": "lung_capacity"
     }
 }

Table of Contents

Checklist

User stories documented
User stories reviewed
Design documented
Design reviewed
Feature merged
Examples and guides
Integration tests
Documentation for feature
Short video demonstrating the feature

Linear Regression