Introduction

Spark plugins that trains and predicts label data based on Random Forest Regression Algorithm.

Use-case

User wants to predict if the flight will be delayed or not based on some features of airline data:

Label → delayed and not delayed - delayed if 1.0 and 0.0 otherwise
Features → {dayOfMonth, weekday, scheduledDepTime, scheduledArrTime, carrier, elapsedTime, originoriginId, destdestId}

User Stories

User should be able to train the data.
User should be able to provide the list of columns(features) to use for training.
User should be able to provide the list of columns(features) to be used for prediction.
User should be able to provide the column to be used as prediction field while training/regressionprediction.
User should be able to specify the maximum depth of the decision tree.
User should be able to specify maximum number of bins used for splitting features.
User should be able to specify the cardinality for categorical features.
User should be able to specify random seed for bootstrapping and choosing feature subsets.
User should be able to specify number of features to consider for splits at each node. Supported: "auto", "all", "sqrt", "log2", "onethird".
User should be able to provide the file set name to save the training model.
User should be able to provide the path of the file set.

Example

Following is a simple example showing how Random Foresr Forest Trainer and Regression would work to predict if the flight will be delayed or not.

For each flight, we have the following information:

Delayed

Day of Week

Carrier

TailNum

FlightNum

OriginOriginId

DestinationDestinationId

Day of

Month

Distance

Arrival
Time

Departure
Time

1.0

4AA

1.0

N787AA

21

JFK101

LAX111

1

2475

1230

855

0.0

6EV

2.0

N457ER

34

ATL105

JAX203

1

1589

1530

1700

The Random Forest Trainer will train the data based on some features, for example : {dayOfMonth, weekday, scheduledDepTime, scheduledArrTime, carrier, elapsedTime, originoriginId, dest destId.

The label for the first and second rows will be set to 1.0 and 0.0(delayed column value).

Trainer will save the model in a fileSet, which will be used later for predicting the delayed value using regression.

Conditions

Fields to be used for training and prediction using Regression(features) should be of simple type : String, int, double, float, long, bytes, boolean.
Fields to be used for training and prediction using Regression(features) should not be of type NULL.

Design

Random Forest Trainer:

Properties:

fileSetName : The name of the FileSet to save the model to.
path : Path of the FileSet to save the model to.
features featuresToInclude: A A comma-separated sequence of field names fields to use for training. Fields to If empty, all fields except label field will be used for training. Features to be used, must be of simple type: String, int, double, float, long, bytes, boolean. Both "featuresToInclude" and "featuresToExclude" fields cannot be specified.
featuresToExclude: A comma-separated sequence of fields to be excluded when training. If empty, all the fields except label field will be usefor training. Both "featuresToInclude" and "featuresToExclude" fields cannot be specified.
cardinalityMapping: Mapping of the feature to the cardinality of that feature; required for categorical features.
predictionField : The field from which to get the prediction. It must be of type double.
maxDepth : Maximum depth of the tree.E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Default is 10.
maxBins : Maximum number of bins used for splitting when discretizing continuous features. Decision Tree Random Forest requires maxBins to be at least as large as the number of values in each categorical feature. Default is 100.
seed : Random seed for bootstrapping and choosing feature subsets.
featureSubsetStrategy : Number of features to consider for splits at each node. Supported: "auto", "all", "sqrt", "log2", "onethird". If "auto" is set, this parameter is set based on numTrees: if numTrees == 1, set to "all"; if numTrees > 1 (forest) set to "onethird".

Input Json Format

Code Block

language	js
linenumbers	true

{
  "name": "RandomForestTrainer",
  "type": "sparksink",
  "properties": {
        "fileSetName": "random-forest-model",
        "path": "RandomForest",
        "featuresfeaturesToInclude": "dofM,dofW,scheduleDepTime,scheduledArrTime,carrier, elapsedTime,originoriginId,destdestId",
        "predictionFieldlabelField": "delayed",
        "maxBins": "100",
        "maxDepth": "9",
        "seed": "12345",
        "featureSubsetStrategy": "auto"
   }
}

Random Forest

Regressor

Predictor:

Properties:

fileSetName : The name of the FileSet to load the model from.
path : Path of the FileSet to load the model from.
features
featuresToInclude: A comma-separated sequence of
field names
fields to
use for regression. Fields to
be used for Decision Tree Regression. If empty, all
fields except the prediction field will be used for prediction
using regression
. Features to be used, must be from one of
simple
the following
type:
String,
int,
double, float, long, bytes, boolean
long, float or double. Both featuresToInclude and featuresToExclude fields cannot be specified.
featuresToExclude: A comma-separated sequence of fields to be excluded for prediction. If empty, all fields except
the prediction field will be used for prediction. Both featuresToInclude and featuresToExclude fields cannot be
specified.
predictionField : The field on which to set the prediction. It will be of type double.

Code Block

language	js
linenumbers	true

{
  "name": "RandomForestRegression",
  "type": "sparkcompute",
  "properties": {
        "fileSetName": "random-forest-model",
        "path": "RandomForest",
        "features": "dofM,dofW,scheduleDepTime,scheduledArrTime,carrier, elapsedTime,originoriginId,destdestId",
        "predictionField": "delayed"
   }
}

Table of Contents

Table of Contents

style	circle

Checklist

User stories documented
User stories reviewed
Design documented
Design reviewed
Feature merged
Examples and guides
Integration tests
Documentation for feature
Short video demonstrating the feature

Versions Compared

Old Version 4

New Version Current

Key

Introduction

Use-case

User Stories

Example

Conditions

Design

Random Forest Trainer:

Random Forest

Predictor:

Page Comparison

Versions Compared

Old Version 4

New Version Current

Key

Introduction

Use-case

User Stories

Example

Conditions

Design

Random Forest Trainer:

Random Forest

Predictor: