Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 5 Next »

Introduction 

Spark plugins that trains and predicts label data based on Random Forest Regression Algorithm.

Use-case

User wants to predict if the flight will be delayed or not based on some features of airline data:

Label → delayed and not delayed - delayed if 1.0 and 0.0 otherwise
Features → {dayOfMonth, weekday, scheduledDepTime, scheduledArrTime, carrier, elapsedTime, origin, dest}

User Stories

  1. User should be able to train the data.

  2. User should be able to provide the list of columns(features) to use for training.

  3. User should be able to provide the list of columns(features) to be used for prediction.

  4. User should be able to provide the column to be used as prediction field while training/regression.

  5. User should be able to specify the maximum depth of the decision tree.

  6. User should be able to specify maximum number of bins used for splitting features.

  7. User should be able to specify random seed for bootstrapping and choosing feature subsets.
  8. User should be able to specify number of features to consider for splits at each node. Supported: "auto", "all", "sqrt", "log2", "onethird".
  9. User should be able to provide the file set name to save the training model.

  10. User should be able to provide the path of the file set.

Example

Following is a simple example showing how Random Foresr Trainer and Regression would work to predict if the flight will be delayed or not.

For each flight, we have the following information:  

Delayed

Day of Week

CarrierTailNumFlightNumOriginDestination

Day of

Month

DistanceArrival
Time 
Departure
Time 
1.04AAN787AA21JFKLAX124751230855
0.06EVN457ER34ATLJAX1158915301700

 

The Random Forest Trainer will train the data based on some features, for example : {dayOfMonth, weekday, scheduledDepTime, scheduledArrTime, carrier, elapsedTime, origin, dest .

The label for the first and second rows will be set to 1.0 and 0.0(delayed column value).

Trainer will save the model in a fileSet, which will be used later for predicting the delayed value using regression.


Conditions

  • Fields to be used for training and prediction using Regression(features) should be of simple type : String, int, double, float, long, bytes, boolean
  • Fields to be used for training and prediction using Regression(features) should not be of type NULL.


Design 

Random Forest Trainer:

Properties:

  • fileSetName : The name of the FileSet to save the model to.
  • path : Path of the FileSet to save the model to.
  • features : A comma-separated sequence of field names to use for training. Fields to be used for training must be of simple type : String, int, double, float, long, bytes, boolean.
  • predictionField : The field from which to get the prediction. It must be of type double.
  • maxDepth : Maximum depth of the tree.E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Default is 10.
  • maxBins : Maximum number of bins used for splitting when discretizing continuous features. Random Forest requires maxBins to be at least as large as the number of values in each categorical feature. Default is 100.
  • seed : Random seed for bootstrapping and choosing feature subsets.
  • featureSubsetStrategy : Number of features to consider for splits at each node. Supported: "auto", "all", "sqrt", "log2", "onethird". If "auto" is set, this parameter is set based on numTrees: if numTrees == 1, set to "all"; if numTrees > 1 (forest) set to "onethird".

Input Json Format

{
  "name": "RandomForestTrainer",
  "type": "sparksink",
  "properties": {
        "fileSetName": "random-forest-model",
        "path": "RandomForest",
        "features": "dofM,dofW,scheduleDepTime,scheduledArrTime,carrier, elapsedTime,origin,dest",
        "predictionField": "delayed",
        "maxBins": "100",
        "maxDepth": "9",
        "seed": "12345",
        "featureSubsetStrategy": "auto"
   }
}

Random Forest Regressor:

Properties:

  • fileSetName : The name of the FileSet to load the model from.
  • path : Path of the FileSet to load the model from.
  • features : A comma-separated sequence of field names to use for regression. Fields to be used for prediction using regression must be of simple type : String, int, double, float, long, bytes, boolean.
  • predictionField : The field on which to set the prediction. It will be of type double.

     

{
  "name": "RandomForestRegression",
  "type": "sparkcompute",
  "properties": {
        "fileSetName": "random-forest-model",
        "path": "RandomForest",
        "features": "dofM,dofW,scheduleDepTime,scheduledArrTime,carrier, elapsedTime,origin,dest",
        "predictionField": "delayed"
   }
}

Table of Contents

 

Checklist

  • User stories documented 
  • User stories reviewed 
  • Design documented 
  • Design reviewed 
  • Feature merged 
  • Examples and guides 
  • Integration tests 
  • Documentation for feature 
  • Short video demonstrating the feature
  • No labels