Decision Tree Regression
- Bhushan Kawadkar
- Ananya Bhattacharya
IntroductionÂ
Spark plugins that trains and predicts label data based on Decision Tree Regression.
Use-case
User wants to predict if the flight will be delayed or not based on some features of airline data:
Label → delayed and not delayed - delayed if 1.0 and 0.0 otherwise
Features → {dayOfMonth, weekday, scheduledDepTime, scheduledArrTime, carrier, elapsedTime, origin, dest}
User Stories
User should be able to train the data.
User should be able to provide the list of columns(features) to use for training.
User should be able to provide the list of columns(features) to be used for prediction.
User should be able to provide the column to be used as prediction field while training/prediction.
User should be able to specify the maximum depth of the decision tree.
User should be able to specify maximum number of bins used for splitting features.
- User should be able to specify the cardinality for categorical features.
User should be able to provide the file set name to save the training model.
User should be able to provide the path of the file set.
Example
Following is a simple example showing how Decision Tree Trainer and Regression would work to predict if the flight will be delayed or not.
For each flight, we have the following information: Â
Delayed | Day of Week | Carrier | TailNum | FlightNum | OriginId | DestId | Day of Month | Distance | Arrival Time | Departure Time |
---|---|---|---|---|---|---|---|---|---|---|
1.0 | 4 | 1.0 | N787AA | 21 | 101 | 111 | 1 | 2475 | 1230 | 855 |
0.0 | 6 | 2.0 | N457ER | 34 | 105 | 203 | 1 | 1589 | 1530 | 1700 |
Â
The Decision Tree Trainer will train the data based on some features, for example :Â {dayOfMonth, weekday, scheduledDepTime, scheduledArrTime, carrier, elapsedTime, originId, destId .
The label for the first and second rows will be set to 1.0 and 0.0(delayed column value).
Trainer will save the model in a fileSet, which will be used later for predicting the delayed value using Decision Tree Regression.
Conditions
- Fields to be used for training and prediction using Decision Tree Regression(features) should be of type number: int, double, float, long.
- Fields to be used for training and prediction using Decision Tree Regression(features) should not be of type NULL.
DesignÂ
Decision Tree Trainer:
Properties:
- fileSetName: The name of the FileSet to save the model to.
- path: Path of the FileSet to save the model to.
- featuresToInclude: A comma-separated sequence of fields to use for training. If empty, all fields will be considered for training. Features to be used, must be of type: int, double, float, long.
- featuresToExclude: A comma-separated sequence of fields to be excluded when training. If empty, all the fields will be considered for training. Specify either the "featuresToInclude" or "featuresToExclude".
- cardinalityMapping:Â Mapping of the feature to the cardinality of that feature; required for categorical features.
- labelField: The field from which to get the prediction. It must be of type double.
- maxDepth: Maximum depth of the tree. For example, depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Default is 10.
- maxBins:Â Maximum number of bins used for splitting when discretizing continuous features. Decision Tree requires maxBins to be at least as large as the number of values in each categorical feature. Default is 100.
Input Json Format
{ "name": "DecisionTreeTrainer", "type": "sparksink", "properties": { "fileSetName": "decision-tree-model", "path": "decisionTree", "featuresToInclude": "dofM,dofW,scheduleDepTime,scheduledArrTime,carrier, elapsedTime,origin,dest", "labelField": "delayed", "maxBins": "100", "maxDepth": "9" } }
Decision Tree Predictor:
Properties:
- fileSetName: The name of the FileSet to load the model from.
- path: Path of the FileSet to load the model from.
- featuresToInclude: A comma-separated sequence of field names to use for Decision Tree Regression. Features to be used, must be from one of the following type: int, double, float, long.
- featureToExclude: A comma-separated sequence of fields to be excluded when calculating prediction. If empty, all fields will be considered for calculating prediction. Specify either "featuresToInclude" or "featuresToExclude".
predictionField: The field on which to set the prediction. It will be of type double.
Â
{ "name": "DecisionTreeRegression", "type": "sparkcompute", "properties": { "fileSetName": "decision-tree-model", "path": "decisionTree", "featuresToInclude": "dofM,dofW,scheduleDepTime,scheduledArrTime,carrier, elapsedTime,origin,dest", "predictionField": "delayed" } }
Table of Contents
Â
Checklist
- User stories documentedÂ
- User stories reviewedÂ
- Design documentedÂ
- Design reviewedÂ
- Feature mergedÂ
- Examples and guidesÂ
- Integration testsÂ
- Documentation for featureÂ
- Short video demonstrating the feature