GD Tree Classifier
- Romy Khetan
IntroductionÂ
Spark plugin that trains and predicts the label data based on the Gradient Boosted Tree Classifier.
Use-case
User wants to predict if the flight will be delayed or not based on some features of airline data:
Label → delayed and not delayed - delayed if 1.0 and 0.0 otherwise
Features → {dayOfMonth, weekday, scheduledDepTime, scheduledArrTime, carrier, elapsedTime, origin, dest}
User Stories
User should be able to train the data.
User should be able to classify the test data using the model build while training.
User should be able to provide the list of columns(features) to use for training.
User should be able to provide the list of columns(features) to be used for prediction.
User should be able to provide the column to be used as prediction field while training/regression.
User should be able to specify the maximum depth of the Gradient Boosted tree.
User should be able to specify maximum number of classes.
User should be able to specify maximum number of iterations.
User should be able to provide the file set name to save the training model.
User should be able to provide the path of the file set.
Example
Following is a simple example showing how GD Tree Trainer and Classifier would work to predict if the flight will be delayed or not.
For each flight, we have the following information: Â
Delayed | Day of Week | Carrier | TailNum | FlightNum | Origin | Destination | Day of Month | Distance | Arrival Time | Departure Time |
---|---|---|---|---|---|---|---|---|---|---|
1.0 | 4 | AA | N787AA | 21 | JFK | LAX | 1 | 2475 | 1230 | 855 |
0.0 | 6 | EV | N457ER | 34 | ATL | JAX | 1 | 1589 | 1530 | 1700 |
Â
The GD Tree Trainer will train the data based on some features, for example :Â {dayOfMonth, weekday, scheduledDepTime, scheduledArrTime, carrier, elapsedTime, origin, dest .
The label for the first and second rows will be set to 1.0 and 0.0(delayed column value).
Trainer will save the model in a fileSet, which will be used later for predicting the delayed value using classification.
Â
Conditions
Design
GD Tree Trainer
Input Json Format
{ "name": "GDTreeTrainer", "type": "sparksink", "properties": { "fileSetName": "gd-tree-model", "path": "/home/cdap", "featuresToInclude": "dofM,dofW,scheduleDepTime,scheduledArrTime,carrier,elapsedTime,origin,dest", "labelField": "delayed", "maxClass": "2", "maxDepth": "9", "maxIteration": "3" } }
Plugin will take above inputs from user and trains the model based on "featureFields" and  "labelField" fields as features and label points respectively.
Properties:
- fileSetName:Â The name of the FileSet to save the model to.
- path: Path of the FileSet to save the model to.
- featuresToInclude:Â A comma-separated sequence of fields to use for training. Features to be used, must be of type: int, double, float, long.
- featuresToExclude:Â A comma-separated sequence of fields to be excluded when training.
- cardinalityMapping:Â Mapping of the feature to the cardinality of that feature; required for categorical features
- labelField:Â It should be the column name from input structure record containing the data to be treated as label for prediction.
- maxClass:Â The number of classes to be used in training model. It should be of type integer.
- maxDepth: Maximum depth of the tree. For example, depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Default is 10
maxIteration: The number of trees in the model. Each iteration produces one tree. Increased iteration value improves training data accuracy.
The model generated from this plugin will further be used by GD-Tree Classifier plugin to classify the input data.
GD Tree Classifier
Input Json Format
{ "name": "GDTreeClassifier", "type": "sparkcompute", "properties": { "fileSetName": "gd-tree-model", "path": "/home/cdap", "featuresToInclude": "dofM,dofW,scheduleDepTime,scheduledArrTime,carrier,elapsedTime,origin,dest", "predictionField": "delayed" } }
Classifier plugin will take above inputs from user and the GD-Tree model from the "fileSetName" and predict the flight data whether the flight will be delayed or not.
Properties:
- fileSetName:Â The name of the FileSet model.
- path: Path of the FileSet from which model needs to be retrieved.
- featuresToInclude:Â A comma-separated sequence of fields to use for classification. Features to be used, must be of type: int, double, float, long.
- featuresToExclude:Â A comma-separated sequence of fields to be excluded while classification.
- predictionField:Â It should be the column name in which the prediction data needs to be saved.
Â
Both *featuresToInclude* and *featuresToExclude* fields cannot be specified simultaneously.
If inputs for *featuresToInclude* and *featuresToExclude* has not been provided then all the fields except label/prediction field will be used as feature fields.
Table of Contents
Â
Checklist
- User stories documentedÂ
- User stories reviewedÂ
- Design documentedÂ
- Design reviewedÂ
- Feature mergedÂ
- Examples and guidesÂ
- Integration testsÂ
- Documentation for featureÂ
- Short video demonstrating the feature