Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 25 Next »

Checklist

  • User Stories Documented
  • User Stories Reviewed
  • Design Reviewed
  • APIs reviewed
  • Release priorities assigned
  • Test cases reviewed
  • Blog post

Introduction 

Machine learning and model building requires some tooling around them like model management and deployment tools. A typical machine learning pipeline looks as follows: 

Usually, data scientists spend most of the time in non machine learning tasks, like data cleansing, sampling, managing models and stats for different iterations etc. We can solve this problem by providing Machine learning model management framework, Athena.

Goals

Build Machine learning model management framework to provide better experience to data scientists.

User Stories 

  • As a data scientist, I should be able to provide training and test data or split the data into training and test datasets.
  • As a data scientist, I should be able to prepare data to create machine learning model.
  • As a data scientist, I should be able to create experiments, apply directives and tune machine learning model parameters. I should be able to do this repetitively in order to get accurate model.
  • As a data scientist, I should be able to view, list different experiments, feature stats and model stats
  • As a data scientist, I should be able to deploy and use finalized model for verification and prediction.

Discussions

The flow of the machine learning framework in cdap will be as below:

  1. Choose Data(training/test data)
  2. Decide splitting method (Random and others?)
  3. Create an experiment out of split data. This means each experiment is associated with a dataset. After this point, the dataset can not be changed.
  4. Prepare data using data prep and apply directives. These directives will be applied to whole training and test datasets.
  5. Users can see stats about features.
  6. Build model with some parameters. 
  7. Evaluate and repeat steps 4 to 7 till accurate model is created.
  8. Once accurate model is created, deploy the model.
  9. After model is deployed, it will be stored in partitioned fileset and data scientists can use it in pipelines using Predictor plugins.

Design - wip

The idea is to use a spark program along with Data Prep application to expose different rest end points to calculate stats and manage models. When we choose and request split for dataset, the rest endpoint exposed by spark program will be called. Usually dataset for model building as small (couple of MBs). This method can either be synchronous or asynchronous. It will basically split the data and store them under <basepath>/training and <basepath>/test directories on HDFS. Once the experiment is created, ui will deploy and run a pipeline along with model parameters. We are implementing Trainer plugin to execute and build machine learning model. This model for that iteration will be persisted on HDFS along with model metadata. Once model is evaluated and finalized, it will be deployed which later can be used by Predictor plugin to predict values through hydrator pipeline.

Implementation

Model Lifecycle:

 

Wrangler Changes:

  • Verify if emitting stats for all the features/columns takes long time in spark program. We need to assess this because we are relying on spark's map and reducebykey methods to get stats for each column in data. We need to make sure it does not take much time to evaluate. Test performance with 100k records.
  • Spark program should also be able to execute directives and UDDs. 
  • Define Rest Endpoints for list models, list experiments, get stats etc
  • Design schema for datasets used to store stats and models.

Platform Changes:

  • Support Rest Server in Spark Program
  • Dynamic Plugin Loading. Similar to servic

 

Approach

Approach #1

Approach #2

API changes

New Programmatic APIs

New Java APIs introduced (both user facing and internal)

Deprecated Programmatic APIs

New REST APIs

PathMethodDescriptionRequest BodyRequest ParamsResponse CodeResponse
/experimentsGETlist experiments 
srcpath="/path", 

returns only experiments that have that srcpath.

 

offset=0,

returns experiments starting from that offset.

 

limit=10,

limits the number of experiments returned to the specified limit.

 

With limit and offset query parameters TOTAL_COUNT is returned as part of the response header to determine total number of "pages" to give a context of current page when using limit and offset query parameters.

 
[
  {
    "name": "",
    "description": "",
    "srcpath": "/path/to/files",
    "outcome": ""
  },
  ...
]
/experiments/<experiment-name>GETget experiment   
{
  "name": "",
  "description": "",
  "srcpath": "/path/to/files",
  "outcome": "price"
}
/experiments/<experiment-name>PUTadd experiment
{
  "description": "",
  "srcpath": "/path/to/files",
  "outcome": "price",
  "outcomeType": "double",
  "directives": [ ... ]
}
 409 - if srcpath, outcome, or outcomeType is already defined as another value.

 

 

/experiments/<experiment-name>DELETEdelete experiment    
/experiments/<experiment-name>/splitsGETlist splits   
{
  "directives": [ ... ],
  "schema": {...},
  "type": "random" | "first",
  "parameters": {
    "percent": "80" 
  },
  "description": "",
  "trainingPath": "",
  "testPath": "",
  "models": [ ... ]
}
/experiments/<experiment-name>/splitsPOSTadd a split
{
  "directives": [ ... ],
  "schema": {...},
  "type": "random" | "first",
  "parameters": {
    "percent": "80" 
  },
  "description": ""
}
  
{
  "id": "123456"
}
/experiments/<experiment-name>/splits/<split-id>GETget split info   
{
  "id": "splitid",
  "directives": [ ... ],
  "schema": {...},
  "type": "random" | "first",
  "parameters": {
    "percent": "80" 
  },
  "description": "",
  "trainingPath": "",
  "testPath": "",
  "stats": [
    {
      "field": "numeric-field",
      "numTotal": {
        "train": ...,
        "test": ...,
      },
      "numNull": {
        "train": ...,
        "test": ...,
      },
      "mean": {
        "train": ...,
        "test": ...,
      },
      "min": {
        "train": ...,
        "test": ...,
      },
      "max": {
        "train": ...,
        "test": ...,
      },
      "stddev": {
        "train": ...,
        "test": ...,
      },
      "numZero": {
        "train": ...,
        "test": ...,
      },
      "numNegative": {
        "train": ...,
        "test": ...,
      },
      "numPositive": {
        "train": ...,
        "test": ...,
      },
      "histo": [
         {
           "bin": "[0,100)",
           "count": {
             "train": ...,
             "test": ...,
           },
         },
         ...
      ],
      "similarity": 0.88
    },
    {
      "field": "categorical-field",
      "numTotal": {
        "train": ...,
        "test": ...,
      },
      "numNull": {
        "train": ...,
        "test": ...,
      },
      "numEmpty": {
        "train": ...,
        "test": ...,
      },
      "unique": {
        "train": ...,
        "test": ...,
      },
      "histo": [
         {
           "bin": "cat1",
           "count": {
             "train": ...,
             "test": ...,
           },
         },
         ...
      ],
      "similarity": 0.88
    },
  ],
  "models": [ ... ]
}
/experiments/<experiment-name>/splits/<split-id>/statusGETget status of split    
/experiments/<experiment-name>/splits/<split-id>DELETEdelete split    
/experiments/<experiment-name>/modelsPOSTcreate a model. If an existing split is included, the model state will be determined by split state. Both directives and split cannot be specified at the same time.
{
  "name": "",
  "description": "",
  "directives": [ ... ],
  "split": id // optional,
}
  
{
  "id": "woeifn"
}
/experiments/<experiment-name>/modelsGETlist models in an experiment   
[
  {
    "id": "",
    "name": "",
    "description": "",
    "split": "id123456",
    "algorithm": "decision tree",
    "hyperparameters": {
      "maxBins": "3"
    },
    "evaluationMetrics": {
      "precision": 0.8,
      "recall": 0.5,
      "f1": 0.7,
      "rmse": ,
      "r2": ,
      "evariance": ,
      "mae": ,
    },
    "directives": [
      "parse-as-csv ...",
      ...
    ],
    "features": [ "f1", "f2", "f3" ],
    "deploytime": timestamp, // missing if not deployed
    "traintime": timestamp, // missing if not trained
    "createtime": timestamp
  },
  ...
]
/experiments/<experiment-name>/models/<model-id>/splitPOSTcreate a split for the model using the directives of the model
{
  "schema": {...},
  "type": "random" | "first",
  "parameters": {
    "percent": "80" 
  },
  "description": ""
}
   
/experiments/<experiment-name>/models/<model-id>/directivesPUTmodify the model directives. Deletes any split already assigned to the model.
{
  "directives": [ ... ]
}
 409 - if the model is not in the PREPARING, SPLIT_FAILED, or TRAINING_FAILED states. 
/experiments/<experiment-name>/models/<model-id>/trainPOSTtrain a model
{
    "algorithm": "decision tree",
    "hyperparameters": {
      "maxBins": "3"
    }
}
 409 - if the model is not in the TRAINING_FAILED or DATA_READY states. 
/experiments/<experiment-name>/models/<model-id>DELETEdelete a model    
/experiments/<experiment-name>/models/<model-id>/statusGETget model status    
/experiments/<experiment-name>/models/<model-id>/deployPOSTdeploy a model    

Deprecated REST API

CLI Impact or Changes

  • Impact #1
  • Impact #2
  • Impact #3

UI Impact or Changes

  • Impact #1
  • Impact #2
  • Impact #3

Security Impact 

What's the impact on Authorization and how does the design take care of this aspect

Impact on Infrastructure Outages 

System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect

Test Scenarios

Test IDTest DescriptionExpected Results
   
   
   
   

Releases

Release X.Y.Z

Release X.Y.Z

Related Work

  • Work #1
  • Work #2
  • Work #3

 

Future work

  • No labels