Introduction

Machine learning and model building requires some tooling around them like model management and deployment tools. A typical machine learning pipeline looks as follows:

Usually, data scientists spend most of the time in non machine learning tasks, like data cleansing, sampling, managing models and stats for different iterations etc. With Athena we can solve this problem by providing Machine learning model management framework, Athena.

Goals

Build Machine learning model management framework to provide better experience to data scientists.

User Stories

As a data scientist, I should be able to provide training and test data or split the data into training and test datasets.
As a data scientist, I should be able to prepare data to create machine learning model.
As a data scientist, I should be able to create experiments, apply directives and tune machine learning model parameters. I should be able to do this repetitively in order to get accurate model.
As a data scientist, I should be able to view, list different experiments, feature stats and model stats
As a data scientist, I should be able to deploy and use finalized model for verification and prediction.

Design

In order to support a

Discussions

The flow of the machine learning framework in cdap will be as below:

Choose Data(training/test data)
Decide splitting method (Random and others?)
Create an experiment out of split data. This means each experiment is associated with a dataset. After this point, the dataset can not be changed.
Prepare data using data prep and apply directives. These directives will be applied to whole training and test datasets.
Users can see stats about features.
Build model with some parameters.
Evaluate and repeat steps 4 to 7 till accurate model is created.
Once accurate model is created, deploy the model.
After model is deployed, it will be stored in partitioned fileset and data scientists can use it in pipelines using Predictor plugins.

Design

Approach

Approach #1

Approach #2

API changes

New Programmatic APIs

New Java APIs introduced (both user facing and internal)

Deprecated Programmatic APIs

New REST APIs

Path

Method

Description

Response Code

Response

/v3/apps/<app-id>

GET

Returns the application spec for a given application

200 - On success

404 - When application is not available

500 - Any internal errors

Deprecated REST API

Path	Method	Description
/v3/apps/<app-id>	GET	Returns the application spec for a given application

CLI Impact or Changes

Impact #1
Impact #2
Impact #3

UI Impact or Changes

Impact #1
Impact #2
Impact #3

Security Impact

What's the impact on Authorization and how does the design take care of this aspect

Impact on Infrastructure Outages

System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect

Test Scenarios

Test ID	Test Description	Expected Results

Versions Compared

Old Version 1

New Version 2

Key

Table of Contents

Introduction

Goals

User Stories

Design

Discussions

Design

Approach

Approach #1

Approach #2

API changes

New Programmatic APIs

Deprecated Programmatic APIs

New REST APIs

Deprecated REST API

CLI Impact or Changes

UI Impact or Changes

Security Impact

Impact on Infrastructure Outages

Test Scenarios

Releases

Release X.Y.Z

Release X.Y.Z

Related Work

Future work

Page Comparison

Versions Compared

Old Version 1

New Version 2

Key

Table of Contents

Introduction

Goals

User Stories

Design

Discussions

Design

Approach

Approach #1

Approach #2

API changes

New Programmatic APIs

Deprecated Programmatic APIs

New REST APIs

Deprecated REST API

CLI Impact or Changes

UI Impact or Changes

Security Impact

Impact on Infrastructure Outages

Test Scenarios

Releases

Release X.Y.Z

Release X.Y.Z

Related Work

Future work