Hydrator Backend Application

To develop a back-end app to encapsulate business logic, that acts as as an intermediary between CDAP-UI and CDAP backend. The back-end app simplifies developing new features in CDAP-UI as it encapsulates the logic to translate business logic request/action to appropriate CDAP backend requests/actions and returns to the UI relevant information. This will make CDAP-UI to focus more on the UI aspects and less about the business logic involved. Ideally this back-end app will remove the "view in CDAP" as the UI will be able to get the relevant information required from the backend-app.

Checklist

User stories documented (Shankar)
User stories reviewed (Nitin)
Design documented (Shankar)
Design reviewed (Terence/Andreas/Albert)
Feature merged (Shankar)
UI updated (Ajai/Edwin)
Documentation for feature (Shankar)

Use-cases

Case #1

User adds a database plugin to the pipeline, clicks on the database plugin to fill in the configuration
User provides JDBC string, table name or SELECT query, username, password.
User then clicks on the button to populate the schema
UI will make the backend call to Hydrator App to retrieve the schema associated depending on whether it's based on Table or SELECT query.
User then has the choice to include the schema as the output schema of the database plugin.
The information of the schema associated with the database plugin is stored as spec in the exported pipeline.

Case #2

User adds a database plugin to the pipeline, clicks on the database plugin to fill in the configuration
User provides JDBC string (include database and other configurations), username and password
User on selecting table will click on the button to list the tables.
UI makes the backend call to retrieve the list of tables and show it to the user
User then selects the table which automatically populates the schema as the output schema of the database plugin.

Case #3

Shankar is using the Hydrator Studio instance to build a pipeline, he is building a batch pipeline for processing data from the Stream
Albert is also using the same instance of Hydrator Studio to build his pipeline, he is building a real-time pipeline for processing data from Twitter
Both Albert and Shankar have complex pipelines to build and they want to ensure that their work is not lost, so they are periodically saving it as draft
When both of them save drafts asynchronously to each other, the draft from each are visible to each other.

User Stories

There are hydrator specific functionalities which could leverage CDAP’s features.

Drafts
- User wants to add a new draft or save the pipeline he is working as a draft
- User can update an existing draft of a pipeline as new version – previous version of pipelines are saved (upto 20 versions)
- User can go back to previous version of draft or for any version of draft
- User wants to retrieve the latest version of draft for a pipeline
- User wants to view all available pipeline drafts across all users
- User wants the ability to write a pipeline draft
- User has access to only those pipelines that are available in the namespace the user is in.
Plugin Output Schema
- User using DB-Source wants to enter connection-string, table name and automatically populate table schema information.
List Field values
- User provides connection-string, user-name and password and expects list of available tables returned in DB-Source.

Design

Option #1

Description

The hydrator app needs to be able to write/read to a dataset to store and retrieve drafts and other information about business logic. We can implement a Hydrator CDAP Application with a service that can have REST endpoints to serve the required hydrator functionalities. Enabling Hydrator in a namespace will deploy this Hydrator app and start the service. Hydrator UI would ping for this service to be available before coming up. The back-end business logic actions which directly needs to use the CDAP services endpoints can be made generic.

Pros
- Everything (Drafts, etc) stored in the same namespace, proper cleanup when namespace is deleted.
Cons
- Every namespace will have an extra app for supporting hydrator if hydrator is enabled. Running this service, will run 2 containers per namespace. we can add an option to enable/disable hydrator if we are not using hydrator in a namespace. It might feel weird as a user app, as the user didn't write/create this app.

Option #2

Description

We will still use an Hydrator CDAP app but we create an "Extensions" namespace and have the "hydrator" app only deployed in the "extensions" namespace, this app would serve the hydrator requests for all namespaces.

It will use a single dataset to store the drafts, row keys can be name spaced for storing the drafts, while deleting the namespace, the rows belonging to the namespace will be deleted from the dataset.

Pros
- Less amount of resources used, only 2 container's used rather than 2 container’s per namespace, only one dataset is used.
- Only one app for using hydrator across namespace and not an app per namespace, less clutter.
- New extensions could be added to the same namespace to support other use cases in future.
Cons
- Using a single dataset for storing all drafts across namespace is less secure?.
- User won't be able to create a new namespace called "Extensions", as it will be reserved.

Open Questions

How to delete the drafts when the namespace is deleted ?
When to stop this service?
Availability of the service?
Security
- If we decide to add more capability in hydrator back-end app, Eg: Make the pipeline validation/deploy app, etc, then in secure environment,
- The hydrator-service can discover appropriate cdap.service and call appropriate endpoints?

Option #3 (based on discussion with terence)

1) No new user level apps are deployed. Config store is used to store user drafts of hydrator apps.

2) REST endpoint 'configure', can accept partial config and return a config response with suggestions of values for fields in a plugin, exceptions if any during configuring the plugin.

user can choose a value from the suggestions for the field and call the configure again.
user can look at exception, fix the issue with either the script or configuration and call configure again.
when all the required configs are provided and there aren't any exceptions, completionStatus would be set to true for the plugin.

Story 1 - Drafts

HTTP Request Type

Endpoint

Request Body

Response Status

Response Body

POST

/namespaces/{namespace-id}/configurations/{config-id}/

{

"config": {...}

}

200 OK: config saved successfully

409 CONFLICT: draft-name already exists

500 Error: while saving the draft

PUT

/namespaces/{namespace-id}/configurations/{config-id}/

{

"config ": {...}

}

200 OK: config updated successfully

404 NOT Found : config doesn't exist already, cannot be updated.

500 Error while updating the config

GET

/namespaces/{namespace-id}/configurations/{config-id}/

200 return all the versions for the config identified by the config-name

404 config not found

500 error while getting config

[

{

"timestamp" : "...",

"config": {

"source" : {

....

},

"transforms" : [...],

"sinks" [...]

"connections" : [..]

}

},

...

]

GET

/namespaces/{namespace-id}/configurations/{config-id}/versions/{version-number}

-1 -> latest version

200 return the versions for the config identified by the config-id and version-number

404 config with version found

500 error while getting config

{

"timestamp" : "...",

"config": {

"source" : {

....

},

"transforms" : [...],

"sinks" [...]

"connections" : [..]

}

GET

/namespaces/{namespace-id}/configurations/

200 return the name of list of all saved configs

500 error

[
"streamToTPFS",
"DBToHBase",
...
]

DELETE

/namespaces/{namespace-id}/configurations/

200 successfully deleted all configs

500 error while deleting

DELETE

/namespaces/{namespace-id}/configurations/{config-id}

200 successfully deleted the specified config

404 config does not exist

500 error while deleting

The ConsoleSettingsHttpHandler currently makes use of ConfigStore. It's however not name-spaced and has few other issues, it can be fixed and can be improved to store configs.

Along with pipeline drafts ConsoleSettingsHttpHandler also stores the following information currently:

Plugin Template Endpoints

GET namespaces/{namespace-id}/plugin-templates/{plugin-template-id}/ 
// create a new plugin template
POST namespaces/{namespace-id}/plugin-templates/{plugin-template-id}/ -d '@plugin-template.json' 
// update existing plugin template
PUT namespaces/{namespace-id}/plugin-templates/{plugin-template-id}/ -d '@plugin-template.json'
// delete the plugin template
DELETE namespaces/{namespace-id}/plugin-templates/{plugin-template-id}/

Defaults

 // create/update defaults this include user's plugin version preferences, etc.
 PUT : namespaces/{namespace-id}/defaults -d '@default.json' 
 GET : namespaces/{namespace-id}/defaults

Config Store:

Existing configstore methods

void create(String namespace, String type, Config config) throws ConfigExistsException;

void createOrUpdate(String namespace, String type, Config config);

void delete(String namespace, String type, String id) throws ConfigNotFoundException;

List<Config> list(String namespace, String type);

Config get(String namespace, String type, String id) throws ConfigNotFoundException; 

void update(String namespace, String type, Config config) throws ConfigNotFoundException;

Configstore new methods

// get a particular version of an entry. 
Config get(String namespace, String type, String id, int version) throws ConfigNotFoundException; 
// get all the versions of an entry.
Config getAllVersions(String namespace, String type, String id) throws ConfigNotFoundException; 
// delete all entries of specified type.
void delete(String namespace, String type)

Open Questions :

1) ConfigStore stores the configs in "config.store.table", currently the table properties doesn't have versioning, drafts would need versioning, this would also need CDAP-upgrade to update properties for the existing dataset?

2) rename ConsoleSettingsHttpHandler to ConfigurationsHttpHanlder ?

Story 2 - Schema and field value suggestions :

REST API:

Request-Method : POST

Request-Endpoint : /namespaces/{namespace-id}/apps/{app-id}/configure

Request-Body

request.json

{
    "artifact": {
        "name": "cdap-etl-batch",
        "scope": "SYSTEM",
        "version": "3.4.0-SNAPSHOT"
    },
    "name": "pipeline",
    "config": {
        "source": {
			     "name": "Stream",
                 "plugin": {
                    "name": "StreamSource",
                    "artifact": {
                        "name": "core-plugins",
                        "version": "1.3.0-SNAPSHOT",
                        "scope": "SYSTEM"
                    },
                    "properties": {
                        "format": "syslog",
                        "name": "test",
                        "duration": "1d"
                    }
                }
            },
         "sinks" : [{..}],
          "transform": [{..}, {...}]
        }
}

Response-Body

response.json

{
    "artifact": {
        "name": "cdap-etl-batch",
        "scope": "SYSTEM",
        "version": "3.4.0-SNAPSHOT"
    },
    "name": "pipeline",
    "config": {
        "source": {
				"name": "Stream",
                "plugin": {
                    "name": "StreamSource",
                    "artifact": {
                        "name": "core-plugins",
                        "version": "1.3.0-SNAPSHOT",
                        "scope": "SYSTEM"
                    },
                    "properties": {
                        "format": "syslog",
                        "name": "test",
                        "duration": "1d",
                        "suggestions" : [{ 
                             "schema" : [ 
                                 { 
								 	"ts" : "long", 
                                    "headers", "Map<String, String>", 
                                    "program", "string",
									"message": "string",
									"pid": "string"
						         }
						       ]
							}],
						"isComplete" : "false"
                  	}
                }
            },
         "sinks" : [{..}],
         "transform": [{..}, {...}]
        }
}

PipelineConfigurable API Change

PipelineConfigurable

@Beta
public interface PipelineConfigurable {
  // change in return-type.
  ConfigResponse configurePipeline(PipelineConfigurer pipelineConfigurer) throws IllegalArgumentException; 
}

ConfigResponse

public class ConfigResponse extends Config {
 // list of suggestions for fields. 
 List<Suggestion> suggestions;
 // if there were any exception while executing configure 
 @Nullable
 String exception;
 // is the stage configuration complete ? 
 @DefaultValue("false")
 boolean isComplete;
}

Suggestion

public class Suggestion {
String fieldName;
// list of possible values for the fieldName
List<String> fieldValues; 
}

ApplicationContext

@Beta
public interface ApplicationContext<T extends Config> {
  // existing
  T getConfig();
  // application will set a config response
  void setResponseConfig(T response);
  // get the response config
  T getResponseConfig();
}

Open Questions:

1) would having setResponseConfig and getResponseConfig ApplicationContext along with input config, allow CDAP programs to set a config and read from other programs, would that be an issue?

2) Database's have information schema table, which has metadata information about column names and their types of tables.

However we have recently removed tableName from DBSource plugin, so how would we figure out this information from just the query ?
how to get schema for complex queries involving multiple tables ? This would involve parsing query to understand fields and the tables they are from and then querying the information schema for the types.

User Stories (3.5.0)

For the hydrator use case, the backend app should be able to support hydrator related functionalities listed below:
query for plugins available for a certain artifacts and list them in UI
obtaining output schema of plugins provided the input configuration information
deploying pipeline and start/stop the pipeline
query the status of a pipeline run and current status of execution if there are multiple stages.
get the next schedule of run, ability to query metrics and logs for the pipeline runs.
creating and saving pipeline drafts
get the input/output streams/datasets of the pipeline run and list them in UI.
explore the data of streams/datasets used in the pipeline if they are explorable.
Add new metadata about a pipeline and retrieve metadata by pipeline run,etc.
delete hydrator pipeline
the backend app's functionalities should be limited to hydrator and it shouldn't be like a proxy for CDAP.

Having this abilities will remove the logic in CDAP-UI to make appropriate CDAP REST calls, this encapsulation will simplify UI's interaction with the back-end and also help in debugging potential issues faster. In future, we could have more apps similar to hydrator app so our back-end app should define and implement generic cases that can be used across these apps and it should also allow extensibility to support adding new features.

Generic Endpoints

HTTP Request Type	Endpoint	Request Body	Description	Response Body
GET	/extensions/{back-end}/status		200 OK : platform service is available 404 Service unavailable
GET	/extensions/{back-end}/program/{program-name}/runs		200 OK: runs of the program	[ "4as432-are425-..", "4az422-are425-.." .... ]
POST	/extensions/{back-end}/program/{program-name}/action		200 start/stop/status of program
POST	/extensions/{back-end}/program/{program-name}/metrics/query Query Params : startTime, endTime, scope		config: time-range, tags. 200 return metrics
GET	/extensions/{back-end}/program/{program-name}/logs/{log-level} Query Params : startTime, endTime		200 return logs for a time-range
GET	/extensions/{back-end}/program/{program-name}/schedule		200 get the next schedule run-time	{ "timestamp":"1455832171" }
GET	/extensions/{back-end}/program/{program-name}/datasets		200 get all the input/output datasets that's used in the program	[ purchases, history, .... ]
POST	/extensions/{back-end}/program/{program-name}/datasets/{dataset-name}/explore/{action}		perform action {preview, download, next} for explore on dataset 200 explore result
POST	/extensions/{back-end}/program/{program-name}/metadata	{ "key" : "...", "value" : "..." }	store metadata supplied in JSON for this program 200 ok
GET	/extensions/{back-end}/program/{program-name}/metadata		get metadata added for this program 200 metadata result	{ "key" : "...", "value" : "..." }
DELETE	/extensions/{back-end}/program/{program-name}/metadata		200 successfully deleted metadata added for the program

Hydrator Backend App

Hydrator Backend Application

Checklist

Use-cases

Case #1

Case #2

Case #3

User Stories

Design

Option #1

Description

Pros

Cons

Description

Pros

Cons

Open Questions

Story 1 - Drafts

Config Store:

Open Questions :

Story 2 - Schema and field value suggestions :

REST API:

Open Questions:

User Stories (3.5.0)

Generic Endpoints