Table of Contents

Checklist

User Stories Documented
User Stories Reviewed
Design Reviewed
APIs reviewed
Release priorities assigned
Test cases reviewed
Blog post

Introduction

Simplify and improve user experience for cloud bases connection-types such as Google Cloud Storage, Amazon S3, BigQuery for CDAP data-prep in cloud environment.

Goals

When CDAP is provisioned in cloud environments such as Google cloud or AWS, data prep must be auto configured to include cloud connection types such as Google Cloud Storage, BigQuery or Amazon S3 based on the default credentials and project information and support browsing them by default.

When CDAP is provisioned in cloud environments such as Google cloud or AWS hide connection-types that are not relevant in cloud environments such as the file browser.

User Stories

As a CDAP administrator, I want to configure DataPrep to have pre defined connections
As a CDAP administrator, I want to configure which DataPrep connection types are available

Design

Background on Data Prep

Connection Type

Currently data prep has two set of connection-types.

Pre configured connection-type - File browser which is configured to browse local file system on cdap sandbox and configured to browse hdfs on cluster.
Configurable connection-type - Database, Kafka, S3, GCS and Google Big Query.

Connection

In Data-prep, its required for users to explicitly create a connection for the configurable connection types such as databases, kafka, GCS, S3, etc, in order to explore them.

Adding a connection:

Code BlockPOST : connections/create Body : { "name" : "connection_name

Table of Contents

Checklist

User Stories Documented
User Stories Reviewed
Design Reviewed
APIs reviewed
Release priorities assigned
Test cases reviewed
Blog post

Introduction

Simplify and improve user experience for cloud bases connection-types such as Google Cloud Storage, Amazon S3, BigQuery for CDAP data-prep in cloud environment.

Goals

When CDAP is provisioned in cloud environments such as Google cloud or AWS, data prep must be auto configured to include cloud connection types such as Google Cloud Storage, BigQuery or Amazon S3 based on the default credentials and project information and support browsing them by default.

When CDAP is provisioned in cloud environments such as Google cloud or AWS hide connection-types that are not relevant in cloud environments such as the file browser.

User Stories

As a CDAP administrator, I want to configure DataPrep to have pre defined connections
As a CDAP administrator, I want to enforce certain connection types as disabled and that will disallow users from creating new connections and accessing existing connections.
As a CDAP user, If my existing connection's type is disabled by administrator, expect data-prep to surface appropriate error message.

Design

Background on Data Prep

Connection Type

Currently data prep has two set of connection-types.

Pre configured connection-type - File browser which is configured to browse local file system on cdap sandbox and configured to browse hdfs on cluster.
Configurable connection-type - Database, Kafka, S3, GCS and Google Big Query.

Connection

In Data-prep, its required for users to explicitly create a connection for the configurable connection types such as databases, kafka, GCS, S3, etc, in order to explore them.

Adding a connection:

Code Block

POST : connections/create
Body :
{
  "name" : "connection_name",
  "type" : "connection_type",
  "properties": {
    ...
  }
}
Response :
{
  "values": [
    "connection id"
  ],
  "count": 1,
  "status": 200,
  "message": "Success"
}

In order to add a connection users would provide the following

Connection-name - for all connections, currently connection name has to be unique across all connections.
type - One of "upload", "file", "database", "table", "s3", "gcs", "bigquery", or "kafka".
Additional fields specific to the connection-type

Example - To add a Kafka connection, users have to provide

connection-name
Kafka broker host and port to connect to
No Format
{ "name": "my-kafka", "type"

: "

connection_type

kafka", "properties": {

...

"brokers": "localhost:9000",

}

Response

:

...

{

"values": [ "connection id" ], "count": 1, "status": 200, "message": "Success" }

In order to add a connection users would provide the following

Connection-name - for all connections, currently connection name has to be unique across all connections.
type - One of "upload", "file", "database", "table", "s3", "gcs", "bigquery", or "kafka".
Additional fields specific to the connection-type

Example - To add a Kafka connection, users have to provide

connection-name

Kafka broker host and port to connect to

No Format
{ "name": "my-kafka", "type": "kafka", "properties": { "brokers": "localhost:9000", ... } }

New Features

Classifying default and configurable connection types:

DataPrep has to support allowing administrators to create default connections.
DataPrep has to support allowing administrators to disable connection types they wish to hide in DataPrep.
We will look into two approaches for supporting this in DataPrep

Approach #1 Data-prep app config to specify supported connection types and default connections (Preferred)

Administrators can specify a data-prep config JSON specific to their environment.

The config JSON can be used to specify a list of ConnectionTypeConfig

ConnectionTypeConfig can be used to specify that a connection needs to be created by default for the connection type and the properties for that connection.
ConnectionTypeConfig can be used to specify the set of connection types that are supported in DataPrep

Code Block

language	java
title	ConnectionConfig

public class ConnectionTypeConfig {
  private Set<ConnectionType> disabledTypes;
  private List<Connection> connections;
}


public class DataPrep extends AbstractApplication<ConnectionTypeConfig> {


}

Code Block

title	example-gcp-data-prep-config

{ "disabledTypes": [ "file" ], "connections": [ {

} }

New Features

Classifying default and configurable connection types:

DataPrep has to support allowing administrators to create default connections.
DataPrep has to support allowing administrators to disable connection types they wish to hide in DataPrep.
We will look into two approaches for supporting this in DataPrep

Approach #1 Data-prep app config to specify supported connection types and default connections (Preferred)

Administrators can specify a data-prep config JSON specific to their environment.

The config JSON can be used to specify a list of ConnectionTypeConfig

ConnectionTypeConfig can be used to specify that a connection needs to be created by default for the connection type and the properties for that connection.
ConnectionTypeConfig can be used to specify the set of connection types that are supported in DataPrep

Code Block

language	java
title	ConnectionConfig

public class ConnectionTypeConfig {
  private Set<ConnectionType> disabledTypes;
  // list of connections to be created
  private List<Connection> connections;
  // which connection will be used by dataprep UI as default connection to show
  private Connection defaultConnection;
}


public class DataPrep extends AbstractApplication<ConnectionTypeConfig> {


}

Code Block

title	example-gcp-data-prep-config

{
   "disabledTypes":[
      "file"
   ],
   "connections":[
      {
         "name":"gcsDefault",
         "type":"gcs",
         "properties":{
            "nameserviceFilePath": "gcsDefault"~/gcp/service-config.json",
      "type      "project": "gcscdap",
      "properties": {  }
      "serviceFilePath": "~/gcp/service-config.json"},
        "project": "cdap"
      }
    }, 
{
   {       "name": "bqDefault",
         "type": "bigquery",
         "properties":{
{
    }     },
     ... }
 ]
}  ], 
   "defaultConnection":{
      "name":"gcsDefault",
      "type":"gcs"
   }
}

Any connections specified in the config are created on start up of the Wrangler service. They cannot be deleted.

Example explanation :

The 'file' connector type is in disabled connection type list, so no file connections are allowed.
There is a default GCS connection that uses pre-defined credentials and project information as specified in the properties map.
There is a default BigQuery connection without any properties. The BigQuery connection will read required credentials from the environment.
Kafka, Database and S3 connection types are configured to be visible, but no default connections are configured for them.

Surfacing the

GCS is configured as default connection - used by dataprep UI to show this connection as the default page on data-prep

Surfacing the data-prep-config JSON to data-prep application

Data-prep config JSON can be specified through cdap-site.xml property

Option - 1

Code Block

<property>
<name>data.prep.config.json</name>
<value></value>
<description>
   data prep config JSON content, this config is used to specify the connection-types that are   
   available as default and the connection-types that have to be disabled in data-prep.
</description>
</property>

Example

Code Block

<property>
<name>data.prep.config.json</name>
<value>
{
  "disabledTypes": [ "file" ],
  "connections": [
    {
      "name": "gcsDefault",
      "type": "gcs",
      "properties": {
        "serviceFilePath": "~/gcp/service-config.json",
        "project": "cdap"
      }
    }, 
    {
      "name": "bqDefault",
      "type": "bigquery",
      "properties": { }
    },
    ...
  ]
}
</value>
</property>

Notes

We have to provide JSON content instead of JSON file path, as the CDAP UI service could be running on different instance than CDAP backend service
Currently data-prep application is deployed and run separately on each namespace, however in the future when data-prep is moved to system namespace, it will be easier to have the data-prep application configured and deployed using bootstrap config files during CDAP startup instead of specifying through a cdap-site.xml property.
By providing the JSON as property value, UI can read the content and use it as application’s config directly

Option - 2

Since the UI is the one that creates the dataprep app, there can be a setting in the static UI configuration for the dataprep config object that should be used when creating the app. The UI did something similar for Tracker through "cdap-ui-config.json", so the structure is in place already.

Option - 3

Adding a new CDAP endpoint that can be used to create application based on a specified artifact and start programs in the application.

Code Blocklanguage

config to specify artifact, application config and programs to start after application is deployed. The path to this config file can be specified through a cdap-site.xml property.

This requires a platform change and CDAP endpoint that can read the config to perform the following

For the specified artifactName, find the latest artifact version if its not provided
Create an app with the identified artifact and the provided app-config
Start the programs configured to be started.

Code Block

language	java
title	CreateAppConfig

class CreateAppConfig {
	// name of the artifact used to create the app
	String artifactName;
	
	// artifact scope, if null - artifacts from both system scope and user artifacts in current namespace will be retrieved
	@Nullable
	String artifactScope;

	//artifact version, if null - latest artifact version will be used
	@Nullable
	String artifactVersion;
	
	// application config
	@Nullable
	String config;


	// list of programs to start 
	List<StartProgram>List<StartProgramInfo> programs;
}


class StartProgramStartProgramInfo {
	ProgramType programType;
	String programId;
}

Example config for data-prep

No Format

{
   "artifactName":"wrangler-service",
   "artifactScope":"system",
   "config":{
      "disabledTypes":[
         "file"
      ],
      "connections":[
         {
            "name":"gcsDefault",
            "type":"gcs",
            "properties":{
               "project":"cdap"
            }
         },
         {
            "name":"bqDefault",
            "type":"bigquery",
            "properties":{

            }
         }
      ]
   },
   "programs":[
      {
         "programType":"Service",
         "programId":"wrangler-service"
      }
   ]
}

The CreateAppConfig can be specified in a file, The path for this file can be configured using "cdap-site.xml" property.

Code Block
<property> <name>dataprep.config.path</name> <value></value> <description> Path to the configuration file for creating and configuring dataprep app</description> </property>

Notes

This decouples additional steps in UI for enabling data-prep from the UI, as the backend can perform finding latest artifact, app-creation and starting programs.
Administrator can make changes to the config file and the changes will be read on re-deploy of the app without requiring a CDAP restart.

Approach #2 Data prep code to handle connection type classification (considered)

Data-prep ConnectionType class will have the logic specifying the default modes they are supported on and the modes they are not supported upon per connection type.

Code Block

language	java
title	ConnectionType.enum

public enum ConnectionType { 
 FILE("file", Collections.singletonList(Mode.NATIVE), Collections.singletonList(Mode.GCP)),
 DATABASE("database"),
 TABLE("table"),
 S3("s3", Collections.singletonList(Mode.AWS), Collections.emptyList()),
 GCS("gcs", Collections.singletonList(Mode.GCP), Collections.emptyList()),
 BIGQUERY("bigquery", Collections.singletonList(Mode.GCP), Collections.emptyList()),
 KAFKA("kafka");

 private String type;
 Map<Mode, List<String>> defaultTypes = new HashMap<>();
 Map<Mode, List<String>> configurableType = new HashMap<>();

 ConnectionType(String type, List<Mode> defaultModes, List<Mode> disabledModes) {
	// logic to update default and configurable type map
 }
}

Comparison of approaches

Approach-1

Approach-2

Flexibility

By configuring data-prep through config json specific to an environment, its very flexible to be used/configured across environments

Not flexible and would require code changes to support changes in default connection types

Extensibility

Its extensible to add new capabilities, offers administrators a way to disable connection types they they like to hide in data-prep in their cluster or environment.

Not extensible

Ease of use

Needs appropriate changes in cdap-site from administrators to configure data-prep connection-types for a certain environment.

No config changes required from admin.

Based on the above points, Approach#1 seems like a better approach for configuring connection-types and connections.

Listing connections

The listing connections endpoint returns all the connections across connection types.

Code Block

title	Listing Connections

GET : connections?type=<connection-type>

If we want to allow deleting default connections, no changes are required in the list connections response, else if we want to disallow we need to add a flag to indicate if the connection can be deleted or not.

Example

Code Block

[
  {
    // existing fields
    created:1533856371,
    description:"kafka connection",
    id:"kafka2",
    name:"Kafka2",
    type:"kafka",
    Updated:1533856371,

    // newly added fields
    "canDelete" : true,
    "canEdit" : true
  },
  {
    // existing fields
    created:1533856371,
    description:"GCS connection",
    id:"gcsdefault",\
    name:"gcsdefault",
    type:"gcs",
    Updated:1533856371,

    // newly added field
    // disabling delete
    "canDelete" : false,
	"canEdit" : false
  },
  ...
]         "type":"gcs",
            "properties":{
               "project":"cdap"
            }
         },
         {
            "name":"bqDefault",
            "type":"bigquery",
            "properties":{

            }
         }
      ]
   },
   "programs":[
      {
         "programType":"Service",
         "programId":"wrangler-service"
      }
   ]
}

The CreateAppConfig is specified as a JSON file, The path for this file can be configured using "cdap-site.xml" property.

Code Block
<property> <name>dataprep.config.path</name> <value></value> <description> Path to the configuration file for creating and configuring dataprep app</description> </property>

Notes

This decouples additional steps in UI for enabling data-prep from the UI, as the backend can perform finding latest artifact, app-creation and starting programs.
Administrator can make changes to the config file and the changes will be read on re-deploy of the app without requiring a CDAP restart.

Approach #2 Data prep code to handle connection type classification (considered)

Data-prep ConnectionType class will have the logic specifying the default modes they are supported on and the modes they are not supported upon per connection type.

Code Block

language	java
title	ConnectionType.enum

public enum ConnectionType { 
 FILE("file", Collections.singletonList(Mode.NATIVE), Collections.singletonList(Mode.GCP)),
 DATABASE("database"),
 TABLE("table"),
 S3("s3", Collections.singletonList(Mode.AWS), Collections.emptyList()),
 GCS("gcs", Collections.singletonList(Mode.GCP), Collections.emptyList()),
 BIGQUERY("bigquery", Collections.singletonList(Mode.GCP), Collections.emptyList()),
 KAFKA("kafka");

 private String type;
 Map<Mode, List<String>> defaultTypes = new HashMap<>();
 Map<Mode, List<String>> configurableType = new HashMap<>();

 ConnectionType(String type, List<Mode> defaultModes, List<Mode> disabledModes) {
	// logic to update default and configurable type map
 }
}

Comparison of approaches

	Approach-1	Approach-2
Flexibility	By configuring data-prep through config json specific to an environment, its very flexible to be used/configured across environments	Not flexible and would require code changes to support changes in default connection types
Extensibility	Its extensible to add new capabilities, offers administrators a way to disable connection types they they like to hide in data-prep in their cluster or environment.	Not extensible
Ease of use	Needs appropriate changes in cdap-site from administrators to configure data-prep connection-types for a certain environment.	No config changes required from admin.

Based on the above points, Approach#1 seems like a better approach for configuring connection-types and connections.

Modifying Connections - Administrators and Users

Users will be allowed to edit or delete default connections similar to other connections.
Administrators can change the configuration of default connections, on re-deploy, the default connections which doesn't exist already will be created.
Changes to existing default connections or removal of a default connection by administrator won't affect already existing connections for users in data-prep, as it can cause disruption for users who are actively using those connections.

Connection-types - Administrator enforcement

Administrator can disable connection types as an enforcement to disable the connection-type due to environment CDAP instance is running on.

When administrator disables a connection-type the behavior is

No new connections will be allowed to be created for that connection-type
Supported connection-types list will not include the disabled connection type
Existing connections using the disabled connection type will be filtered out from the connections list endpoint
Accessing workspaces using the disabled connection type will produce an error

Listing Connections

No Format

Request : GET : v3/namespaces/dpdemo/apps/dataprep/services/service/methods/connections?type=*


Response:


{
	// existing fields
    "count": 2,
    "message": "Success",
    "status": 200,
    "values": [
        {
            "created": 1534807183,
            "description": null,
            "id": "mybq",
            "name": "mybq",
            "type": "BIGQUERY",
            "updated": 1534807183
        },
        {
            "created": 1534807183,
            "description": null,
            "id": "mygcs",
            "name": "mygcs",
            "type": "GCS",
            "updated": 1534807183
        }
    ],
    // newly added to surface information about which connection to display by default in DataPrep UI   
    "defaultConnection" : {"name" : "mygcs", "type" : "GCS"}
}

Error Handling

Default connections, if they fail test connection, they wont be created and the error log will be logged. Administrators can look into the error log for data-prep application to debug the issue.

API changes

New REST APIs

Path

Method

Description

Request Body

Response Code

Response

DataPrep

v3/apps/dataprep/services/

service/methods/connectionTypes

GET

Returns the list of connection types available and if they are visible or not

200 - On success

500 - Any internal errors

Code Block
[ { "type":"gcs" }, { "type":"bigquery" }, { "type":"kafka" }, { "type":"s3" }, { "type":"database" } ]

CDAP

/v3/apps/{app-id}

?namespaces=<namespace-id>

Edit : Not considered as this is covered as part of bootstrapping design

PUT

If request body is not specified

-

load file specified by the CDAP property

"{app-id}.config.path" to load the configuration.

Based on the artifactName in the config, find the latest artifact version for the artifact.

With the identified artifact and app config create the application with id {app-id} in

{namespace-id}

the configured namespace.

finally start the programs configured to be started

. No Format{

, if they are already running, restart the services.

No Format
{ "namespace":"default", "artifact":{ "name":"wrangler-service", "

artifactName

version":"

wrangler-service

[3.0-4.0)",
      "

artifactScope

scope":"

system

SYSTEM"
   },
   "config":{
      "disabledTypes":[
         "file"
      ],
      "connections":[
         {
            "name":"gcsDefault",
            "type":"gcs",
            "properties":{
               "project":"cdap"
            }
         },
         {
            "name":"bqDefault",
            "type":"bigquery",
            "properties":{

            }
         }
      ],
     
   },
   "programs":[
      {
         "

programType

type":"Service",
         "

programId

name":"

wrangler-

service"
      }

] }

200 - on success

]
}

200 - on success

400 - Bad request

if the request body is empty and no config property found for {app-id}.config.path

404 - Configured file doesn't exist

(or)

Namespace does not exist

(or)

No artifacts found with the specified artifact name

500 - Any internal errors

Response-1

"appId": "dataprep",

"programType": "Service",

"programId": "service",

"statusCode": 200

,

"appId": "dataprep",

"programType": "Service",

"programId": "service2",

"statusCode": 409,

"error": "Already running"

]

UI Impact

UI will have to change to only display connection types in the GET : connection-types result list.

UI will have to use the additional config while creating the data-prep app

UI will have to disable or hide showing delete/edit buttons for connections, if the connection has the property "canDelete/canEdit" set to false.

UI Impact

UI for adding a new connection, will have to change to display only connection types in the GET : connection-types result list.
In data-prep UI side bar, should only display connection types that are supported from GET:connection-types result.
Existing workspace created using a connection-types that is disabled currently will result in a error from the backend, UI should handle the error appropriately and surface it to the user.

Test Scenarios

Test ID	Test Description	Expected Results
1	Disable a connection type for which there is an existing connection and workspace in dataprep. Reload and restart dataprep app	connection for the disabled connection type should be filtered by backend and not displayed in UI. Accessing the workspace should surface an error about the connection type being disabled and error should be surfaced in UI.

Page Comparison

Versions Compared

Old Version 27

New Version Current

Key

Introduction

Goals

User Stories

Design

Background on Data Prep

Connection Type

Connection

Adding a connection:

Introduction

Goals

User Stories

Design

Background on Data Prep

Connection Type

Connection

Adding a connection:

New Features

Classifying default and configurable connection types:

Approach #1 Data-prep app config to specify supported connection types and default connections (Preferred)

New Features

Classifying default and configurable connection types:

Approach #1 Data-prep app config to specify supported connection types and default connections (Preferred)

Surfacing the data-prep-config JSON to data-prep application

Approach #2 Data prep code to handle connection type classification (considered)

Comparison of approaches

Listing connections

Approach #2 Data prep code to handle connection type classification (considered)

Comparison of approaches

Modifying Connections - Administrators and Users

Connection-types - Administrator enforcement

Listing Connections

Error Handling

API changes

New REST APIs

UI Impact

UI Impact

Test Scenarios