Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Checklist

  •  User Stories Documented
  •  User Stories Reviewed
  •  Design Reviewed
  •  APIs reviewed
  •  Release priorities assigned
  •  Test cases reviewed
  •  Blog post

Introduction 

Simplify and improve user experience for cloud bases connection-types such as Google Cloud Storage, Amazon S3, BigQuery for CDAP data-prep in cloud environment.

Goals

When CDAP is provisioned in cloud environments such as Google cloud or AWS, data prep must be auto configured to include cloud connection types such as Google Cloud Storage, BigQuery or Amazon S3 based on the default credentials and project information and support browsing them by default.

When CDAP is provisioned in cloud environments such as Google cloud or AWS hide connection-types that are not relevant in cloud environments such as the file browser.

User Stories 

  • As a CDAP administrator, I want to configure DataPrep to have pre defined connections
  • As a CDAP administrator, I want to configure which DataPrep connection types are available

Design

Background on Data Prep

Connection Type

Currently data prep has two set of connection-types.

  1. Pre configured connection-type - File browser which is configured to browse local file system on cdap sandbox and configured to browse hdfs on cluster.

  2. Configurable connection-type - Database, Kafka, S3, GCS and Google Big Query.

Connection

In Data-prep, its required for users to explicitly create a connection for the configurable connection types such as databases, kafka, GCS, S3, etc, in order to explore them.

Adding a connection:

Code Block
POST : connections/create
Body :
{
  "name" : "connection_name",
  "type" : "connection_type",
  "properties": {
    ...
  }
}
Response :
{
  "values": [
    "connection id"
  ],
  "count": 1,
  "status": 200,
  "message": "Success"
}


In order to add a connection users would provide the following

  • Connection-name - for all connections, currently connection name has to be unique across all connections.

  • type - One of "upload", "file", "database", "table", "s3", "gcs", "bigquery", or "kafka".

  • Additional fields specific to the connection-type

Example - To add a Kafka connection, users have to provide

  • connection-name

  • Kafka broker host and port to connect to

    No Format
    {
      "name": "my-kafka",
      "type": "kafka",
      "properties": {
        "brokers": "localhost:9000",
        ...
      }
    }

New Features

Classifying default and configurable connection types:

  • DataPrep has to support allowing administrators to create default connections.
  • DataPrep has to support allowing administrators to disable connection types they wish to hide in DataPrep.
    We will look into two approaches for supporting this in DataPrep


Approach #1 Data-prep app config to specify connection type classification (Preferred)


Administrators can specify a data-prep config JSON specific to their environment.

The config JSON can be used to specify a list of ConnectionTypeConfig 

  • ConnectionTypeConfig can be used to specify that a connection needs to be created by default for the connection type and the properties for that connection.

  • ConnectionTypeConfig can be used to specify that a connection type needs to be hidden in DataPrep


Code Block
languagejava
titleConnectionConfig
public class ConnectionTypeConfig {
  private Set<ConnectionType> types;
  private List<Connection> connections;
}


public class DataPrep extends AbstractApplication<ConnectionTypeConfig> {


}



Code Block
titleexample-gcp-data-prep-config
{
  "types": [ "gcs", "bigquery", "kafka", "database", "s3" ],
  "connections": [
    {
      "name": "gcsDefault",
      "type": "gcs",
      "properties": {
        "serviceFilePath": "~/gcp/service-config.json",
        "project": "cdap"
      }
    }, 
    {
      "name": "bqDefault",
      "type": "bigquery",
      "properties": { }
    },
    ...
  ]
}

Any connections specified in the config are created on start up of the Wrangler service. They cannot be deleted.

Example explanation :

  • The 'file' connector type is not in the list, so no file connections are allowed.
  • There is a default GCS connection that uses pre-defined credentials and project information as specified in the properties map. 
  • There is a default BigQuery connection without any properties. The BigQuery connection will read required credentials from the environment.
  • Kafka, Database and S3 connection types are configured to be visible, but no default connections are configured for them.
Surfacing the data-prep-config JSON to data-prep application

Data-prep config JSON can be specified through cdap-site.xml property

Option - 1

Code Block
<property>
<name>data.prep.config.json</name>
<value></value>
<description>
   data prep config JSON content, this config is used to specify the connection-types that are   
   available as default and the connection-types that have to be disabled in data-prep.
</description>
</property>


Example 

Code Block
<property>
<name>data.prep.config.json</name>
<value>
{
  "types": [ "gcs", "bigquery", "kafka", "database", "s3" ],
  "connections": [
    {
      "name": "gcsDefault",
      "type": "gcs",
      "properties": {
        "serviceFilePath": "~/gcp/service-config.json",
        "project": "cdap"
      }
    }, 
    {
      "name": "bqDefault",
      "type": "bigquery",
      "properties": { }
    },
    ...
  ]
}
</value>
</property>


Notes

  • We have to provide JSON content instead of JSON file path, as the CDAP UI service could be running on different instance than CDAP backend service

  • Currently data-prep application is deployed and run separately on each namespace, however in the future when data-prep is moved to system namespace, it will be easier to have the data-prep application configured and deployed using bootstrap config files during CDAP startup instead of specifying through a cdap-site.xml property.

  • By providing the JSON as property value, UI can read the content and use it as application’s config directly


Option - 2

Since the UI is the one that creates the dataprep app, there can be a setting in the static UI configuration for the dataprep config object that should be used when creating the app. The UI did something similar for Tracker through "cdap-ui-config.json", so the structure should be is in place already.


Approach #2 Data prep code to handle connection type classification (considered)

Data-prep ConnectionType class will have the logic specifying the default modes they are supported on and the modes they are not supported upon per connection type.


Code Block
languagejava
titleConnectionType.enum
public enum ConnectionType { 
 FILE("file", Collections.singletonList(Mode.NATIVE), Collections.singletonList(Mode.GCP)),
 DATABASE("database"),
 TABLE("table"),
 S3("s3", Collections.singletonList(Mode.AWS), Collections.emptyList()),
 GCS("gcs", Collections.singletonList(Mode.GCP), Collections.emptyList()),
 BIGQUERY("bigquery", Collections.singletonList(Mode.GCP), Collections.emptyList()),
 KAFKA("kafka");

 private String type;
 Map<Mode, List<String>> defaultTypes = new HashMap<>();
 Map<Mode, List<String>> configurableType = new HashMap<>();

 ConnectionType(String type, List<Mode> defaultModes, List<Mode> disabledModes) {
	// logic to update default and configurable type map
 }
}


Comparison of approaches 



Approach-1

Approach-2

Flexibility

By configuring data-prep through config json specific to an environment, its very flexible to be used/configured across environments

Not very flexible and would require code changes to support changes in default connection types

Extensibility

Its extensible to add new capabilities, offers administrators a way to disable connection types they they like to hide in data-prep in their cluster or environment.

Not highly extensible

Ease of use

Needs appropriate changes in cdap-site from administrators to configure data-prep connection-types for a certain environment.  

No config changes required from admin.


Based on the above points, Approach#1 option#1 seems like a better approach for configuring classification of the connection-types and connections.

Listing connections

The listing connections endpoint returns all the connections across connection types.

Code Block
titleListing Connections
GET : connections?type=<connection-type>


If we want to allow deleting default connections, no changes are required in the list connections response, else if we want to disallow we need to add a flag to indicate if the connection can be deleted or not. 

Example


Code Block
[
  {
    // existing fields
    created:1533856371,
    description:"kafka connection",
    id:"kafka2",
    name:"Kafka2",
    type:"kafka",
    Updated:1533856371,

    // newly added fieldfields
    // false for default connections - disabling delete"canDelete" : true,
    "canDeletecanEdit" : true
  },
  {
    // existing fields
    created:1533856371,
    description:"GCS connection",
    id:"gcsdefault",\
    name:"gcsdefault",
    type:"gcs",
    Updated:1533856371,

    // newly added field
    // disabling delete
    "canDelete" : false,
	"canEdit" : false
  },
  ...
]


Error Handling

Default connections, if they fail test connection, they wont be created and the error log will be logged. Administrators can look into the error log for data-prep application to debug the issue. 

API changes

New REST APIs in Data-Prep

PathMethodDescriptionRequest BodyResponse CodeResponse
v3/apps/dataprep/services/service/methods/connectionTypesGETReturns the list of connection types available and if they are visible or not

200 - On success


500 - Any internal errors


Code Block
[
   {
      "type":"gcs"
   },
   {
      "type":"bigquery"
   },
   {
      "type":"kafka"
   },
   {
      "type":"s3"
   },
   {
      "type":"database"
   }
]
v3/apps/dataprepPUTCreates DataPrep app with configuration
No Format
{
 
"types":
 
[ "gcs", "bigquery", "kafka", "database", "s3" ], "connections": [
 "artifact":{
      "name":
"
gcsDefault
wrangler-service",
      "
type
version":
"gcs
"3.2.0-SNAPSHOT",
      "
properties
scope":"system"
{
   },
   
"
serviceFilePath
config":
"~/gcp/service-config.json",
{
      "
project
types":[
"cdap"
        
}
 "gcs",
   
},
      
{
"bigquery",
         "
name":
kafka",
         "
bqDefault
database",
         "
type": "bigquery"
s3"
      ],
      "
properties
connections":[
     
{
 
}
   {
 
},
     
...
   
] }

UI Impact

  • UI will have to change to only display connection types in the GET : connection-types result list.
  • UI will have to get the config for data-prep configured by the administrator through the property "data.prep.config.json" in "cdap-site.xml" , using the existing REST endpoint "/config/cdap"
    and provide that as app-config while creating data-prep application
  • UI will have to update the add connection widget to support adding connections only for connection-types that are marked visible 
       "name":"gcsDefault",
                "type":"gcs",
                "properties":{            
                   "project":"cdap"
                }
             },
             {
                "name":"bqDefault",
                "type":"bigquery",
                "properties":{
    
                }
             }
          ]
       }
    }



    UI Impact

    • UI will have to change to only display connection types in the GET : connection-types result list.
    • UI will have to use the additional config while creating the data-prep app 
    • UI will have to disable or hide showing delete/edit buttons for connections, if the connection has the property "canDelete/canEdit" set to false.