Introduction

The schema associated with HDFS Files may vary based on set of wrangler directives. Current implementation of datasets do not apply wrangler directives while exploring datasets or reading records using hydrator pipeline.

Goals

Better User experience: The goal is to apply wrangler directives while reading data through explore and hydrator pipelines from datasets.

User Stories

The HDFS files from existing Dataset can be sampled and wrangled through wrangler. The final output schema and wrangler directives should be applied while exploring the dataset or reading records from the dataset using hydrator pipeline. If there are any further changes to wrangler directives or output schema, it should be reflected through explore queries and hydrator pipeline.

The records from Table Dataset can be sampled and wrangled through wrangler. The final output schema and wrangler directives should be applied while exploring the dataset or reading records from the dataset using hydrator pipeline. If there are any further changes to wrangler directives or output schema, it should be reflected through explore queries and hydrator pipeline.

Users of CDAP might have already existing data in HDFS or HBase. In order to bring the data into CDAP, the only ways are to create a data pipeline or a CDAP app, to re-process the data and to create a CDAP datasets for further analysis. Allowing capabilities to use existing data without having to re-process will allow for great user experience and a good on-ramp for customers with a lot of legacy data.

Goals

Ease of adoption: Allow users to leverage their existing data in HDFS or HBase without having to re-process the data

Usability: Create datasets from existing data in HDFS or HBase and provide a great user-experience.

User Stories

As a user, I would like to create dataset from existing data on HDFS (or HBase)
As a user, I would like to apply schema to the dataset that is created from existing data on HDFS (or HBase)
As a user, I would like to apply transformations on data existing on HDFS (or HBase) to derive the data with pre-defined schema
As a user, I would like to use explore queries on the dataset that is created from existing data on HDFS
As a user, I would like to use the dataset as a source in data pipelines.

Design

Cover details on assumptions made, design alternatives considered, high level design

Approach

Approach #1

Approach #2

API changes

New Programmatic APIs

New Java APIs introduced (both user facing and internal)

Deprecated Programmatic APIs

New REST APIs

Path

Method

Description

Response Code

Response

/v3/apps/<app-id>

GET

Returns the application spec for a given application

200 - On success

404 - When application is not available

500 - Any internal errors

Deprecated REST API

Path	Method	Description
/v3/apps/<app-id>	GET	Returns the application spec for a given application

CLI Impact or Changes

Impact #1
Impact #2
Impact #3

UI Impact or Changes

Impact #1
Impact #2
Impact #3

Security Impact

What's the impact on Authorization and how does the design take care of this aspect

Impact on Infrastructure Outages

System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect

Test Scenarios

Test ID	Test Description	Expected Results

Versions Compared

Old Version 1

New Version 2

Key

Table of Contents

Introduction

Goals

User Stories

User Stories

Design

Approach

Approach #1

Approach #2

API changes

New Programmatic APIs

Deprecated Programmatic APIs

New REST APIs

Deprecated REST API

CLI Impact or Changes

UI Impact or Changes

Security Impact

Impact on Infrastructure Outages

Test Scenarios

Releases

Release X.Y.Z

Release X.Y.Z

Related Work

Future work

Page Comparison

Versions Compared

Old Version 1

New Version 2

Key

Table of Contents

Introduction

Goals

User Stories

User Stories

Design

Approach

Approach #1

Approach #2

API changes

New Programmatic APIs

Deprecated Programmatic APIs

New REST APIs

Deprecated REST API

CLI Impact or Changes

UI Impact or Changes

Security Impact

Impact on Infrastructure Outages

Test Scenarios

Releases

Release X.Y.Z

Release X.Y.Z

Related Work

Future work