Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Checklist

  •  User Stories Documented
  •  User Stories Reviewed
  •  Design Reviewed
  •  APIs reviewed
  •  Release priorities assigned
  •  Test cases reviewed
  •  Blog post

Introduction 

The schema associated with HDFS Files may vary based on set of wrangler directives. Current implementation of datasets do not apply wrangler directives while exploring datasets or reading records using hydrator pipeline. 

Goals

Better User experience: The goal is to apply wrangler directives while reading data through explore and hydrator pipelines from datasets. 

User Stories 

  • The HDFS files from existing Dataset can be sampled and wrangled through wrangler. The final output schema and wrangler directives should be applied while exploring the dataset or reading records from the dataset using hydrator pipeline. If there are any further changes to wrangler directives or output schema, it should be reflected through explore queries and hydrator pipeline. 
  • The records from Table Dataset can be sampled and wrangled through wrangler. The final output schema and wrangler directives should be applied while exploring the dataset or reading records from the dataset using hydrator pipeline. If there are any further changes to wrangler directives or output schema, it should be reflected through explore queries and hydrator pipeline. 

    Users of CDAP might have already existing data in HDFS or HBase. In order to bring the data into CDAP, the only ways are to create a data pipeline or a CDAP app, to re-process the data and to create a CDAP datasets for further analysis. Allowing capabilities to use existing data without having to re-process will allow for great user experience and a good on-ramp for customers with a lot of legacy data.

    Goals

    Ease of adoption: Allow users to leverage their existing data in HDFS or HBase without having to re-process the data 

    Usability: Create datasets from existing data in HDFS or HBase and provide a great user-experience.

    User Stories 

     

    • As a user, I would like to create dataset from existing data on HDFS (or HBase)
    • As a user, I would like to apply schema to the dataset that is created from existing data on HDFS (or HBase)
    • As a user, I would like to apply transformations on data existing on HDFS (or HBase) to derive the data with pre-defined schema
    • As a user, I would like to use explore queries on the dataset that is created from existing data on HDFS
    • As a user, I would like to use the dataset as a source in data pipelines.

    Design

    Cover details on assumptions made, design alternatives considered, high level design

    Approach

    Approach #1

    Approach #2

    API changes

    New Programmatic APIs

    New Java APIs introduced (both user facing and internal)

    Deprecated Programmatic APIs

    New REST APIs

    PathMethodDescriptionResponse CodeResponse
    /v3/apps/<app-id>GETReturns the application spec for a given application

    200 - On success

    404 - When application is not available

    500 - Any internal errors

     

         

    Deprecated REST API

    PathMethodDescription
    /v3/apps/<app-id>GETReturns the application spec for a given application

    CLI Impact or Changes

    • Impact #1
    • Impact #2
    • Impact #3

    UI Impact or Changes

    • Impact #1
    • Impact #2
    • Impact #3

    Security Impact 

    What's the impact on Authorization and how does the design take care of this aspect

    Impact on Infrastructure Outages 

    System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect

    Test Scenarios

    Test IDTest DescriptionExpected Results
       
       
       
       

    Releases

    Release X.Y.Z

    Release X.Y.Z

    Related Work

    • Work #1
    • Work #2
    • Work #3

     

    Future work