Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

...

...

...

...

  •  User stories documented(Andreas)
  •  User stories reviewed(Nitin)
  •  User stories reviewed(Todd)
  •  Requirements documented(Andreas)
  •  Requirements Reviewed
  •  Mockups Built
  •  Design Built
  •  Design Accepted

...

 

Goal

 

In CDAP 4.0, the main theme for Datasets is improving/establishing proper and semantically sound dataset management. That includes the management of dataset types (code), and the management of dataset instances (actual data) throughout their life cycle. The current dataset framework has various shortcomings that need to be addressed. This document will discuss each area of improvement, list end-to-end use cases and requirements, and finally address the design to implement the requirements.

Checklist

  •  User stories documented(Andreas)
  •  User stories reviewed(Nitin)
  •  User stories reviewed(Todd)
  •  Requirements documented(Andreas)
  •  Requirements Reviewed
  •  Mockups Built
  •  Design Built
  •  Design Accepted

...

Discussion

Dataset Type Management

Currently, the major areas of concern are:

...

It should also be possible to enable or disable Explore for a dataset at any time during its lifecycle. That is not always a simple creation of a Hive table. For example, for a partitioned file set, this involves adding all the partitions that the dataset already has, and that can require a long running process. Again, this is better implemented by the dataset type itself than by the platform, and we need APIs that allow custom dataset types to provide an implementation.  

Scenarios

Scenario 1: Dataset Type Used Only by a Single Application

...

It is virtually impossible to list all possible scenarios, but it is important to realize that any combination of the above scenarios must work seamlessly. For example, a dataset may be maintained by multiple apps, and still shared with many others. Or a dataset may be created through a Hydrator pipeline but shared with many other pipelines or apps. That also means that the simplest of use cases (Scenario 1) must be interoperable with the most complex one (Scenario 3). Also, any time there is a conflict between different apps, pipelines, plugins, or app store artifacts that attempt to create the same dataset, but with different types, or with a version conflict, etc., this conflict must be detected by CDAP and reported back to the user in a clear and easy-to-read way.

User Stories

[DTM] Dataset Type Management

...