Introduction

In CDAP 4.0, the main theme for Datasets is improving/establishing proper and semantically sound dataset management. That includes the management of dataset types (code), and the management of dataset instances (actual data) throughout their life cycle. The current dataset framework has various shortcomings that need to be addressed. This document will discuss each area of improvement, list end-to-end use cases and requirements, and finally address the design to implement the requirements.

Discussion

Dataset Type Management

Currently, the major areas of concern are:

Injection of dataset code: The dataset framework allows deploying the code for a dataset type. However, that only applies to Explore: For use in applications, we require that the application includes the code for the dataset type (unless it is provided by the system). There is no way to ensure that multiple applications sharing a dataset all use a compatible version of the code.
Artifact management for dataset code is completely different from how it is done for application and plugin artifacts. We should unify that to create predictability about runtime class loading.
Versioning of dataset code: Similarly, when updating the code for a dataset type, it again only applies to Explore. For apps using that type, every app needs to be recompiled and repackaged with the new dataset code, and then redeployed. For a deployed app, there is no insight into what version of the code it is using. Also, if the owner of a dataset changes its format (or schema, etc.), he has no way to enforce that all apps use a version of the code that supports the new format.
Dataset types are only identified by their name. That is, two apps can have entirely different code (and semantics) for the same dataset type. If these two apps share a dataset of that type, data corruption is inevitable.
Only one version of the code can exist in the system at the same time. It is therefore not possible to deploy a new version of the code without immediately affecting all datasets of that type. Ideally, one could deploy multiple versions of the code which coexist; and the dataset instances can be upgraded/migrated one by one over time.
APIs to define a dataset type are complex: one must implement a Dataset class, a DatasetAdmin, and a DatasetDefinition.

Datasets Design 4.0

Introduction

Discussion

Dataset Type Management