Goal
In CDAP 4.0, the main theme for Datasets is improving/establishing proper and semantically sound dataset management. That includes the management of dataset types (code), and the management of dataset instances (actual data) throughout their life cycle. The current dataset framework has various shortcomings that need to be addressed. This document will discuss each area of improvement, list end-to-end use cases and requirements, and finally address the design to implement the requirements.
...
- User stories documented(Andreas)
- User stories reviewed(Nitin)
- User stories reviewed(Todd)
- Requirements documented(Andreas)
- Requirements Reviewed
- Mockups Built
- Design Built
- Design Accepted
...
- As a user, I want to specify as part of dataset configuration whether it is explorable.
- As a user, I do not want to specify the explore schema (and format) as separate properties if they can be derived from other standard dataset properties.
- As a user, I want to specify the explore schema separately (for example, only include a subset of the fields of a table, or name fields differently).
- As a user, I expect that dataset creation fails if the dataset cannot be enabled for explore.
- As a user, I expect that dataset reconfiguration fails if the corresponding update of the explore table fails.
- As a user, I expect that a dataset operation fails if it fails to make its required changes to explore.
- As a user, I expect that an update of explore never leads to silent loss of data (or data available for explore). If, for example, partitions would be dropped from the explore table, I want to have the option to either cancel the update, or to be notified of the drop and have a tool to bring explore in sync with the data.
- As a user, I want to enable explore for a dataset that was not configured for explore initially.
- As a user, I want to disable explore for a dataset that was configure for explore initially.
...
Requirements
[DTM] Dataset Type Management
- Unification of artifact management for plugins, apps and dataset types
- Explicit versioning of dataset types
- Coexistence of multiple versions of the same dataset type
- Explicit dependency of a dataset instance on a specific version of its type
- Explicit upgrade of a dataset instance to a new version of its type
- Preserve the easy experience of self-contained apps using an implicit versioning scheme
- Apps can bundle dataset types, create dataset instance of that type
- At runtime, such dataset types are loaded from the program jar
- Seamless experience when redeploying such an app
- Dataset Admin needs a new method to upgrade to a new version of the type
- This method can reject the upgrade
- Hydrator Plugins can also contain dataset types
- Ability to take a dataset instance offline for a migration procedure
- Injection of dataset code at runtime
- Not for dataset types embedded in app artifact (see 2.)
- Always inject the version of the type that the dataset instance is tagged with
- No noticeable performance degradation (some degradation is expected due to code injection at first instantiation of a type)
- Backward-compatibility with existing dataset modules
- Versioning for system dataset types
- Core types always use latest system version
- Composite types: TBD
- New dataset admin API for performing custom actions
- New versioned REST and CLI methods for versioned type and instance management
- Maven archetype for dataset artifacts
[DIC] Dataset Instance Configuration
- New dataset API to retrieve the properties accepted by a type
- what the accepted values are
- whether they are mutable
- whether they are required
- what the default value is
- Schema as a standardized system property
- Validation of schema
- Specify schema in Avro or SQL style
- All system datasets to use new schema property
- New API to update or remove a single property of a dataset
- Ability to "merge" dataset properties without changing existing ones, failing in that case
- Dataset Management Operations are atomic
- Always leave behind a consistent state
[EI] Explore Integration
- Simplification of explore configuration
- Whether explore is enabled is explicit property
- All other explore properties derived from dataset properties if possible
- Explore failure also fails the DTM operation that called it
- Ability to communicate warnings to the user for successful explore operations
- Enable/Disable explore as dataset management operations