...
- Injection of dataset code: The dataset framework allows deploying the code for a dataset type. However, that only applies to Explore: For use in applications, we require that the application includes the code for the dataset type (unless it is provided by the system). There is no way to ensure that multiple applications sharing a dataset all use a compatible version of the code.
- Artifact management for dataset code is completely different from how it is done for application and plugin artifacts. We should unify that to create predictability about runtime class loading.
- Versioning of dataset code: Similarly, when updating the code for a dataset type, it again only applies to Explore. For apps using that type, every app needs to be recompiled and repackaged with the new dataset code, and then redeployed. For a deployed app, there is no insight into what version of the code it is using. Also, if the owner of a dataset changes its format (or schema, etc.), he has no way to enforce that all apps use a version of the code that supports the new format.
- Dataset types are only identified by their name. That is, two apps can have entirely different code (and semantics) for the same dataset type. If these two apps share a dataset of that type, data corruption is inevitable.
- Only one version of the code can exist in the system at the same time. It is therefore not possible to deploy a new version of the code without immediately affecting all datasets of that type. Ideally, one could deploy multiple versions of the code which coexist; and the dataset instances can be upgraded/migrated one by one over time.
- APIs to define a dataset type are complex: one must implement a Dataset class, a DatasetAdmin, and a DatasetDefinition.
Dataset Instance Management
The two areas of concern here are configuration and management of datasets over their lifetime.
Dataset Configuration
A dataset instance is configured by passing a set of properties (that is, string-to-string pairs) to the configure() method of the dataset type. However:
- Common properties such as schema are not standardized across dataset types
- There is no way (other than reading documentation) to find out what properties a dataset type accepts. For a wizard-driven UI we would need a programmatic API to list all config. For plugins and apps, we have a very good way to include that in the implementation of the plugin. Datasets should have something similar.
- Reconfiguration of a dataset can be problematic. Sometimes the change of a property is not compatible with existing data in a dataset (for example, changing the schema). There is no easy way to find out what properties can be changed.
- Also, a reconfiguration may require a data migration or other long-running process to implement the change. The current dataset framework has no APIs to implement that.
Dataset Management over its life time
The dataset framework defines five administrative APIs: create(), exists(), drop(), truncate() and update() (and upgrade() which is broken). However, many dataset types have specific administrative procedures that are not common across types. For example, an HBase table may require compaction, which is not supported by other dataset types. We need a way to implement such actions as part of the dataset administration interface.
Also, the current implementation of dataset admin execution is not transactional: If it fails, it may leave behind partial artifacts of data creation. For example, if a composite dataset embeds two datasets, creation of the first succeeds, but the second fails, then the first one remains as a leftover in the the physical storage - without any clue in CDAP meta data about its existence. Similar for dropping and reconfiguring datasets.
Explore Integration
This is related to configuration but goes beyond that. To begin with, the configuration of how a dataset is made is explorable is separate from the rest of the dataset configuration, and every dataset may use a different set of properties. For example, a Table requires a schema and a rowkey property to make it explorable, whereas a file set requires a format and an exploreSchema. As a consequence, enabling explore is implemented in the platform (explore service) code, which has special treatment for all known types of explorable datasets. Instead, it would make more sense to delegate the generation of Hive DDL commands to the dataset type code: each dataset type implementation knows exactly how to create a corresponding Hive table. At the same time, we should standardize on a set of explore properties that are used across all dataset types, for example, the schema.
It should also be possible to enable or disable Explore for a dataset at any time during its lifecycle. That is not always a simple creation of a Hive table. For example, for a partitioned file set, this involves adding all the partitions that the dataset already has, and that can require a long running process. Again, this is better implemented by the dataset type itself than by the platform, and we need APIs that allow custom dataset types to provide an implementation.