...
The dataset framework defines five administrative APIs: create(), exists(), drop(), truncate() and update() (and upgrade() which is broken). However, many dataset types have specific administrative procedures that are not common across types. For example, an HBase table may require compaction, which is not supported by other dataset types. We need a way to implement such actions as part of the dataset administration interface.
- In the simple case, the app should only need to define the Dataset API itself (similar to the current AbstractDataset)
- If a dataset type requires special administrative operations (say, "rebalance"), then this operation can be performed from the app itself, as well as through REST/CLI/UI.
Also, the current implementation of dataset admin execution is not transactional: If it fails, it may leave behind partial artifacts of data creation. For example, if a composite dataset embeds two datasets, creation of the first succeeds, but the second fails, then the first one remains as a leftover in the the physical storage - without any clue in CDAP meta data about its existence. Similar for dropping and reconfiguring datasets.
Explore Integration
This is related to configuration but goes beyond that. To begin with, the configuration of how a dataset is made is explorable is separate from the rest of the dataset configuration, and every dataset may use a different set of properties. For example, a Table requires a schema and a rowkey property to make it explorable, whereas a file set requires a format and an exploreSchema. As a consequence, enabling explore is implemented in the platform (explore service) code, which has special treatment for all known types of explorable datasets. Instead, it would make more sense to delegate the generation of Hive DDL commands to the dataset type code: each dataset type implementation knows exactly how to create a corresponding Hive table. At the same time, we should standardize on a set of explore properties that are used across all dataset types, for example, the schema.
It should also be possible to enable or disable Explore for a dataset at any time during its lifecycle. That is not always a simple creation of a Hive table. For example, for a partitioned file set, this involves adding all the partitions that the dataset already has, and that can require a long running process. Again, this is better implemented by the dataset type itself than by the platform, and we need APIs that allow custom dataset types to provide an implementation.
Scenarios
Scenario 1. Dataset Type Used Only by a Single Application
This can almost be viewed as a private utility class of that application, except that the dataset may be explorable, and the dataset type's code and configuration may evolve over time along with the application. This is also the most simple and most common use case, and we want to make it super easy as follows:
- Dataset Type code is part of the application
- Upon deployment of the app, the dataset type is also deployed, and the dataset(s) of this type can be created as part of the same deployment step.
- When the app is redeployed, the dataset type is updated to the latest version of the code, and so are the datasets of this type.
- The developer/devops never needs to worry explicitly about versioning of the dataset or manually upgrading a dataset.
- Explore works seamlessly: It always picks up the latest version of the dataset code.
- If there are multiple versions of the application artifact (see Application Versioning Design), each application uses the version of the dataset type defined by its version of the artifact.
Scenario 2. Dataset Type Shared by Multiple Applications, no Data Sharing
This case is very similar to scenario 1, however, we need to solve the problem of distributing the code of the dataset type: In scenario 1, we would simply include it in the application code, but now this code is shared between multiple apps. Including the code in each app would mean code duplication, and, over time, divergence. If that is desired (which is possible), then it is wiser to simply use different type names in each app, and we have multiple instances of scenario 1. However, in most cases it will be desirable to share one implementation of the dataset code across all apps. There are two major alternatives:
- The dataset type is implemented as a separate library that is available as maven dependency to both apps:
- Both apps include this dataset type in their jar
- Every time one of the two apps is deployed, the dataset type is updated to that version of the code.
- The problem with this is that one application may use an older version of the dataset code than the one currently deployed. In that case:
- The update of the dataset type overrides the type's code with an outdated version.
- Because this code is used by Explore, queries for datasets created with a newer version of the code may not work any more.
- However, for ease of use, it should be possible for the developer(s) to deploy either app at any time without impacting other apps using the same dataset type.
- This is similar to the case of scenario 1, where multiple versions of the same dataset type coexist in different versions of the app artifact.
- The dataset type has an interface and an implementation:
- The interface is available to developers as maven dependency, whereas the implementation is deployed as a separate artifact in the dataset framework.
- In order to compile and package their apps, developers only need the interface.
- At runtime, CDAP injects the implementation of the dataset type into the programs.
- This means that the dataset type is not bundled with the apps any longer, and the deployment of an app has no effect on the code of a dataset type.
- However, it means increased complexity for app and dataset developers: Both the interface in maven and the dataset module in CDAP must be kept in sync.
- Note that this approach allow for separation of roles and skills in a larger organization: Dataset types can developed and deployed independently from applications.
This scenario suggests that we need some kind of versioning for dataset types (and with that, dataset instances are bound to a specific version of the type).
Scenario 3. A Dataset is Maintained by a Single Organization and Shared with Many Applications
Scenario 1 can still be kept very simple by using implicit versioning (for example, using the artifact's version as the dataset type version).