Introduction
In CDAP 4.0, the main theme for Datasets is improving/establishing proper and semantically sound dataset management. That includes the management of dataset types (code), and the management of dataset instances (actual data) throughout their life cycle. The current dataset framework has various shortcomings that need to be addressed. This document will discuss each area of improvement, list end-to-end use cases and requirements, and finally address the design to implement the requirements.
Discussion
Dataset Type Management
Currently, the major areas of concern are:
- Injection of dataset code: The dataset framework allows deploying the code for a dataset type. However, that only applies to Explore: For use in applications, we require that the application includes the code for the dataset type (unless it is provided by the system). There is no way to ensure that multiple applications sharing a dataset all use a compatible version of the code.
- Artifact management for dataset code is completely different from how it is done for application and plugin artifacts. We should unify that to create predictability about runtime class loading.
- Versioning of dataset code: Similarly, when updating the code for a dataset type, it again only applies to Explore. For apps using that type, every app needs to be recompiled and repackaged with the new dataset code, and then redeployed. For a deployed app, there is no insight into what version of the code it is using. Also, if the owner of a dataset changes its format (or schema, etc.), he has no way to enforce that all apps use a version of the code that supports the new format.
- Dataset types are only identified by their name. That is, two apps can have entirely different code (and semantics) for the same dataset type. If these two apps share a dataset of that type, data corruption is inevitable.
- Only one version of the code can exist in the system at the same time. It is therefore not possible to deploy a new version of the code without immediately affecting all datasets of that type. Ideally, one could deploy multiple versions of the code which coexist; and the dataset instances can be upgraded/migrated one by one over time.
- APIs to define a dataset type are complex: one must implement a Dataset class, a DatasetAdmin, and a DatasetDefinition.
Dataset Instance Management
The two areas of concern here are configuration and management of datasets over their lifetime.
Dataset Configuration
A dataset instance is configured by passing a set of properties (that is, string-to-string pairs) to the configure() method of the dataset type. However:
- Common properties such as schema are not standardized across dataset types
- There is no way (other than reading documentation) to find out what properties a dataset type accepts. For a wizard-driven UI we would need a programmatic API to list all config. For plugins and apps, we have a very good way to include that in the implementation of the plugin. Datasets should have something similar.
- Reconfiguration of a dataset can be problematic. Sometimes the change of a property is not compatible with existing data in a dataset (for example, changing the schema). There is no easy way to find out what properties can be changed.
- Also, a reconfiguration may require a data migration or other long-running process to implement the change. The current dataset framework has no APIs to implement that.
Dataset Management over its life time
The dataset framework defines five administrative APIs: create(), exists(), drop(), truncate() and update() (and upgrade() which is broken). However, many dataset types have specific administrative procedures that are not common across types. For example, an HBase table may require compaction, which is not supported by other dataset types. We need a way to implement such actions as part of the dataset administration interface.
- In the simple case, the app should only need to define the Dataset API itself (similar to the current AbstractDataset)
- If a dataset type requires special administrative operations (say, "rebalance"), then this operation can be performed from the app itself, as well as through REST/CLI/UI.
Also, the current implementation of dataset admin execution is not transactional: If it fails, it may leave behind partial artifacts of data creation. For example, if a composite dataset embeds two datasets, creation of the first succeeds, but the second fails, then the first one remains as a leftover in the the physical storage - without any clue in CDAP meta data about its existence. Similar for dropping and reconfiguring datasets.
Explore Integration
This is related to configuration but goes beyond that. To begin with, the configuration of how a dataset is made is explorable is separate from the rest of the dataset configuration, and every dataset may use a different set of properties. For example, a Table requires a schema and a rowkey property to make it explorable, whereas a file set requires a format and an exploreSchema. As a consequence, enabling explore is implemented in the platform (explore service) code, which has special treatment for all known types of explorable datasets. Instead, it would make more sense to delegate the generation of Hive DDL commands to the dataset type code: each dataset type implementation knows exactly how to create a corresponding Hive table. At the same time, we should standardize on a set of explore properties that are used across all dataset types, for example, the schema.
It should also be possible to enable or disable Explore for a dataset at any time during its lifecycle. That is not always a simple creation of a Hive table. For example, for a partitioned file set, this involves adding all the partitions that the dataset already has, and that can require a long running process. Again, this is better implemented by the dataset type itself than by the platform, and we need APIs that allow custom dataset types to provide an implementation.
Scenarios
Scenario 1. Dataset Type Used Only by a Single Application
This can almost be viewed as a private utility class of that application, except that the dataset may be explorable, and the dataset type's code and configuration may evolve over time along with the application. This is also the most simple and most common use case, and we want to make it super easy as follows:
- Dataset Type code is part of the application
- Upon deployment of the app, the dataset type is also deployed, and the dataset(s) of this type can be created as part of the same deployment step.
- When the app is redeployed, the dataset type is updated to the latest version of the code, and so are the datasets of this type.
- The developer/devops never needs to worry explicitly about versioning of the dataset or manually upgrading a dataset.
- Explore works seamlessly: It always picks up the latest version of the dataset code.
- If there are multiple versions of the application artifact (see Application Versioning Design), each application uses the version of the dataset type defined by its version of the artifact.
Scenario 2. Dataset Type Shared by Multiple Applications, no Data Sharing
This case is very similar to scenario 1, however, we need to solve the problem of distributing the code of the dataset type: In scenario 1, we would simply include it in the application code, but now this code is shared between multiple apps. Including the code in each app would mean code duplication, and, over time, divergence. If that is desired (which is possible), then it is wiser to simply use different type names in each app, and we have multiple instances of scenario 1. However, in most cases it will be desirable to share one implementation of the dataset code across all apps. There are two major alternatives:
- The dataset type is implemented as a separate library that is available as maven dependency to both apps:
- Both apps include this dataset type in their jar
- Every time one of the two apps is deployed, the dataset type is updated to that version of the code.
- The problem with this is that one application may use an older version of the dataset code than the one currently deployed. In that case:
- The update of the dataset type overrides the type's code with an outdated version.
- Because this code is used by Explore, queries for datasets created with a newer version of the code may not work any more.
- However, for ease of use, it should be possible for the developer(s) to deploy either app at any time without impacting other apps using the same dataset type.
- This is similar to the case of scenario 1, where multiple versions of the same dataset type coexist in different versions of the app artifact.
- The dataset type has an interface and an implementation:
- The interface is available to developers as maven dependency, whereas the implementation is deployed as a separate artifact in the dataset framework.
- In order to compile and package their apps, developers only need the interface.
- At runtime, CDAP injects the implementation of the dataset type into the programs.
- This means that the dataset type is not bundled with the apps any longer, and the deployment of an app has no effect on the code of a dataset type.
- However, it means increased complexity for app and dataset developers: Both the interface in maven and the dataset module in CDAP must be kept in sync.
- Note that this approach allow for separation of roles and skills in a larger organization: Dataset types can developed and deployed independently from applications.
This scenario suggests that we need some kind of versioning for dataset types (and with that, dataset instances are bound to a specific version of the type).
Scenario 3. A Dataset is Maintained by a Single Organization and Shared with Many Applications
For example, a CustomerDirectory
dataset is maintained by organization X in an enterprise. This dataset is used by many applications to look up customers. This dataset has a custom type with various methods to maintain its data; however, most of the applications only need one API: CustomerInfo getCustomer(String id)
.
- Applications that use this dataset need to include a dependency
customer-api-1.0
in their pom in order to compile and package. (See the discussion of scenario 2 for why this should be a maven dependency). - This actual dataset type implements the
CustomerDirectory
API, say using a classTableBasedCustomerDirectory
in artifactcustomer-table-1.3.1
. - At runtime, when the app calls getDataset(), CDAP determines that the dataset instance has that type and version, and loads the class from that artifact.
- The actual dataset type has more methods in its API, including one that allows adding new customers. Therefore, the app that maintains this dataset, includes the implementing artifact in its pom file.
- The implementation can be updated without changing the API. In this case, X deploys a new artifact
customer-table-1.3.2
and upgrades the dataset to this version. The maintaining app must now pick up the new artifact the next time it runs. (Whether this requires recompiling/packaging the app is up for detailed design). No change is needed for the other applications that use this dataset, because CDAP always injects the correct version of the dataset type. - The implementation can be updated with an interface change, for example, adding a new field to the
CustomerInfo
. To make this update seamless, a new artifactcustomer-table-1.4.0
is deployed, and both the dataset and the maintaining app are upgraded to this version. Then a new version of the API,customer-api-1.1
, is deployed, and apps may now upgrade to this version. If they don’t, then they will not see the new field, but that is fine for existing apps because their code does not use this field. Note that this requires thatCustomerInfo
is an interface (consisting mainly of getters) that has an implementation in thecustomer-table
artifact. Similarly, a new method could be added the the interface, and applications that do not use this new interface, do not require recompile and redeploy.
This scenario is one the most complex but the complexity is limited to the app that maintains the dataset as a service for others, who only need to know the published interface. This scenario also poses some important questions:
- what is the deployment mechanism for the two artifacts (customer-api and customer-table)?
- how does CDAP know that customer-table implements customer-api? Does it have to know?
- how can X migrate the dataset to a new data format without having control over the apps that consume it? Even after upgrading the dataset to a new version, X does not know when all apps have picked that up, because they may have long-running programs such as a flow or service that need to be restarted for picking up the new version.
Scenario 4. A Dataset is Created and Maintained by a Hydrator Pipeline
This is very similar to Scenario 3, but the dataset type and the dataset instance are defined by a Hydrator plugin. The plugin may embed the code for the dataset type in its own code, or it may depend on a dataset type artifact that was deployed separately. In either case, the dataset is subsequently available to Hydrator pipelines, applications and Explore, and it can be maintained using REST, CLI or UI.
The important distinction here is that the user does not write code (although somebody wrote the code for the plugin). The user should be able to deploy the dataset (and its type) without knowing about the mechanics of dataset (type) management.
Scenario 5. A Dataset is Created through the App Store or Marketplace
Again, this is very similar to Scenario 4, except that this time the user does not even interact with CDAP or Hydrator. He gets a dataset (and a pipeline that feeds it) from the Market with the click of a button, and the dataset is available to Explore, and also to other pipelines and apps.
Other Scenarios
It is virtually impossible to list all possible scenarios, but it is important to realize that any combination of the above scenarios must work seamlessly. For example, a dataset may be maintained by multiple apps, and still shared with many others. Or a dataset may be created through a Hydrator pipeline but shared with many other pipelines or apps. That also means that the simplest of use cases (Scenario 1) must be interoperable with the most complex one (Scenario 3). Also, any time there is a conflict between different apps, pipelines, plugins, or app store artifacts that attempt to create the same dataset, but with different types, or with a version conflict, etc., this conflict must be detected by CDAP and reported back to the user in a clear and easy-to-read way.
User Stories
[DTM] Dataset Type Management
- As an app developer, I want to include the code of a dataset type in my app artifact, and create a dataset of that type when deploying the app.
- As an app developer, I want to deploy a new version of a dataset type as part of deploying a new version of the app that includes it, and I expect that all dataset instances of that type that were created as part of the app deployment start using the new code.
- As an app developer, I want to share a dataset type that I had previously deployed as part of an app.
- As an app developer, I want to deploy a new version of a dataset type as part of an app artifact, without affecting other datasets of this type.
- As an app developer, I want to explore a dataset instance of a type that was deployed as part of an app.
- As an app developer, I expect that deploying an artifact without creating an app will not create any dataset types or instances (that is, this only happens when creating an app).
- As an app developer, I want to share a dataset type across multiple applications that include the dataset type's code in their artifacts.
- As an app developer, when deploying a new version of an app that includes a shared dataset type, I expect that all dataset instances created by this app start using the new code, but all dataset instances created by other apps remain unchanged.
- As an app developer, I want to deploy a new version of an app that includes an older version of a dataset type deployed by another app, and I expect that the dataset instances created by this app use the dataset type code included in this app.
- As an app developer, when I deploy a new version of an app that includes an different version of a dataset type deployed by another app, and this app shares a dataset instance of this type with the other app, the deployment will fail with a version conflict error. (Because otherwise I might "downgrade" the instance to an older version, making it incompatible with the other app).
Note: This use case needs discussion. What is proper behavior? How can be prevent data corruption due to unintentional "downgrade" without restricting ease of use too much? - As a dataset developer, I want to deploy a dataset type independent from any app, and allow apps to create and use dataset instances of that type.
- As a dataset developer, I want to separate the interface from the implementation of a dataset type.
- As an app developer, I want to only depend on the interface of a dataset type in my app, and have the system inject the implementation at runtime.
- As an app developer, I want to write unit tests for a an app that depends on the interface of a dataset type. (This means I need an extra dependency with test scope in my pom.xml)
- As a dataset developer, I want to assign explicit versions to the code of a dataset type.
- As a dataset developer, I want to deploy a new version of a dataset type without affecting the dataset instances of that type.
- As an app developer, I want to create a dataset instance with a specific version of a dataset type.
- As a dataset developer, I want to have the option of implementing an "upgrade step" for when a dataset instance is upgraded to a new version of the dataset type.
- As a dataset developer, I want to have a way to reject an upgrade of a dataset instance to a newer version of it type, if the upgrade is not compatible.
- As a dataset developer, I want to have the option of implementing a migration procedure that can be run after an upgrade of a dataset instance to a new version of it type. This can be a long-running (background) process.
- As a dataset developer, I want to implement custom administrative operations (such as "compaction", or "rebalance") that are no common to all dataset types.
- As an app developer, I want to perform custom administrative operations on dataset instances from my app, the CLI, REST, or the UI.
- As a dataset developer, I want to explore a dataset instance created from a dataset type that was deployed by itself.
- As a dataset developer, I want to delete outdated versions of a dataset type. I expect this to fail if there are any dataset instances with that version of the type.
- As a dataset developer, I want to list all dataset instances that use a dataset type, or a specific version of a type.
- As a data scientist or app developer, I want to be able to create a dataset instance of an existing dataset type without writing code.
- As a data scientist or app developer, I want to be able to upgrade a dataset instance to a new version of its code.
- As a hydrator user, I want to create a pipeline that reads or writes an existing dataset instance.
- As a hydrator user, I want to create a pipeline that reads or writes a new dataset instance, and I want to create that dataset instance as part of pipeline creation.
- As a hydrator user, I want to specify an explicit version of the dataset types of the dataset instances created by my pipeline, and I expect pipeline creation to fail (similar to app creation) if that results in incompatible upgrade of an existing dataset instance that is shared with other apps or pipelines.
- As a hydrator user, I want to explore the datasets created by my pipeline.
- As a hydrator user, I expect all dataset instances created by apps to be available as sinks and sources for pipelines (if there is a corresponding plugin).
- As an app developer, I expect all dataset instances created by Hydrator pipelines to be accessible to the app.
- As a plugin developer, I want to include the code for a dataset type in the plugin artifact. When a pipeline using this plugin is created, a dataset instance of that type is created, and it is explorable and available to apps.
- As a plugin developer, I want to use a custom dataset type (that was deployed independently or as part of an app) inside the plugin.
- As a plugin developer, I want to upgrade the code of a dataset type used by a dataset instance created by that plugin, when I deploy a new version of the plugin and update the pipeline to use that version.
- As a pipeline developer, I want to upgrade a dataset instance to a newer version of the code after the pipeline was created.
[DIC] Dataset Instance Configuration
[Note: "As a user" refers to app developers, data scientists, dev-ops, or Hydrator users, pipeline developers]
- As a user, when creating a dataset instance, I want to find out what properties are supported by the dataset type, what values are allowed, and what the defaults are.
- As a user, I want to specify the schema of a dataset in a uniform way across all dataset types.
- As a user, I want to specify schema as a JSON string (verbose, Avro-style).
- As a user, I want to specify schema as a SQL schema string (brief, Hive-style).
- As a user, I want to configure time-to-live (TTL) in a uniform way across all dataset types.
- As a user, I want to see the properties that were used to configure a dataset instance.
- As a user, I want to find out what properties of a dataset can be updated.
- As a user, I want to update the properties of a dataset instance. I expect this to fail if the new properties are not compatible, with a meaningful error message.
- As a user, I want to update a single property of a dataset instance, without knowing all other properties. For example, set the TTL without having to know the schema.
- As a user, I want to remove a single property of a dataset instance, without knowing all other properties. For example, remove the TTL without having to know the schema.
- As a user, I want to trigger a migration process for a dataset if updating its properties requires that.
- As a user, I expect that if reconfiguration of a dataset fails, then no changes have taken effect. In other words, all steps required to reconfigure a dataset must be done as a single atomic action.
- As an app developer, I expect that application creation fails if any of its datasets cannot be created.
- As an app developer, I expect that application redeployment fails if any of its datasets cannot be reconfigured (if the new app spec specifies different configuration).
- As an app developer, when creating a dataset as part of app deployment, I want to tolerate existing datasets if their properties are different but compatible. For example, I can configure the dataset schema, but leave the existing TTL of a table untouched.
- As a pipeline designer, I want to use an existing dataset as a sink or source. If the schema (or any other property) of the dataset is incompatible with what the pipeline requires, I expect that pipeline creation fails with a meaningful error message.
[EI] Explore Integration
- As a user, I want to specify as part of dataset configuration whether it is explorable.
- As a user, I do not want to specify the explore schema (and format) as separate properties if they can be derived from other standard dataset properties.
- As a user, I want to specify the explore schema separately (for example, only include a subset of the fields of a table, or name fields differently).
- As a user, I expect that dataset creation fails if the dataset cannot be enabled for explore.
- As a user, I expect that dataset reconfiguration fails if the corresponding update of the explore table fails.
- As a user, I expect that a dataset operation fails if it fails to make its required changes to explore.
- As a user, I expect that an update of explore never leads to silent loss of data (or data available for explore). If, for example, partitions would be dropped from the explore table, I want to have the option to either fails the update, or to be notified of the drop and have a tool to bring explore in sync with the data.
- As a user, I want to enable explore for a dataset that was not configured for explore initially.
- As a user, I want to disable explore for a dataset that was configure for explore initially.