Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Minor edits.

...

Also, the current implementation of dataset admin execution is not transactional: If it fails, it may leave behind partial artifacts of data creation. For example, if a composite dataset embeds two datasets, creation of the first succeeds, but the second fails, then the first one remains as a leftover in the the physical storage - without storage—without any clue in CDAP meta data about its existence. Similar for dropping and reconfiguring datasets. 

...

It should also be possible to enable or disable Explore for a dataset at any time during its lifecycle. That is not always a simple creation of a Hive table. For example, for a partitioned file set, this involves adding all the partitions that the dataset already has, and that can require a long running process. Again, this is better implemented by the dataset type itself than by the platform, and we need APIs that allow custom dataset types to provide an implementation.  

Scenarios

Scenario 1

...

: Dataset Type Used Only by a Single Application

This can almost be viewed as a private utility class of that application, except that the dataset may be explorable, and the dataset type's code and configuration may evolve over time along with the application. This is also the most simple and most common use case, and we want to make it super-easy as follows:

  • Dataset Type code is part of the application
  • Upon deployment of the app, the dataset type is also deployed, and the dataset(s) of this type can be created as part of the same deployment step. 
  • When the app is redeployed, the dataset type is updated to the latest version of the code, and so are the datasets of this type. 
  • The developer/devops never needs to worry explicitly about versioning of the dataset or manually upgrading a dataset. 
  • Explore works seamlessly: It always picks up the latest version of the dataset code. 
  • If there are multiple versions of the application artifact (see Application Versioning Design), each application uses the version of the dataset type defined by its version of the artifact. 

Scenario 2

...

: Dataset Type Shared by Multiple Applications, no Data Sharing

This case is very similar to scenario 1, however, we need to solve the problem of distributing the code of the dataset type: In in scenario 1, we would simply include it in the application code, but now this code is shared between multiple apps. Including the code in each app would mean code duplication, and, over time, divergence. If that is desired (which is possible), then it is wiser to simply use different type names in each app, and we have multiple instances of scenario 1. However, in most cases it will be desirable to share one implementation of the dataset code across all apps. There are two major alternatives:

  1. The dataset type is implemented as a separate library that is available as a maven dependency to both apps:
    • Both apps include this dataset type in their jar
    • Every time one of the two apps is deployed, the dataset type is updated to that version of the code. 
    • The problem with this is that one application may use an older version of the dataset code than the one currently deployed. In that case: 
      • The update of the dataset type overrides the type's code with an outdated version. 
      • Because this code is used by Explore, queries for datasets created with a newer version of the code may not work any more. 
    • However, for ease of use, it should be possible for the developer(s) to deploy either app at any time without impacting other apps using the same dataset type. 
    • This is similar to the case of scenario 1, where multiple versions of the same dataset type coexist in different versions of the app artifact. 

  2. The dataset type has an interface and an implementation:
    • The interface is available to developers as a maven dependency, whereas the implementation is deployed as a separate artifact in the dataset framework. 
    • In order to compile and package their apps, developers only need the interface. 
    • At runtime, CDAP injects the implementation of the dataset type into the programs. 
    • This means that the dataset type is not bundled with the apps any longer, and the deployment of an app has no effect on the code of a dataset type.
    • However, it means increased complexity for app and dataset developers: Both both the interface in maven and the dataset module in CDAP must be kept in sync.
    • Note that this approach allow for separation of roles and skills in a larger organization: Dataset dataset types can developed and deployed independently from applications. 

This scenario suggests that we need some kind of versioning for dataset types (and with that, dataset instances are then bound to a specific version of the type).

Scenario 3

...

: A Dataset is Maintained by a Single Organization and Shared with Many Applications

For example, a CustomerDirectory dataset is maintained by organization X in an enterprise. This dataset is used by many applications to look up customers. This dataset has a custom type with various methods to maintain its data; however, most of the applications only need one API: CustomerInfo getCustomer(String id).  

  • Applications that use this dataset need to include a dependency customer-api-1.0 in their pom in order to compile and package. (See the discussion of scenario 2 for why this should be a maven dependency). 
  • This actual dataset type implements the CustomerDirectory API, say using a class TableBasedCustomerDirectory in artifact customer-table-1.3.1
  • At runtime, when the app calls getDataset(), CDAP determines that the dataset instance has that type and version, and loads the class from that artifact. 
  • The actual dataset type has more methods in its API, including one that allows adding new customers. Therefore, the app that maintains this dataset , includes the implementing artifact in its pom file. 
  • The implementation can be updated without changing the API. In this case, X deploys a new artifact customer-table-1.3.2 and upgrades the dataset to this version. The maintaining app must now pick up the new artifact the next time it runs. (Whether this requires recompiling/packaging the app is up for detailed design). No change is needed for the other applications that use this dataset, because CDAP always injects the correct version of the dataset type.
  • The implementation can be updated with an interface change, for example, adding a new field to the CustomerInfo. To make this update seamless, a new artifact customer-table-1.4.0 is deployed, and both the dataset and the maintaining app are upgraded to this version. Then a new version of the API, customer-api-1.1, is deployed, and apps may now upgrade to this version. If they don’t, then they will not see the new field, but that is fine for existing apps because their code does not use this field. Note that this requires that CustomerInfo is be an interface (consisting mainly of getters) that has an implementation in the customer-table artifact. Similarly, a new method could be added to the the interface , and applications that do not use this new interface , do will not require recompile and redeploy.

This scenario is one the most complex but the complexity is limited to the app that maintains the dataset as a service for others, who only need to know the published interface. This scenario also poses some important questions:

  • what What is the deployment mechanism for the two artifacts (customer-api and customer-table)?
  • how How does CDAP know that customer-table implements  implements customer-api? Does it have to know?
  • how How can X migrate the dataset to a new data format without having control over the apps that consume it? Even after upgrading the dataset to a new version, X does not know when all apps have picked that up, because they may have long-running programs such as a flow or service that need to be restarted for picking up the new version.

Scenario 4

...

: A Dataset is Created and Maintained by a Hydrator Pipeline

This is very similar to Scenario 3, but the dataset type and the dataset instance are defined by a Hydrator plugin. The plugin may embed the code for the dataset type in its own code, or it may depend on a dataset type artifact that was deployed separately. In either case, the dataset is subsequently available to Hydrator pipelines, applications and Explore, and it can be maintained using REST, CLI, or UI. 

The important distinction here is that the user does not write code (although somebody wrote the code for the plugin). The user should be able to deploy the dataset (and its type) without knowing about the mechanics of dataset (type) management.

Scenario 5

...

: A Dataset is Created through the App Store or Marketplace

Again, this is very similar to Scenario 4, except that this time the user does not even interact with CDAP or Hydrator. He gets a dataset (and a pipeline that feeds it) from the Market Marketplace with the click of a button, and the dataset is available to Explore, and also to other pipelines and apps. 

...

  1. As a user, I want to specify as part of dataset configuration whether it is explorable.
  2. As a user, I do not want to specify the explore schema (and format) as separate properties if they can be derived from other standard dataset properties.
  3. As a user, I want to specify the explore schema separately (for example, only include a subset of the fields of a table, or name fields differently).
  4. As a user, I expect that dataset creation fails if the dataset cannot be enabled for explore.
  5. As a user, I expect that dataset reconfiguration fails if the corresponding update of the explore table fails.
  6. As a user, I expect that a dataset operation fails if it fails to make its required changes to explore.
  7. As a user, I expect that an update of explore never leads to silent loss of data (or data available for explore). If, for example, partitions would be dropped from the explore table, I want to have the option to either fails cancel the update, or to be notified of the drop and have a tool to bring explore in sync with the data. 
  8. As a user, I want to enable explore for a dataset that was not configured for explore initially.
  9. As a user, I want to disable explore for a dataset that was configure for explore initially.

...