Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

...

Contents

Table of Contents
maxLevel2

Goal

In CDAP 4.0, the main theme for Datasets is improving/establishing proper and semantically sound dataset management. That includes the management of dataset types (code), and the management of dataset instances (actual data) throughout their life cycle. The current dataset framework has various shortcomings that need to be addressed. This document will discuss each area of improvement, list end-to-end use cases and requirements, and finally address the design to implement the requirements.

...

  •  User stories documented(Andreas)
  •  User stories reviewed(Nitin)
  •  User stories reviewed(Todd)
  •  Requirements documented(Andreas)
  •  Requirements Reviewed
  •  Mockups Built
  •  Design Built
  •  Design Accepted

...

It is virtually impossible to list all possible scenarios, but it is important to realize that any combination of the above scenarios must work seamlessly. For example, a dataset may be maintained by multiple apps, and still shared with many others. Or a dataset may be created through a Hydrator pipeline but shared with many other pipelines or apps. That also means that the simplest of use cases (Scenario 1) must be interoperable with the most complex one (Scenario 3). Also, any time there is a conflict between different apps, pipelines, plugins, or app store artifacts that attempt to create the same dataset, but with different types, or with a version conflict, etc., this conflict must be detected by CDAP and reported back to the user in a clear and easy-to-read way.

User Stories

This collection of stories represents the vision that we have for dataset management. It is a living document and will be maintained over time. In each release, we need to determine and prioritize which of these stories are in scope. 

[DTM] Dataset Type Management

  1. As an app developer, I want to include the code of a dataset type in my app artifact, and create a dataset of that type when deploying the app.
  2. As an app developer, I want to deploy a new version of a dataset type as part of deploying a new version of the app that includes it, and I expect that all dataset instances of that type that were created as part of the app deployment start using the new code. 
  3. As an app developer, I want to share deploy a new version of a dataset type that I had previously deployed as part of an app artifact, without affecting other datasets of this type.
  4. As an app developer, I want to deploy a new version of a dataset type as part of an app artifact, without affecting other datasets of this type.As an app developer, I want to explore a dataset instance of a type that was deployed as part of an app.
  5. As an app developer, I expect that deploying an artifact without creating an app will not create any dataset types or instances (that is, this only happens when creating an app).
  6. As an app developer, I want to share a dataset type across multiple applications that include the dataset type's code in their artifacts.
  7. As an app developer, when deploying a new version of an app that includes a shared dataset type, I expect that all dataset instances created by this app start using the new code, but all dataset instances created by other apps remain unchanged.
  8. As an app developer, I want to deploy a new version of an app that includes an older version of a dataset type deployed by another app, and I expect that the dataset instances created by this app use the dataset type code included in this app.
  9. As an app developer, when I deploy a new version of an app that includes an different version of a dataset type deployed by another app, and this app shares a dataset instance of this type with the other app, the deployment will fail with a version conflict error. (Because otherwise I might "downgrade" the instance to an older version, making it incompatible with the other app). 
    Note: This use case needs discussion. What is proper behavior? How can be we prevent data corruption due to an unintentional "downgrade" without restricting ease of use too much?

  10. As an app developer, I want to share a dataset type that I had previously deployed as part of an app.
  11. As a dataset developer, I want to deploy a dataset type independent from any app, and allow apps to create and use dataset instances of that type.
  12. As a dataset developer, I want to have the option of forcing applications to have the dataset code injected at runtime (that gives me control over what version of the code apps use).
  13. As a dataset developer, I need an archetype that helps me package my dataset type properly.
  14. As a dataset developer, I want to separate the interface from the implementation of a dataset type.
  15. As an app developer, I want to only depend on the interface of a dataset type in my app, and have the system inject the implementation at runtime. 
  16. As an app developer, I want to write unit tests for a an app that depends on the interface of a dataset type. (This means I need an extra dependency with test scope in my pom.xml)
  17. As a dataset developer, I want to assign explicit versions to the code of a dataset type.
  18. As a dataset developer, I want to deploy a new version of a dataset type without affecting the dataset instances of that type.
  19. As an app developer, I want to create a dataset instance with a specific version of a dataset type. 
  20. As a dataset developer, I want to have the option of implementing an "upgrade step" for when explore a dataset instance is upgraded to a new version of the dataset type.created from a dataset type that was deployed by itself. 
  21. As a dataset developer, I want to have a way to reject an upgrade delete outdated versions of a dataset instance to a newer version of it type, if the upgrade is not compatibletype. I expect this to fail if there are any dataset instances with that version of the type
  22. As a dataset developer, I want to have the option of implementing a migration procedure that can be run after an upgrade of a dataset instance to a new version of it type. This can be a long-running (background) process.
  23. As a dataset developer, I want to implement custom administrative operations (such as "compaction", or "rebalance") that are no common to all dataset types.
  24. As an app developer, I want to perform custom administrative operations on dataset instances from my app, the CLI, REST, or the UI. 
  25. As a dataset developer, I want to explore a dataset instance created from a dataset type that was deployed by itself. 
  26. As a dataset developer, I want to delete outdated versions of a dataset type. I expect this to fail if there are any dataset instances with that version of the type. 
  27. As a dataset developer, I want to list all dataset instances that use a dataset type, or a specific version of a type.
  28. As a data scientist or app developer, I want to be able to create a dataset instance of an existing dataset type without writing code.
  29. As a data scientist or app developer, I want to be able to upgrade a dataset instance to a new version of its code.
  30. As a hydrator user, I want to create a pipeline that reads or writes an existing dataset instance.
  31. As a hydrator user, I want to create a pipeline that reads or writes a new dataset instance, and I want to create that dataset instance as part of pipeline creation. 
  32. As a hydrator user, I want to specify an explicit version of the dataset types of the dataset instances created by my pipeline, and I expect pipeline creation to fail (similar to app creation) if that results in incompatible upgrade of an existing dataset instance that is shared with other apps or pipelines.
  33. As a hydrator user, I want to explore the datasets created by my pipeline.
  34. As a hydrator user, I expect all dataset instances created by apps to be available as sinks and sources for pipelines (if there is a corresponding plugin).
  35. As an app developer, I expect all dataset instances created by Hydrator pipelines to be accessible to the app.
  36. As a plugin developer, I want to include the code for a dataset type in the plugin artifact. When a pipeline using this plugin is created, a dataset instance of that type is created, and it is explorable and available to apps.
  37. As a plugin developer, I want to use a custom dataset type (that was deployed independently or as part of an app) inside the plugin. 
  38. As a plugin developer, I want to upgrade the code of a dataset type used by a dataset instance created by that plugin, when I deploy a new version of the plugin and update the pipeline to use that version.
  39. As a pipeline developer, I want to upgrade a dataset instance to a newer version of the code after the pipeline was createdlist all dataset instances that use a dataset type, or a specific version of a type.
  40. As a data scientist or app developer, I want to be able to create a dataset instance of an existing dataset type without writing code.
  41. As a data scientist or app developer, I want to be able to upgrade a dataset instance to a new version of its code.
  42. As a hydrator user, I want to create a pipeline that reads or writes an existing dataset instance.
  43. As a hydrator user, I want to create a pipeline that reads or writes a new dataset instance, and I want to create that dataset instance as part of pipeline creation. 
  44. As a hydrator user, I want to specify an explicit version of the dataset types of the dataset instances created by my pipeline, and I expect pipeline creation to fail (similar to app creation) if that results in incompatible upgrade of an existing dataset instance that is shared with other apps or pipelines.
  45. As a hydrator user, I want to explore the datasets created by my pipeline.
  46. As a hydrator user, I expect all dataset instances created by apps to be available as sinks and sources for pipelines (if there is a corresponding plugin).
  47. As an app developer, I expect all dataset instances created by Hydrator pipelines to be accessible to the app.
  48. As a plugin developer, I want to include the code for a dataset type in the plugin artifact. When a pipeline using this plugin is created, a dataset instance of that type is created, and it is explorable and available to apps.
  49. As a plugin developer, I want to use a custom dataset type (that was deployed independently or as part of an app) inside the plugin. 
  50. As a plugin developer, I want to upgrade the code of a dataset type used by a dataset instance created by that plugin, when I deploy a new version of the plugin and update the pipeline to use that version.
  51. As a pipeline developer, I want to upgrade a dataset instance to a newer version of the code after the pipeline was created.  

  52. As a dataset developer, I want to have the option of implementing an "upgrade step" for when a dataset instance is upgraded to a new version of the dataset type.
  53. As a dataset developer, I want to have a way to reject an upgrade of a dataset instance to a newer version of it type, if the upgrade is not compatible. 
  54. As a dataset developer, I want to have the option of implementing a migration procedure that can be run after an upgrade of a dataset instance to a new version of it type. This can be a long-running (background) process.
  55. As a developer, I want to take a dataset "offline" so that I can perform a long-running maintenance or migration procedure.
  56. As a dataset developer, I want to implement custom administrative operations (such as "compaction", or "rebalance") that are no common to all dataset types.
  57. As an app developer, I want to perform custom administrative operations on dataset instances from my app, the CLI, REST, or the UI

[DIC] Dataset Instance Configuration

[Note: "As a user" refers to app developers, data scientists, dev-ops, or Hydrator users, pipeline developers]

  1. As a user, when creating a dataset instance, I want to find out what properties are supported by the dataset type, what values are allowed, and what the defaults are. 
  2. As a user, I want to specify the schema of a dataset in a uniform way across all dataset types.
  3. As a user, I want to specify schema as a JSON string (verbose, Avro-style).
  4. As a user, I want to specify schema as a SQL schema string (brief, Hive-style).
  5. As a user, I want to configure time-to-live (TTL) in a uniform way across all dataset types. 
  6. As a user, I want to see the properties that were used to configure a dataset instance.
  7. As a user, I want to find out what properties of a dataset can be updated.  
  8. As a user, I want to update the properties of a dataset instance. I expect this to fail if the new properties are not compatible, with a meaningful error message.
  9. As a user, I want to update a single property of a dataset instance, without knowing all other properties. For example, set the TTL without having to know the schema. 
  10. As a user, I want to remove a single property of a dataset instance, without knowing all other properties. For example, remove the TTL without having to know the schema. 
  11. As a user, I want to trigger a migration process for a dataset if updating its properties requires that.
  12. As a user, I expect that if reconfiguration of a dataset fails, then no changes have taken effect. In other words, all steps required to reconfigure a dataset must be done as a single atomic action.
  13. As an app developer, I expect that application creation fails if any of its datasets cannot be created.
  14. As an app developer, I expect that application redeployment fails if any of its datasets cannot be reconfigured (if the new app spec specifies different configuration). 
  15. As an app developer, when creating a dataset as part of app deployment, I want to tolerate existing datasets if their properties are different but compatible. For example, I can configure the dataset schema, but leave the existing TTL of a table untouched.
  16. As a pipeline designer, I want to use an existing dataset as a sink or source. If the schema (or any other property) of the dataset is incompatible with what the pipeline requires, I expect that pipeline creation fails with a meaningful error message. 

...

[DTM] Dataset Type Management

  1. Unification of artifact management for plugins, apps and dataset types
    1. Explicit versioning of dataset types
    2. Coexistence of multiple versions of the same dataset type
    3. Explicit dependency of a dataset instance on a specific version of its type
    4. Explicit upgrade of a dataset instance to a new version of its type
  2. Preserve the easy experience of self-contained apps using an implicit versioning scheme
    1. Apps can bundle dataset types, create dataset instance of that type
    2. At runtime, such dataset types are loaded from the program jar
    3. Seamless experience when redeploying such an app
  3. Dataset Admin needs a new method to upgrade to a new version of the type
    1. This method can reject the upgrade
  4. Hydrator Plugins can also contain dataset types
  5. Ability to take a dataset instance offline for a migration procedure
  6. Injection of dataset code at runtime
    1. Not for dataset types embedded in app artifact (see 2.)
    2. Always inject the version of the type that the dataset instance is tagged with
    3. No noticeable performance degradation (some degradation is expected due to code injection at first instantiation of a type)
  7. Backward-compatibility with existing dataset modules
  8. Versioning for system dataset types
    1. Core types always use latest system version
    2. Composite types: TBD
  9. New dataset admin API for performing custom actions
  10. New versioned REST and CLI methods for versioned type and instance management
  11. Maven archetype for dataset artifacts

[DIC] Dataset Instance Configuration

  1. New dataset API to retrieve the properties accepted by a type
    1. what the accepted values are
    2. whether they are mutable
    3. whether they are required
    4. what the default value is
  2. Schema as a standardized system property
    1. Validation of schema
    2. Specify schema in Avro or SQL style
    3. All system datasets to use new schema property
  3. New API to update or remove a single property of a dataset
  4. Ability to "merge" dataset properties without changing existing ones, failing in that case
  5. Dataset Management Operations are atomic
    1. Always leave behind a consistent state

[EI] Explore Integration

...

  1. Whether explore is enabled is explicit property
  2. All other explore properties derived from dataset properties if possible 

...

1. Replacing the Dataset Type Manager implementation based on the Artifact Repo

The first and necessary part of the work is to unify the current dataset module/type management with the existing artifact repository. This can be done in a way that does not make versioning explicit (since the current dataset framework has no versioning, we could switch over without introducing that). The current requirement is that all dataset code must be included in the program artifact. That is, dataset type code is not injected by the platform, but aways loaded from the program class loader. We can mimic by using a specific version string - say "embedded" - that means loading from the program class loader. The work to do that breaks down as follows:

  • Deploying a dataset type (or module) is implemented as deployment of an artifact with version "embedded"
    • what does this mean for configuration, recording of dependencies? 
  • Version "embedded" is treated like a snapshot version, that is, it can be redeployed any time. For now this is the only version we use. 
  • Creating a dataset instance tags that instance with version "embedded"
  • New implementation of dataset framework (for explore only) that loads the code from the artifact repo. 
  • In programs, since the only version is "embedded", dataset code is still loaded using the program class loader.
  • No introduction of new or versioned APIs

This implements user stories DTM 1-9, but no new user stories that were not implemented by the existing framework. Instead this it lays the foundation for those stories.

2. Introducing explicit versioning for dataset types

Next we can implement user stories DTM 10-34 that require explicit dataset type versions, along with the ability to deploy a dataset outside of an app.

  • Maven archetype for dataset artifacts
  • Explicit versioning of dataset types. This will be the same as the artifact version.
    • Coexistence of multiple versions of the same dataset type
    • This comes (almost) for free after the migration to the artifact repository
  • Explicit dependency of a dataset instance on a specific version of its type
    • When creating a dataset, and explicit version of the type can be given
    • Otherwise the latest version will be used
    • The dataset meta data (spec) will contain this version
  • Injection of dataset code at runtime, from the artifact of that version.
  • No noticeable performance degradation (some degradation is expected due to code injection at first instantiation of a type).
  • Explicit upgrade of a dataset instance to a new version of its type.
  • For dataset types deployed as part of app deployment, we will keep using "embedded" as the version.
  • Hydrator Plugins can also contain dataset types. The dataset will be loaded from the plugin artifact at runtime.
  • New versioned REST and CLI methods for versioned type and instance management
  • At this point, we may remove the REST endpoints for deploying dataset modules (that can now be done through the artifact repo APIs) 

This still keeps the dataset APIs unchanged. 

3. Introducing new Dataset APIs

At this point, we need to decide whether we want to keep the existing APIs and enhance them, or whether we want to come up with a new set of APIs. Some considerations:

  • Deploying a dataset module currently invokes Java code (the module's register() method). This is used to declare dependencies in a programmatic way. 
  • All other artifacts, however, declare their dependencies through a configuration file included in the artifact. 
  • For a true unification of the artifact management, we should probably change dataset to follow that pattern. 
  • Up for discussion.

New APIs to be added:

  • A new dataset admin method to upgrade to a new version of the type
    • This method can reject the upgrade
  • New dataset admin APIs for performing custom actions
  • Ability to take a dataset instance offline for a migration procedure

This addresses user stories DTM 35-40. 

[DIC] Dataset Instance Configuration

 Any of the following do not depend on the migration to the artifact repo. However, if we implement these in the current Dataset APIs, we may to redo again if/after we switch to a new set of APIs (DTM 3).

  1. New dataset API to retrieve the properties accepted by a type
    1. what the accepted values are
    2. whether they are mutable
    3. whether they are required
    4. what the default value is
  2. Schema as a standardized system property
    1. Validation of schema
    2. Specify schema in Avro or SQL style
    3. All system datasets to use new schema property
  3. New API to update or remove a single property of a dataset
  4. Ability to "merge" dataset properties without changing existing ones, failing in that case
  5. Dataset Management Operations are atomic
    1. Always leave behind a consistent state

[EI] Explore Integration

Any of the following do not depend on the migration to the artifact repo. However, if we implement these in the current Dataset APIs, we may to redo again if/after we switch to a new set of APIs (DTM 3).

  1. Simplification of explore configuration
    1. Whether explore is enabled is explicit property
    2. All other explore properties derived from dataset properties if possible 
  2. Explore failure also fails the DTM operation that called it
  3. Ability to communicate warnings to the user for successful explore operations
  4. Enable/Disable explore as dataset management operations 

Proposed Scope for 4.0

  1. Minimal work to remove artifact management from DatasetTypeManager
    1. Remove the (experimental) REST API to deploy a dataset module by itself
    2. For dataset types/modules deployed from an app, remove the generation of an artifact. Instead record the app artifact that is was created from
    3. Similar as b. for dataset types included in plugins
    4. For apps, load dataset types from program class loader. For explore, load from the artifact recorded for the type
    5. May require some changes in artifact repository
  2. Simplify configuration of datasets
    1. Schema and format as a system properties with validation
    2. TTL as a system property
  3. New API for a dataset type to declare what configuration it accepts (needed for Resource Center)
    1. Properties (instance configuration)
    2. Arguments (runtime configuration)
  4. Make dataset lifecycle methods (create, update, drop) consistent
    1. In case of failure, do not leave partial/inconsistent state behind
    2. Do not silently ignore explore failures: they must fail the entire operation
  5. Simplify configuration of explore properties CDAP-2790 
    1. Derived all explore properties from schema+format when possible. 
    2. Allow configuring the detailed explore properties (as today) for power users.
  6. Improved control over transactions for programs CDAP-7319
    1. Configure transaction timeout as a runtime argument / preference at namespace, app, program level CDAP-6103
    2. Programmatic APIs for programs that allow executing a transaction with custom timeout CDAP-7193CDAP-7320CDAP-7322
    3. Add a way to access datasets (and call non-transactional methods) CDAP-7323
    4. Fix the transactional behavior of WorkerContext.execute() CDAP-6837