Datasets Revamp

Goals

Our current dataset system is has both operational and functional shortcomings: The performance is subpar, reliability is below expectations, and the APIs are complex for developers without defining important properties such as schema or versioning. Our goal is improve this on both the operational and the functional side, by

  • Defining better APIs
  • Addressing code quality in the current framework

"Better APIs" means that we will have to break compatibility with existing dataset APIs, and therefore we will have to support both the old and the new APIs for some time. Ideally we can introduce the new APIs as "beta" in the next release (3.5), so that we can deprecate the old ones in 4.0 and eventually remove them. We will invest minimal effort in stabilizing the old APIs and focus on framework correctness, reliability and performance for the new APIs. Here "minimal effort" still requires some fixing in the current framework.

New APIs will introduced without disruption or deprecation of existing dataset types. We will follow an approach similar to what we have done with BatchReadable, RecordScannable etc. A dataset can expose a capability by implementing an API, the system will use reflection to determine capabilities. Or a dataset type can use annotations that are optional. This will allow existing dataset types to be valid while at the same time introducing new capablities. 

New API Objectives

  • Clean distinction between interfaces to plug in different implementations for core storage abstractions such as Table and File (Storage Provider Interface, SPI) and interfaces to define data access patterns on top of these abstractions (Composite Datasets). Even though we can still refer to both of these as "Datasets", they are two completely different concerns:
    • Storage Providers are configured at the platform level, as a system-wide capability. Applications can chose what implementations they require, but can't provide the code. Composite datasets, however, can be defined by applications and included in the app artifacts.
    • SPIs must implement core capabilities such as security, transactions, schema, indexing, etc, whereas composite datasets can rely on the existence of these capabilities. 
    • Developing a storage provider is advanced expert use of CDAP and can expose significant complexity to the developer, whereas implementing a custom composite dataset should be as simple as possible. 
  • Platform support for core data concepts: In today's dataset framework, concepts such as schema and indexing are unknown to the framework, and implemented by every dataset type in its own way (for example, we have half a dozen different ways of expressing schema). Here is a list of concepts the APIs should be able to express:
    • Schema (and evolution)
    • Indexing
    • Querying
    • Snapshots
    • Scanning (batch read)
    • Upgrade
    • Transactions
    • Read vs. Write Access

Not every dataset type can implement every one of these capabilities, but the platform should define standard interfaces that datasets must implement if they have that capability. The dataset framework must have adequate ways to make use of these interfaces: For example, if the dataset implements Snapshots, then the dataset framework should have a standard way to take a snapshot, that is the same for all types with that capability. 

In addition to that, the framework needs to implement:

    • Versioning
    • Aliasing (multiple names for a type)
    • Upgrade and Migration
  • Non-Transaction-Centric datasets.  In today's dataset framework, everything revolves around transactions, but certain types of datasets (e.g., file sets) and also certain programming paradigms (e.g., Spark) are not transactional in their nature. We ned to be able to express that in a natural way. Also, we should separate the transactions performed by the system (MDS, Lineage, Audit, etc.) from the transactions performed by applications - today share the same transaction space.
  • Ability to express new core storage abstractions such as Graph, Time Series, or Message Bus. These are data types that can expressed on top of tables or files, but there are storage engines that are optimized for this kind of data and that expose specific primitive data operations. Note that "Text Search" is not a storage abstraction. It is more of a system service, similar to Explore.
  • Platform services available to Datasets. Certain capabilities are better implemented by the platform, with APIs available to datasets for leveraging them. For example, today we already have Explore as platform service available to all datasets. However, it is not generic enough, for example, it does not have a clean way to manipulate the Hive tables from within a composite dataset. That is required, for example, when adding a partition to a partitioned dataset. Other platform services that will be useful:
    • Explore
    • Text Indexing and Search
    • Recording meta data

Work Planned for 3.5

(bold = must have, italic = implied by must-have, regular = stretch)

  • Stabilization of existing dataset framework:
    • improve performance and throughput
    • better error handling - atomicity of dataset admin
    • reduce footprint on transaction system
  • Definition of new dataset APIs
    • Dataset capability interfaces: @Read, @Write, @ReadWrite
    • Dataset admin APIs: "Updatable"
      • create in configure() if dataset exists is an update
      • update with compatibility check
      • distinguish update from upgrade
      • Implementation new APIs for existing system datasets:
        • Table, FileSet
    • Schema as a system property
  • Transactions:
    • customizeable transaction timeout in programs
    • long transactions in programs
    • dataset access without transaction
    • read-only transactions
  • Major Bugs
    • fix in-Memory table
    • remove buffering in MR/Spark
  • Management
    • dataset types should have aliases, register only once

 

Â