...
"Better APIs" means that we will have to break compatibility with existing dataset APIs, and therefore we will have to support both the old and the new APIs for some time. Ideally we can introduce the new APIs as "beta" in the next release (3.5), so that we can deprecate the old ones in 4.0 and eventually remove them. We will invest minimal effort in stabilizing the old APIs and focus on framework correctness, reliability and performance for the new APIs. Here "minimal effort" still requires some fixing in the current framework.
New APIs will introduced without disruption or deprecation of existing dataset types. We will follow an approach similar to what we have done with BatchReadable, RecordScannable etc. A dataset can expose a capability by implementing an API, the system will use reflection to determine capabilities. Or a dataset type can use annotations that are optional. This will allow existing dataset types to be valid while at the same time introducing new capablities.
New API Objectives
- Clean distinction between interfaces to plug in different implementations for core storage abstractions such as Table and File (Storage Provider Interface, SPI) and interfaces to define data access patterns on top of these abstractions (Composite Datasets). Even though we can still refer to both of these as "Datasets", they are two completely different concerns:
- Storage Providers are configured at the platform level, as a system-wide capability. Applications can chose what implementations they require, but can't provide the code. Composite datasets, however, can be defined by applications and included in the app artifacts.
- SPIs must implement core capabilities such as security, transactions, schema, indexing, etc, whereas composite datasets can rely on the existence of these capabilities.
- Developing a storage provider is advanced expert use of CDAP and can expose significant complexity to the developer, whereas implementing a custom composite dataset should be as simple as possible.
- Platform support for core data concepts: In today's dataset framework, concepts such as schema and indexing are unknown to the framework, and implemented by every dataset type in its own way (for example, we have half a dozen different ways of expressing schema). Here is a list of concepts the APIs should be able to express:
- Schema (and evolution)
- Indexing
- Querying
- Snapshots
- Scanning (batch read)
- Upgrade
- Transactions
- Read vs. Write Access
...
- Stabilization of existing dataset framework:
- improve performance and throughput
- better error handling - atomicity of dataset admin
- reduce footprint on transaction system
- Definition of new dataset APIs
- SPIs - similar to plugins
- Dataset capability interfacesComposite Dataset definition: @Read, @Write, @ReadWrite
- Dataset admin APIs: "Updatable"
- Schema as a system property
- Implementation of new APIs for existing system datasets with new APIs:
- Table, FileSet as SPIs
- Schema, indexed tables as capabilities of the SPI
- Object-Mapped tables, ObjectStore as composite on top
- TimeSeries
- PartitionedFileSet
- Implementation of platform services
- Explore
- Search
- Meta Data
- Transactions:
- long transactions in programs
- dataset access without transaction
- read-only transactions
- Major Bugs
- fix in-Memory table
- remove buffering in MR/Spark
- Management
- create in configure() if dataset exists is an update
- update with compatibility check
- distinguish update from upgrade
- dataset types should have aliases, register only once