...
- Non-Transaction-Centric datasets. In today's dataset framework, everything revolves around transactions, but certain types of datasets (e.g., file sets) and also certain programming paradigms (e.g., Spark) are not transactional in their nature. We ned to be able to express that in a natural way. Also, we should separate the transactions performed by the system (MDS, Lineage, Audit, etc.) from the transactions performed by applications - today share the same transaction space.
- Ability to express new core storage abstractions such as Graph, Time Series, or Message Bus. These are data types that can expressed on top of tables or files, but there are storage engines that are optimized for this kind of data and that expose specific primitive data operations. Note that "Text Search" is not a storage abstraction. It is more of a system service, similar to Explore.
- Platform services available to Datasets. Certain capabilities are better implemented by the platform, with APIs available to datasets for leveraging them. For example, today we already have Explore as platform service available to all datasets. However, it is not generic enough, for example, it does not have a clean way to manipulate the Hive tables from within a composite dataset. That is required, for example, when adding a partition to a partitioned dataset. Other platform services that will be useful:
- Explore
- Text Indexing and Search
- Recording meta data
Work Planned for 3.5
(bold = must have, italic = implied by must-have, regular = stretch)
- Stabilization of existing dataset framework:
- improve performance and throughput
- better error handling - atomicity of dataset admin
- reduce footprint on transaction system
- Definition of new dataset APIs
- Dataset capability interfaces: @Read, @Write, @ReadWrite
- Dataset admin APIs: "Updatable"
- create in configure() if dataset exists is an update
- update with compatibility check
- distinguish update from upgrade
- Implementation new APIs for existing system datasets:
- Table, FileSet
- Schema as a system property
- Transactions:
- customizeable transaction timeout in programs
- long transactions in programs
- dataset access without transaction
- read-only transactions
- Major Bugs
- fix in-Memory table
- remove buffering in MR/Spark
- Management
- dataset types should have aliases, register only once
...