Author: Andreas Neumann
Created: November 2018
Last updated: January 2019
Status: Draft
Main Document: Decoupling CDAP System Storage from Hadoop
Tl;dr
This documents details the design for the Metadata SPI for CDAP 6.x and offers a choice of implementation for the case when an ElasticSearch instance is available.
...
It is readily available on Kubernetes and many OSS customers already run Elasticsearch
Elasticsearch is scalable and operationally hardened by a large community
It is the solution with the least anticipated development effort.
Design
Requirements
Ability to configure different storage providers for Metadata storage, retrieval, and search
Store and modify the metadata for an entity
Retrieve the metadata for an entity, by lookup on entity id
Find entities by searching their names, properties, descriptions and schema.
Secure search that returns only what the user has access to.
...
The Metadata SPI will also use this representation.
Metadata Representation
Metadata consists of tags and properties, associated with an entity. Each property or tag has a scope: SYSTEM for system-generated metadata (for the most part, technical metadata), and USER for user-defined metadata (a.k.a. business metadata). In CDAP 5.x, the two metadata scopes were implemented as two separate tables, and the API was reflecting that design, in using a metadata record that can hold only the tags and properties of one scope. It would be better to hide that implementation detail and use a more generic metadata representation, as follows:
...
Why not throw a NotFoundException? That would mean that we can distinguish between an existing entity that has no metadata, and an entity that does not exist. However, the metadata store does not know what entities exist. It cannot rely on the clients (the CDAP system and users) to create every entity explicitly before adding metadata. Therefore this API is agnostic to the existence of entities: it only manages the metadata.
Entity creation: When an entity is created, or replaced with a new version, the system generates technical metadata for this entity. Examples include:
The type and schema of a dataset
The tag batch for a MapReduce program
The creation time of the entity
Entity update: When an entity is created, this metadata is written for the first time. When the entity is updated, the system metadata is replaced. However, some of the original metadata needs to be preserved. In CDAP 5.x, preserved metadata is:
...