Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Author: Andreas Neumann
Created: November 2018
Last updated: January 2019
Status: Draft
Main Document: Decoupling CDAP System Storage from Hadoop

Tl;dr

This documents details the design for the Metadata SPI for CDAP 6.x and offers a choice of , which allows different implementations of the metadata backend, and details an implementation for the case when an ElasticSearch instance is available.

Motivation

The current CDAP (5.x) metadata store is built using CDAP tables and Tephra transactions. This has the following drawbacks:

...

  1. Allow using a full-blown search engine for indexing and searching the metadata

  2. Improve the metadata representation to be more generic, capable of expressing unknown/foreign entities.

Choice of Search Engine

To use a search engine in CDAP, there are the several options:

...

The following sections will discuss these options.

Develop a System Service

Developing a CDAP system service that embeds a search library and maintains its indexes in persistent storage has the advantage that it would work almost unmodified in Hadoop and in any non-Hadoop setup. Similar to other system services, the CDAP master will configure and manage this service, and CDAP will have full control over all its characteristics, similar to the existing Metadata Service in CDAP 5.x.

...

While this gives complete control over what the search engine does, the development effort is high, and a lot of that effort would go into features such as scale-out and availability that have been implemented in projects such as Solr and Elasticsearch - why reinvent the wheel?

Deploy a stand-alone search service

Assuming that there is an external search service that can store, index, and query CDAP’s metadata, CDAP can communicate with that search service via RPC. The search service can be managed by the customer outside of CDAP, or, when running in Kubernetes, the CDAP master can deploy it in Kubernetes. In both cases, CDAP will treat this service as an external service.

...

The advantage of this approach is that it can work well in Hadoop and non-Hadoop environments. On-prem users of Open Source CDAP can stand up an Elasticsearch instance or reuse an existing one, and connect CDAP to it. For Cloud or Kubernetes setups, the CDAP Master, or the agent that installs CDAP, can deploy an ElasticSearch for CDAP to use.    

Elasticsearch for CDAP Metadata

As detailed above, Elasticsearch is the most suitable (and only viable) implementation for a CDAP metadata store.

  • It is readily available on Kubernetes and many OSS customers already run Elasticsearch

  • Elasticsearch is scalable and operationally hardened by a large community

  • It is the solution with the least anticipated development effort.  

Design

Requirements

  1. Ability to configure different storage providers for Metadata storage, retrieval, and search

  2. Store and modify the metadata for an entity

  3. Retrieve the metadata for an entity, by lookup on entity id

  4. Find entities by searching their names, properties, descriptions and schema.

  5. Secure search that returns only what the user has access to.

Not in Scope

  1. Store and retrieve the history of an entity’s metadata

  2. Structured search across entities, especially parent/child and siblings. For example, search on the description of a dataset AND a property of one of its fields.

API

Entity Identification

In current CDAP 5.x, an entity is represented as an object of MetadataEntity, which uses an ordered sequence of key-value pairs, for example:

...

The Metadata SPI will also use this representation.  

Metadata Representation

Metadata consists of tags and properties, associated with an entity. Each property or tag has a scope: SYSTEM for system-generated metadata (for the most part, technical metadata), and USER for user-defined metadata (a.k.a. business metadata). In CDAP 5.x, the two metadata scopes were implemented as two separate tables, and the API was reflecting that design, in using a metadata record that can hold only the tags and properties of one scope. It would be better to hide that implementation detail and use a more generic metadata representation, as follows:

...

  • Metadata from both scopes can be represented in a single object

  • Both tags and properties can be extended in the future with optional fields (e.g., creation date or modification-date)

  • If needed, more scopes can be added by extending the Scope enum, without further code changes

Manipulating Metadata

The scenarios in which metadata is created, changed or deleted are:

...

  • Search. The current search API in CDAP 5.x is sufficient. However, the results will be returned in the form of Metadata objects. Details TBD.

Security

TBD