Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 14 Next »

Overview

This page covers the requirements, design and implementation of metadata and data discovery features in 3.3

High Level Requirements

  1. Metadata search
  2. Schema as metadata
  3. System metadata
  4. CLI, Test Framework Support for metadata
  5. UI for Metadata Search
  6. UI for Lineage
  7. UI for Adding/Updating metadata properties/tags
  8. Lineage based on Type of Dataset Access
  9. Monitoring/Logs for Metadata Service

Scope

  1. Schema as metadata
  2. System metadata
  3. Metadata CLI
  4. Test Framework support for Metadata
  5. UI... (needs to be finalized)

User Stories

IdDescriptionComments
U1As a user, I should be able to search Datasets containing the specified fieldsList the kinds of queries that will be supported
U2As a CDAP system, I should be able to annotate CDAP entities with system metadata automatically

List all the system tags that should be annotated

  • Kind of entity (dataset, app, program, program type, stream)?
  • artifact name

 

U3As a user, I should be able to access and update CDAP metadata using the CDAP CLI 
U4As a developer, I should be able to access and update CDAP metadata using the CDAP Test Framework 
U5As a user, I should be able to search CDAP entities based on metadata using the CDAP UI 
U6As a user, I should be able to view the lineage of a CDAP dataset/stream in a specified time window using the CDAP UI 

 

System Metadata

Kinds of system metadata:

Applications

  • Artifact name

Programs

  • Type of program

Datasets

  • Type of dataset
  • Schema
  • RecordScannable/BatchWritable/RecordWritable/BatchReadable
  • Other properties

Streams

  • Format

Views

  • Format

Design Considerations

Storage

System Metadata will be stored in a separate dataset for the following reasons:

  1. Only the CDAP system can update System Metadata.  
  2. System Metadata may have different authorization as well as retention policies than Business Metadata
  3. System Metadata can be updated at specific times only, where users can update Business Metadata at any given time

As a result, the metadata system will have to manage two different datasets. The storage format of both datasets (both keys and values) will be identical, they will only write to separate tables.

A higher level construct (TBD, but an extended BusinessMetadataStore or MetadataAdmin) will have to be extended to interact with two separate datasets.

History

We will re-use the same pattern that the Business Metadata Dataset uses to store history.

Runtime

For interacting with the System Metadata Dataset, we will introduce a SystemMetadataUpdater interface, which will be injected at various stages outlined below, to add, update or delete business metadata

System Metadata will be added when:

  1. An app is deployed - We will add a SystemMetadataUpdater stage in the deployment pipeline that will update system metadata for the app, as well as all the programs in the app.
  2. A new dataset instance is created - The LineageWriterDatasetFramework will be passed a SystemMetadataUpdater, to add system metadata in the addDatasetInstance call.
  3. A new stream is created - StreamAdmins will be passed a SystemMetadataUpdater as well, to add system metadata in the create API.

System Metadata will be updated when:

  1. A dataset instance's properties are updated - The LineageWriterDatasetFramework's updateInstance method will use the SystemMetadataUpdater to update the passed properties
  2. A stream's config is updated - The StreamAdmin's updateConfig method will use the SystemMetadataUpdater to update the stream's system metadata

System Metadata will be deleted when:

  1. An app is deleted - The ApplicationLifecycleService will use the SystemMetadataUpdater to delete system metadata for the application
  2. A program is removed from an existing app - The DeletedProgramsHandlerStage will use the SystemMetadataUpdater to delete system metadata for the programs
  3. A dataset instance is deleted - The LineageWriterDatasetFramework's deleteInstance method will use the SystemMetadataUpdater to delete system metadata for the dataset instance
  4. A stream is deleted - The StreamAdmin's drop method will use the SystemMetadataUpdater to delete system metadata for the stream 

System Metadata Updates

Only the CDAP system can update system metadata for entities. This capability will not be exposed to users. However, given this design choice, users will need a capability in CDAP to discover all the system tags/properties. To start off with, this can be exposed via a simple API that lists all tags/properties. It can later be extended via full-text search capabilities when CDAP has a more comprehensive search capability that extends beyond IndexedTables and prefix lookups.

REST APIs

The add/update/delete APIs for system metadata will not be documented, or be accessible from the Router. Internally, the SystemMetadataUpdater will preferably interact with the transactional store for system metadata directly.

If REST APIs are absolutely necessary (TBD):

  • The REST APIs for adding/updating/deleting system metadata will not be documented, and will not be exposed via the Router
  • The SystemMetadataUpdater will use service discovery to discover the Metadata Service and make REST calls.

Schema as Metadata

Schema as metadata is meant to add the capability in CDAP for users to be able to retrieve datasets/streams with a field X optionally of type Y.

For storing schema as a system metadata, we will use the existing metadata properties mechanism. An option to store Schema as metadata would be to store every field in the schema as the metadata property:

Key: 

field^A<fieldName>

Value:

<fieldType>

Note: We may have to reverse this, based on the indexing mechanisms available in the System Metadata Dataset. If it supports key:value and value type searches, then we may have to swap the key and value above, so two types of searches can be supported:

  1. All Datasets with the field field1
  2. All Datasets with the field field1 of type int

Views

Up until 3.2, users could not associate metadata with stream views. We will need to add this capability in 3.2. However, there would not be any parent-child relationship between a view, and its stream, as far as metadata is concerned. A view will be a separate entity from its stream, and will show up separately in search results. Metadata of a stream will not be automatically available as metadata of a view. 

Implementation

REST APIs

Questions

 

 

 

 

  • No labels