Overview

This page covers the requirements, design and implementation of metadata and data discovery features in 3.3

High Level Requirements

Metadata search
Schema as metadata
System metadata
CLI, Test Framework Support for metadata
UI for Metadata Search
UI for Lineage
UI for Adding/Updating metadata properties/tags
Lineage based on Type of Dataset Access
Monitoring/Logs for Metadata Service

Scope

Schema as metadata
System metadata
Metadata CLI
Test Framework support for Metadata
UI... (needs to be finalized)

User Stories

Id	Description	Comments
U1	As a user, I should be able to search Datasets containing the specified fields	List the kinds of queries that will be supported
U2	As a CDAP system, I should be able to annotate CDAP entities with system metadata automatically	List all the system tags that should be annotated Kind of entity (dataset, app, program, program type, stream)? artifact name
U3	As a user, I should be able to access and update CDAP metadata using the CDAP CLI
U4	As a developer, I should be able to access and update CDAP metadata using the CDAP Test Framework
U5	As a user, I should be able to search CDAP entities based on metadata using the CDAP UI
U6	As a user, I should be able to view the lineage of a CDAP dataset/stream in a specified time window using the CDAP UI

System Metadata

Kinds of system metadata:

Applications

Artifact name

Programs

Type of program

Datasets

Type of dataset
Creation time - property
Last update time? - property
RecordScannable/BatchWritable/RecordWritable/BatchReadable
Other properties

Streams

Format
View

Schema as Metadata

Schema as metadata is meant to add the capability in CDAP for users to be able to retrieve datasets/streams with a field X optionally of type Y.

Design Considerations

Storage

There is a case for storing System Metadata in a separate dataset for the following reasons:

Only the CDAP system can update System Metadata.
System Metadata may have different authorization as well as retention policies than Business Metadata
System Metadata can be updated at specific times only, where users can update Business Metadata at any given time

However, if stored as a separate dataset, the metadata system will have to manage two different datasets. APIs may need filters, etc - TODO: Details

Storing History - same pattern as Business Metadata

Runtime

System Metadata will be added/updated when:

An app is deployed - We will add a SystemMetadataUpdater stage in the deployment pipeline that will update system metadata for the app, as well as all the programs in the app.
A new dataset instance is created - The LineageWriterDatasetFramework can be extended to update system metadata when a dataset is added.
A new stream is created -

Deletes for all the above

System Metadata Updates

Only the CDAP system can update system metadata for entities. This capability will not be exposed to users. However, given this design choice, users will need a capability in CDAP to discover all the system tags/properties. To start off with, this can be exposed via a simple API that lists all tags/properties. It can later be extended via full-text search capabilities when CDAP has a more comprehensive search capability that extends beyond IndexedTables and prefix lookups.

Metadata and Data Discovery 3.3