Overview
This page covers the requirements, design and implementation of metadata and data discovery features in 3.3
High Level Requirements
- Metadata search
- Schema as metadata
- System metadata
- CLI, Test Framework Support for metadata
- UI for Metadata Search
- UI for Lineage
- UI for Adding/Updating metadata properties/tags
- Lineage based on Type of Dataset Access
- Monitoring/Logs for Metadata Service
Scope
- Schema as metadata
- System metadata
- Metadata CLI
- Test Framework support for Metadata
- UI... (needs to be finalized)
User Stories
Id | Description | Requirements Fulfilled | Comments |
---|---|---|---|
U1 | As a user, I should be able to search Datasets containing the specified fields | List the kinds of queries that will be supported | |
U2 | As a CDAP system, I should be able to annotate CDAP entities with system metadata automatically | List all the system tags that should be annotated
| |
U3 | As a user, I should be able to access and update CDAP metadata using the CDAP CLI | ||
U4 | As a developer, I should be able to access and update CDAP metadata using the CDAP Test Framework | ||
U5 | As a user, I should be able to search CDAP entities based on metadata using the CDAP UI | ||
U6 | As a user, I should be able to view the lineage of a CDAP dataset/stream in a specified time window using the CDAP UI | ||
System Metadata
Kinds of system metadata:
Applications
- Artifact name
Programs
- Type of program
Datasets
- Type of dataset
- Creation time - property
- Last update time? - property
- RecordScannable/BatchWritable/RecordWritable/BatchReadable
- Other properties
Streams
- Format
Schema as Metadata
Schema as metadata is meant to add the capability in CDAP for users to be able to retrieve datasets/streams with a field X optionally of type Y.
Design Considerations
Storage
There is a case for storing System Metadata in a separate dataset for the following reasons:
- Only the CDAP system can update System Metadata.
- System Metadata may have different authorization as well as retention policies than Business Metadata
- System Metadata can be updated at specific times only, where users can update Business Metadata at any given time
However, if stored as a separate dataset, the metadata system will have to manage two different datasets. APIs may need filters, etc - TODO: Details
Runtime
System Metadata will be added/updated when:
- An app is deployed - We will add a SystemMetadataUpdater stage in the deployment pipeline that will update system metadata for the app, as well as all the programs in the app.
- A new dataset instance is created - The LineageWriterDatasetFramework can be extended to update system metadata when a dataset is added.
- A new stream is created -
System Metadata Updates
Only the CDAP system can update system metadata for entities. This capability will not be exposed to users. However, given this design choice, users will need a capability in CDAP to discover all the system tags/properties. To start off with, this can be exposed via a simple API that lists all tags/properties. It can later be extended via full-text search capabilities when CDAP has a more comprehensive search capability that extends beyond IndexedTables and prefix lookups.
Questions