Overview
This page covers the requirements, design and implementation of metadata and data discovery features in 3.3
High Level Requirements
- Metadata search
- Schema as metadata
- System metadata
- CLI, Test Framework Support for metadata
- UI for Metadata Search
- UI for Lineage
- UI for Adding/Updating metadata properties/tags
- Lineage based on Type of Dataset Access
- Monitoring/Logs for Metadata Service
Scope
- Schema as metadata
- System metadata
- Metadata CLI
- Test Framework support for Metadata
- UI... (needs to be finalized)
User Stories
Id | Description | Requirements Fulfilled | Comments |
---|---|---|---|
U1 | As a user, I should be able to search Datasets containing the specified fields | List the kinds of queries that will be supported | |
U2 | As a CDAP system, I should be able to annotate CDAP entities with system metadata automatically | List all the system tags that should be annotated
| |
U3 | As a user, I should be able to access and update CDAP metadata using the CDAP CLI | ||
U4 | As a developer, I should be able to access and update CDAP metadata using the CDAP Test Framework | ||
U5 | As a user, I should be able to search CDAP entities based on metadata using the CDAP UI | ||
U6 | As a user, I should be able to view the lineage of a CDAP dataset/stream in a specified time window using the CDAP UI | ||
System Metadata
Kinds of system metadata:
Applications
- Artifact name
Programs
- Type of program
Datasets
- Type of dataset
- Creation time - property
- Last update time? - property
- RecordScannable/BatchWritable/RecordWritable/BatchReadable
- Other properties
Streams
- Format
- View
Schema as Metadata
Schema as metadata is meant to add the capability in CDAP for users to be able to retrieve datasets/streams with a field X optionally of type Y.
Design Considerations
Storage
There is a case for storing System Metadata in a separate dataset for the following reasons:
- Only the CDAP system can update System Metadata.
- System Metadata may have different authorization as well as retention policies than Business Metadata
- System Metadata can be updated at specific times only, where users can update Business Metadata at any given time
However, if stored as a separate dataset, the metadata system will have to manage two different datasets. APIs may need filters, etc - TODO: Details
Storing History - same pattern as Business Metadata
Runtime
System Metadata will be added/updated when:
- An app is deployed - We will add a SystemMetadataUpdater stage in the deployment pipeline that will update system metadata for the app, as well as all the programs in the app.
- A new dataset instance is created - The LineageWriterDatasetFramework can be extended to update system metadata when a dataset is added.
- A new stream is created -
Deletes for all the above
System Metadata Updates
Only the CDAP system can update system metadata for entities. This capability will not be exposed to users. However, given this design choice, users will need a capability in CDAP to discover all the system tags/properties. To start off with, this can be exposed via a simple API that lists all tags/properties. It can later be extended via full-text search capabilities when CDAP has a more comprehensive search capability that extends beyond IndexedTables and prefix lookups.
Questions