Overview
This page covers the requirements, design and implementation of metadata and data discovery features in 3.3
High Level Requirements
- Schema as metadata
- System metadata
- CLI, Test Framework Support for metadata
- UI for Metadata Search
- UI for Lineage
- UI for Adding/Updating metadata properties/tags
- Metadata search
- Lineage based on Type of Dataset Access
- Monitoring/Logs for Metadata Service
Scope
- Schema as metadata
- System metadata
- Metadata CLI
- Test Framework support for Metadata
- UI (needs to be finalized)
User Stories
Id | Description | Comments |
---|---|---|
U1 | As a user, I should be able to search Datasets containing the specified fields | List the kinds of queries that will be supported |
U2 | As a CDAP system, I should be able to annotate CDAP entities with system metadata automatically | System metadata for each entity is listed below |
U3 | As a user, I should be able to access and update CDAP metadata using the CDAP CLI | |
U4 | As a developer, I should be able to access and update CDAP metadata using the CDAP Test Framework | |
U5 | As a user, I should be able to search CDAP entities based on metadata using the CDAP UI | |
U6 | As a user, I should be able to view the lineage of a CDAP dataset/stream in a specified time window using the CDAP UI |
New Metadata Search:
Metadata Storage and Search to support:
- Key-Value Metadata:
- Codename: Alpha Tango Charlie
Supported searches:- Whole Key-Value (complete or partial) - Codename: Alpha Tango Charlie or Codename: Alpha Tang*
- Key with Part of Value (complete or partial) - Codename: Alpha or Codename: Tango or Codename: Charlie or Codename: Alp*
- Whole Value (complete or partial): Alpha Tango Charlie or Alpha* or Alpha Tan*
- Parts of value (complete or partial): Alpha or Tango or Charlie or Alph* or Tan* or Ch*
- Whole Key-Value (complete or partial) - Codename: Alpha Tango Charlie or Codename: Alpha Tang*
- Codename: Alpha Tango Charlie
- Tags Metadata:
- Tags: Tag1, Tag22
Supported searches:- With tags key and a tag value (complete or partial): Tags: Tag1 or Tags: Tag*
- With tag value complete or partial: Tag22 or Tag2*
- Tags: Tag1, Tag22
- Schema Metadata: This is just key-value where key is schema and value schema fields but needs special indexing to support searches with fieldName and fieldName:fieldType.
- Schema: {EmpName: String, EmpContact: {EmpTel: Integer, EmpAddr: String}} (Note: This is a nested schema)
Supported searches:- FieldName with FieldType: EmpName: String or EmpTel:Integer or EmpAddr: String
- FieldName: EmpName, EmpTel, EmpAddr
- Schema: {EmpName: String, EmpContact: {EmpTel: Integer, EmpAddr: String}} (Note: This is a nested schema)
Storage:
Main Table: This table stores stores the metadata for the entity. It will be used when an user wants to get metadata of an entity. This table is not for searching.
Key: Entity with key | Value: Value of Metadata |
---|---|
<Entity-Id><CodeName> | Alpha Tango Charlie |
<Entity-Id><Tags> | {Tag1, Tag22} |
<Entity-Id><Schema-Id> | {EmpName: String, EmpContact: {EmpTel: Integer, EmpAddr: String}} |
Index Table: This table will be used for searching and it will use IndexedTable.
Key: Entity with index | Value: Index |
---|---|
<Entity-Id><CodeName: Alpha Tango Charlie> | CodeName: Alpha Tango Charlie |
<Entity-Id><Codename: Alpha> | Codename: Alpha |
<Entity-Id><Codename: Tango> | Codename: Tango |
<Entity-Id><Codename: Charlie> | Codename: Charlie |
<Entity-Id><Alpha Tango Charlie> | Alpha Tango Charlie |
<Entity-Id><Alpha> | Alpha |
<Entity-Id><Tango> | Tango |
<Entity-Id><Charlie> | Charlie |
<Entity-Id><Tags: Tag1> | Tags: Tag1 |
<Entity-Id><Tags: Tag22> | Tags: Tag22 |
<Entity-Id><Tag1> | Tag1 |
<Entity-Id><Tag22> | Tag22 |
System Metadata
Kinds of system metadata:
Artifacts
TBD
Applications
- Artifact name
Programs
- Type of program
Datasets
- Type of dataset
- Schema
- RecordScannable/BatchWritable/RecordWritable/BatchReadable
- Other properties
Streams
- Format
Views
- Format
Design Considerations
Storage
System Metadata will be stored in a separate dataset for the following reasons:
- Only the CDAP system can update System Metadata.
- System Metadata may have different authorization as well as retention policies than Business Metadata
- System Metadata can be updated at specific times only, where users can update Business Metadata at any given time
As a result, the metadata system will have to manage two different datasets. The storage format of both datasets (both keys and values) will be identical, they will only write to separate tables.
A higher level construct, MetadataStore will have the ability to interact with two separate datasets. It will use a MetadataScope (possible values USER and SYSTEM) object to distinguish between operations that should go to the business metadata dataset from the ones that should go to the system metadata dataset.
The MetadataStore class is chosen to have the ability to interact with two different metadata datasets, because it is the API that is used across CDAP (LineageDatasetFramework, Lineage classes, StreamAdmin, AppLifecycleService, DeletedProgramHandlerStage, to name a few classes) to interact with Metadata. There was an option to have this ability in the MetadataAdmin object instead and have the MetadataStore be local to a specific dataset (this may have made the MetadataStore class itself cleaner). However, this way, we would have needed the downstream classes (users of MetadataStore) handle multiple MetadataStores, which is not clean. Also, currently, the MetadataAdmin is only used by the MetadataHttpHandler. As a result, we cannot move this logic to the MetadataAdmin class, since not all clients of the metadata system have access to it.
The MetadataAdmin class is currently in app-fabric, because it needs access to the AppMetadataStore to check if entities exist. This is not ideal, but to fix this, we need to split cdap-app-fabric, which is much beyond the scope of the Metadata work.
History
We will re-use the same pattern that the Business Metadata Dataset uses to store history. There will however be one update to not serialize the MetadataScope in the history, as described in
Runtime
For interacting with the System Metadata Dataset, we will introduce a SystemMetadataUpdater
interface, which will be injected at various stages outlined below, to add, update or delete system metadata
System Metadata will be added when:
- An app is deployed - We will add a SystemMetadataUpdater stage in the deployment pipeline that will update system metadata for the app, as well as all the programs in the app.
- A new dataset instance is created - The LineageWriterDatasetFramework will be passed a SystemMetadataUpdater, to add system metadata in the
addDatasetInstance
call. - A new stream is created - StreamAdmins will be passed a SystemMetadataUpdater as well, to add system metadata in the
create
API.
System Metadata will be updated when:
- A dataset instance's properties are updated - The LineageWriterDatasetFramework's
updateInstance
method will use the SystemMetadataUpdater to update the passed properties - A stream's config is updated - The StreamAdmin's
updateConfig
method will use the SystemMetadataUpdater to update the stream's system metadata
System Metadata will be deleted when:
- An app is deleted - The ApplicationLifecycleService will use the SystemMetadataUpdater to delete system metadata for the application
- A program is removed from an existing app - The DeletedProgramsHandlerStage will use the SystemMetadataUpdater to delete system metadata for the programs
- A dataset instance is deleted - The LineageWriterDatasetFramework's
deleteInstance
method will use the SystemMetadataUpdater to delete system metadata for the dataset instance - A stream is deleted - The StreamAdmin's
drop
method will use the SystemMetadataUpdater to delete system metadata for the stream
System Metadata Updates
Only the CDAP system can update system metadata for entities. This capability will not be exposed to users. However, given this design choice, users will need a capability in CDAP to discover all the system tags/properties. To start off with, this can be exposed via a simple API that lists all tags/properties. It can later be extended via full-text search capabilities when CDAP has a more comprehensive search capability that extends beyond IndexedTables and prefix lookups.
REST APIs
The add/update/delete APIs for system metadata will not be documented, or be accessible from the Router. Internally, the SystemMetadataUpdater will preferably interact with the transactional store for system metadata directly.
If REST APIs are absolutely necessary (TBD):
- The REST APIs for adding/updating/deleting system metadata will not be documented, and will not be exposed via the Router
- The SystemMetadataUpdater will use service discovery to discover the Metadata Service and make REST calls.
Schema as Metadata
Schema as metadata is meant to add the capability in CDAP for users to be able to retrieve datasets/streams with a field X optionally of type Y.
For storing schema as a system metadata, we will use the existing metadata properties mechanism. An option to store Schema as metadata would be to store every field in the schema as the metadata property:
Key:
field | ^A | <fieldName> |
---|
Value:
<fieldType> |
---|
Note: We may have to reverse this, based on the indexing mechanisms available in the System Metadata Dataset. If it supports key:value
and value
type searches, then we may have to swap the key and value above, so two types of searches can be supported:
- All Datasets with the field field1
- All Datasets with the field field1 of type int
Views
Up until 3.2, users could not associate metadata with stream views. We will need to add this capability in 3.2. However, there would not be any parent-child relationship between a view, and its stream, as far as metadata is concerned. A view will be a separate entity from its stream, and will show up separately in search results. Metadata of a stream will not be automatically available as metadata of a view.
Upgrade
The BusinessMetadataDataset dataset type introduced in 3.2 will be renamed to MetadataDataset, since it will also serve system metadata in 3.3. For existing CDAP installations, we will need an upgrade step to change the type of the existing "business.metadata" dataset in the "datasets.instance" table.
New REST APIs
- The Metadata REST APIs to retrieve properties and tags will be updated to accept a scope query parameter. It will support the values user and system. If scope is not specified, the API will return all metadata across both scopes.
- New APIs will be added for View and artifacts:
Purpose | API | Body | Response | Routable | Comments | Approved? |
---|---|---|---|---|---|---|
Annotate business metadata for view | POST /v3/namespaces/{namespace-id}/streams/{stream-id}/views/{view-id}/metadata/properties | { "key1" : "value1", "key2" : "value2", //... } | 200: Successful 404: view not found in specified namespace | Yes |
|
|
Retrieve business metadata for view | GET /v3/namespaces/{namespace-id}/streams/{stream-id}/views/{view-id}/metadata/properties | N/A | 200: Successful 404: View not found in specified namespace { "key1" : "value1", "key2" : "value2", //... } | Yes | ||
Delete all business metadata for view | DELETE /v3/namespaces/{namespace-id}/streams/{stream-id}/views/{view-id}/metadata/properties |
| 200: Successful 404: View not found in specified namespace | Yes | ||
Delete selected key from business metadata for view | DELETE /v3/namespaces/{namespace-id}/streams/{stream-id}/views/{view-id}/metadata/properties/{key} |
| 200: Successful 404: View not found in specified namespace | Yes | ||
Search views containing business metadata | GET /v3/namespaces/{namespace-id}/metadata/search?query=term&target=view | N/A | 200: Successful ["view1", "view2"] | Yes |
| |
Add tags to a view | POST /v3/namespaces/{namespace-id}/streams/{stream-id}/views/{view-id}/metadata/tags | ["tag1", "tag2"] | 200: Successful 404: View not found in specified namespace | Yes |
|
|
Retrieve view tags | GET /v3/namespaces/{namespace-id}/streams/{stream-id}/views/{view-id}/metadata/tags | N/A | ["tag1", "tag2"] | Yes | ||
Remove all view tags | DELETE /v3/namespaces/{namespace-id}/streams/{stream-id}/views/{view-id}/metadata/tags |
| 200: Successful 404: View not found in specified namespace | Yes | ||
Remove specified view tag | DELETE /v3/namespaces/{namespace-id}/streams/{stream-id}/views/{view-id}/metadata/tags/{tag} |
| 200: Successful 404: View not found in specified namespace | Yes | ||
Get all business metadata for a view | GET /v3/namespaces/{namespace-id}/streams/{stream-id}/views/{view-id}/metadata |
| 200: Successful 404: View not found in specified namespace | Yes | Retrieves all properties and tags for a stream. |
Existing/Changed REST APIs and CLI Commands:
Note: Changes are in blue
Purpose | API | CLI Command | Body | Response | Comments | Approved? |
---|---|---|---|---|---|---|
Annotate business metadata for datasets | POST /v3/namespaces/{namespace-id}/datasets/{dataset-id}/metadata/properties | set metadata properties datasets <dataset-id> | { "key1" : "value1", "key2" : "value2", //... } | 200: Successful 404: Dataset not found in specified namespace |
|
|
Annotate business metadata for apps | POST /v3/namespaces/{namespace-id}/apps/{app-id}/metadata/properties | set metadata properties apps <app-id> | { "key1" : "value1", "key2" : "value2", //... } | 200: Successful 404: App not found in specified namespace |
|
|
Annotate business metadata for programs | POST /v3/namespaces/{namespace-id}/apps/{app-id}/{program-type}/{program-id}/metadata/properties | set metadata properties app <app-id> program-type <program-type> | { "key1" : "value1", "key2" : "value2", //... } | 200: Successful 404: Program not found in specified namespace |
|
|
Annotate business metadata for streams | POST /v3/namespaces/{namespace-id}/streams/{stream-id}/metadata/properties | set metadata properties streams <stream-id> | { "key1" : "value1", "key2" : "value2", //... } | 200: Successful 404: Stream not found in specified namespace |
|
|
Retrieve business metadata for datasets | GET /v3/namespaces/{namespace-id}/datasets/{dataset-id}/metadata/properties | get metadata properties scope datasets <dataset-id> | N/A | 200: Successful 404: Dataset not found in specified namespace { "key1" : "value1", "key2" : "value2", //... } | ||
Retrieve business metadata for apps | GET /v3/namespaces/{namespace-id}/apps/{app-id}/metadata/properties | get metadata properties scope apps <app-id> | N/A | 200: Successful 404: App not found in specified namespace { "key1" : "value1", "key2" : "value2", //... } | ||
Retrieve business metadata for programs | GET /v3/namespaces/{namespace-id}/apps/{app-id}/{program-type}/{program-id}/metadata/properties | get metadata properties scope apps <app-id> program-type <program-id> | N/A | 200: Successful 404: Program not found in specified namespace { "key1" : "value1", "key2" : "value2", //... } | ||
Retrieve business metadata for streams | GET /v3/namespaces/{namespace-id}/streams/{stream-id}/metadata/properties | get metadata properties scope streams <stream-id> | N/A | 200: Successful 404: Stream not found in specified namespace { "key1" : "value1", "key2" : "value2", //... } | ||
Delete all business metadata for datasets | DELETE /v3/namespaces/{namespace-id}/datasets/{dataset-id}/metadata/properties | delete metadata properties datasets <dataset-id> |
N/A | 200: Successful 404: Dataset not found in specified namespace | ||
Delete selected key from business metadata for datasets | DELETE /v3/namespaces/{namespace-id}/datasets/{dataset-id}/metadata/properties/{key} | delete metadata properties datasets <dataset-id> <key> |
N/A | 200: Successful 404: Dataset not found in specified namespace | ||
Delete all business metadata for apps | DELETE /v3/namespaces/{namespace-id}/apps/{app-id}/metadata/properties | delete metadata properties apps <app-id> |
| 200: Successful 404: App not found in specified namespace | ||
Delete selected key from business metadata for apps | DELETE /v3/namespaces/{namespace-id}/apps/{app-id}/metadata/properties/{key} | delete metadata properties apps <app-id> <key> |
| 200: Successful 404: App not found in specified namespace | ||
Delete all business metadata for programs | DELETE /v3/namespaces/{namespace-id}/apps/{app-id}/{program-type}/{program-id}/metadata/properties | delete metadata properties apps <app-id> program-type <program-id> |
| 200: Successful 404: Program not found in specified namespace | ||
Delete all business metadata for programs | DELETE /v3/namespaces/{namespace-id}/apps/{app-id}/{program-type}/{program-id}/metadata/properties/{key} | delete metadata properties apps <app-id> program-type <program-id> <key> |
| 200: Successful 404: Program not found in specified namespace | ||
Delete all business metadata for streams | DELETE /v3/namespaces/{namespace-id}/streams/{stream-id}/metadata/properties | delete metadata properties streams <stream-id> |
| 200: Successful 404: Stream not found in specified namespace | ||
Delete selected key from business metadata for streams | DELETE /v3/namespaces/{namespace-id}/streams/{stream-id}/metadata/properties/{key} | delete metadata properties streams <stream-id> <key> |
| 200: Successful 404: Stream not found in specified namespace | ||
Search entities containing business metadata | GET /v3/namespaces/{namespace-id}/metadata/search?query=term&target=<target-type>
target-type => dataset, app, program, stream, view
| search metadata scope <search-query> <target> | N/A | 200: Successful ["entity1", "entity2"] |
| |
View Dataset Lineage | GET /v3/namespaces/{namespace-id}/datasets/{dataset-id}/lineage?start=<start-ts>&end=<end-ts>&maxLevels=<max-levels> | get lineage datasets <dataset-id> <startTs> <endTs> <maxLevels> | N/A | 200: Successful Response TBD, but will contain a DAG representation | ||
View Stream Lineage | GET /v3/namespaces/{namespace-id}/streams/{stream-id}/lineage?start=<start-ts>&end=<end-ts>&maxLevels=<max-levels> | get lineage streams <stream-id> <startTs> <endTs> <maxLevels> | N/A | 200: Successful Response TBD, but will contain a DAG representation | ||
View Run Id Accesses | GET /v3/namespaces/{namespace-id}/apps/{app-id}/{program-type}/{program-id}/runs/{run-id}/metadata | get metadata apps <app-id> program-type <program-id> runs <run-id> | N/A | 200: Successful Response Body TBD |
| |
Add tags to a dataset | POST /v3/namespaces/{namespace-id}/datasets/{dataset-id}/metadata/tags | add metadata tags datasets <dataset-id> | ["tag1", "tag2"] | 200: Successful 404: Dataset not found in specified namespace |
|
|
Add tags to an app | POST /v3/namespaces/{namespace-id}/apps/{app-id}/metadata/tags | add metadata tags apps <app-id> | ["tag1", "tag2"] | 200: Successful 404: App not found in specified namespace |
|
|
Add tags to a program | POST /v3/namespaces/{namespace-id}/apps/{app-id}/{program-type}/{program-id}/metadata/tags | add metadata tags apps <app-id> program-type <program-id> | ["tag1", "tag2"] | 200: Successful 404: Program not found in specified namespace |
|
|
Add tags to a stream | POST /v3/namespaces/{namespace-id}/streams/{stream-id}/metadata/tags | ["tag1", "tag2"] | 200: Successful 404: Stream not found in specified namespace |
|
| |
Retrieve dataset tags | GET /v3/namespaces/{namespace-id}/datasets/{dataset-id}/metadata/tags | get metadata tags datasets <dataset-id> | N/A | ["tag1", "tag2"] | ||
Retrieve app tags | GET /v3/namespaces/{namespace-id}/apps/{app-id}/metadata/tags | N/A | ["tag1", "tag2"] | |||
Retrieve program tags | GET /v3/namespaces/{namespace-id}/apps/{app-id}/{program-type}/{program-id}/metadata/tags | N/A | ["tag1", "tag2"] | |||
Retrieve stream tags | GET /v3/namespaces/{namespace-id}/streams/{stream-id}/metadata/tags | N/A | ["tag1", "tag2"] | |||
Remove all dataset tags | DELETE /v3/namespaces/{namespace-id}/datasets/{dataset-id}/metadata/tags | delete metadata tags datasets <dataset-id> | N/A
| 200: Successful 404: Dataset not found in specified namespace |
|
|
Remove specified dataset tag | DELETE /v3/namespaces/{namespace-id}/datasets/{dataset-id}/metadata/tags/{tag} | N/A
| 200: Successful 404: Dataset not found in specified namespace |
|
| |
Remove all app tags | DELETE /v3/namespaces/{namespace-id}/apps/{app-id}/metadata/tags | N/A
| 200: Successful 404: App not found in specified namespace | |||
Remove specified app tag | DELETE /v3/namespaces/{namespace-id}/apps/{app-id}/metadata/tags/{tag} | N/A
| 200: Successful 404: App not found in specified namespace | |||
Remove all program tags | DELETE /v3/namespaces/{namespace-id}/apps/{app-id}/{program-type}/{program-id}/metadata/tags | N/A
| 200: Successful 404: Program not found in specified namespace | |||
Remove specified program tag | DELETE /v3/namespaces/{namespace-id}/apps/{app-id}/{program-type}/{program-id}/metadata/tags/{tag} | N/A
| 200: Successful 404: Program not found in specified namespace | |||
Remove all stream tags | DELETE /v3/namespaces/{namespace-id}/streams/{stream-id}/metadata/tags |
| 200: Successful 404: Stream not found in specified namespace | |||
Remove specified stream tag | DELETE /v3/namespaces/{namespace-id}/streams/{stream-id}/metadata/tags/{tag} |
| 200: Successful 404: Stream not found in specified namespace | |||
Remove all business metadata for a dataset | DELETE /v3/namespaces/{namespace-id}/datasets/{dataset-id}/metadata |
| 200: Successful 404: Dataset not found in specified namespace | Removes all properties and tags from a dataset. Will not happen in 3.2 | ||
Remove all business metadata for an app | DELETE /v3/namespaces/{namespace-id}/apps/{app-id}/metadata |
| 200: Successful 404: App not found in specified namespace | Removes all properties and tags from an app. Will not happen in 3.2 | ||
Remove all business metadata for a program | DELETE /v3/namespaces/{namespace-id}/apps/{app-id}/{program-type}/{program-id}/metadata |
| 200: Successful 404: Program not found in specified namespace | Removes all properties and tags from a program. Will not happen in 3.2 | ||
Remove all business metadata for a dataset | DELETE /v3/namespaces/{namespace-id}/datasets/{dataset-id}/metadata |
| 200: Successful 404: Dataset not found in specified namespace | Removes all properties and tags from a dataset. Will not happen in 3.2 | ||
Get all business metadata for a dataset | GET /v3/namespaces/{namespace-id}/datasets/{dataset-id}/metadata?scope=system/user |
| 200: Successful 404: Dataset not found in specified namespace | Retrieves all properties and tags for a dataset. Will not happen in 3.2 | ||
Get all business metadata for an app | GET /v3/namespaces/{namespace-id}/apps/{app-id}/metadata |
| 200: Successful 404: App not found in specified namespace | Retrieves all properties and tags for an app. Will not happen in 3.2 | ||
Get all business metadata for a program | GET /v3/namespaces/{namespace-id}/apps/{app-id}/{program-type}/{program-id}/metadata |
| 200: Successful 404: Program not found in specified namespace | Retrieves all properties and tags for a program. Will not happen in 3.2 | ||
Get all business metadata for a stream | GET /v3/namespaces/{namespace-id}/streams/{stream-id}/metadata |
| 200: Successful 404: Stream not found in specified namespace | Retrieves all properties and tags for a stream. Will not happen in 3.2 |
Questions
- The REST APIs to retrieve metadata will accept an additional scope parameter. Is it considered a backward incompatible change that if the scope is not specified, the API will now return all metadata, and not just business metadata, like it did in 3.3?