Goals:

Improve Metadata Search: This requires redesign of how we store metadata. Design proposed below.
- Fix the bug in Metadata Search for tags which works only for the first entry and not other tags in the list
- Support tokenized search where user can search with any word from the value
Schema Search:
- CDAP Schema for Datasets, Streams and Views should be stored as metadata and searchable through fieldnames.
We will also take Bhoosan's work on System Metadata work to completion.
- - Work done by Bhooshan Mogal:
    - Doc: Metadata and Data Discovery 3.3
    - PR: https://github.com/caskdata/cdap/pull/4683
  - Things to do:
    - Emit more metadata from system entities
    - Merge: https://github.com/caskdata/cdap/pull/4683

Checklist

User stories documented (Rohit/Poorna)
User stories reviewed (Nitin)
Design documented (Rohit/Poorna)
Design reviewed (Andreas)
Feature merged (Rohit)
Examples and guides (Rohit)
Integration tests (Rohit)
Documentation for feature (Rohit)
Blog post

User Stories:

User should be able to search key-value metadata with the following or with its prefix:
- key-value
- key with part of value
- value
- Individual words in the value
  
  Example:
  User stores a key-value metadata with key = "Codename" and value = "Alpha Tango Charlie" for an entity
- User should be able to search for this entity with the following queries:
  - key-value
    1. Codename: Alpha Tango Charlie
    2. Codename: Alpha Tang*
  - key with part of value
    1. Codename: Alpha
    2. Codename: Tango
    3. Codename: Charlie
    4. Codename: Alp*
  - value
    1. Alpha Tango Charlie
    2. Alpha*
    3. Alpha Tan*
      Design Decision:
      1. We have decided not to support searches for queries which have parts of value for example "Tango Charlie". You can either search for whole value or with prefix or single words (we plan to tokenize on whitespace)
  - parts of value
    1. Alpha
    2. Tango
    3. Charlie
    4. Alph*
    5. Tan*
    6. Ch*
User should be able to search tags metadata with the following or with its prefix:
- tags key and a tag value
- a tag value
  
  Example:
  User tags an entity with the following tags "Tag1, Tag22"
- User should be able to search for this entity with the following queries:
  - tag key and a tag value:
    1. tags: Tag1
    2. tags: Tag*
  - a tag value
    1. tag22
    2. tag2*

======================================================================================================================

Use Cases:

Key-Value Metadata:
1. Codename: Alpha Tango Charlie
  Use case: User should be able to search with
  1. Whole Key-Value (complete or prefix)
    1. Codename: Alpha Tango Charlie
    2. Codename: Alpha Tang*
  2. Key with Part of Value (complete or prefix)
    1. Codename: Alpha
    2. Codename: Tango
    3. Codename: Charlie
    4. Codename: Alp*
  3. Whole Value (complete or prefix):
    1. Alpha Tango Charlie
    2. Alpha*
    3. Alpha Tan*
      Design Decision:
      1. We have decided not to support searches for queries which have parts of value for example "Tango Charlie". You can either search for whole value or with prefix or single words (we plan to tokenize on whitespace)
  4. Parts of value (complete or prefix):
    1. Alpha
    2. Tango
    3. Charlie
    4. Alph*
    5. Tan*
    6. Ch*
Tags Metadata:
1. tags: Tag1, Tag22
  Use case: User should be able to search with
  1. With tags key and a tag value (complete or prefix):
    1. tags: Tag1
    2. tags: Tag*
  2. With tag value (complete or prefix):
    1. tag22
    2. tag2*
Schema Metadata: This is just key-value where key is schema and value schema fields but needs special indexing to support searches with fieldName (Note: we don't plan to support schema searches with fieldTypes
1. Nested Schema
```
{
  "EmpName": "String",
  "EmpContact": {
    "EmpTel": "Integer",
    "EmpAddr": "String"
  }
}
```
  Use case: User should be able to search with
  1. FieldName scoped to schema (complete or prefix):
    1. schema: EmpName
    2. schema: EmpContact
    3. schema: EmpTel
    4. schema: EmpAddr
    5. schema: Emp*
  2. FieldName (complete or prefix):
    1. EmpName
    2. EmpContact
    3. EmpTel
    4. EmpAddr
    5. Emp*
  3. Searching for everything which has schema
    1. schema:*
      
      Design Decisions:
      1. We don't plan to support schema searches with fieldType.
      2. If a user searched with a query which is not scoped with schema by default it will search for schema fields besides the normal key-value and tags.
        Open questions:
        What if an entity has multiple schema (ex: transform which has input and output schema)
        We can index its fields with input and output schema and we expect an user to specify whether they are looking for something in input schema or output schema.
        What about entities which have more than one schema?
        We are thinking to store them either as input output with identifier.
        How will an user search for a fieldName across input and output schema ?
        One way is to besides indexing the fields as input and output schema we also index every field as just schema so that we can perform such queries.

New Design:

Storage:

Metadata Storage Format:

Key Column	Value Column
<VRPrefix><Entity-Id><Key>	Value
<VRPrefix><Entity-Id><Tags>	Tag1, Tag2, Tag3....
<VRPrefix><Entity-Id><Schema>	{Some Schema}

Index Storage Format:

Key Column	Index Column
<IRPrefix><Entity-Id><Key><Index>	Index
<IRPrefix><Entity-Id><Tags><Index>	Index
<IRPrefix><Entity-Id><Schema><Index>	Index

Sample Index Table which stores the above metadata and indexes together. Index Column contains all the possibilities of search queries.

Key: Entity with key	Value Column: Value of Metadata (Not Indexed)	Index Column: Indexed value (Indexed)
<VRPrefix><Entity-Id><CodeName>	Alpha Tango Charlie
<VRPrefix><Entity-Id><Tags>	Tag1, Tag22
<VRPrefix><Entity-Id><Schema>	{EmpName: String, EmpContact: {EmpTel: Integer, EmpAddr: String}}
<IRPrefix><Entity-Id><Codename><CodeName: Alpha Tango Charlie>		CodeName: Alpha Tango Charlie
<IRPrefix><Entity-Id><Codename><Codename: Alpha>		Codename: Alpha
<IRPrefix><Entity-Id><Codename><Codename: Tango>		Codename: Tango
<IRPrefix><Entity-Id><Codename><Codename: Charlie>		Codename: Charlie
<IRPrefix><Entity-Id><Codename><Alpha Tango Charlie>		Alpha Tango Charlie
<IRPrefix><Entity-Id><Codename><Alpha>		Alpha
<IRPrefix><Entity-Id><Codename><Tango>		Tango
<IRPrefix><Entity-Id><Codename><Charlie>		Charlie
<IRPrefix><Entity-Id><tags><tags: Tag1>		tags: Tag1
<IRPrefix><Entity-Id><tags><tags: Tag22>		tags: Tag22
<IRPrefix><Entity-Id><tags><Tag1>		Tag1
<IRPrefix><Entity-Id><tags><Tag22>		Tag22
<IRPrefix><Entity-Id><schema><schema: EmpName>		schema: EmpName
<IRPrefix><Entity-Id><schema><schema: EmpContact>		schema: EmpContact
<IRPrefix><Entity-Id><schema><schema: EmpTel>		schema: EmpTel
<IRPrefix><Entity-Id><schema><schema: EmpAddr>		schema: EmpAddr
<IRPrefix><Entity-Id><schema><EmpName>		EmpName
<IRPrefix><Entity-Id><schema><EmpContact>		EmpContact
<IRPrefix><Entity-Id><schema><EmpTel>		EmpTel
<IRPrefix><Entity-Id><schema><EmpAddr>		EmpAddr

We will be using the indexedTable like before but now our keys which store values will be prefixed with a special VRPrefix (ValueRowPrefix) and we will store the value in the value column. The indexes will also be

stored in the same table and the key will be prefixes with IRPrefix (IndexRowPrefix), the value column for such rows will be empty and the index column will have the index value which will be indexed for search.

Another possibility was to store the real key value in a separate table and the indexes in the indexedTable which will avoid the empty column values for a row but this will lead to 6 tables on total (3 for system and business each)

hence we have decided against it.

Search Result:

Metadata search will return Entities with the following details depending upon the type of the Entity.

Entity Type	Search Details	Note
Application	Type
	Name
	Metadata: Tags and Properties
	App Description
Program	Type	If Type=Workflow then also show all program under the workflow
	Name
	Metadata: Tags and Properties
	App it belongs to
Artifact	Type
	Name
Dataset	Type
	Name
Stream	Name
	Type
View	Name
	Type
	Stream Name

Design Decision:

In the search result of entity we will return all the metadata for that entity too.

Open Question:

Please suggest other things which we can add to different search result entities ?

System Metadata:

Here is a list of System Metadata which we are planning to emit from different entities. If you have any suggestions as what other info can be useful as system metadata please comment below.

Artifacts

Version

Applications

ArtifactId
Plugins
- Plugin Type
- Plugin Name
Schedule
Programs

Programs

Type: Flow, MapReduce etc
- Workflow
  - Nodes under this workflow
Mode: Batch, Realtime

Datasets

Schema
RecordScannable/BatchWritable/RecordWritable/BatchReadable
Type: KVTable, FileSet etc
ttl

Streams

Schema
ttl

Views

Schema

Open Questions:

Please suggest other things which we can add to different system metadata entries
Nitin Motgi: Can we call "business metadata" "user metadata" and also the table which stores it userMetadata table rather than business to keep it consistent with other stuff like metrics etc.

Copy of Improvements to Metadata Search and System Metadata