Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Goals:

...

  • Fix the bug in Metadata Search for tags which works only for the first entry and not other tags in the list
  • Support tokenized search where user can search with any word from the value

...

  • CDAP Schema for Datasets, Streams and Views should be stored as metadata and searchable through fieldnames.

...

Jira Legacy
serverCask Community Issue Tracker
serverId45b48dee-c8d6-34f0-9990-e6367dc2fe4b
keyCDAP-4274

...

Checklist

  •  User stories documented (Rohit/Poorna)
  •  User stories reviewed (Nitin)
  •  Design documented (Rohit/Poorna)
  •  Design reviewed (Andreas)
  •  Feature merged (Rohit)
  •  Examples and guides (Rohit)
  •  Integration tests (Rohit) 
  •  Documentation for feature (Rohit)
  •  Blog post 

...

User Stories: 

 Goal 1 and 2: 

...

  • key-value
    1. Codename: Alpha Tango Charlie
    2. Codename: Alpha Tang*
  • key with part of value
    1. Codename: Alpha
    2. Codename: Tango
    3. Codename: Charlie
    4. Codename: Alp*
  • value
    1. Alpha Tango Charlie
    2. Alpha*
    3. Alpha Tan*
      Note:
      1. We have decided not to support searches for queries which have parts of value for example "Tango Charlie". You can either search for whole value or with prefix or single words (we plan to tokenize on whitespace) 
  • parts of value
    1. Alpha
    2. Tango
    3. Charlie
    4. Alph*
    5. Tan*
    6. Ch*

...

  • tag key and a tag value:
    1. tags: Tag1
    2. tags: Tag*
  • a tag value
    1. tag22 
    2. tag2*

...

Code Block
titleNested Schema
{
  "EmpName": "String",
  "EmpContact": {
    "EmpTel": "Integer",
    "EmpAddr": "String"
  }
}

User should be able to search for this dataset with the following queries:

...

  1. EmpName
  2. EmpContact
  3. EmpTel 
  4. EmpAddr
  5. Emp*

...

  1. schema: EmpName
  2. schema: EmpContact
  3. schema: EmpTel
  4. schema: EmpAddr 
  5. schema: Emp*

...

  • We don't plan to support schema searches with fieldType. If a user  searched with a query which is not scoped with schema by default it will search for schema fields besides the normal key-value and tags.
    Open questions:
    • What if an entity has multiple schema (ex: transform which has input and output schema)
      • Maybe We can index its fields with input and output schema and we expect an user to specify whether they are looking for something in input schema or output schema. 
    • What about entities which have more than one schema?
      • Maybe we can store them either as input output with identifier.
    • How will an user search for a fieldName across input and output schema ?
      • One way is to besides indexing the fields as input and output schema we also index every field as just schema so that we can perform such queries.

 Goal 3:

  1. User should be able to see all metadata of an entity in search result of a metadata search
  2. User should be able to see other relevant information of entity.

 

 Goal 4:

 

  1. No known user stories at this point.

 

Design

Metadata Search and Storage (Goal 1 and 2)

Storage:

We are going to use the IndexedTable which we are using currently too. In the new storage design we will have two rows: 

  1. Value Row: This row will store the entity id with key and value in the value column
  2. Index Row: This row will store the entity id with key (like above) appended by the index which is also stored in the index column. The index column will be used for indexing.

 

Metadata Storage Format:

Key ColumnValue Column
<VRPrefix><Entity-Id><Key>Value
<VRPrefix><Entity-Id><Tags>Tag1, Tag2, Tag3....
<VRPrefix><Entity-Id><Schema>{Some Schema}

Index Storage Format:

Key ColumnIndex Column
<IRPrefix><Entity-Id><Key><Index>Index
<IRPrefix><Entity-Id><Tags><Index>Index
<IRPrefix><Entity-Id><Schema><Index>Index

 

This table data represents key-value, tags and schema example discussed above to show how we plan to store the data. Index Column contains all the possibilities of search queries. 

Key: Entity with keyValue Column: Value of Metadata (Not Indexed)Index Column: Indexed value (Indexed)
<VRPrefix><Entity-Id><CodeName>Alpha Tango Charlie 
<VRPrefix><Entity-Id><Tags>Tag1, Tag22 
<VRPrefix><Entity-Id><Schema>{EmpName: String, EmpContact: {EmpTel: Integer, EmpAddr: String}} 
<IRPrefix><Entity-Id><Codename><CodeName: Alpha Tango Charlie> CodeName: Alpha Tango Charlie
<IRPrefix><Entity-Id><Codename><Codename: Alpha> Codename: Alpha
<IRPrefix><Entity-Id><Codename><Codename: Tango> Codename: Tango
<IRPrefix><Entity-Id><Codename><Codename: Charlie> Codename: Charlie
<IRPrefix><Entity-Id><Codename><Alpha Tango Charlie> Alpha Tango Charlie
<IRPrefix><Entity-Id><Codename><Alpha> Alpha
<IRPrefix><Entity-Id><Codename><Tango> Tango
<IRPrefix><Entity-Id><Codename><Charlie> Charlie
<IRPrefix><Entity-Id><tags><tags: Tag1> tags: Tag1
<IRPrefix><Entity-Id><tags><tags: Tag22> tags: Tag22
<IRPrefix><Entity-Id><tags><Tag1> Tag1
<IRPrefix><Entity-Id><tags><Tag22> Tag22
<IRPrefix><Entity-Id><schema><schema: EmpName> schema: EmpName
<IRPrefix><Entity-Id><schema><schema: EmpContact> schema: EmpContact
<IRPrefix><Entity-Id><schema><schema: EmpTel> schema: EmpTel
<IRPrefix><Entity-Id><schema><schema: EmpAddr> schema: EmpAddr
<IRPrefix><Entity-Id><schema><EmpName> EmpName
<IRPrefix><Entity-Id><schema><EmpContact> EmpContact
<IRPrefix><Entity-Id><schema><EmpTel> EmpTel
<IRPrefix><Entity-Id><schema><EmpAddr> EmpAddr

We will be using the indexedTable like before but now our keys which store values will be prefixed with a special VRPrefix (ValueRowPrefix) and we will store the value in the value column. The indexes will also be stored in the same table and the key will be prefixes with IRPrefix (IndexRowPrefix), the value column for such rows will be empty and the index column will have the index value which will be indexed for search.

Another possibility was to store the real key value in a separate table and the indexes in the indexedTable which will avoid the empty column values for a row but this will lead to 6 tables on total (3 for system and business each) hence we have decided against it.

 

Search Result (Goal 3)

Metadata search will return Entities with the following details depending upon the type of the Entity.

...

Type

...

 

 

Design Decision: 

  • In the search result of entity we will return all the metadata for that entity too. 

Open Question: 

  • Please suggest other things which we can add to different search result entities ? 

System Metadata (Goal 4)

Here is a list of System Metadata which we are planning to emit from different entities. If you have any suggestions as what other info can be useful as system metadata please comment below.

Artifacts

  • Version

Applications

  • ArtifactId
  • Plugins
    • Plugin Type
    • Plugin Name
  • Schedule
  • Programs

Programs

  • Type: Flow, MapReduce etc
    • Workflow
      • Nodes under this workflow
  • Mode: Batch, Realtime

Datasets

  • Schema
  • RecordScannable/BatchWritable/RecordWritable/BatchReadable
  • Type: KVTable, FileSet etc
  • ttl

Streams

  • Schema
  • ttl

Views

  • Schema

Open Questions:

  • Please suggest other things which we can add to different system metadata entries
  • Nitin Motgi: Can we call "business metadata" "user metadata" and also the table which stores it userMetadata table rather than business to keep it consistent with other stuff  like metrics etc.