Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Goals:

  1. Improve Metadata Search: This requires redesign of how we store metadata. Design proposed below.
    • Fix the bug in Metadata Search for tags which works only for the first entry and not other tags in the list
    • Support tokenized search where user can search with any word from the value
  2. Schema Search:
    • CDAP Schema for Datasets, Streams and Views should be stored as metadata and searchable through fieldnames.
  3. Metadata Search Results:
    • Jira Legacy
      serverCask Community Issue Tracker
      serverId45b48dee-c8d6-34f0-9990-e6367dc2fe4b
      keyCDAP-4274
    • In addition to the above we will also like to return some other relevant info. Please see details below.

  4. Minor: We will also take Bhoosan's work on System Metadata to completion.

Checklist

  •  User stories documented (Rohit/Poorna)
  •  User stories reviewed (Nitin)
  •  Design documented (Rohit/Poorna)
  •  Design reviewed (Andreas)
  •  Feature merged (Rohit)
  •  Examples and guides (Rohit)
  •  Integration tests (Rohit) 
  •  Documentation for feature (Rohit)
  •  Blog post 

User Stories: 

 Goal 1 and 2: 

  1. User should be able to search key-value metadata with the following or with its prefix:
    • key-value 
    • key with part of value
    • value
    • Individual words in the value

      Example:
      User stores a key-value metadata with key = "Codename" and value = "Alpha Tango Charlie" for an entity
    • User should be able to search for this entity with the following queries:
      • key-value
        1. Codename: Alpha Tango Charlie
        2. Codename: Alpha Tang*
      • key with part of value
        1. Codename: Alpha
        2. Codename: Tango
        3. Codename: Charlie
        4. Codename: Alp*
      • value
        1. Alpha Tango Charlie
        2. Alpha*
        3. Alpha Tan*
          Note:
          1. We have decided not to support searches for queries which have parts of value for example "Tango Charlie". You can either search for whole value or with prefix or single words (we plan to tokenize on whitespace) 
      • parts of value
        1. Alpha
        2. Tango
        3. Charlie
        4. Alph*
        5. Tan*
        6. Ch*
  2. User should be able to search tags metadata with the following or with its prefix:
    • tags key and a tag value
    • a tag value

      Example:
      User tags an entity with the following tags "Tag1, Tag22"
    • User should be able to search for this entity with the following queries:
      • tag key and a tag value:
        1. tags: Tag1
        2. tags: Tag*
      • a tag value
        1. tag22 
        2. tag2*
  3. User should be able search for entities (datasets, streams, views) through field-names in schema with the following or with its prefix:
    1. fieldname
    2. fieldname scoped with schema - this should limit the search to just schema fields and not other metadata
    3. User should be able to search for all entities with which has a schema

      Example:
      A dataset has the following schema: 

      Code Block
      titleNested Schema
      {
        "EmpName": "String",
        "EmpContact": {
          "EmpTel": "Integer",
          "EmpAddr": "String"
        }
      }

      User should be able to search for this dataset with the following queries:

      • fieldname:
        1. EmpName
        2. EmpContact
        3. EmpTel 
        4. EmpAddr
        5. Emp*
      • fieldname scoped to schema:
        1. schema: EmpName
        2. schema: EmpContact
        3. schema: EmpTel
        4. schema: EmpAddr 
        5. schema: Emp*
      • search for all entities with a schema
        1. schema:* This will return this dataset entity and also all the other entities which have schema stored as their metadata

      Note:
      • We don't plan to support schema searches with fieldType. If a user  searched with a query which is not scoped with schema by default it will search for schema fields besides the normal key-value and tags.
        Open questions:
        • What if an entity has multiple schema (ex: transform which has input and output schema)
          • Maybe We can index its fields with input and output schema and we expect an user to specify whether they are looking for something in input schema or output schema. 
        • What about entities which have more than one schema?
          • Maybe we can store them either as input output with identifier.
        • How will an user search for a fieldName across input and output schema ?
          • One way is to besides indexing the fields as input and output schema we also index every field as just schema so that we can perform such queries.

...

  1. User should be able to see all metadata of an entity in search result of a metadata search
  2. User should be able to see other relevant information of entity.
    The table below shows the information which we will present to the user:
     

    Search Result:

     

    Metadata search will return Entities with the following details depending upon the type of the Entity.

     

    Entity TypeSearch DetailsNoteApplication

    Type

      Name  Metadata: Tags and Properties  App Description ProgramTypeIf Type=Workflow then also show all program under the workflow Name  Metadata: Tags and Properties  App it belongs to ArtifactType  Name DatasetType  Name StreamName  Type ViewName Type Stream Name 

     

     

     

    Design Decision:

     
    • In the search result of entity we will return all the metadata for that entity too. 
     Open Question:  
    • Please suggest other things which we can add to different search result entities ? 

 

New Proposed Design:

  1.  

Design

Metadata Search and Storage (Goal 1 and 2)

Storage:

We are going to use the IndexedTable which we are using currently too. In the new storage design we will have two rows: 

...

Another possibility was to store the real key value in a separate table and the indexes in the indexedTable which will avoid the empty column values for a row but this will lead to 6 tables on total (3 for system and business each) hence we have decided against it.

 

Search Result (Goal 3)

Metadata search will return Entities with the following details depending upon the type of the Entity.

...

Entity TypeSearch DetailsNote
Application

Type

 
 Name 
 Metadata: Tags and Properties 
 App Description 
ProgramTypeIf Type=Workflow then also show all program under the workflow
 Name 
 Metadata: Tags and Properties 
 App it belongs to 
ArtifactType 
 Name 
DatasetType 
 Name 
StreamName 
 Type 
ViewName 



Type 
Stream Name 

 

 

Design Decision: 

  • In the search result of entity we will return all the metadata for that entity too. 

Open Question: 

  • Please suggest other things which we can add to different search result entities ? 

System Metadata (Goal 4)

Here is a list of System Metadata which we are planning to emit from different entities. If you have any suggestions as what other info can be useful as system metadata please comment below.

...