Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Goals:

  1. Improve Metadata Search: This requires redesign of how we store metadata. Design proposed below.
    • Make search for tags work for all the tags in the list
    • Support tokenized search where user can search with any word from the value
  2. Schema Search:
    • CDAP Schema for Datasets, Streams and Views should be stored as metadata and searchable through fieldname or and fieldname with fieldtype (only for primitive fieldtype)
  3. Search filtering based on entity type.

...

  •  User stories documented (Rohit/Poorna)
  •  User stories reviewed (Nitin)
  •  Design documented (Rohit/Poorna)
  •  Design reviewed (Andreas)
  •  Feature merged (Rohit)
  •  Examples and guides (Rohit)
  •  Integration tests (Rohit) 
  •  Documentation for feature (Rohit)
  •  Blog post 

...

  1. Key Value Metadata Search
    1. User should be able to search with key-value or its prefix
    2. User should be able to search with key and part of value or individual word in value or its prefix
    3. User should be able to search with just value or its prefix
    4. User should be able to search with individual words in the value 
  2. Tag Metadata Search
    1. User should be able to search with tags key and a tag value or its prefix
    2. User should be able to search with just a tag value or its prefix.
  3. Schema Search:
    1. User should be able search with fieldname or its prefix
    2. User should be able to search with fieldname or its prefix scoped just to schema 
    3. User should be able to search with fieldname and fieldtype (only for primitive types)
  4. Search Filtering:
    1. User should be able to filter searches to a particular entity type for example app, program, dataset
  5. Partial Searching:
    1. User should be able to see result for individual words in search query.

Design

Search Query Examples:

  1. User stores a key-value metadata with key = "Codename" and value = "Alpha Tango Charlie" for an entity
    1. User can retrieve this entity with the following queries:
      • key-value
        1. Codename: Alpha Tango Charlie
        2. Codename: Alpha Tang*
      • key with part of value
        1. Codename: Alpha
        2. Codename: Tango
        3. Codename: Charlie
        4. Codename: Alp*
      • value
        1. Alpha Tango Charlie
        2. Alpha*
        3. Alpha Tan*
          Note:
          1. We have decided not to support searches for queries which have parts of value for example "Tango Charlie". You can either search for whole value or with prefix or single words (we plan to tokenize on whitespace) 
      • parts of Individual word in value
        1. Alpha
        2. Tango
        3. Charlie
        4. Alph*
        5. Tan*
        6. Ch*
    2. Not supported:
      1. key* i.e. Codename*
  2. User tags an entity with the following tags "Tag1, Tag22"
    • User can retrieve this entity with the following queries:
      • tag key and a tag value:
        1. tags: Tag1
        2. tags: Tag*
      • a tag value
        1. tag22 
        2. tag2*
  3. A dataset has the following schema: 

    Code Block
    titleNested Schema
    {
      "EmpName": "String",
      "EmpContact": {
        "EmpTel": "Integer",
        "EmpAddr": "String"
      }
    }

    User can retrieve this dataset entity with the following queries:

    • fieldname:
      1. EmpName
      2. EmpContact
      3. EmpTel 
      4. EmpAddr
      5. Emp*
    • fieldname scoped to schema:
      1. schema: EmpName
      2. schema: EmpContact
      3. schema: EmpTel
      4. schema: EmpAddr 
      5. schema: Emp*
    • fieldname with fieldtype (only for primitive types)
      1. EmpName:String (only for java primitive types)

    Note:
    1. We don't plan to support schema searches with complex fieldType. If a user  searched with a query which is not scoped with schema by default it will search for schema fields besides the normal key-value and tags.
      Open questions:
      • What if an entity has multiple schema (ex: transform which has input and output schema)
        • We will index both schema (After discussion with Nitin)
      • How will an user search for a fieldName across input and output schema ?
        • We do not support searches limited to input/output or just one schema (After discussion with Nitin)
  4. Search Filtering:
    1. User wants to search only for 'dataset'
      1. dataset: Codename: Alpha
      2. dataset: tags: Tag1
      3. dataset: schema: EmpName
        Note: if not entity type is specified we will return all matched entities. 
  5. Partial Searching:
    1. User searches for  "California USA" : Separate every search query on white space and search for every single word (or)
      Search result will contain:
      1. All entities tagged with  "California USA" followed by
      2. All entities tagged with "California" followed by
      3. All entities tagged with "USA"

Storage:

We are going to use the IndexedTable which we are using currently too. In the new storage design we will have two rows: 

...

  1. Metadata Search Results:

    • CDAP-4274 - Metadata search should returns the metadata of matching entities ( Open)
    • Also return some other relevant info. Please see details below.

    Search Result 

    Metadata search will return Entities with the following details depending upon the type of the Entity. The search results will be order descending on basis of entity creation time.

    Entity TypeSearch Details
    Application

    Type

     Name
     Matched Metadata (Snippet) with all system metadata
     App Description
     Entity creation time
    ProgramType
     Name
     Matched Metadata (Snippet) with all system metadata
     App it belongs to
     Entity creation time
    ArtifactType
     Name
     Matched Metadata (Snippet) with all system metadata
     Entity creation time
    DatasetType
     Name
     Matched Metadata (Snippet) with all system metadata
     Entity creation time
    StreamName
     Type
     Matched Metadata (Snippet) with all system metadata
     Entity creation time
    ViewName



    Type
    Matched Metadata (Snippet) with all system metadata
    Stream Name
    Entity creation time

    Design Decision: 

      • In the search result of entity we will return the matched metadata with all the system metadata for that entity too. 

    Open Question: 

      • Please suggest other things which we can add to different search result entities ? 
  2. Emit more metadata from system entities:

...

  1. Invalidate just * query
  2. Support Pagination of search result in backend
  3. User entity creation time for ordering of search result
  4. Support searched with stemming (workflow/workflows) : Porter Stemming
  5. Support and (&) operation: Example search query - app:appname & program