Goals:

Improve Metadata Search: This requires redesign of how we store metadata. Design proposed below.

...

- Make search for tags

...

- work for all the

...

- tags in the list
- Support tokenized search where user can search with any word from the value
Schema Search:
- CDAP Schema for Datasets, Streams and Views should be stored as metadata and searchable through

...

- fieldname or and fieldname with fieldtype (only for primitive fieldtype)
Search filtering based on entity type.

Checklist

User stories documented (Rohit/Poorna)
User stories reviewed (Nitin)
Design documented (Rohit/Poorna)
Design reviewed (Andreas)
Feature merged (Rohit)
Examples and guides (Rohit)
Integration tests (Rohit)
Documentation for feature (Rohit)
Blog post

User Stories:

...

Key Value Metadata Search
1. User should be able to search with key-
value metadata with the following or with its prefix:
key-value
key with part of value
value

Individual words in the value

Example

value or its prefix
User should be able to search with key and individual word in value or its prefix
User should be able to search with just value or its prefix
User should be able to search with individual words in the value

Tag Metadata Search
1. User should be able to search with tags key and a tag value or its prefix
2. User should be able to search with just a tag value or its prefix.
Schema Search:
1. User should be able search with fieldname or its prefix
2. User should be able to search with fieldname or its prefix scoped just to schema
3. User should be able to search with fieldname and fieldtype (only for primitive types)
Search Filtering:
1. User should be able to filter searches to a particular entity type for example app, program, dataset
Partial Searching:
1. User should be able to see result for individual words in search query.

Design

Search Query Examples:

User stores a key-value metadata with key = "Codename" and value = "Alpha Tango Charlie" for an entity

User

should be able to search for

can retrieve this entity with the following queries:
- key-value
  1. Codename: Alpha Tango Charlie
  2. Codename: Alpha Tang*
- key with part of value
  1. Codename: Alpha
  2. Codename: Tango
  3. Codename: Charlie
  4. Codename: Alp*
- value
  1. Alpha Tango Charlie
  2. Alpha*
  3. Alpha Tan*

Design Decision

- 1. Note:
    1. We have decided not to support searches for queries which have parts of value for example "Tango Charlie". You can either search for whole value or with prefix or single words (we plan to tokenize on whitespace)

parts of

- Individual word in value
  1. Alpha
  2. Tango
  3. Charlie
  4. Alph*
  5. Tan*
  6. Ch*

User should be able to search tags metadata with the following or with its prefix

Not supported:
tags key and a tag value

a tag value

Example:

1. key* i.e. Codename*

User tags an entity with the following tags "Tag1, Tag22"
- User should be able to search for can retrieve this entity with the following queries:
  - tag key and a tag value:
    1. tags: Tag1
    2. tags: Tag*
  - a tag value
    1. tag22
    2. tag2*

======================================================================================================================

Use Cases:

Key-Value Metadata:Codename: Alpha Tango Charlie
Use case: User should be able to search with
Whole Key-Value (complete or prefix)
1. Codename: Alpha Tango Charlie
2. Codename: Alpha Tang*
Key with Part of Value (complete or prefix)
1. Codename: Alpha
2. Codename: Tango
3. Codename: Charlie
4. Codename: Alp*
Whole Value (complete or prefix):
1. Alpha Tango Charlie
2. Alpha*
3. Alpha Tan*
  Design Decision:
  1. We have decided not to support searches for queries which have parts of value for example "Tango Charlie". You can either search for whole value or with prefix or single words (we plan to tokenize on whitespace)

Parts of value (complete or prefix):

Alpha
Tango
Charlie
Alph*
Tan*

Ch*

Tags Metadata:

tags: Tag1, Tag22

Use case:

User should be able to search with

With tags key and a tag value (complete or prefix):
1. tags: Tag1
2. tags: Tag*

With tag value (complete or prefix):

tag22

tag2*

Schema Metadata

: This is just key-value where key is schema and value schema fields but needs special indexing to support searches with fieldName (Note: we don't plan to support schema searches with fieldTypes

A dataset has the following schema:

Code Block

title	Nested Schema

{
  "EmpName": "String",
  "EmpContact": {
    "EmpTel": "Integer",
    "EmpAddr": "String"
  }
}

Use case:

User should be able to search withFieldName scoped to schema (complete or prefix):

User can retrieve this dataset entity with the following queries:

fieldname:
1. EmpName
2. EmpContact
3. EmpTel
4. EmpAddr
5. Emp*
fieldname scoped to schema:
1. schema: EmpName
2. schema: EmpContact
3. schema: EmpTel
4. schema: EmpAddr
5. schema: Emp*

FieldName (complete or prefix):
1. EmpName
2. EmpContact
3. EmpTel
4. EmpAddr
5. Emp*

Searching for everything which has schemaschema:*

Design Decisions:

fieldname with fieldtype (only for primitive types)
1. EmpName:String (only for java primitive types)

Note:

We don't plan to support schema searches with complex fieldType. If a user searched with a query which is not scoped with schema by default it will search for schema fields besides the normal key-value and tags.
Open questions:
- What if an entity has multiple schema (ex: transform which has input and output schema)
  - We
can index its fields with input and output schema and we expect an user to specify whether they are looking for something in input schema or output schema. What about entities which have more than one schema? We are thinking to store them either as input output with identifier.
- - will index both schema (After discussion with Nitin)
- How will an user search for a fieldName across input and output schema ?
- One way is to besides indexing the fields as input and output schema we also index every field as just schema so that we can perform such queries.

New Design:

Storage:

1. - - We do not support searches limited to input/output or just one schema (After discussion with Nitin)
Search Filtering:
1. User wants to search only for 'dataset'
  1. dataset: Codename: Alpha
  2. dataset: tags: Tag1
  3. dataset: schema: EmpName
    Note: if not entity type is specified we will return all matched entities.
Partial Searching:
1. User searches for "California USA" : Separate every search query on white space and search for every single word (or)
  Search result will contain:
  1. All entities tagged with "California USA" followed by
  2. All entities tagged with "California" followed by
  3. All entities tagged with "USA"

Storage:

We are going to use the IndexedTable which we are using currently too. In the new storage design we will have two rows:

Value Row: This row will store the entity id with key and value in the value column
Index Row: This row will store the entity id with key (like above) appended by the index which is also stored in the index column. The index column will be used for indexing.

Metadata Storage Format:

Key Column	Value Column
<VRPrefix><Entity-Id><Key>	Value
<VRPrefix><Entity<VRPrefix><Entity-Id><Tags>	Tag1, Tag2, Tag3....
<VRPrefix><Entity<VRPrefix><Entity-Id><Schema>	{Some Schema}

Index Storage Format:

Key Column	Index Column
<IRPrefix><Entity-Id><Key><Index>	Index
<IRPrefix><Entity<IRPrefix><Entity-Id><Tags><Index>	Index
<IRPrefix><Entity<IRPrefix><Entity-Id><Schema><Index>	Index

Sample

Index Table which stores the above metadata and indexes together. Index This table data represents key-value, tags and schema example discussed above to show how we plan to store the data. Index Column contains all the possibilities of search queries.

Key: Entity with key	Value Column: Value of Metadata (Not Indexed)	Index Column: Indexed value (Indexed)
<VRPrefix><Entity-Id><CodeName>	Alpha Tango Charlie
<VRPrefix><Entity-Id><Tags>	Tag1, Tag22
<VRPrefix><Entity-Id><Schema>	{EmpName: String, EmpContact: {EmpTel: Integer, EmpAddr: String}}
<IRPrefix><Entity-Id><Codename><CodeNameId><Codename><CodeName: Alpha Tango Charlie>		CodeName: Alpha Tango Charlie
<IRPrefix><Entity-Id><Codename><Codename<IRPrefix><Entity-Id><Codename><Codename: Alpha>		Codename: Alpha
<IRPrefix><Entity-Id><Codename><Codename<IRPrefix><Entity-Id><Codename><Codename: Tango>		Codename: Tango
<IRPrefix><Entity-Id><Codename><Codename<IRPrefix><Entity-Id><Codename><Codename: Charlie>		Codename: Charlie
<IRPrefix><Entity<IRPrefix><Entity-Id><Codename><Alpha Tango Charlie>		Alpha Tango Charlie
<IRPrefix><Entity-Id><Codename><Alpha><IRPrefix><Entity-Id><Codename><Alpha>		Alpha
<IRPrefix><Entity-Id><Codename><Tango><IRPrefix><Entity-Id><Codename><Tango>		Tango
<IRPrefix><Entity-Id><Codename><Charlie><IRPrefix><Entity-Id><Codename><Charlie>		Charlie
<IRPrefix><Entity<IRPrefix><Entity-Id><tags><tags: Tag1>		tags: Tag1
<IRPrefix><Entity-Id><tags><tags<IRPrefix><Entity-Id><tags><tags: Tag22>		tags: Tag22
<IRPrefix><Entity-Id><tags><Tag1><IRPrefix><Entity-Id><tags><Tag1>		Tag1
<IRPrefix><Entity-Id><tags><Tag22><IRPrefix><Entity-Id><tags><Tag22>		Tag22
<IRPrefix><Entity<IRPrefix><Entity-Id><schema><schema: EmpName>		schema: EmpName
<IRPrefix><Entity-Id><schema><schema<IRPrefix><Entity-Id><schema><schema: EmpContact>		schema: EmpContact
<IRPrefix><Entity-Id><schema><schema<IRPrefix><Entity-Id><schema><schema: EmpTel>		schema: EmpTel
<IRPrefix><Entity-Id><schema><schema<IRPrefix><Entity-Id><schema><schema: EmpAddr>		schema: EmpAddr
<IRPrefix><Entity-Id><schema><EmpName><IRPrefix><Entity-Id><schema><EmpName>		EmpName
<IRPrefix><Entity-Id><schema><EmpContact><IRPrefix><Entity-Id><schema><EmpContact>		EmpContact
<IRPrefix><Entity-Id><schema><EmpTel><IRPrefix><Entity-Id><schema><EmpTel>		EmpTel
<IRPrefix><Entity-Id><schema><EmpAddr><IRPrefix><Entity-Id><schema><EmpAddr>		EmpAddr

We will be using the indexedTable like before but now our keys which store values will be prefixed with a special VRPrefix (ValueRowPrefix) and we will store the value in the value column. The indexes will also bestored be stored in the same table and the key will be prefixes with IRPrefix (IndexRowPrefix), the value column for such rows will be empty and the index column will have the index value which will be indexed for search.

Another possibility was to store the real key value in a separate table and the indexes in the indexedTable which will avoid the empty column values for a row but this will lead to 6 tables on total (3 for system and business each)hence hence we have decided against it.

Search

...

Filtering: We will perform post filtering if the query is limited to an entity type.

In addition to above goals we also plan to do the following:

Metadata Search Results:
- Image AddedCDAP-4274 - Metadata search should returns the metadata of matching entities (Image Added Open)
- Also return some other relevant info. Please see details below.
Search Result
Metadata search will return Entities with the following details depending upon the type of the Entity. The search results will be order descending on basis of entity creation time.
Entity Type Search Details

...

Application
Type

...

Name

...

Matched Metadata (Snippet) with all system metadata

...

App Description
Entity creation time
Program Type

...

Name

...

Matched Metadata (Snippet) with all system metadata
App it belongs to
Entity creation time
Artifact Type
Name

...

Matched Metadata (Snippet) with all system metadata
Entity creation time
Dataset Type
Name

...

Matched Metadata (Snippet) with all system metadata
Entity creation time
Stream Name
Type

...

Matched Metadata (Snippet) with all system metadata
Entity creation time
View Name

...

Type

...

Matched Metadata (Snippet) with all system metadata
Stream Name

...

Entity creation time
Design Decision:
- - In the search result of entity we will return the matched metadata with all the system metadata for that entity too.
Open Question:
- - Please suggest other things which we can add to different search result entities ?

...

Emit more metadata from system entities:

Here is a list of System Metadata which we are planning to emit from different entities. If you have any suggestions as what other info can be useful as system metadata please comment below.

Artifacts

- Artifact name
- Version

Applications

- Application name
- ArtifactId
- Plugins
  - Plugin Type
  - Plugin Name
- Schedule
- Programs

Programs

- Program name
- Type: Flow, MapReduce etc
  - Workflow
  - Nodes under this workflow
- Mode: Batch, Realtime

Datasets

- Dataset name
- Schema
- RecordScannable/BatchWritable/RecordWritable/BatchReadable
- Type: KVTable, FileSet etc
- ttl

Streams

- Stream name
- Schema
- ttl

Views

- View name
- Schema

Open Questions:

- Please suggest other things which we can add to different system metadata entries
- Nitin Motgi: Can we call "business metadata" "user metadata" and also the table which stores it userMetadata table rather than business to keep it consistent with other stuff like metrics etc.

Additional Requirement and Notes:

Invalidate just * query
Support Pagination of search result in backend
User entity creation time for ordering of search result
Support searched with stemming (workflow/workflows) : Porter Stemming
Support and (&) operation: Example search query - app:appname & program

Versions Compared

Old Version 1

New Version Current

Key

Table of Contents

Goals:

Checklist

User Stories:

Design

Search Query Examples:

Use Cases:

New Design:

Storage:

Storage:

In addition to above goals we also plan to do the following:

Metadata Search Results:

Emit more metadata from system entities:

Matched Metadata (Snippet) with all system metadata
	App it belongs to
	Entity creation time
Artifact	Type
	Name

Matched Metadata (Snippet) with all system metadata
	Entity creation time
Dataset	Type
	Name

Matched Metadata (Snippet) with all system metadata
	Entity creation time
Stream	Name
	Type

Matched Metadata (Snippet) with all system metadata
	Entity creation time
View	Name

Page Comparison

Versions Compared

Old Version 1

New Version Current

Key

Table of Contents

Goals:

Checklist

User Stories:

Design

Search Query Examples:

Use Cases:

New Design:

Storage:

Storage:

In addition to above goals we also plan to do the following:

Metadata Search Results:

Emit more metadata from system entities: