...
- Improve Metadata Search: This requires redesign of how we store metadata. Design proposed below.
- Fix the bug in Metadata Search Make search for tags which works only work for all the first entry and not other tags in the list
- Support tokenized search where user can search with any word from the value
- Schema Search:
- CDAP Schema for Datasets, Streams and Views should be stored as metadata and searchable through fieldname
Checklist
- User stories documented (Rohit/Poorna)
- User stories reviewed (Nitin)
- Design documented (Rohit/Poorna)
- Design reviewed (Andreas)
- Feature merged (Rohit)
- Examples and guides (Rohit)
- Integration tests (Rohit)
- Documentation for feature (Rohit)
- Blog post
User Stories:
- Key Value Metadata Search
- User should be able to search with key-value
- or
- its prefix
- key-value
- User should be able to search with key and part of value or its prefix
- User should be able to search with just value
- or its prefix
- User should be able to search with individual words in the value
- Tag Metadata Search
- User should be able to search with tags key and a tag value
Example- or its prefix
- User should be able to search with just a tag value or its prefix.
- Schema Search:
- User should be able search with fieldname or its prefix
- User should be able to search with fieldname or its prefix scoped just to schema
- User should be able to search for all entities with a schema
Design
Search Query Examples:
- User stores a key-value metadata with key = "Codename" and value = "Alpha Tango Charlie" for an entity
- User
- can retrieve this entity with the following queries:
- key-value
- Codename: Alpha Tango Charlie
- Codename: Alpha Tang*
- key with part of value
- Codename: Alpha
- Codename: Tango
- Codename: Charlie
- Codename: Alp*
- value
- Alpha Tango Charlie
- Alpha*
- Alpha Tan*
Note:- We have decided not to support searches for queries which have parts of value for example "Tango Charlie". You can either search for whole value or with prefix or single words (we plan to tokenize on whitespace)
- parts of value
- Alpha
- Tango
- Charlie
- Alph*
- Tan*
- Ch*
- key-value
- a tag value
Example:
User tags an entity with the following tags "Tag1, Tag22"- User should be able to search for can retrieve this entity with the following queries:
- tag key and a tag value:
- tags: Tag1
- tags: Tag*
- a tag value
- tag22
- tag2*
- tag key and a tag value:
- User should be able to search for can retrieve this entity with the following queries:
- User should be able search for entities (datasets, streams, views) through field-names in schema with the following or with its prefix:
- fieldname
- fieldname scoped with schema - this should limit the search to just schema fields and not other metadata User should be able to search for all entities with which has a schema
- fieldname:
- EmpName
- EmpContact
- EmpTel
- EmpAddr
- Emp*
- fieldname scoped to schema:
- schema: EmpName
- schema: EmpContact
- schema: EmpTel
- schema: EmpAddr
- schema: Emp*
- search for all entities with a schema
- schema:* This will return this dataset entity and also all the other entities which have schema stored as their metadata
- schema:* This will return this dataset entity and also all the other entities which have schema stored as their metadata
- We don't plan to support schema searches with fieldType. If a user searched with a query which is not scoped with schema by default it will search for schema fields besides the normal key-value and tags.
Open questions:- What if an entity has multiple schema (ex: transform which has input and output schema)
- Maybe We can index its fields with input and output schema and we expect an user to specify whether they are looking for something in input schema or output schema.
- What about entities which have more than one schema?
- Maybe we can store them either as input output with identifier.
- How will an user search for a fieldName across input and output schema ?
- One way is to besides indexing the fields as input and output schema we also index every field as just schema so that we can perform such queries.
- What if an entity has multiple schema (ex: transform which has input and output schema)
- We don't plan to support schema searches with fieldType. If a user searched with a query which is not scoped with schema by default it will search for schema fields besides the normal key-value and tags.
Example:
A dataset has the following schema:
Code Block | ||
---|---|---|
| ||
{
"EmpName": "String",
"EmpContact": {
"EmpTel": "Integer",
"EmpAddr": "String"
}
} |
User
should be able to search forcan retrieve this dataset entity with the following queries:
Design
...
Storage:
We are going to use the IndexedTable which we are using currently too. In the new storage design we will have two rows:
...