Table of Contents
Goals:
- Improve Metadata Search: This requires redesign of how we store metadata. Design proposed below.
- Fix the bug in Metadata Search Make search for tags which works only work for all the first entry and not other tags in the list
- Support tokenized search where user can search with any word from the value
- Schema Search:
- CDAP Schema for Datasets, Streams and Views should be stored as metadata and searchable through fieldnames.
Jira Legacy server Cask Community Issue Tracker serverId 45b48dee-c8d6-34f0-9990-e6367dc2fe4b key CDAP-4274 In addition to the above we will also like to return some other relevant info. Please see details below.- Minor: We will also take Bhoosan's work on System Metadata to completion.
- Work done by Bhooshan Mogal:
- Things to do:
- Emit more metadata from system entities
- Merge: https://github.com/caskdata/cdap/pull/4683
- Work done by Bhooshan Mogal:
- fieldname or and fieldname with fieldtype (only for primitive fieldtype)
- Search filtering based on entity type.
Checklist
- User stories documented (Rohit/Poorna)
- User stories reviewed (Nitin)
- Design documented (Rohit/Poorna)
- Design reviewed (Andreas)
- Feature merged (Rohit)
- Examples and guides (Rohit)
- Integration tests (Rohit)
- Documentation for feature (Rohit)
- Blog post
User Stories:
...
- Key Value Metadata Search
- User should be able to search with key-
- key-value
- key with part of value
- value Individual words in the value
- value or its prefix
- User should be able to search with key and individual word in value or its prefix
- User should be able to search with just value or its prefix
- User should be able to search with individual words in the value
- Tag Metadata Search
- User should be able to search with tags key and a tag value or its prefix
- User should be able to search with just a tag value or its prefix.
- Schema Search:
- User should be able search with fieldname or its prefix
- User should be able to search with fieldname or its prefix scoped just to schema
- User should be able to search with fieldname and fieldtype (only for primitive types)
- Search Filtering:
- User should be able to filter searches to a particular entity type for example app, program, dataset
- Partial Searching:
- User should be able to see result for individual words in search query.
Example
Design
Search Query Examples:
- User stores a key-value metadata with key = "Codename" and value = "Alpha Tango Charlie" for an entity
- User
- can retrieve this entity with the following queries:
- key-value
- Codename: Alpha Tango Charlie
- Codename: Alpha Tang*
- key with part of value
- Codename: Alpha
- Codename: Tango
- Codename: Charlie
- Codename: Alp*
- value
- Alpha Tango Charlie
- Alpha*
- Alpha Tan*
Note:- We have decided not to support searches for queries which have parts of value for example "Tango Charlie". You can either search for whole value or with prefix or single words (we plan to tokenize on whitespace)
- key-value
- Individual word in value
- Alpha
- Tango
- Charlie
- Alph*
- Tan*
- Ch*
- Individual word in value
- Not supported:
- tags key and a tag value a tag value
- key* i.e. Codename*
- User tags an entity with the following tags "Tag1, Tag22"
- User should be able to search for can retrieve this entity with the following queries:
- tag key and a tag value:
- tags: Tag1
- tags: Tag*
- a tag value
- tag22
- tag2*
- tag key and a tag value:
- User should be able to search for can retrieve this entity with the following queries:
- User should be able search for entities (datasets, streams, views) through field-names in schema with the following or with its prefix:
- fieldname
- fieldname scoped with schema - this should limit the search to just schema fields and not other metadata User should be able to search for all entities with which has a schema
- fieldname:
- EmpName
- EmpContact
- EmpTel
- EmpAddr
- Emp*
- fieldname scoped to schema:
- schema: EmpName
- schema: EmpContact
- schema: EmpTel
- schema: EmpAddr
- schema: Emp*
- fieldname with fieldtype (only for primitive types)
- EmpName:String (only for java primitive types)
- EmpName:String (only for java primitive types)
- We don't plan to support schema searches with complex fieldType. If a user searched with a query which is not scoped with schema by default it will search for schema fields besides the normal key-value and tags.
Open questions:- What if an entity has multiple schema (ex: transform which has input and output schema)
- Maybe We can index its fields with input and output schema and we expect an user to specify whether they are looking for something in input schema or output schema. What about entities which have more than one schema?Maybe we can store them either as input output with identifier.will index both schema (After discussion with Nitin)
- How will an user search for a fieldName across input and output schema ?
- One way is to besides indexing the fields as input and output schema we also index every field as just schema so that we can perform such queries.
- What if an entity has multiple schema (ex: transform which has input and output schema)
Example:
Example:
A dataset has the following schema:
Code Block | ||
---|---|---|
| ||
{
"EmpName": "String",
"EmpContact": {
"EmpTel": "Integer",
"EmpAddr": "String"
}
} |
User
should be able to search forcan retrieve this dataset entity with the following queries:
Goal 3:
- User should be able to see all metadata of an entity in search result of a metadata search
- User should be able to see other relevant information of entity.
Goal 4:
- No known user stories at this point.
Design
...
- We do not support searches limited to input/output or just one schema (After discussion with Nitin)
- Search Filtering:
- User wants to search only for 'dataset'
- dataset: Codename: Alpha
- dataset: tags: Tag1
- dataset: schema: EmpName
Note: if not entity type is specified we will return all matched entities.
- User wants to search only for 'dataset'
- Partial Searching:
- User searches for "California USA" : Separate every search query on white space and search for every single word (or)
Search result will contain:- All entities tagged with "California USA" followed by
- All entities tagged with "California" followed by
- All entities tagged with "USA"
- User searches for "California USA" : Separate every search query on white space and search for every single word (or)
Storage:
We are going to use the IndexedTable which we are using currently too. In the new storage design we will have two rows:
...
Another possibility was to store the real key value in a separate table and the indexes in the indexedTable which will avoid the empty column values for a row but this will lead to 6 tables on total (3 for system and business each) hence we have decided against it.
Search Filtering: We will perform post filtering if the query is limited to an entity type.
In addition to above goals we also plan to do the following:
Search Result (Goal 3)
Metadata Search Results:
- CDAP-4274 - Metadata search should returns the metadata of matching entities ( Open)
- Also return some other relevant info. Please see details below.
Search Result
Metadata search will return Entities with the following details depending upon the type of the Entity. The search results will be order descending on basis of entity creation time.
Entity Type Search Details
...
Application Type
...
Name
...
Matched Metadata (Snippet) with all system metadata
...
App Description Entity creation time Program Type
...
Name
...
Matched Metadata (Snippet) with all system metadata App it belongs to Entity creation time Artifact Type Name
...
Matched Metadata (Snippet) with all system metadata Entity creation time Dataset Type Name
...
Matched Metadata (Snippet) with all system metadata Entity creation time Stream Name Type
...
Matched Metadata (Snippet) with all system metadata Entity creation time View Name
...
Type
...
Matched Metadata (Snippet) with all system metadata Stream Name
...
Entity creation time Design Decision:
- In the search result of entity we will return the matched metadata with all the system metadata for that entity too.
Open Question:
- Please suggest other things which we can add to different search result entities ?
...
Emit more metadata from system entities:
Here is a list of System Metadata which we are planning to emit from different entities. If you have any suggestions as what other info can be useful as system metadata please comment below.
Artifacts
- Artifact name
- Version
Applications
- Application name
- ArtifactId
- Plugins
- Plugin Type
- Plugin Name
- Schedule
- Programs
Programs
- Program name
- Type: Flow, MapReduce etc
- Workflow
- Nodes under this workflow
- Mode: Batch, Realtime
Datasets
- Dataset name
- Schema
- RecordScannable/BatchWritable/RecordWritable/BatchReadable
- Type: KVTable, FileSet etc
- ttl
Streams
- Stream name
- Schema
- ttl
Views
- View name
- Schema
Open Questions:
- Please suggest other things which we can add to different system metadata entries
- Nitin Motgi: Can we call "business metadata" "user metadata" and also the table which stores it userMetadata table rather than business to keep it consistent with other stuff like metrics etc.
Additional Requirement and Notes:
- Invalidate just * query
- Support Pagination of search result in backend
- User entity creation time for ordering of search result
- Support searched with stemming (workflow/workflows) : Porter Stemming
- Support and (&) operation: Example search query - app:appname & program