Checklist
- User Stories Documented
- User Stories Reviewed
- Design Reviewed
- APIs reviewed
- Release priorities assigned
- Test cases reviewed
- Blog post
Introduction
In CDAP 6.0.0, all user-annotated metadata property values are stored as String objects, making them ineligible for numeric search, even when users can understand them as numbers. To improve users’ metadata search experience, we can use Elasticsearch to introduce more specific representations for metadata values and allow users to search numerically.
User Stories
- A pipeline developer attaches a “priority” property to their datasets, assigning to that property an integer value from 1-10. They would like to search for all of the datasets with a priority higher than 7.
- A pipeline developer structures their datasets hierarchically and attaches a numeric “depth” property to them that specifies their distance from the first generation of datasets. They would like to search for all datasets before the 3rd generation.
- A pipeline developer interacting with CDAP through the CLI programmatically assigns a numeric value to a “writes” property of their datasets based on the amount of writes each dataset receives. They would like to search for all datasets with at least 5 writes.
UI Impact or Changes
This feature introduces five new search operators (particularly, comparison operators) to the UI, which come after the key:value separator:
- A "=" prefix indicates a search for values exactly matching the number given.
- A “>” prefix indicates a search for values higher than the number given.
- A “>=” prefix indicates a search for values higher than or equal to the number given.
- A “<” prefix indicates a search for values lower than the number given.
- A "<=" prefix indicates a search for values lower than or equal to the number given.
Some examples of this syntax in use are:
- priority:>7
- depth:<3
- writes:>=5
Without a preceding comparison operator, a search containing numeric values will be interpreted as a String-based search. In the event that a search term contains both a preceding comparison operator and alphabetic characters, the search term will be interpreted as a String-based search.
Regardless of the presence or absence of requirement syntax (a "+" prefix), a specified numeric search will be considered a required term.
Discussions
- CDAP metadata search supports both "key:value" syntax and simple "value" syntax. The expected behavior and use case for a key:value search (e.g. "key:>30") is well-defined, but what should one expect when a property or key is not specified (e.g. ">30")? Should this be considered a valid numeric search?
- It may be considered a valid numeric search. Currently, when no key is specified, CDAP looks over all String-based representations of metadata values. If this were to be a valid numeric search, CDAP would have to search over all numeric representations of metadata values, as well.
It may not be considered a valid numeric search. This would require users to specify a key when attempting to conduct a numeric search. Given that all numeric search terms are also required search terms, this specificity requirement seems the most useful.
- If we are to store numeric values as such, we must handle the limitations of number storage. If we store numbers as integers, how do we respond when a user inputs a number larger than Integer.MAX_VALUE? The automatically-assigned Creation-Time property, for instance, has a value above this.
There exist at least three possible solutions to this problem:- Store numbers as BigIntegers, which may solve the problem of a maximum value—CDAP enforces a 50-character limit on property values, which is within the scope of what a BigInteger can hold—but may cause memory/performance issues.
- Store numbers as longs, and throw an exception when met with numeric values exceeding some cap (e.g. Long.MAX_VALUE).
- Store numbers as longs, and interpret any numbers exceeding Long.MAX_VALUE as Strings.
- Store numbers as BigIntegers, which may solve the problem of a maximum value—CDAP enforces a 50-character limit on property values, which is within the scope of what a BigInteger can hold—but may cause memory/performance issues.
Design
New syntax will be introduced to allow searching metadata for numeric values. Greater than, greater than or equal to, less than, less than or equal to, and equality searches will be available.
The metadata indexing process will be changed to store valid numbers as numeric values in addition to being stored as Strings.
The Elasticsearch implementation of metadata storage will make use of the Elasticsearch Java API’s built-in classes to search metadata for numeric properties.
Implementation
Parsing search queries for numeric syntax
For a given search term string, we have to accurately tell whether it constitutes a numeric search.
Approach #1 - in QueryParser API
We can communicate type information about a search term to ElasticsearchMetadataStorage through the QueryTerm. Much like the existing Qualifier enumerator, a SearchType enumerator will hold that information, and the Elasticsearch metadata storage implementation can use that enumerator to construct the relevant QueryBuilder objects.
This requires the QueryParser to receive a few changes:
Since terms to be parsed can follow property:number syntax—possibly with a comparison operator directly after the colon—the QueryParser would require knowledge of the colon as a key:value separator, possibly by importing MetadataConstants.java.
QueryParser must check whether the search term contains the key:value separator.
QueryParser must check whether there exists a comparison operator directly after the key:value separator.
QueryParser must check whether there exists a valid number after the comparison operator.
Approach #2 - in ElasticsearchMetadataStorage
We can instead extract type information by parsing the search term within ElasticsearchMetadataStorage, requiring the createTermQuery method to conduct the checks listed above. Items 1 and 2 of the above checks are already conducted by the method. A disadvantage of this is that it takes some functionality from QueryParser that QueryParser could reasonably have available for possible future implementations of metadata storage.
Indexing numeric values
For a given value entry, we have to accurately tell whether it constitutes a numeric value, much in the way we must do so for a numeric search. We must then store the value such that Elasticsearch can access it.
Approach #1 - Extending the Property class
We can create a NumericProperty class that extends MetadataDocument’s Property class, allowing its objects to store both the String and Long representation of a numeric value. This would also require adding a new field to the index.mapping.json file, ElasticsearchMetadataStorage, and the MetadataDocument class.
Approach #2 - Augmenting the Property class
We can instead change the Property class to hold an extra Long field (e.g. numericValue) that may or may not be null.
Searching for numbers with Elasticsearch
Elasticsearch’s RangeQueryBuilder class provides a straightforward way to conduct greater than, greater than or equal to, less than, less than or equal to, and equality searches for numeric values. After parsing a numeric search term, we can detect the presence of a comparison operator—and if there is one, detect which one it is—and map that operator to a RangeQueryBuilder method. This can be executed within ElasticsearchMetadataStorage’s createTermQuery method.
API changes
Changes to QueryParser
QueryParser now parses search queries for value type information (e.g. whether the query is looking for a string or a number), and creates the corresponding QueryTerms.
Changes to QueryTerm
QueryTerm now contains a SearchType enumerator and searchType field.
Related Jira