Checklist
- User Stories Documented
- User Stories Reviewed
- Design Reviewed
- APIs reviewed
- Release priorities assigned
- Test cases reviewed
- Blog post
Introduction
In CDAP 6.0.0, all user-annotated metadata property values are stored as String objects, making them ineligible for numeric search, even when users can understand them as numbers. To improve users’ metadata search experience, we can use Elasticsearch to introduce more specific representations for metadata values and allow users to search numerically.
User Stories
- A pipeline developer attaches a “priority” property to their datasets, assigning to that property an integer value from 1-10. They would like to search for all of the datasets with a priority higher than 7.
- A pipeline developer structures their datasets hierarchically and attaches a numeric “depth” property to them that specifies their distance from the first generation of datasets. They would like to search for all datasets before the 3rd generation.
- A pipeline developer interacting with CDAP through the CLI programmatically assigns a numeric value to a “writes” property of their datasets based on the amount of writes each dataset receives. They would like to search for all datasets with at least 5 writes.
UI Impact or Changes
This feature introduces five new search operators (particularly, comparison operators) to the UI, which come after the key:value separator:
- A "=" prefix indicates a search for values exactly matching the number given.
- A “>” prefix indicates a search for values higher than the number given.
- A “>=” prefix indicates a search for values higher than or equal to the number given.
- A “<” prefix indicates a search for values lower than the number given.
- A "<=" prefix indicates a search for values lower than or equal to the number given.
Some examples of this syntax in use are:
- priority:>7
- depth:<3
- writes:>=5
Without a preceding comparison operator, a search containing numeric values will be interpreted as a String-based search. In the event that a search term contains both a preceding comparison operator and alphabetic characters, the search term will be interpreted as a String-based search.
Regardless of the presence or absence of requirement syntax (a "+" prefix), a specified numeric search will be considered a required term.
Discussions
Unspecified keys
CDAP metadata search supports both "key:value" syntax and simple "value" syntax. The expected behavior and use case for a key:value search (e.g. "key:>30") is well-defined, but what should one expect when a property or key is not specified (e.g. ">30")? Should this be considered a valid numeric search? There exist two options:
- It may be considered a valid numeric search. Currently, when no key is specified, CDAP looks over all String-based representations of metadata values. If this were to be a valid numeric search, CDAP would have to search over all numeric representations of metadata values, as well.
- It may not be considered a valid numeric search. This would require users to specify a key when attempting to conduct a numeric search. Given that all numeric search terms are also required search terms, this specificity requirement seems the most useful.
Conclusion: Solution #2 is desirable, and a search without a specified property will be ineligible for numeric search. All numeric search terms are also assumed to be required search terms, so this specificity requirement will maximize proper use of that assumption.
Number storage limitations
If we are to store numeric values as such, we must handle the limitations of number storage in Java. If we store numbers as integers, how do we respond when a user inputs a number larger than Integer.MAX_VALUE? The automatically-assigned Creation-Time property, for instance, has a value above this, and is stored as a Long. There exist at least three possible solutions to this problem:
- Store numbers as BigIntegers, which may solve the problem of a maximum value—CDAP enforces a 50-character limit on property values, which is within the scope of what a BigInteger can hold—but may cause memory/performance issues.
- Store numbers as Longs or Doubles, and throw an exception when met with numeric values exceeding some cap (e.g. Long.MAX_VALUE).
- Store numbers as Longs or Doubles, and interpret any numbers exceeding the cap as Strings.
Conclusion: Solution #3 is desirable; numbers will be stored as Doubles, and if they exceed Double.MAX_VALUE, they will be interpreted as Strings. Storing them as doubles allows for both decimal and integer formats to be accepted (e.g. "2.0" and "2"). Interpreting excessively large numbers as Strings is the most simple and sufficient solution currently, as string interpretations are the default; in the future, this behavior may be changed to throw a user-facing exception instead.
Design
New syntax will be introduced to allow searching metadata for numeric values. Greater than, greater than or equal to, less than, less than or equal to, and equality searches will be available.
The metadata indexing process will be changed to store valid numbers as numeric values in addition to being stored as Strings.
The Elasticsearch implementation of metadata storage will make use of the Elasticsearch Java API’s built-in classes to search metadata for numeric properties.
Implementation
Parsing search queries for numeric syntax
For a given search term string, we have to accurately tell whether it constitutes a numeric search.
Approach #1 - in QueryParser API
We can communicate additional information about a search term to ElasticsearchMetadataStorage through the QueryTerm. Much like the existing Qualifier enumerator, a SearchType and Comparison enumerator will hold information, and the Elasticsearch metadata storage implementation can use those enumerators to construct the relevant QueryBuilder objects.
Approach #2 - in ElasticsearchMetadataStorage
We can instead extract type information by parsing the search term within ElasticsearchMetadataStorage, requiring the createTermQuery method to conduct the checks listed above. Items 1 and 2 of the above checks are already conducted by the method. A disadvantage of this is that it takes some functionality from QueryParser that QueryParser could reasonably have available for possible future implementations of metadata storage.
Conclusion
Approach #1 is desirable. A natural benefit of this approach is that it is consistent with QueryParser's purpose. While Elasticsearch is currently the only CDAP metadata storage implementation to use numeric search, an added benefit of parsing in QueryParser is that it abstracts much of the conceptual work away from the ElasticsearchMetadataStorage class, enhancing its readability.
Indexing numeric values
For a given value entry, we have to accurately tell whether it constitutes a numeric value, much in the way we must do so for a numeric search. We must then store the value such that Elasticsearch can access it.
Approach #1 - Extending the Property class
We can create a NumericProperty class that extends MetadataDocument’s Property class, allowing its objects to store both the String and numeric representation of a numeric value.
Approach #2 - Augmenting the Property class
We can instead change the Property class to hold an extra numeric field (e.g. numericValue) that may or may not be null.
Conclusion
Approach #2 is desirable; it is the simpler approach while remaining effective, and requires few, straightforward changes to the codebase. The Property class will hold an extra Double field named numericValue, and will be assigned depending on whether the string representation can be parsed as a Double (through the Double.parseDouble method). Accompanying this change to the Property class, index.mapping.json will include a new nested property of type double, "numericValue". ElasticsearchMetadataStorage will then include a nested numeric value field that corresponds to this change.
Searching for numbers with Elasticsearch
Elasticsearch’s RangeQueryBuilder class provides a straightforward way to conduct greater than, greater than or equal to, less than, less than or equal to, and equality searches for numeric values. After parsing a numeric search term, we can detect the presence of a comparison operator—and if there is one, detect which one it is—and map that operator to a RangeQueryBuilder method. This can be executed within ElasticsearchMetadataStorage’s createTermQuery method.
API changes
Updated QueryParser
Updated QueryTerm
Related Jira
Related Work
Future Work
With the introduction of several new and implementation-specific metadata search features, a user-friendly way of navigating what features are available should be implemented.