Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 17 Next »

Checklist

  • User Stories Documented
  • User Stories Reviewed
  • Design Reviewed
  • APIs reviewed
  • Release priorities assigned
  • Test cases reviewed
  • Blog post

Introduction 

In CDAP 6.0.0, all user-annotated metadata property values are stored as String objects, making them ineligible for numeric search, even when users can understand them as numbers. To improve users’ metadata search experience, we can use Elasticsearch to introduce more specific representations for metadata values and allow users to search numerically.

User Stories 

  • A pipeline developer attaches a “priority” property to their datasets, assigning to that property an integer value from 1-10. They would like to search for all of the datasets with a priority higher than 7.
  • A pipeline developer structures their datasets hierarchically and attaches a numeric “depth” property to them that specifies their distance from the first generation of datasets. They would like to search for all datasets before the 3rd generation.
  • A pipeline developer interacting with CDAP through the CLI programmatically assigns a numeric value to a “writes” property of their datasets based on the amount of writes each dataset receives. They would like to search for all datasets with at least 5 writes.

UI Impact or Changes

This feature introduces five new search operators (particularly, comparison operators) to the UI, which come after the key:value separator:

  • A "=" prefix indicates a search for values exactly matching the number given.
  • A “>” prefix indicates a search for values higher than the number given.
  • A “>=” prefix indicates a search for values higher than or equal to the number given.
  • A “<” prefix indicates a search for values lower than the number given.
  • A "<=" prefix indicates a search for values lower than or equal to the number given.

Some examples of this syntax in use are:

  • priority:>7
  • depth:<3
  • writes:>=5

Without a preceding comparison operator, a search containing numeric values will be interpreted as a String-based search. In the event that a search term contains both a preceding comparison operator and alphabetic characters, the search term will be interpreted as a String-based search.

Regardless of the presence or absence of requirement syntax (+), a specified numeric search will be considered a required term.

Discussions

  1. One can imagine a hypothetical situation in which a user searches, “>30.” When no property is explicitly mentioned, CDAP looks through all metadata properties for the specified value. Because of this, such a search would return all datasets—the creation time of any dataset, represented in Unix time, will be greater than 30.There exist at least two possible solutions to this problem:
    1. Change the search representation of creation time to be a formatted date, such that an explicit numeric search wouldn’t apply, and a date search would be necessary.
    2. Enforce a search rule that numeric values require a property to be specified.
    Potentially, neither of these approaches could be chosen, and instead, no accommodations are made. This may prove confusing to users that do not intuit creation time as a long, a problem which may be exacerbated if creation time is later presented to the user as a formatted string (e.g. 2019/01/01).
  2. If we are to store numeric values as such, we must handle the limitations of number storage. If we store numbers as integers, how do we respond when a user inputs a number larger than Integer.MAX_VALUE? The automatically-assigned Creation-Time property, for instance, has a value above this.
    There exist at least three possible solutions to this problem:
    1. Store numbers as BigIntegers, which may solve the problem of a maximum value—CDAP enforces a 50-character limit on property values, which is within the scope of what a BigInteger can hold—but may cause memory/performance issues.
    2. Store numbers as longs, and throw an exception when met with numeric values exceeding some cap (e.g. Long.MAX_VALUE). Currently, the numeric value of Creation-Time is stored as a Long, lending credibility to this option.
    3. Store numbers as longs, and interpret any numbers exceeding Long.MAX_VALUE as Strings.

Design

New syntax will be introduced to allow searching metadata for numeric values. Greater than, greater than or equal to, less than, less than or equal to, and equality searches will be available.

The metadata indexing process will be changed to store valid numbers as numeric values in addition to being stored as Strings.

The Elasticsearch implementation of metadata storage will make use of the Elasticsearch Java API’s built-in classes to search metadata for numeric properties.

Implementation

Parsing search queries for numeric syntax

For a given search term string, we have to accurately tell whether it constitutes a numeric search.

Approach #1 - in QueryParser API

We can communicate type information about a search term to ElasticsearchMetadataStorage through the QueryTerm. Much like the existing Qualifier enumerator, a SearchType enumerator will hold that information, and the Elasticsearch metadata storage implementation can use that enumerator to construct the relevant QueryBuilder objects.

This requires the QueryParser to receive a few changes:

    1. Since terms to be parsed can follow property:number syntax—possibly with a comparison operator directly after the colon—the QueryParser would require knowledge of the colon as a key:value separator, possibly by importing MetadataConstants.java.

    2. QueryParser must check whether the search term contains the key:value separator. 

    3. QueryParser must check whether there exists a comparison operator directly after the key:value separator.

    4. QueryParser must check whether there exists a valid number after the comparison operator.

Approach #2 - in ElasticsearchMetadataStorage

We can instead extract type information by parsing the search term within ElasticsearchMetadataStorage, requiring the createTermQuery method to conduct the checks listed above. Items 1 and 2 of the above checks are already conducted by the method. A disadvantage of this is that it takes some functionality from QueryParser that QueryParser could reasonably have available for possible future implementations of metadata storage.

Indexing numeric values

For a given value entry, we have to accurately tell whether it constitutes a numeric value, much in the way we must do so for a numeric search. We must then store the value such that Elasticsearch can access it.

Approach #1 - Extending the Property class

We can create a NumericProperty class that extends MetadataDocument’s Property class, allowing its objects to store both the String and Long representation of a numeric value. This would also require adding a new field to the index.mapping.json file, ElasticsearchMetadataStorage, and the MetadataDocument class.

Approach #2 - Augmenting the Property class

We can instead change the Property class to hold an extra Long field (e.g. numericValue) that may or may not be null.

Searching for numbers with Elasticsearch

Elasticsearch’s RangeQueryBuilder class provides a straightforward way to conduct greater than, greater than or equal to, less than, less than or equal to, and equality searches for numeric values. After parsing a numeric search term, we can detect the presence of a comparison operator—and if there is one, detect which one it is—and map that operator to a RangeQueryBuilder method. This can be executed within ElasticsearchMetadataStorage’s createTermQuery method.

API changes

Changes to QueryParser

QueryParser now parses search queries for value type information (e.g. whether the query is looking for a string or a number), and creates the corresponding QueryTerms.

Changes to QueryTerm

QueryTerm now contains a SearchType enumerator and searchType field.

Related Jira

Error rendering macro 'jira' : Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Unable to locate Jira server for this macro. It may be due to Application Link configuration.

Related Work

  • No labels