Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Checklist

  •  User Stories Documented
  •  User Stories Reviewed
  •  Design Reviewed
  •  APIs reviewed
  •  Release priorities assigned
  •  Test cases reviewed
  •  Blog post

Introduction 

In CDAP 6.0.0, all user-annotated metadata property values are stored as String objects, making them ineligible for numeric search, even when users can understand them as numbers. To improve users’ metadata search experience, we can use Elasticsearch to introduce more specific representations for metadata values and allow users to search numerically.

User Stories 

  • A pipeline developer attaches a “priority” property to their datasets, assigning to that property an integer value from 1-10. They would like to search for all of the datasets with a priority higher than 7.
  • A pipeline developer structures their datasets hierarchically and attaches a numeric “depth” property to them that specifies their distance from the first generation of datasets. They would like to search for all datasets before the 3rd generation.
  • A pipeline developer interacting with CDAP through the CLI programmatically assigns a numeric value to a “Writes” “writes” property of their datasets based on the amount of writes each dataset receives. They would like to search for all datasets with at least 5 writes.

Discussions

One can imagine a hypothetical situation in which a user searches, “>30.” When no property is explicitly mentioned, CDAP looks through all metadata properties for the specified value. Because of this, such a search would return all datasets—the creation time of any dataset, represented in Unix time, will be greater than 30.There exist at least two possible solutions to this problem:
  1. Change the search representation of creation time to be a formatted date, such that an explicit numeric search wouldn’t apply, and a date search would be necessary.
  2. Enforce a search rule that numeric values require a property to be specified. This may place undue limits on user-facing functionality, and require a change to the UI.
Potentially, neither of these approaches could be chosen, and instead, no accommodations are made. This may prove confusing to users that do not intuit creation time as a long, a problem which may be exacerbated if creation time is later presented to the user as a formatted string (e.g. 2019/01/01).

UI Impact or Changes

This feature introduces five new search operators (particularly, comparison operators) to the UI, which come after the key:value separator:

  • A "==" prefix indicates a search for values exactly matching the number given.
  • A “>” prefix indicates a search for values higher than the number given.
  • A “>=” prefix indicates a search for values higher than or equal to the number given.
  • A “<” prefix indicates a search for values lower than the number given.
  • A "<=" prefix indicates a search for values lower than or equal to the number given.

Some examples of this syntax in use are:

  • priority:>7
  • depth:<3
  • writes:>=5

Without a preceding comparison operator, a search containing numeric values will be interpreted as a String-based search. In the event that a search term contains both a preceding comparison operator and alphabetic characters, the search term will be interpreted as a String-based search.

Regardless of the presence or absence of requirement syntax (a "+" prefix), a specified numeric search will be considered a required term.

Discussions

Unspecified keys

CDAP metadata search supports both "key:value" syntax and simple "value" syntax. The expected behavior and use case for a key:value search (e.g. "key:>30") is well-defined, but what should one expect when a property or key is not specified (e.g. ">30")? Should this be considered a valid numeric search? There exist two options:

  1. It may be considered a valid numeric search. Currently, when no key is specified, CDAP looks over all String-based representations of metadata values. If this were to be a valid numeric search, CDAP would have to search over all numeric representations of metadata values, as well.
  2. It may not be considered a valid numeric search. This would require users to specify a key when attempting to conduct a numeric search. Given that all numeric search terms are also required search terms, this specificity requirement seems the most useful.

Conclusion: Solution #2 is desirable, and a search without a specified property will be ineligible for numeric search. All numeric search terms are also assumed to be required search terms, so this specificity requirement will maximize proper use of that assumption. 

Number storage limitations

If we are to store numeric values assuch, we must handle the limitations of number storage in Java. If we store numbers as integers, how do we respond when a user inputs a number larger than Integer.MAX_VALUE? The automatically-assigned Creation-Time property, for instance, has a value above this

.

, and is stored as a Long. There exist at least

two

three possible solutions to this problem:

  1. Store numbers as BigIntegers, which may solve the problem of a maximum value—CDAP enforces a 50-character limit on property values, which is within the scope of what a BigInteger can hold—but may cause memory/performance issues.
  2. Store numbers as longs, and enforce a particular rule in the UI against numeric Longs or Doubles, and throw an exception when met with numeric values exceeding some cap (e.g. Long.MAX_VALUE). Currently, the numeric value of Creation-Time is stored as a Long, lending credibility to this option
  3. Store numbers as Longs or Doubles, and interpret any numbers exceeding the cap as Strings.

Conclusion: Solution #3 is desirable; numbers will be stored as Doubles, and if they exceed Double.MAX_VALUE, they will be interpreted as Strings. Storing them as doubles allows for both decimal and integer formats to be accepted (e.g. "2.0" and "2"). Interpreting excessively large numbers as Strings is the most simple and sufficient solution currently, as string interpretations are the default; in the future, this behavior may be changed to throw a user-facing exception instead.

Design

New syntax will be introduced to allow searching metadata for numeric values. Greater than, greater than or equal to, less than, less than or equal to, and equality searches will be available.

The metadata indexing process will be changed to store valid numbers as numeric values in addition to being stored as Strings.

The Elasticsearch implementation of metadata storage will make use of the Elasticsearch Java API’s built-in classes to search metadata for numeric properties.

Implementation

Parsing search queries for numeric syntax

For a given search term string, we have to accurately tell whether it constitutes a numeric search.

Approach #1 - in QueryParser API

We can communicate type additional information about a search term to ElasticsearchMetadataStorage through the QueryTerm. Much like the existing Qualifier enumerator, a SearchType and Comparison enumerator will hold that information, and the Elasticsearch metadata storage implementation can use that enumerator those enumerators to construct the relevant QueryBuilder objects.

This requires the QueryParser to receive a few changes:

    1. Since terms to be parsed can follow property:number syntax—possibly with a comparison operator directly after the colon—the QueryParser would require knowledge of the colon as a key:value separator, possibly by importing MetadataConstants.java.

    2. QueryParser must check whether the search term contains the key:value separator. 

    3. QueryParser must check whether there exists a comparison operator directly after the key:value separator.

    4. QueryParser must check whether there exists a valid number after the comparison operator.

Because QueryTerm objects do not hold information about comparison operators—they would be unintelligible for String-based QueryTerms—this approach would necessitate a second check for comparison operators within ElasticsearchMetadataStorage, wherein a meaningful interpretation of them could be made.

Approach #2 - in ElasticsearchMetadataStorage

We can instead extract type information by parsing the search term within ElasticsearchMetadataStorage, requiring the createTermQuery method to conduct the checks listed above. Items 1 and 2 of the above checks are already conducted by the method. A disadvantage of this is that it takes some functionality from QueryParser that QueryParser could reasonably have available for possible future implementations of metadata storage.

Conclusion

Approach #1 is desirable. A natural benefit of this approach is that it is consistent with QueryParser's purpose. While Elasticsearch is currently the only CDAP metadata storage implementation to use numeric search, an added benefit of parsing in QueryParser is that it abstracts much of the conceptual work away from the ElasticsearchMetadataStorage class, enhancing its readability.

Indexing numeric values

For a given value entry, we have to accurately tell whether it constitutes a numeric value, much in the way we must do so for a numeric search. We must then store the value such that Elasticsearch can access it.

Approach #1 - Extending the Property class

We can create a NumericProperty class that extends MetadataDocument’s Property class, allowing its objects to store both the String and

Long

numeric representation of a numeric value.

This would also require adding a new field to the index.mapping.json file, ElasticsearchMetadataStorage, and the MetadataDocument class.

 

Approach #2 - Augmenting the Property class

We can instead change the Property class to hold an extra

Long

numeric field (e.g. numericValue) that may or may not be null.

A possible disadvantage of this is that null values can be unwieldy and dangerous to handle.

Conclusion

Approach #2 is desirable; it is the simpler approach while remaining effective, and requires few, straightforward changes to the codebase.  The Property class will hold an extra Double field named numericValue, and will be assigned depending on whether the string representation can be parsed as a Double (through the Double.parseDouble method). Accompanying this change to the Property class, index.mapping.json will include a new nested property of type double, "numericValue". ElasticsearchMetadataStorage will then include a nested numeric value field that corresponds to this change. 

Searching for numbers with Elasticsearch

Elasticsearch’s RangeQueryBuilder class provides a straightforward way to conduct greater than, greater than or equal to, less than, less than or equal to, and equality searches for numeric values. After parsing a numeric search term, we can detect the presence of a comparison operator—and if there is one, detect which one it is—and map that operator to a RangeQueryBuilder method. This can be executed within ElasticsearchMetadataStorage’s createTermQuery method.

API changes

Changes to QueryParser

QueryParser now parses search queries for value type information (e.g. whether the query is looking for a string or a number), and creates the corresponding QueryTerms.

Changes to QueryTerm

QueryTerm now contains a SearchType enumerator and searchType field.

UI Impact or Changes

This feature introduces four new search operators to the UI:

  • A “>” prefix indicates a search for values higher than the number given.
  • A “>=” prefix indicates a search for values higher than or equal to the number given.
  • A “<” prefix indicates a search for values lower than the number given.
  • A "<=" prefix indicates a search for values lower than or equal to the number given.

Some examples of this syntax in use are:

  • priority:>7
  • depth:<3
  • writes:>=5

Related Work

Date search

Updated QueryParser

Code Block
languagejava
titleQueryParser.java
collapsetrue
/**
 * A thread-safe class that provides helper methods for metadata search string interpretation,
 * and defines search syntax for various search term properties, i.e. the data stored in {@link QueryTerm} objects.
 */
public final class QueryParser {
  private static final Pattern SPACE_SEPARATOR_PATTERN = Pattern.compile("\\s+");
  private static final String KEYVALUE_SEPARATOR = ":";
  private static final String REQUIRED_OPERATOR = "+";

  // private constructor to prevent instantiation
  private QueryParser() {}

  /**
   * Organizes and separates a raw, space-separated search string
   * into multiple {@link QueryTerm} objects. Spaces are defined by the {@link QueryParser#SPACE_SEPARATOR_PATTERN}
   * field, the semantics of which are documented in Java's {@link Pattern} class.
   * Certain typical separations of terms, such as hyphens and commas, are not considered spaces.
   * This method preserves the original case of the query.
   *
   * QueryTerms are assigned a search type {@link QueryTerm.SearchType} based on their format. For instance,
   * if a string can be parsed as a numeric double, it will be assigned the NUMERIC type, which allows it to be used
   * in a numeric search. Search terms containing alphabetical characters and those exceeding {@link Double#MAX_VALUE}
   * will be assigned the String type.
   *
   * This method supports the use of certain search operators that, when placed before a search term,
   * denote qualifying information about that search term. When translated into a QueryTerm object, search terms
   * containing a qualifying operator have the operator removed from the string representation.
   * The {@link QueryParser#REQUIRED_OPERATOR} character signifies a search term that must receive a match.
   * By default, this method considers search items of {@link SearchType#STRING}
   * without a qualifying operator to be optional.
   * Search items of {@link SearchType#NUMERIC} are automatically required.
   *
   * For numeric searches, multiple comparison operators can be used.
   * >, >=, <, <=, or = can be placed before a numeric search field to denote a
   * greater-than, greater-than-or-equal-to, less-than, less-than-or-equal-to search, or equality search, respectively.
   * Search items without a comparison operator are considered string-based searches.
   *
   * @param query the raw search string
   * @return a list of QueryTerms
   */
  public static List<QueryTerm> parse(String query) {
	//...
  }

  /**
   * Extracts the raw value of the input term, given that terms can follow a key:[comparison-operator]value syntax.
   * This method removes any syntactic characters from the input string, including comparison and wildcard operators,
   * as well as the property qualifier, e.g. "key".
   * As an example, extractTermValue("key:>=30") returns "30".
   *
   * Note that this method removes comparison operators from alphabetic strings as well, even though they do not qualify
   * for numeric search.
   * As an example, extractTermValue("+>=thirty") returns "thirty".
   *
   * If the value consists entirely of a single operator (e.g. ">=" or "+"), that operator will be returned.
   * As an example, extractTermValue("key:>=") returns ">=", despite it typically being a comparison operator. In this
   * example, ">=" does not precede anything, and is thus considered its own search term.
   *
   * @param term the search term, with all syntactic operators included
   * @return the raw value of the search term, with all syntactic operators excluded
   */
  public static String extractTermValue(String term) {
	//...
  }

Updated QueryTerm

Code Block
languagejava
titleQueryTerm.java
collapsetrue
/**
 * Represents a single item in a search query in terms of its content (i.e. the value being searched for)
 * and any useful properties of the search term, e.g. its qualifier and search type.
 * Is typically constructed in a list via {@link QueryParser#parse(String)}
 */
public class QueryTerm {
  private final String term;
  private final Qualifier qualifier;
  private final SearchType searchType;
  private final Comparison comparison;

  /**
   * Defines the different types of search operators that can be used.
   * A qualifier determines how the search implementation should prioritize the given term, e.g.
   * prioritizing required terms over optional ones.
   */
  public enum Qualifier {
    OPTIONAL, REQUIRED
  }

  /**
   * Defines the different types of search terms that can be used.
   * A search type describes the intuitive object type of the term;
   * for instance, the term may be intuited as a number and parsed as one, though internally represented as a String.
   * Its search type would be considered NUMERIC.
   */
  public enum SearchType {
    STRING, NUMERIC
  }

  /**
   * Defines the different relationships a search term can have to potential matches.
   * For a String or keyword search, only EQUALS is valid.
   */
  public enum Comparison {
    EQUALS, GREATER, GREATER_OR_EQUAL, LESS, LESS_OR_EQUAL
  }

  /**
   * Older constructor that assumes a simple String search. Ineligible for numeric search fields.
   *
   * @param term the search term
   * @param qualifier the qualifying information {@link Qualifier}
   */
  public QueryTerm(String term, Qualifier qualifier) {
    this(term, qualifier, SearchType.STRING, Comparison.EQUALS);
  }
  /**
   * Constructs a QueryTerm using the search term, qualifying information, search type, and comparison type.
   *
   * @param term the search term
   * @param qualifier the qualifying information {@link Qualifier}
   * @param searchType the intuitive object type {@link SearchType}
   * @param comparison the desired relative value of potential matches {@link Comparison}
   */
  public QueryTerm(String term, Qualifier qualifier, SearchType searchType, Comparison comparison) {
    this.term = term;
    this.qualifier = qualifier;
    this.searchType = searchType;
    this.comparison = comparison;
  }

  public String getTerm() {
    return term;
  }

  public Qualifier getQualifier() {
    return qualifier;
  }

  public SearchType getSearchType() {
    return searchType;
  }

  public Comparison getComparison() {
    return comparison;
  }

  @Override
  public boolean equals(Object o) {
    if (o == this) {
      return true;
    }
    if (o == null || getClass() != o.getClass()) {
      return false;
    }

    QueryTerm that = (QueryTerm) o;

    return Objects.equals(term, that.getTerm())
        && Objects.equals(qualifier, that.getQualifier())
        && Objects.equals(searchType, that.getSearchType())
        && Objects.equals(comparison, that.getComparison());
  }

  @Override
  public int hashCode() {
    return Objects.hash(term, qualifier, searchType, comparison);
  }

  @Override
  public String toString() {
    return "term:" + term
        + ", qualifier: " + qualifier
        + ", searchType: " + searchType
        + ", comparison: " + comparison;
  }
}


Related Jira

Jira Legacy
serverCask Community Issue Tracker
serverId45b48dee-c8d6-34f0-9990-e6367dc2fe4b
keyCDAP-15685

Jira Legacy
serverCask Community Issue Tracker
serverId45b48dee-c8d6-34f0-9990-e6367dc2fe4b
keyCDAP-15703

Related Work

Future Work

With the introduction of several new and implementation-specific metadata search features, a user-friendly way of navigating what features are available should be implemented.