This document captures the design of enhancements to data discovery in 4.0. Its main goal is to serve the Listing Center Home Page of CDAP 4.0.

Requirements

The main requirements influencing these enhancements are:

Support configurable sorting for search results. Preferably, both sortBy and sortOrder should be supported. In addition, it would be nice to support multiple combinations of sortBy and sortOrder
Support pagination for search results. The API should accept offset (defines the start position in the search results) and limit (defines the number of results to show at a time) parameters.
Search queries should be able to filter results by one or more entity types
Metadata for every search result should include (**needs confirmation**):
1. name
2. description
3. creation time
4. version
5. entity type
6. ownerany
7. moreStatus - Composed of statistics, current state, etc of the entity?
Potential requirement: Ability to annotate (if not filter) an entity by scope (SYSTEM vs USER)

...

The CDAP search backend today has been implemented using an IndexedTable. Implementing sorting and pagination on this implementation may be difficult as well as introduce performance bottlenecks, due to multiple potential HBase scans. Also, an index would have to be stored per sortBy and sortOrder combination. An alternative to this is to fetch results for the provided search query and sort them in-memory after that. But in a big data scenario, this option is not viable.

The eventual goal of CDAP is to move from the current IndexedTable backed search to an external search engine. More details The major motivations for that are to facilitate richer search queries and full-text search. Some initial investigation about alternatives for search are at External Search and Indexing Engine Investigation. A summary of the two most viable alternatives - Apache Solr and Elasticsearch can be found at these links:

[1] http://solr-vs-elasticsearch.com/

[2] https://thinkbiganalytics.com/solr-vs-elastic-search/

Most research indicates feature parity between the two options, although Elasticsearch seems to have better REST API and JSON support. However, being that Apache Solr is more favored in Hadoop-land (supported by more distributions, is the only search engine that Cloudera supports, and has support in Slider to run on YARN), it makes more sense as the first candidate for supporting a search backend. The search backend, however, can be made pluggable, so it could be swiped out for ElasticSearch if users wish to in future.

Support for 4.0 requirements in Apache Solr

Sorting (including multiple sort orderings) is supported in Apache Solr using the sort parameter.

Pagination is supported as a combination of the start and rows parameters.

REST API changes

The following changes would be made in the metadata search RESTful API:

a sort parameter that specifies the sort query. It contains a comma-separated list of sort fields and sort order. e.g. sort=name%20asc,created_time%20desc
an offset parameter that specifies the offset into the search results. Defaults to 0.
a size parameter that specifies the number of results to return, starting at the offset. Defaults to Integer.MAX_VALUE.

The response would contain 2 fields:

results - Contains a set of search results matching the search query
total - specifies the total number of matched entities. This can be used to calculate the number of pages.

Versions Compared

Old Version 5

New Version 6

Key

Requirements

Support for 4.0 requirements in Apache Solr

REST API changes

Page Comparison

Versions Compared

Old Version 5

New Version 6

Key

Requirements

Support for 4.0 requirements in Apache Solr

REST API changes