Overview
This document captures the design of enhancements to data discovery in 4.0.
Requirements
The main requirements influencing these enhancements are:
- Support configurable sorting for search results. Preferably, both
sortBy
andsortOrder
should be supported. In addition, it would be nice to support multiple combinations ofsortBy
andsortOrder
- Support pagination for search results. The API should accept
offset
(defines the start position in the search results) andlimit
(defines the number of results to show at a time) parameters. - Search queries should be able to filter results by one or more entity types
- Metadata for every search result should include (**needs confirmation**):
name
description
creation time
version
entity type
owner
- any more?
- Potential requirement: Ability to annotate (if not filter) an entity by scope (
SYSTEM
vsUSER
)
User Stories
- As a CDAP user, I should be able to search all entities (artifacts, applications, programs, datasets, streams, views) sorted by a name and/or creation time
- As a CDAP user, I should be able to paginate search results by specifying a page size. In addition, I should be able to specify the offset from where to return search results.
- As a CDAP user, I should be able to filter search results by a given entity type
Design
Alternatives
The CDAP search backend today has been implemented using an IndexedTable
. Implementing sorting and pagination on this implementation may be difficult as well as introduce performance bottlenecks, due to multiple HBase scans. Also, the eventual goal of CDAP is to move from the current IndexedTable
backed search to an external search engine. More details about alternatives for search are at External Search and Indexing Engine Investigation