Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Overview

This document captures the design of enhancements to data discovery in 4.0. Its main goal is to serve the Listing Center Home Page of CDAP 4.0.

Requirements

The main requirements influencing these enhancements are:

  1. Support configurable sorting for search results. Preferably, both sortBy and sortOrder should be supported. In addition, it would be nice to support multiple combinations of sortBy and sortOrder
  2. Support pagination for search results. The API should accept offset (defines the start position in the search results) and limit (defines the number of results to show at a time) parameters.
  3. Search queries should be able to filter results by one or more entity types
  4. Metadata for every search result should include (**needs confirmation**):
    1. name
    2. description
    3. creation time
    4. version
    5. entity type
    6. owner
    7. Status - Composed of statistics, current state, etc of the entity?
  5. Potential requirement: Ability to annotate (if not filter) an entity by scope (SYSTEM vs USER)

User Stories

  1. As a CDAP user, I should be able to search all entities (artifacts, applications, programs, datasets, streams, views) sorted by a name and/or creation time
  2. As a CDAP user, I should be able to paginate search results by specifying a page size. In addition, I should be able to specify the offset from where to return search results.
  3. As a CDAP user, I should be able to filter search results by a given entity type

Design

Alternatives

The CDAP search backend today has been implemented using an IndexedTable. Implementing sorting and pagination on this implementation may be difficult as well as introduce performance bottlenecks, due to multiple potential HBase scans. Also, an index would have to be stored per sortBy and sortOrder combination. An alternative to this is to fetch results for the provided search query and sort them in-memory after that. But in a big data scenario, this option is not viable.

...

Solr can be run as either a separate Twill Runnable using logic like https://github.com/lucidworks/yarn-proto/blob/master/src/main/java/org/apache/solr/cloud/yarn/SolrMaster.java or can be housed inside the DatasetOpExecutorTwillRunnable as well. This decision depends on some prototyping 

Support for 4.0 requirements in Apache Solr

Sorting (including multiple sort orderings) is supported in Apache Solr using the sort parameter.

Pagination is supported as a combination of the start and rows parameters.

REST API changes

The following changes would be made in the metadata search RESTful API:

...

Code Block
$ curl http://localhost:11015/v3/namespaces/default/metadata/search?offset=50&size=2
{
  "total": 142,
  "results": [
    {
      "entityId":{
         "id":{
            "applicationId":"PurchaseHistory",
            "namespace":{
               "id":"default"
            }
         },
         "type":"application"
      },
      "metadata":{
         "SYSTEM":{
            "properties":{
               "Flow:PurchaseFlow":"PurchaseFlow",
               "MapReduce:PurchaseHistoryBuilder":"PurchaseHistoryBuilder"
            },
            "tags":[
               "Purchase",
               "PurchaseHistory"
            ]
         }
      }
    },
    {
      "entityId":{
         "id":{
            "instanceId":"history",
            "namespace":{
               "id":"default"
            }
         },
         "type":"datasetinstance"
      },
      "metadata":{
         "SYSTEM":{
            "properties":{
               "type":"co.cask.cdap.examples.purchase.PurchaseHistoryStore"
            },
            "tags":[
               "history",
               "explore",
               "batch"
            ]
         }
      }
    }
  ]
}

Status of an entity

Along with showing the metadata of an entity (name, description, tags, properties, etc), one of the requirements for the home page is to also show a brief 'status' for every entity, which is a summary of statistics and metrics. For each entity type, status should surface:

...