User stories documented (Bhooshan)
User stories reviewed (Nitin)
User stories reviewed (Todd)
Requirements documented (Bhooshan)
Requirements Reviewed (Nitin/Todd)
Design Documented (Bhooshan
Design Reviewed (Andreas/Terence/Poorna)
Implementation
Documentation

Requirements

The main requirements influencing these enhancements are:

...

As a CDAP user, I should be able to search all entities (artifacts, applications, programs, datasets, streams, views) sorted by a name and/or creation time
As a CDAP user, I should be able to paginate search results by specifying a page size. In addition, I should be able to specify the offset from where to return search results.
As a CDAP user, I should be able to filter search results by a given entity type

Design

Alternatives

The CDAP search backend today has been implemented using an IndexedTable. Implementing sorting and pagination on this implementation may be difficult as well as introduce performance bottlenecks, due to multiple potential HBase scans. Also, an index would have to be stored per sortBy and sortOrder combination. An alternative to this is to fetch results for the provided search query and sort them in-memory after that. But in a big data scenario, this option is not viable.

...

Most research indicates feature parity between the two options, although Elasticsearch seems to have better REST API and JSON support. However, being that Apache Solr is more favored in Hadoop-land (supported by more distributions, is the only search engine that Cloudera supports, and has support in Slider to run on YARN), it makes more sense as the first candidate for supporting a search backend. The search backend, however, can be made pluggable, so it could be swiped out for ElasticSearch if users wish to in future.

4.0 Requirements in Apache Solr

Sorting (including multiple sort orderings) is supported in Apache Solr using the sort parameter.

Pagination is supported as a combination of the start and rows parameters.

Running Apache Solr

Distributed Mode

Solr can be run as either a separate Twill Runnable using logic like https://github.com/lucidworks/yarn-proto/blob/master/src/main/java/org/apache/solr/cloud/yarn/SolrMaster.java or can be housed inside the DatasetOpExecutorTwillRunnable as as well. This decision depends on some prototyping

Support for 4.0 requirements in Apache Solr

Sorting (including multiple sort orderings) is supported in Apache Solr using the sort parameter.

...

Standalone Mode

Solr supports a standalone mode, which starts up a separate Solr process. However, we will prefer to use EmbeddedSolrServer, in the same process as standalone CDAP.

InMemory Mode

CDAP will use EmbeddedSolrServer in in-memory mode.

REST API changes

The following changes would be made in the metadata search RESTful API:

...

Code Block

$ curl http://localhost:11015/v3/namespaces/default/metadata/search?offset=50&size=2
{
  "total": 142,
  "results": [
    {
      "entityId":{
         "id":{
            "applicationId":"PurchaseHistory",
            "namespace":{
               "id":"default"
            }
         },
         "type":"application"
      },
      "metadata":{
         "SYSTEM":{
            "properties":{
               "Flow:PurchaseFlow":"PurchaseFlow",
               "MapReduce:PurchaseHistoryBuilder":"PurchaseHistoryBuilder"
            },
            "tags":[
               "Purchase",
               "PurchaseHistory"
            ]
         }
      }
    },
    {
      "entityId":{
         "id":{
            "instanceId":"history",
            "namespace":{
               "id":"default"
            }
         },
         "type":"datasetinstance"
      },
      "metadata":{
         "SYSTEM":{
            "properties":{
               "type":"co.cask.cdap.examples.purchase.PurchaseHistoryStore"
            },
            "tags":[
               "history",
               "explore",
               "batch"
            ]
         }
      }
    }
  ]
}

Status of an

...

Entity

Along with showing the metadata of an entity (name, description, tags, properties, etc), one of the requirements for the home page is to also show a brief 'status' for every entity, which is a summary of statistics and metrics. For each entity type, status should surface:

...

Versions Compared

Old Version 10

New Version 11

Key

Requirements

Design

Alternatives

Running Apache Solr

Distributed Mode

Support for 4.0 requirements in Apache Solr

Standalone Mode

InMemory Mode

REST API changes

Status of an

Entity

Page Comparison

Versions Compared

Old Version 10

New Version 11

Key

Requirements

Design

Alternatives

Running Apache Solr

Distributed Mode

Support for 4.0 requirements in Apache Solr

Standalone Mode

InMemory Mode

REST API changes

Status of an

Entity