Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Most research indicates feature parity between the two options, although Elasticsearch seems to have better REST API and JSON support. However, being that Apache Solr is more favored in Hadoop-land (supported by more distributions, is the only search engine that Cloudera supports, and has support in Slider to run on YARN), it makes more sense as the first candidate for supporting a search backend. The search backend, however, can be made pluggable, so it could be swiped out for ElasticSearch if users wish to in future.

Solr can be run as either a separate Twill Runnable using logic like https://github.com/lucidworks/yarn-proto/blob/master/src/main/java/org/apache/solr/cloud/yarn/SolrMaster.java or can be housed inside the DatasetOpExecutorTwillRunnable as well. This decision depends on some prototyping 

Support for 4.0 requirements in Apache Solr

...

  1. results - Contains a set of search results matching the search query
  2. total - specifies the total number of matched entities. This can be used to calculate the number of pages.

 TODO: Given the format of the entityId object in the search response, figure out if sorting can be applied on the entity name.

Code Block
$ curl http://localhost:11015/v3/namespaces/default/metadata/search?offset=50&size=2
{
  "total": 142,
  "results": [
    {
      "entityId":{
         "id":{
            "applicationId":"PurchaseHistory",
            "namespace":{
               "id":"default"
            }
         },
         "type":"application"
      },
      "metadata":{
         "SYSTEM":{
            "properties":{
               "Flow:PurchaseFlow":"PurchaseFlow",
               "MapReduce:PurchaseHistoryBuilder":"PurchaseHistoryBuilder"
            },
            "tags":[
               "Purchase",
               "PurchaseHistory"
            ]
         }
      }
    },
    {
      "entityId":{
         "id":{
            "instanceId":"history",
            "namespace":{
               "id":"default"
            }
         },
         "type":"datasetinstance"
      },
      "metadata":{
         "SYSTEM":{
            "properties":{
               "type":"co.cask.cdap.examples.purchase.PurchaseHistoryStore"
            },
            "tags":[
               "history",
               "explore",
               "batch"
            ]
         }
      }
    }
  ]
}

Status of an entity

Along with showing the metadata of an entity (name, description, tags, properties, etc), one of the requirements for the home page is to also show a brief 'status' for every entity, which is a summary of statistics and metrics. For each entity type, status should surface:

Artifact: # apps, # extensions, # plugins

Application: Total # programs, # Running, # Stopped

Program

Dataset: Read Rate, Write Rate, # apps using it

StreamRead Rate, Write Rate, # apps connected to it, # stream views created

Stream ViewRead Rate, Write Rate, # apps connected to it

This information will not be surfaced from the metadata system. The UI will have to make separate calls potentially for:

  1. Metrics APIs for getting Read Rate and Write Rate
  2. Usage Registry for apps using datasets, streams and stream views
  3. App Fabric APIs for getting the other information from App Fabric. 

For 2 and 3, there could be an alternative to provide a UI-only (non-documented) batch endpoint.