Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Task marked incomplete

Table of Contents

Overview

This document captures the design of enhancements to data discovery in 4.0. Its main goal is to serve the Listing Center Home Page of CDAP 4.0.

...

  •  User stories documented (Bhooshan)
  •  User stories reviewed (Nitin)
  •  User stories reviewed (Todd)
  •  Requirements documented (Bhooshan)
  •  Requirements Reviewed (Nitin/Todd)
  •  Design Documented (Bhooshan
  •  Design Reviewed (Andreas/Terence/Poorna)
  •  Implementation
  •  Documentation
  •   

...

Requirements

The main requirements influencing these enhancements are:

...

Most research indicates feature parity between the two options, although Elasticsearch seems to have better REST API and JSON support. However, being that Apache Solr is more favored in Hadoop-land (supported by more distributions, is the only search engine that Cloudera supports, and has support in Slider to run on YARN), it makes more sense as the first candidate for supporting a search backend. The search backend, however, can be made pluggable (as an extension loaded using its own classloader using an SPI), so it could be swiped out for ElasticSearch if users wish to in future.

...

Solr can be run as either a separate Twill Runnable using logic like https://github.com/lucidworks/yarn-proto/blob/master/src/main/java/org/apache/solr/cloud/yarn/SolrMaster.java or can be housed inside the DatasetOpExecutorTwillRunnable as well. This decision depends on some prototyping . Solr will be started to use HDFS for persistence.

Standalone Mode

Solr supports a standalone mode, which starts up a separate Solr process. However, we will prefer to use EmbeddedSolrServer, in the same process as standalone CDAP.

...

CDAP will use EmbeddedSolrServer in in-memory mode.

Data Flow


Like in 3.5, there would be a call to update the index everytime the metadata of an entity is updated. Unlike in 3.5 though, this call would be an HTTP call to the Search Service (running Solr in 4.0). 

Note: Since this call is now an HTTP call,

  1. should it be asynchronous?
  2. it will happen outside of the transaction to update the Metadata Dataset. 

Index Sync

Since the persistence stores for metadata and the search index will be different, we will need a utility to keep them in sync. This could be a service/thread that runs periodically (preferred), or a tool that is invoked manually.

Upgrade

There should be a way to upgrade existing indexes to be stored in the new Search backend. The index sync tool should be developed in a way that it can be run via the Upgrade Tool to update existing metadata in the new search backend.

Indexes (TBD)

  1. What is the schema of data to be indexed in the new search backend?

REST API changes

The following changes would be made in the metadata search RESTful API:

...

The response would contain 2 fields, other than the above input parameters:

  1. results - Contains a set of search results matching the search query
  2. total - specifies the total number of matched entities. This can be used to calculate the number of pages.

...

Code Block
$ curl http://localhost:11015/v3/namespaces/default/metadata/search?offset=50&size=2
{
  "sort": "name asc,created_time desc",
  "offset": 141,
  "size": 10,
  "total": 142,
  "results": [
    {
      "entityId":{
         "id":{
            "applicationId":"PurchaseHistory",
            "namespace":{
               "id":"default"
            }
         },
         "type":"application"
      },
      "metadata":{
         "SYSTEM":{
            "properties":{
               "Flow:PurchaseFlow":"PurchaseFlow",
               "MapReduce:PurchaseHistoryBuilder":"PurchaseHistoryBuilder"
            },
            "tags":[
               "Purchase",
               "PurchaseHistory"
            ]
         }
      }
    },
    {
      "entityId":{
         "id":{
            "instanceId":"history",
            "namespace":{
               "id":"default"
            }
         },
         "type":"datasetinstance"
      },
      "metadata":{
         "SYSTEM":{
            "properties":{
               "type":"co.cask.cdap.examples.purchase.PurchaseHistoryStore"
            },
            "tags":[
               "history",
               "explore",
               "batch"
            ]
         }
      }
    }
  ]
}

...

For 2 and 3, there could be an alternative to provide a UI-only (non-documented) batch endpoint.

 

 

 

 

...

Dataset Types in Metadata System

Currently, the Metadata System only supports artifacts, applications, programs, datasets, streams and stream views as entities. Is support for dataset types and modules necessary for 4.0?