External Search and Indexing Engine Investigation

Proposed External Search and Indexing Engine

To be able to do a full text search on the metadata information, CDAP to have some kind of revert index from keywords to actual object and containing that metadata.

To make this happen we will be leveraging external search and indexing engine to build the index against all the metadata tables to get the CDAP object information.

There are several potential search and indexing engines we could integrate with CDAP:

  1. Apache Solr (https://cwiki.apache.org/confluence/display/solr)

  2. Apache Lucene (https://lucene.apache.org/core)

  3. Lily project from NGData (https://github.com/henry-cask/lilyproject)

All the projects are using Apache 2.0 license so we could integrate it with CDAP.

Apache Solr

https://wiki.apache.org/solr/HowToReindex

We could deploy a single node Search System Service (see next section) that will be used to deploy Solr inside a container managed by YARN.

With Apache Solr, it would have have been run as a separate web application (https://cwiki.apache.org/confluence/display/solr/Running+Solr) in a YARN container where the CDAP search service runnable run. The communication between CDAP search service and Solr would be through HTTP.

We will store the index in HDFS (https://cwiki.apache.org/confluence/display/solr/Running+Solr+on+HDFS) to support fault tolerant and distributed search for the index.

Apache Lucene

We could also use Apache Lucene directly to manage the index and search (see http://www.lucenetutorial.com/lucene-vs-solr.html).

The interaction from CDAP will be happening via Java APIs because it would have Lucene engine embedded in one of the YARN containers run by CDAP search service. The CDAP search service also be responsible to re-index the documents whenever change happening in the metadata tables.

Lily Project

Internally, Lily also internally uses Solr to do the actual indexing and search capability (http://docs.ngdata.com/lily-docs-current/408-lily.html).

The main advantage and additional features of Lily is that the project maintain a records in HBase and indexer that would allow automatic updates to the Solr indexes.

The drawback is with Lily is that there will be some new components need to be deployed and managed for deploying CDAP in a distributed mode. This is probably too much for CDAP.