Solr Search or Cloudera Search

Introduction

Solr Search Sink plugin writes data to a Solr node or cluster or to Solr Cloud. This plugin allows users to build pipelines that include transformations and write them to Solr. The incoming fields from the previous stage in pipelines are mapped to Solr fields.

Use-case

User is able to select Solr Sink in Batch and Realtime mode
User is able to specify whether he is connecting to Single Node Solr instance or Solr Cloud or a Solr cluster
User is able to specify a URI when specifying single node cluster and Zookeeper Connection String when it's a Solr Cluster.
User is able to specify the collection name that the data needs to be written to
User is able to map incoming fields to the output schema

Design

We should work on data type mapping between possible data types from input and Solr data types.
For the following CDAP data types, Solr has supported built in data types as mentioned below:

Note: Currently, SolrSearch sink plugin supports only the above mentioned data types (CDAP primitive data types), to write data into the Solr. Plugin will validate the input schema types during the pipeline configuration stage.
User will get a drop down to select the Solr modes to be connected. Solr modes will be:
SingleNode
SolrCloud
And depending upon the Solr mode, user will specify the URL for establishing connection, through 'csv widget'.
Requirement: User is able to map incoming fields to the output schema.
Approach: 'keyvalue' widget to map incoming fields to the output field names. The key specifies the name of the input field to map, with its corresponding value specifying the new name for that field with which data will be indexed to Solr.
Note: All the final output fields which user wants to index, should be properly declared and defined in Solr's schema.xml.
User will provide one 'Id Field' as input that will determine the Unique Key for the document, being indexed in Solr. And this Id Field should match the field name in the structured record of the input.
Note:
Id field should be declared and defined in Solr's schema.xml. Also it should be defined as <uniqueKey> in Solr.
If Id field value is null in the input record, then that particular record will be filtered out and not written to the Solr index.
In batch mode for SolrSearch sink, user can configure the 'Batch Size'. This parameter will indicate the number of documents to create a batch and send it to Solr for indexing, and after each batch commit will be triggered. Default batch size is '10000'.
Note:
Tested the pipeline performance with the batch size 1000 as well 10000 and observed that the performance is better with the batch size as 10000.
Therefore keeping default batch size as 10000 for bulk data, as this seems feasible. However, users will always have control to change the batch size as per their requirement and feasibility.

Also, indexing and commit will be done at the end of each mapper, to ensure that all the data is properly indexed in Solr, (even if it does not meet the batch size).

Assumptions:

User will already have all the fields defined and declared (name and its type) in Solr's schema.xml.
Plugin will not validate the Solr schema during the pipeline configuration stage. Any mismatch in field names and its type will result into the Run time exception.

More details will be added based on the findings.

Example

For Batch mode SolrSearch sink:

Properties:

referenceName: This will be used to uniquely identify this sink for lineage, annotating metadata, etc.
solrMode: Solr mode to connect to. For example, Single Node Solr or SolrCloud.
solrHost: The hostname and port for the Solr server. For example, localhost:8983 if Single Node Solr or zkHost1:2181,zkHost2:2181,zkHost3:2181 for SolrCloud.
collectionName: Name of the collection where data will be indexed and stored in Solr.
keyField: Field that will determine the unique id for the document to be indexed. It should match a fieldname in the structured record of the input.
batchSize: Number of documents to create a batch and send it to Solr for indexing. After each batch, commit will be triggered. Default batch size is 10000.
outputFieldMappings: List of the input fields to map to the output Solr fields. The key specifies the name of thefield to rename, with its corresponding value specifying the new name for that field.

Example:

This example connects to 'Sinlge Node Solr' server, which is running locally( at default port 8983), and writes the data to the specified collection (test_collection). The data is indexed using the key field coming in the input record. And also the fieldname 'office address' is mapped to the 'address' field in Solr's index.

{
  "name": "SolrSearch",
  "type": "batchsink",
    "properties": {
      "solrMode": "SingleNode",
      "solrHost": "localhost:8983",
      "collectionName": "test_collection",
      "keyField": "id",
      "batchSize": "10000",
      "outputFieldMappings": "office address:address"
    }
}

For example, suppose the SolrSearch sink receives the input record:

+===================================================================================================+
| id : STRING | firstname : STRING  | lastname : STRING |  Office Address : STRING  | pincode : INT |
+===================================================================================================+
| 100A        | John                | Wagh              |  NE Lakeside              | 480001        |
| 100B        | Brett               | Lee               |  SE Lakeside              | 480001        |
+===================================================================================================+

Once SolrSearch sink plugin execution is completed, all the rows from input data will be indexed in the test_collection with the fields id, firstname, lastname, address and pincode.

For Realtime mode SolrSearch sink:

Properties:

referenceName: This will be used to uniquely identify this sink for lineage, annotating metadata, etc.
solrMode: Solr mode to connect to. For example, Single Node Solr or SolrCloud.
solrHost: The hostname and port for the Solr server. For example, localhost:8983 if Single Node Solr or zkHost1:2181,zkHost2:2181,zkHost3:2181 for SolrCloud.
collectionName: Name of the collection where data will be indexed and stored in Solr.
keyField: Field that will determine the unique id for the document to be indexed. It should match a fieldname in the structured record of the input.
outputFieldMappings: List of the input fields to map to the output Solr fields. The key specifies the name of thefield to rename, with its corresponding value specifying the new name for that field.

Example:

This example connects to 'Sinlge Node Solr' server, which is running locally( at default port 8983), and writes the data to the specified collection (test_collection). The data is indexed using the key field coming in the input record. And also the fieldname 'office address' is mapped to the 'address' field in Solr's index.

{
  "name": "SolrSearch",
  "type": "realtimesink",
    "properties": {
      "solrMode": "SingleNode",
      "solrHost": "localhost:8983",
      "collectionName": "test_collection",
      "keyField": "id",
      "outputFieldMappings": "office address:address"
    }
}

For example, suppose the SolrSearch sink receives the input record:

+===================================================================================================+
| id : STRING | firstname : STRING  | lastname : STRING |  Office Address : STRING  | pincode : INT |
+===================================================================================================+
| 100A        | John                | Wagh              |  NE Lakeside              | 480001        |
| 100B        | Brett               | Lee               |  SE Lakeside              | 480001        |
+===================================================================================================+

Once SolrSearch sink plugin execution is completed, all the rows from input data will be indexed in the test_collection with the fields id, firstname, lastname, address and pincode.

Questions/Clarifications

How would we test this plug-in? Shall we get cluster with Solr cloud set-up on it?
For standalone instance; we will store plug-in output in memory and test them using Solr API.
To replicate actual environment; solr cloud/cluster will be set up and we will test with it by using Solr APIs.
Apart from the following CDAP schema data types that are mentioned in the above design section, like BYTES, ENUM, ARRAY, MAP, RECORD, UNION, we didn't find Solr built in types for these.
a. do we need to handle these data types or are we limiting the user with the primitives types only?

Table of Contents

Checklist

User stories documented
User stories reviewed
Design documented
Design reviewed
Feature merged
Examples and guides
Integration tests
Documentation for feature
Short video demonstrating the feature