Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Overview

Elasticsearch is designed as an alternative to relational databases to allow for scalability and flexible querying. Elasticsearch stores objects in documents (similar to rows in traditional databases), which contain fields (analogous to columns). Unlike relational databases, however, these fields can be searchable by full-text search, where results are returned if the search terms are contained in the body of the field, not if they match the field exactly.  Data can be processed and accessed quickly through a combination of filtering on exact value fields (analyzing structured data) and searching full-text fields, which are commonly referred to as unstructured data.

Motivation

By incorporating Elasticsearch as a sink for ETLBatch adapters, users will have more flexibility in analyzing their data; moreover, through the use of products like Kibana, they can then visualize and interpret their Elasticsearch data.  Finally, implementing elasticsearch sink will give CDAP users greater flexibility in the system through which they store and manage their data.

...

Specific use cases may include analysis of access logs or any information processing that includes full-text analysis (bodies of tweets, for example).

Requirements

Elasticsearch includes packages, such as Hadoop OutputFormat and, more generally,  elasticsearch-hadoop, for writing to elasticsearch from Hadoop. This method requires the elasticsearch server and port, as well as the index (analogous to a database in SQL) and type (which is equivalent to a table and will be discussed in more detail below).

...

In summary, to create an elasticsearch sink, the user would need to specify the name of an index, which may or may not have been created, the type name, the hostname and port for the elasticsearch server, and the field to derive the document id from.

Specifications

  • The data will be written in batch, upon receipt from the source.
  • Data should be processed without any data loss, and exactly one document should exist for each entry or event in the source.
  • The user should be able to query elasticsearch while the adapter is running.
  • The ID (equivalent to the row field) can be derived from a user-specified field.
  • CDAP will create a connection to elasticsearch, write data, then close the connection without creating unnecessary nodes.
  • CDAP will only write data to elasticsearch, not query or read data stored in elasticsearch.
  • CDAP can create the type from the user-supplied name or use a pre-existing type.

Configuration Json

Sample configuration:

 

Wiki Markup
{
    "template": "ETLBatch",
    "description": "Elasticsearch Configuration",
    "config": {
        "schedule": "*/1 * * * *",
        "source":{
            "name":"Stream",
            "properties":{
                "name":"myStream",
                "duration":"1d"
            }
        },
        "sink": {
            "name": "Elasticsearch",
            "properties": {
                "es.host": "localhost:9200",
                "es.index": "index",
                "es.type": "type",
                "es.idField": "ts"
            }
        },
        "transforms": []
    }
}