HTTPToHDFS Action

Introduction

There are a ton of datasets that are available from web urls. Some examples include government data such as http://www.consumerfinance.gov/data-research/consumer-complaints/#download-the-data or datasets from https://www.data.gov/. The problem is that in order to set up a repeatable process to pull this information into your cluster, you need to first download the data and write it into a file before you can use it in a batch workflow. Ideally, there should be a way to configure a single pipeline to pull that data in, store it to a temporary file on HDFS, then kick off a spark or MR workflow to process and load it into the cluster. This would also make demos and example pipelines from the marketplace much easier to leverage since there would be no local configuration needed in most cases.  

Use case(s)

  • I would like to store consumer complaints data from http://www.consumerfinance.gov/data-research/consumer-complaints/#download-the-data into my cluster. The data is updated nightly, and is 260mb, so i would like to build a pipeline to run every 24 hours and refresh the data from the site. Using this plugin, i configure it to pull data from the url in csv format, and store it in hdfs. then i configure a File source, a CSV parser, and a table sink to process the data in hydrator.
  • The inventory team publishes a product inventory xml feed that contains all the current products and the quantity of the item. I would like to configure a pipeline to query that service every hour, and write the data into a table. Using this plugin, i configure it to request information from the url, provide my credentials and expected format as request headers, and write the results to the /tmp/ dir in HDFS so that the rest of the pipeline can process the data.
  • I have partnered with a 3rd party company that is sending me a large amount of data in a gzip file. The file is stored on a webserver somewhere and i am given a url to download it. Using this plugin, i configure it to pull the binary file, store the .gz file on hdfs, and use the file source to natively read that data for processing.

User Storie(s)

  • As a pipeline developer, i would like to fetch data from an external webservice or url by providing the request method (GET, POST), url, payload (If POST), request headers, charset (if text, otherwise binary), timeouts, and a file path in hdfs. 
  • As a pipeline developer, i would like to have the option to flatten multi line json and xml responses into a single line by removing newlines and additional spaces.
  • As a pipeline developer, I would like to download files in excess of 1gb without failing
  • As a pipeline developer, I would like the plugin to retry an configurable amount of time before failing the pipeline
  • As a pipeline developer, I would like to be able to download text or binary data and store it in hdfs for further processing
  • As a pipeline developer, I would like to be able to send basic auth credentials by providing a username and password in the config
  • As a pipeline developer, I would like to be able to read from http and https endpoints.

Plugin Type

  • Action

Configurables

This section defines properties that are configurable for this plugin. 

User Facing NameTypeDescriptionConstraintsMacro Enabled?
HDFS File PathStringThe location to write the data in HDFS yes
URLString
Required. The URL to fetch data from.
 yes
Request MethodSelect
The HTTP request method.
GET, POST 
Request BodyString
Optional request body
 yes
Request HeadersKeyValue
An optional string of header values to send in each request where the keys and values are
delimited by a colon (":") and each pair is delimited by a newline ("\n").
 yes
Text or Binary?SelectShould be data be written as text (JSON, XML, txt files) or Binary (zip, gzip, images) data?Text, Binary 
CharsetSelectIf text data is selected, this should be the charset of the text being returned. Defaults to UTF-8."ISO-8859-1", "US-ASCII", "UTF-8", "UTF-16", "UTF-16BE", "UTF-16LE" 
Should Follow Redirects?Select
Whether to automatically follow redirects. Defaults to true.
true,false 
Number of RetriesSelect
The number of times the request should be retried if the request fails. Defaults to 3.
0,1,2,3,4,5,6,7,8,9,10 
Connect TimeoutString
The time in milliseconds to wait for a connection. Set to 0 for infinite. Defaults to 60000 (1 minute).
  
Read TimeoutString
The time in milliseconds to wait for a read. Set to 0 for infinite. Defaults to 60000 (1 minute).
  

Design / Implementation Tips

  • Please use HTTPPoller and HTTPCallback in Hydrator plugins as a reference.
  • The workflow token should contain the file path for the data that was written so that the file source can read from it

Design

{
    "name": "HTTPToHDFSAction",
      "plugin": {
        "name": "HTTPToHDFSAction",
        "type": "action",
        "label": "HTTPToHDFSAction",
        "artifact": {
          "name": "HTTPToHDFSActionPlugin",
          "version": "1.6.0",
          "scope": "SYSTEM"
      },
      "properties": {
          "hdfsFilePath": "file://tmp/data.csv",
          "url": "http://example.com/data",
          "method": "GET",
          "outputFormat": "Text",
          "charset": "UTF-8",
		  "requestHeaders": "acb:test",
          "followRedirects": "true",
		  "disableSSLValidation": "true",
          "numRetries": 1,
          "connectTimeout": 60000,
          "readTimeout": 60000
      }
}

 

Approach(s)

Implementation Approach:

1.java.net.HttpURLConnection APIs would be used to execute the service.

2.GET and POST type of methods would be supported.Get would be the default method.

3.Text and Binary output formats would be supported.Text would be the default format.

4.Retry would be done as per the configuration provided for "numRetries".

Properties

1.hdfsFilePath: The location to write the data in HDFS.If the file already exists, it will be overwritten.
2.url: The URL to fetch data from.
3.method: The HTTP request method.
4.body: Optional request body.
5.outputFormat: Output data should be written as Text (JSON, XML, txt files) or Binary (zip, gzip, images). Defaults to Text.
6.charset: If Text data is selected, this should be the charset of the text being returned. Defaults to UTF-8.
7.requestHeaders: An optional string of header values to send in each request where the keys and values are
delimited by a colon (":") and each pair is delimited by a newline ("\n").
8.followRedirects: Whether to automatically follow redirects. Defaults to true.
9.numRetries: The number of times the request should be retried if the request fails. Defaults to 3.
10.connectTimeout: The time in milliseconds to wait for a connection. Set to 0 for infinite. Defaults to 60000 (1 minute).
11.readTimeout: The time in milliseconds to wait for a read. Set to 0 for infinite. Defaults to 60000 (1 minute).

12.disableSSLValidation: If false(SSL validation is enabled), need to add the certificate to the truststore of each machine. Defaults to true.

13.outputPath: The key used to store the file path for the data that was written so that the file source can read from it.Plugins that run at later       stages in the pipeline can retrieve the file path using this key through macro substitution:${filePath} where "filePath" is the key specified. Defaults to "filePath".

14.responseHeaders: The key used to store the response headers so that they are available to other plugins down the line.Plugins that run at later stages in the pipeline can retrieve the response headers using this through macro substitution:${responseHeaders} where responseHeaders" is the key specified. "Defaults to "responseHeaders".

NFR

1.This plugin should be able to execute a service successfully which returns a file of size more than 1GB.

2.Only Performance measurement is in scope as part of NFR.

3.If user enables SSL validation, they will be expected to add the certificate to the truststore of each machine.

Limitation(s)

1.Only GET and POST types of HTTP methods will be supported.

Future Work

  • Some future work – HYDRATOR-99999
  • Another future work – HYDRATOR-99999

Test Case(s)

  • Download csvfrom https url with ssl validation disabled.
  • Download csv from https url with ssl validation enabled.
  • Download csv file from http url.
  • Download binary tar(more than 500mb) from https url.
  • Post request with body/payload

Sample Pipeline

HTTPTOHDFS_BIGTEXTFILE_HTTPS_SSL_DISABLED_copy-cdap-data-pipeline.json

HTTPTOHDFS_BIGTEXTFILE_HTTPS_SSL_ENABLED_copy-cdap-data-pipeline.json

HTTPTOHDFS_TEXTFILE_HTTP_copy-cdap-data-pipeline.json

HTTPTOHDFS_TAR-cdap-data-pipeline.json

HTTPTOHDFS_POST_copy-cdap-data-pipeline.json

Table of Contents

Checklist

  • User stories documented 
  • User stories reviewed 
  • Design documented 
  • Design reviewed 
  • Feature merged 
  • Examples and guides 
  • Integration tests 
  • Documentation for feature 
  • Short video demonstrating the feature