Amazon Kinesis source and sink
Overview
Amazon Kinesis Streams (similar to kafka) provides infrastructure to collect and process large streams of data in realtime. It works on the concept of shards, each stream can have a default value of 25 shards (specific to region). the streams are capable of rapid and continuos data ingestion and aggregation. The response time of data processing is realtime.
Motivation
By adding the capability of sending and receiving data from kinesis streams users will be able to integrate CDAP with their already running applications based on AWS infrastructure. We will be creating hydrator source and sink for Kinesis which will allow user to integrate them with our hydrator pipelines. users will be able to use their existing Kinesis stream events as a source and can send the data processed in pipeline to Kinesis stream providing more options for data processing. Users having heavy dependency on the AWS infrastructure can have a single point of data exchange between CDAP and AWS and could create access points within kinesis for their various aws applications.
Integrating kinesis will also help hydrator to store and process various formats that it doesn`t currently support, such as processing of images as blob objects.
Requirements
Kinesis provides stream producer library KPL and stream consumer library KCL, We would explore both the options with and without the native library in order to make the plugin lightweight. Users will be able to input their aws application credentials and will be able to add kinesis streams to hydrator.
The data is ingested in Kinesis in the form of blobs. the plugin should be able to accept various formats, convert them into blob and push it to kinesis. Similarly the data received from kinesis will be in key value format, value being as a blob, the source plugin should be able to convert it back to string format
Inputs
Input | Required | Default |
---|---|---|
access Key ID | Yes | |
Secret Access Key | Yes | |
Stream name: | Yes | |
stream start point | SHARD.LATEST | |
Key: data key | Yes | key |
Value: data to be stored | Yes | |
Shard name: | Defaults to next available shard |
Limitations:
By default a shard can only ingest data at the rate of 1 mb / sec and a stream instance can have max 25 shards
We will need 1 plugin instance per shard