Introduction

One of the most efficient ways to load data into Amazon Redshift is by pushing it to s3 first and then issuing a command to load that data to Redshift. Here is a tutorial about it: http://docs.aws.amazon.com/redshift/latest/dg/tutorial-loading-data.html

Use case(s)

A financial customer would like to quickly load financial reports into Redshift that have been generated from a Hydrator pipeline that has been synced to S3. The pipeline would have an s3 sink followed by an s3toredshift action which would load the data into the redshift tables.

User Storie(s)

As a user, i would like to load data written to S3 using the COPY command in redshift.
I would like to authenticate with IAM credentials as well as id and secret key pairs.
I would like to load multiple files into the same redshift table.
I would like to load s3 data created by cdap from s3 into redshift.

Plugin Type

Action

Configurables

This section defines properties that are configurable for this plugin.

User Facing Name	Type	Description	Constraints
Access Key	String	AWS access key for S3	Optional
Secret Access Key	String	AWS secret access key for S3	Optional
AWS IAM Role	String	IAM Role	Optional
S3 Data Location	String	Location of the data files in S3 bucket (including key prefix or manifest file)
S3 Region	String	Region for S3	Optional
Redshift Cluster DB Url	String	JDBC Redshift DB url for connecting to the redshift cluster
Master User	String	Master user for redshift
Master User Password	String	Master user password
Redshift Table Name	String	Redshift table name on which data will be loaded
List of Columns	String	List of Redshift table columns	Optional

Design / Implementation Tips

Follow the tutorial here to get a better understanding of what might be needed: http://docs.aws.amazon.com/redshift/latest/dg/tutorial-loading-data.html

Design

{
  "name": "S3ToRedshift",
  "type": "action",
    "properties": {
      "accessKey": "access-key",
      "secretAccessKey": "secret-access-key",
      "s3DataPath": "s3://bucket-name>/test",
      "clusterDbUrl": "jdbc:redshift://x.y.us-west-2.redshift.amazonaws.com:5439/dev",
      "masterUser": "master-user",
      "masterPassword": "master-password",
      "tableName": "redshifttest"
    }
}

Approach(s)

User can connect to the S3 buckets Either using 'access and secret access key' or using 'IAM role'.
By default copy command assumes that Redshift and S3 are in the same region. If S3 bucket is not in same region as of the Redshift cluster, then user can provide the region using 'S3 Region'.
Existing S3 sink plugins support only those regions that have versions 2 and 4 signature version support. Hence, S3ToRedhsift plugin will have support for only these regions: "us-east-1", "us-west-1", "us-west-2", "ap-southeast-1", "ap-southeast-2", "ap-northeast-1", "eu-west-1" and "sa-east-1".
Currently, existing S3 sink plugins use server side encryption technique SSE-S3 (With out using keys) for encrypting and writing the data into S3 buckets. Hence, the S3ToRedshift action plugin will load the data present in S3 buckets encrypted using server side encryption SSE-S3 technique into the Redshift. No additional parameter is required in copy command.
Existing S3 sink plugins write the data into the S3 buckets only in two formats: 'Avro and Parquet'. And currently Copy command does not support Parquet format. (http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html). For now, the S3ToRedshift action plugin will only have the support for Avro format.
Copy command uses format as avro 'auto' option, while loading the avro formatted data from S3 to Redshift.
1. COPY automatically maps the data elements in the Avro source data to the columns in the target table by matching field names in the Avro schema to column names in the target table. The matching is case-sensitive. Column names in Amazon Redshift tables are always lowercase, so matching field names must also be lowercase.
User can read the data from the specific directory of S3 buckets, for example 's3://<bucket-name>/test/2017-02-24/' or can read the files present in S3 buckets that starts with lets say 2017-02-24, through 's3://<bucket-name>/test/2017-02-24'. In this case, '2017-02-24' will be considered as key prefix.
Schema of the Redshift table must match with S3 data schema for the successful load.
1. If the target column does not present in the S3, then COPY attempts to load NULL.
2. If COPY attempts to assign NULL to a column that is defined as NOT NULL in the Redshift schema, the COPY command will fail.

Properties

**accessKey:** Access key for AWS S3 to connect to. (Macro-enabled)

**secretAccessKey:** Secret access key for AWS S3 to connect to. (Macro-enabled)

**iamRole:** IAM Role for AWS S3 to connect to. (Macro-enabled)

**s3Region:** The region for AWS S3 to connect to. If not specified, then plugin will consider that S3 bucket is in
the same region as of the Redshift cluster. (Macro-enabled)

**s3DataPath:** The S3 path of the bucket where the data is stored and will be loaded into the Redshift table.
For example, 's3://<bucket-name>/test/2017-02-22/'(will load files present in ``test/2017-02-22`` directory) or
's3://<bucket-name>/test/2017-02-22'(will load files from ``test`` directory having prefix ``2017-02-22``), for cases
when the user wants to load multiple files. (Macro-enabled)

**clusterDbUrl:** The JDBC Redshift database URL for Redshift cluster, where the table is present. For example,
'jdbc:redshift://x.y.us-west-2.redshift.amazonaws.com:5439/dev'. (Macro-enabled)

**masterUser:** Master user for the Redshift cluster to connect to. (Macro-enabled)

**masterPassword:** Master password for Redshift cluster to connect to. (Macro-enabled)

**tableName:** The Redshift table name where the data from the S3 bucket will be loaded. (Macro-enabled)

**listOfColumns:** Comma-separated list of the Redshift table column names to load the specific columns from S3
bucket. If not provided, then all the columns from S3 will be loaded into the Redshift table. (Macro-enabled)

Security

Limitation(s)

Future Work

Some future work – HYDRATOR-99999
Another future work – HYDRATOR-99999

Test Case(s)

S3ToRedshift action - Using keys for connection

S3ToRedshift action - Using IAM Role for connection

S3ToRedshift action - With Snappy Compression

S3ToRedshift action - With Deflate Compression

S3ToRedshift action - With list of selected redshift target columns to be loaded

Sample Pipeline

S3ToRedshift_Action_With_Keys-cdap-data-pipeline.json

S3ToRedshift_Action_With_SnappyCompression-cdap-data-pipeline.json

S3ToRedshift_Action_With_DeflateCompression_copy-cdap-data-pipeline.json

S3ToRedshift_Action_With_ListOfRedshiftColumns_copy-cdap-data-pipeline.json

Table of Contents

Checklist

User stories documented
User stories reviewed
Design documented
Design reviewed
Feature merged
Examples and guides
Integration tests
Documentation for feature
Short video demonstrating the feature

CDAP

S3toRedshift Action