S3toRedshift Action

Introduction

One of the most efficient ways to load data into Amazon Redshift is by pushing it to s3 first and then issuing a command to load that data to Redshift. Here is a tutorial about it: http://docs.aws.amazon.com/redshift/latest/dg/tutorial-loading-data.html

Use case(s)

  • A financial customer would like to quickly load financial reports into Redshift that have been generated from a Hydrator pipeline that has been synced to S3. The pipeline would have an s3 sink followed by an s3toredshift action which would load the data into the redshift tables.

User Storie(s)

  • As a user, i would like to load data written to S3 using the COPY command in redshift.
  • I would like to authenticate with IAM credentials as well as id and secret key pairs.
  • I would like to load multiple files into the same redshift table.
  • I would like to load s3 data created by cdap from s3 into redshift.

Plugin Type

  • Action

Configurables

This section defines properties that are configurable for this plugin. 

User Facing NameTypeDescriptionConstraints
Access KeyStringAWS access key for S3Optional
Secret Access KeyStringAWS secret access key for S3Optional
AWS IAM RoleStringIAM RoleOptional
S3 Data LocationStringLocation of the data files in S3 bucket (including key prefix or manifest file) 
S3 RegionStringRegion for S3Optional
Redshift Cluster DB UrlStringJDBC Redshift DB url for connecting to the redshift cluster 
Master UserStringMaster user for redshift 
Master User PasswordStringMaster user password 
Redshift Table NameStringRedshift table name on which data will be loaded 
List of ColumnsStringList of Redshift table columnsOptional

Design / Implementation Tips

Design

{
"name": "S3ToRedshift",
"type": "action",
"properties": {
"accessKey": "access-key",
"secretAccessKey": "secret-access-key",
"s3DataPath": "s3://bucket-name>/test",
"clusterDbUrl": "jdbc:redshift://x.y.us-west-2.redshift.amazonaws.com:5439/dev",
"masterUser": "master-user",
"masterPassword": "master-password",
"tableName": "redshifttest"
}
}

Approach(s)

  1. User can connect to the S3 buckets Either using 'access and secret access key' or using 'IAM role'.
  2. By default copy command assumes that Redshift and S3 are in the same region. If S3 bucket is not in same region as of the Redshift cluster, then user can provide the region using 'S3 Region'.
  3. Existing S3 sink plugins support only those regions that have versions 2 and 4 signature version support. Hence, S3ToRedhsift plugin will have support for only these regions: "us-east-1", "us-west-1", "us-west-2", "ap-southeast-1", "ap-southeast-2", "ap-northeast-1", "eu-west-1" and "sa-east-1".

  4. Currently, existing S3 sink plugins use server side encryption technique SSE-S3 (With out using keys) for encrypting and writing the data into S3 buckets. Hence, the S3ToRedshift action plugin will load the data present in S3 buckets encrypted using server side encryption SSE-S3 technique into the Redshift. No additional parameter is required in copy command.
  5. Existing S3 sink plugins write the data into the S3 buckets only in two formats: 'Avro and Parquet'. And currently Copy command does not support Parquet format. (http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html). For now, the S3ToRedshift action plugin will only have the support for Avro format.
  6. Copy command uses format as avro 'auto' option, while loading the avro formatted data from S3 to Redshift.
    1. COPY automatically maps the data elements in the Avro source data to the columns in the target table by matching field names in the Avro schema to column names in the target table. The matching is case-sensitive. Column names in Amazon Redshift tables are always lowercase, so matching field names must also be lowercase.
  7. User can read the data from the specific directory of S3 buckets, for example 's3://<bucket-name>/test/2017-02-24/' or can read the files present in S3 buckets that starts with lets say 2017-02-24, through 's3://<bucket-name>/test/2017-02-24'. In this case, '2017-02-24' will be considered as key prefix.
  8. Schema of the Redshift table must match with S3 data schema for the successful load.
    1. If the target column does not present in the S3, then COPY attempts to load NULL.

    2. If COPY attempts to assign NULL to a column that is defined as NOT NULL in the Redshift schema, the COPY command will fail.

Properties

**accessKey:** Access key for AWS S3 to connect to. (Macro-enabled)

**secretAccessKey:** Secret access key for AWS S3 to connect to. (Macro-enabled)

**iamRole:** IAM Role for AWS S3 to connect to. (Macro-enabled)

**s3Region:** The region for AWS S3 to connect to. If not specified, then plugin will consider that S3 bucket is in
the same region as of the Redshift cluster. (Macro-enabled)

**s3DataPath:** The S3 path of the bucket where the data is stored and will be loaded into the Redshift table.
For example, 's3://<bucket-name>/test/2017-02-22/'(will load files present in ``test/2017-02-22`` directory) or
's3://<bucket-name>/test/2017-02-22'(will load files from ``test`` directory having prefix ``2017-02-22``), for cases
when the user wants to load multiple files. (Macro-enabled)

**clusterDbUrl:** The JDBC Redshift database URL for Redshift cluster, where the table is present. For example,
'jdbc:redshift://x.y.us-west-2.redshift.amazonaws.com:5439/dev'. (Macro-enabled)

**masterUser:** Master user for the Redshift cluster to connect to. (Macro-enabled)

**masterPassword:** Master password for Redshift cluster to connect to. (Macro-enabled)

**tableName:** The Redshift table name where the data from the S3 bucket will be loaded. (Macro-enabled)

**listOfColumns:** Comma-separated list of the Redshift table column names to load the specific columns from S3
bucket. If not provided, then all the columns from S3 will be loaded into the Redshift table. (Macro-enabled)

Security

Limitation(s)

Future Work

  • Some future work – HYDRATOR-99999
  • Another future work – HYDRATOR-99999

Test Case(s)

S3ToRedshift action - Using keys for connection

S3ToRedshift action - Using IAM Role for connection

S3ToRedshift action - With Snappy Compression

S3ToRedshift action - With Deflate Compression

S3ToRedshift action - With list of selected redshift target columns to be loaded

Sample Pipeline

S3ToRedshift_Action_With_Keys-cdap-data-pipeline.json

S3ToRedshift_Action_With_Keys-cdap-data-pipeline.json

S3ToRedshift_Action_With_SnappyCompression-cdap-data-pipeline.json

S3ToRedshift_Action_With_DeflateCompression_copy-cdap-data-pipeline.json

S3ToRedshift_Action_With_ListOfRedshiftColumns_copy-cdap-data-pipeline.json

Table of Contents

Checklist

  • User stories documented 
  • User stories reviewed 
  • Design documented 
  • Design reviewed 
  • Feature merged 
  • Examples and guides 
  • Integration tests 
  • Documentation for feature 
  • Short video demonstrating the feature