S3toRedshift Action
- Russ Savage
- Rashi Gandhi
Introduction
One of the most efficient ways to load data into Amazon Redshift is by pushing it to s3 first and then issuing a command to load that data to Redshift. Here is a tutorial about it: http://docs.aws.amazon.com/redshift/latest/dg/tutorial-loading-data.html
Use case(s)
- A financial customer would like to quickly load financial reports into Redshift that have been generated from a Hydrator pipeline that has been synced to S3. The pipeline would have an s3 sink followed by an s3toredshift action which would load the data into the redshift tables.
User Storie(s)
- As a user, i would like to load data written to S3 using the COPY command in redshift.
- I would like to authenticate with IAM credentials as well as id and secret key pairs.
- I would like to load multiple files into the same redshift table.
- I would like to load s3 data created by cdap from s3 into redshift.
Plugin Type
- Action
Configurables
This section defines properties that are configurable for this plugin.
User Facing Name | Type | Description | Constraints |
---|---|---|---|
Access Key | String | AWS access key for S3 | Optional |
Secret Access Key | String | AWS secret access key for S3 | Optional |
AWS IAM Role | String | IAM Role | Optional |
S3 Data Location | String | Location of the data files in S3 bucket (including key prefix or manifest file) | |
S3 Region | String | Region for S3 | Optional |
Redshift Cluster DB Url | String | JDBC Redshift DB url for connecting to the redshift cluster | |
Master User | String | Master user for redshift | |
Master User Password | String | Master user password | |
Redshift Table Name | String | Redshift table name on which data will be loaded | |
List of Columns | String | List of Redshift table columns | Optional |
Design / Implementation Tips
- Follow the tutorial here to get a better understanding of what might be needed: http://docs.aws.amazon.com/redshift/latest/dg/tutorial-loading-data.html
Design
{
"name": "S3ToRedshift",
"type": "action",
"properties": {
"accessKey": "access-key",
"secretAccessKey": "secret-access-key",
"s3DataPath": "s3://bucket-name>/test",
"clusterDbUrl": "jdbc:redshift://x.y.us-west-2.redshift.amazonaws.com:5439/dev",
"masterUser": "master-user",
"masterPassword": "master-password",
"tableName": "redshifttest"
}
}
Approach(s)
- User can connect to the S3 buckets Either using 'access and secret access key' or using 'IAM role'.
- By default copy command assumes that Redshift and S3 are in the same region. If S3 bucket is not in same region as of the Redshift cluster, then user can provide the region using 'S3 Region'.
Existing S3 sink plugins support only those regions that have versions 2 and 4 signature version support. Hence, S3ToRedhsift plugin will have support for only these regions: "us-east-1", "us-west-1", "us-west-2", "ap-southeast-1", "ap-southeast-2", "ap-northeast-1", "eu-west-1" and "sa-east-1".
- Currently, existing S3 sink plugins use server side encryption technique SSE-S3 (With out using keys) for encrypting and writing the data into S3 buckets. Hence, the S3ToRedshift action plugin will load the data present in S3 buckets encrypted using server side encryption SSE-S3 technique into the Redshift. No additional parameter is required in copy command.
- Existing S3 sink plugins write the data into the S3 buckets only in two formats: 'Avro and Parquet'. And currently Copy command does not support Parquet format. (http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html). For now, the S3ToRedshift action plugin will only have the support for Avro format.
- Copy command uses format as avro 'auto' option, while loading the avro formatted data from S3 to Redshift.
- COPY automatically maps the data elements in the Avro source data to the columns in the target table by matching field names in the Avro schema to column names in the target table. The matching is case-sensitive. Column names in Amazon Redshift tables are always lowercase, so matching field names must also be lowercase.
- User can read the data from the specific directory of S3 buckets, for example 's3://<bucket-name>/test/2017-02-24/' or can read the files present in S3 buckets that starts with lets say 2017-02-24, through 's3://<bucket-name>/test/2017-02-24'. In this case, '2017-02-24' will be considered as key prefix.
- Schema of the Redshift table must match with S3 data schema for the successful load.
If the target column does not present in the S3, then COPY attempts to load NULL.
If COPY attempts to assign NULL to a column that is defined as NOT NULL in the Redshift schema, the COPY command will fail.
Properties
**accessKey:** Access key for AWS S3 to connect to. (Macro-enabled)
**secretAccessKey:** Secret access key for AWS S3 to connect to. (Macro-enabled)
**iamRole:** IAM Role for AWS S3 to connect to. (Macro-enabled)
**s3Region:** The region for AWS S3 to connect to. If not specified, then plugin will consider that S3 bucket is in
the same region as of the Redshift cluster. (Macro-enabled)
**s3DataPath:** The S3 path of the bucket where the data is stored and will be loaded into the Redshift table.
For example, 's3://<bucket-name>/test/2017-02-22/'(will load files present in ``test/2017-02-22`` directory) or
's3://<bucket-name>/test/2017-02-22'(will load files from ``test`` directory having prefix ``2017-02-22``), for cases
when the user wants to load multiple files. (Macro-enabled)
**clusterDbUrl:** The JDBC Redshift database URL for Redshift cluster, where the table is present. For example,
'jdbc:redshift://x.y.us-west-2.redshift.amazonaws.com:5439/dev'. (Macro-enabled)
**masterUser:** Master user for the Redshift cluster to connect to. (Macro-enabled)
**masterPassword:** Master password for Redshift cluster to connect to. (Macro-enabled)
**tableName:** The Redshift table name where the data from the S3 bucket will be loaded. (Macro-enabled)
**listOfColumns:** Comma-separated list of the Redshift table column names to load the specific columns from S3
bucket. If not provided, then all the columns from S3 will be loaded into the Redshift table. (Macro-enabled)
Security
Limitation(s)
Future Work
- Some future work – HYDRATOR-99999
- Another future work – HYDRATOR-99999
Test Case(s)
S3ToRedshift action - Using keys for connection
S3ToRedshift action - Using IAM Role for connection
S3ToRedshift action - With Snappy Compression
S3ToRedshift action - With Deflate Compression
S3ToRedshift action - With list of selected redshift target columns to be loaded
Sample Pipeline
S3ToRedshift_Action_With_Keys-cdap-data-pipeline.json
S3ToRedshift_Action_With_Keys-cdap-data-pipeline.json
S3ToRedshift_Action_With_SnappyCompression-cdap-data-pipeline.json
S3ToRedshift_Action_With_DeflateCompression_copy-cdap-data-pipeline.json
S3ToRedshift_Action_With_ListOfRedshiftColumns_copy-cdap-data-pipeline.json
Table of Contents
Checklist
- User stories documented
- User stories reviewed
- Design documented
- Design reviewed
- Feature merged
- Examples and guides
- Integration tests
- Documentation for feature
- Short video demonstrating the feature