Snowflake Data Loading Action
- Russ Savage
- Shashank
Introduction
Data loading in Snowflake (https://docs.snowflake.net/manuals/user-guide/data-load-overview.html) is performed in two, separate steps. The first step is to load the files into a staging area, and the second step is to trigger the upload with a COPY INTO step. This action plugin will execute both steps using Snowflake JDBC.
Use case(s)
- As a financial services company, I would like to build an analytics pipeline to push data into the Snowflake. I would leverage Hydrator, and use this plugin as action to create files in the staging area and trigger the loading of the data.
User Storie(s)
- I would like to create files in the staging area and trigger loading of the data.
- I would like to have capability to load files from internal and external(s3) location.
Plugin Type
- Action
Configurables
This section defines properties that are configurable for this plugin.Â
User Facing Name | Type | Description | Constraints |
---|---|---|---|
Account Name | String | Snowflake account name | Â |
User Name | String | Username | Â |
Password | String | Password |  |
Warehouse Name | String | Snowflake Warehouse | Â |
Database Name | String | Database | Â |
Schema | String | DB Schema |  |
Table Name | String | Table where data is to be loaded | Â |
Stage Name | String | Staging area where data would be loaded intermediately | Â |
Data Source | Select | Source of the data to be loaded. Defaults to Internal. | Â |
File Path/S3 Bucket URL | String | Path/URL(AWS S3) of the data to be loaded | Â |
File Format Type | Select | File format to specify the type of file. Defaults to CSV. | Possible values are: CSV,AVRO,PARQUET,JSON,XML |
Authentication Method | Select | Authentication method. Defaults to Access Credentials. | For IAM authentication, cluster should be hosted on AWS servers |
S3 Access Key | String | Access key for AWS S3 to connect to. | Mandatory if Data Source is external(AWS S3). |
S3 Secret Access Key | String | Secret access key for AWS S3 to connect to. | Mandatory if Data Source is external(AWS S3). |
Design / Implementation Tips
- See this document for an overview:Â https://docs.snowflake.net/manuals/user-guide/data-load-overview.html
Design
Approach(s)
1.For loading files from internal location,below high level commands would be executed in snowflake using jdbc:
   a.create or replace stage
   b.PUT files onto staging location from the specified path where globbing patterns (i.e. wildcards) are accepted.
   c.COPY INTO specified snowflake table.
2.For loading files from external location i.e S3,below high level commands would be executed in snowflake using jdbc:
   a.create or replace stage using URL option
   b.COPY INTO specified snowflake table having a single VARIANT column.
Properties
accountName: Snowflake account name
userName: Username
password: Password
warehouse: Snowflake warehouse
dbName: Database Name
schema: DB Schema
tableName: The Snowflake table name where the data will be loaded
stageName: The staging location where the data from the data source will be loaded
dataSource: Source of the data to be loaded. Defaults to Internal
path: Path/URL(AWS S3) of the data to be loaded
fileFormatType: File format to specify the type of file. Defaults to CSV
s3AuthenticationMethod: Authentication method. Defaults to Access Credentials. For IAM authentication, cluster should be hosted on AWS servers
accessKey: Access key for AWS S3 to connect to.Mandatory if dataSource is AWS S3
secretAccessKey: Secret access key for AWS S3 to connect to. Mandatory if dataSource is AWS S3
Limitation(s)
1.COPY INTO command ignores staged data files that were already loaded into the table.FORCE parameter is not used since loading of same files is not required.https://docs.snowflake.net/manuals/sql-reference/sql/copy-into-table.html
2.JSON/XML/AVRO/PARQUET file format can produce one and only one column of type variant or object or array. Use CSV file format if you want to load more than one column.
3.Default values for format options are considered.https://docs.snowflake.net/manuals/sql-reference/sql/create-stage.html
4.Create or replace stage command is used to create stage so that if stage already exists with different set of files,would be replaced.
Future Work
- Some future work – HYDRATOR-99999
- Another future work – HYDRATOR-99999
Test Case(s)
- Load Avro data from S3 to Snowflake using s3 keys
- Load Parquet data from S3 to Snowflake using s3 keys
Sample Pipeline
S3SinkAvroCredentials1_Snow_v2-cdap-data-pipeline.json
S3SinkParquetCredentials1_Snow_v2-cdap-data-pipeline.json
Pipeline #1
Pipeline #2
Â
Â
Table of Contents
Checklist
- User stories documentedÂ
- User stories reviewedÂ
- Design documentedÂ
- Design reviewedÂ
- Feature mergedÂ
- Examples and guidesÂ
- Integration testsÂ
- Documentation for featureÂ
- Short video demonstrating the feature