Snowflake Data Loading Action

Introduction

Data loading in Snowflake (https://docs.snowflake.net/manuals/user-guide/data-load-overview.html) is performed in two, separate steps. The first step is to load the files into a staging area, and the second step is to trigger the upload with a COPY INTO step. This action plugin will execute both steps using Snowflake JDBC.

Use case(s)

  • As a financial services company, I would like to build an analytics pipeline to push data into the Snowflake. I would leverage Hydrator, and use this plugin as action to create files in the staging area and trigger the loading of the data.

User Storie(s)

  • I would like to create files in the staging area and trigger loading of the data.
  • I would like to have capability to load files from internal and external(s3) location.

Plugin Type

  • Action

Configurables

This section defines properties that are configurable for this plugin. 

User Facing NameTypeDescriptionConstraints
Account NameStringSnowflake account name 
User NameStringUsername 
Password StringPassword 
Warehouse NameStringSnowflake Warehouse 
Database NameStringDatabase 
Schema StringDB Schema 
Table NameStringTable where data is to be loaded 
Stage NameStringStaging area where data would be loaded intermediately 
Data SourceSelectSource of the data to be loaded. Defaults to Internal. 
File Path/S3 Bucket URLStringPath/URL(AWS S3) of the data to be loaded 
File Format TypeSelectFile format to specify the type of file. Defaults to CSV.

Possible values are:

CSV,AVRO,PARQUET,JSON,XML

Authentication MethodSelectAuthentication method. Defaults to Access Credentials.For IAM authentication, cluster should be hosted on AWS servers
S3 Access KeyStringAccess key for AWS S3 to connect to.Mandatory if Data Source is external(AWS S3).
S3 Secret Access KeyStringSecret access key for AWS S3 to connect to.Mandatory if Data Source is external(AWS S3).

Design / Implementation Tips

Design

Approach(s)

1.For loading files from internal location,below high level commands would be executed in snowflake using jdbc:
     a.create or replace stage
     b.PUT files onto staging location from the specified path where globbing patterns (i.e. wildcards) are accepted.
     c.COPY INTO specified snowflake table.
2.For loading files from external location i.e S3,below high level commands would be executed in snowflake using jdbc:
     a.create or replace stage using URL option
     b.COPY INTO specified snowflake table having a single VARIANT column.

Properties

accountName: Snowflake account name
userName: Username
password: Password
warehouse: Snowflake warehouse
dbName: Database Name
schema: DB Schema
tableName: The Snowflake table name where the data will be loaded
stageName: The staging location where the data from the data source will be loaded
dataSource: Source of the data to be loaded. Defaults to Internal
path: Path/URL(AWS S3) of the data to be loaded
fileFormatType: File format to specify the type of file. Defaults to CSV
s3AuthenticationMethod: Authentication method. Defaults to Access Credentials. For IAM authentication, cluster should be hosted on AWS servers
accessKey: Access key for AWS S3 to connect to.Mandatory if dataSource is AWS S3
secretAccessKey: Secret access key for AWS S3 to connect to. Mandatory if dataSource is AWS S3

Limitation(s)

1.COPY INTO command ignores staged data files that were already loaded into the table.FORCE parameter is not used since loading of same files is not required.https://docs.snowflake.net/manuals/sql-reference/sql/copy-into-table.html
2.JSON/XML/AVRO/PARQUET file format can produce one and only one column of type variant or object or array. Use CSV file format if you want to load more than one column.
3.Default values for format options are considered.https://docs.snowflake.net/manuals/sql-reference/sql/create-stage.html
4.Create or replace stage command is used to create stage so that if stage already exists with different set of files,would be replaced.

Future Work

  • Some future work – HYDRATOR-99999
  • Another future work – HYDRATOR-99999

Test Case(s)

  • Load Avro data from S3 to Snowflake using s3 keys
  • Load Parquet data from S3 to Snowflake using s3 keys

Sample Pipeline

S3SinkAvroCredentials1_Snow_v2-cdap-data-pipeline.json

S3SinkParquetCredentials1_Snow_v2-cdap-data-pipeline.json

Pipeline #1

Pipeline #2

 

 

Table of Contents

Checklist

  • User stories documented 
  • User stories reviewed 
  • Design documented 
  • Design reviewed 
  • Feature merged 
  • Examples and guides 
  • Integration tests 
  • Documentation for feature 
  • Short video demonstrating the feature