Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The suggestion is to create a new maven project in it's own repositorysub-module in the database-plugins repo under the data-integrations organization.

Snowflake bulk API for the Sink

Using JDBC for loading data has performance limitations. Snowflake provides bulk APIs for loading data

COPY INTO <table> command loads data from staged files to an existing table. The files must already be staged in one of the following locations: https://docs.snowflake.net/manuals/user-guide-data-load.html

  • Named internal stage (or table/user stage). Files can be staged using the PUT command. 
  • Named external stage that references an external location (AWS S3, Google Cloud Storage, or Microsoft Azure). 
  • External location (AWS S3, Google Cloud Storage, or Microsoft Azure).

Example:

Code Block
-- Stages
copy into mytable from '@mystage/path 1/file 1.csv';
copy into mytable from '@%mytable/path 1/file 1.csv';
copy into mytable from '@~/path 1/file 1.csv';

-- S3 bucket
copy into mytable from 's3://mybucket 1/prefix 1/file 1.csv';

-- Azure container
copy into mytable from 'azure://myaccount.blob.core.windows.net/mycontainer/encrypted_files/file 1.csv';

Also, there is an option to use Snowpipe to load data continuously.


Although it is possible to create a Batch Sink Plugin that will accept files' locations and utilize COPY INTO <table> command to load data files into Snowflake, it seems that it's not a good idea:

  • COPY INTO <table> command uses locations of the raw data files, so it won't be possible to use this sink in the same way as other Database Sinks.
  • No transformations can be done on actual data on the CDAP since we operate on file locations and not on the actual data. It is still possible to perform a transformation on the Snowflake side using Transformation Parameters.


Thus, the suggestion is to create a Snowflake Copy Action Plugin that will utilize COPY INTO <table> command to load data files into Snowflake and COPY INTO <location> command to unload data from a table (or query) into one or more files.

Source Properties

See: 

Section

User Configuration LabelLabel DescriptionOptionsDefaultVariableUser Widget
GeneralLabelLabel for UI.


textbox

Reference NameUniquely identified name for lineage.

referenceNametextbox

Account NameFull name of Snowflake account.

accountNametextbox

Database

Database name to connect to.



databasetextbox

Import Query

Query for import data.



importQuerytextarea
CredentialsUsernameUser identity for connecting to the specified database.

usernametextbox

...


PasswordPassword to use to connect to the specified database.

passwordpassword
Key Pair AuthenticationKey Pair Authentication EnabledIf true, plugin will perform Key Pair authentication.
  • True
  • False
FalsekeyPairEnabledtoggle

Key File PathPath to the private key file.

pathtextbox

UserSnowflake login name.

usertextbox
OAuth2OAuth2 EnabledIf true, plugin will perform OAuth2 authentication.
  • True
  • False
Falseoauth2Enabledtoggle

Auth URLEndpoint for the authorization server used to retrieve the authorization code.

authUrltextbox

Token URLEndpoint for the resource server, which exchanges the authorization code for an access token.

tokenUrltextbox

Client IDClient identifier obtained during the Application registration process.

clientIdtextbox

Client SecretClient secret obtained during the Application registration process.

clientSecretpassword

ScopesScope of the access request, which might have multiple space-separated values.

scopestextbox

Refresh TokenToken used to receive accessToken, which is end product of OAuth2.

refreshTokentextbox
AdvancedBounding QueryBounding Query should return the min and max of the values of the 'splitBy' field. For example, 'SELECT MIN(id),MAX(id) FROM table'. Not required if numSplits is set to one.

boundingQuerytextarea

Split-By Field NameField Name which will be used to generate splits. Not required if numSplits is set to one.

splitBytextbox

Number of Splits to GenerateNumber of splits to generate.

numSplitstextbox

Connection ArgumentsA list of arbitrary string tag/value pairs as connection arguments. See: https://docs.snowflake.net/manuals/

...

...

Notes:


Source Data Types Mapping

Snowflake Data TypesCDAP Schema Data TypeComment
NUMBER
decimalDefault precision and scale are (38,0).
DECIMALdecimalSynonymous with NUMBER.
NUMERICdecimalSynonymous with NUMBER.
INT, INTEGER, BIGINT, SMALLINTdecimalSynonymous with NUMBER, except that precision and scale cannot be specified (i.e. always defaults to NUMBER(38, 0)).
FLOAT, FLOAT4, FLOAT8 doubleSnowflake uses double-precision (64 bit) IEEE 754 floating point numbers.
DOUBLEdoubleSynonymous with FLOAT.
DOUBLE PRECISIONdoubleSynonymous with FLOAT.
REALdoubleSynonymous with FLOAT.
VARCHARstringDefault (and maximum) is 16,777,216 bytes.
CHAR, CHARACTERstringSynonymous with VARCHAR except default length is VARCHAR(1).
STRINGstringSynonymous with VARCHAR.
TEXTstringSynonymous with VARCHAR.
BINARYbytes
VARBINARYbytesSynonymous with BINARY.
BOOLEANboolean
DATEdate
DATETIMEtimestampAlias for TIMESTAMP_NTZ
TIMEtime
TIMESTAMPtimestamp/stringAlias for one of the TIMESTAMP variations (TIMESTAMP_NTZ by default).
TIMESTAMP_LTZtimestampTIMESTAMP with local time zone; time zone, if provided, is not stored.
TIMESTAMP_NTZtimestampTIMESTAMP with no time zone; time zone, if provided, is not stored.
TIMESTAMP_TZstringTIMESTAMP with time zone.
VARIANTstringA tagged universal type, which can store values of any other type, including OBJECT and ARRAY, up to a maximum size of 16 MB compressed.
OBJECTrecord
ARRAYarray

See: 

Sink Properties

SectionUser Configuration LabelLabel DescriptionOptionsDefaultVariableUser Widget
GeneralLabelLabel for UI.


textbox

Reference NameUniquely identified name for lineage.

referenceNametextbox

Account NameFull name of Snowflake account.




Database

Database name to connect to connect to.






Table Name

Name of a database table to write to.





CredentialsUsernameUser identity for connecting to the specified database.

usernametextbox

PasswordPassword to use to connect to the specified database.

passwordpassword
Key Pair AuthenticationKey Pair Authentication EnabledIf true, plugin will perform Key Pair authentication.
  • True
  • False
FalsekeyPairEnabledtoggle

Key File PathPath to the private key file.

pathtextbox

UserSnowflake login name.

usertextbox
OAuth2OAuth2 EnabledIf true, plugin will perform OAuth2 authentication.
  • True
  • False
Falseoauth2Enabledtoggle

Auth URLEndpoint for the authorization server used to retrieve the authorization code.

authUrltextbox

Token URLEndpoint for the resource server, which exchanges the authorization code for an access token.

tokenUrltextbox

Client IDClient identifier obtained during the Application registration process.

clientIdtextbox

Client SecretClient secret obtained during the Application registration process.

clientSecretpassword

ScopesScope of the access request, which might have multiple space-separated values.

scopestextbox

Refresh TokenToken used to receive accessToken, which is end product of OAuth2.

refreshTokentextbox
AdvancedBounding QueryBounding Query should return the min and max of the values of the 'splitBy' field. For example, 'SELECT MIN(id),MAX(id) FROM table'. Not required if numSplits is set to one.

boundingQuerytextarea

Split-By Field NameField Name which will be used to generate splits. Not required if numSplits is set to one.

splitBytextbox

Number of Splits to GenerateNumber of splits to generate.

numSplitstextbox

Connection ArgumentsA list of arbitrary string tag/value pairs as connection arguments. See: https://docs.snowflake.net/manuals/user-guide/jdbc-configure.html#jdbc-driver-connection-string

connectionArgumentskeyvalue

Sink Data Types Mapping

CDAP Schema Data TypeSnowflake Data TypesComment
booleanBOOLEAN
bytesBINARY
dateDATE
doubleFLOATSnowflake uses double-precision (64 bit) IEEE 754 floating point numbers.
decimalNUMBER(s, p)
floatFLOAT
intNUMBER(s, p)

Where p >= 10.


It's safe to write primitives as values of decimal logical type in the case of valid precision.

longNUMBER(s, p)

Where p >= 19.


It's safe to write primitives as values of decimal logical type in the case of valid precision.

stringVARCHAR
timeTIME
timestampTIMESTAMP_NTZ
arrayARRAY
recordOBJECT
enumVARCHAR
mapOBJECT
unionVARIANT

Action Properties


User Configuration Label

Label Description

OptionsDefault

Variable

User Widget
General

Label

Label for UI




textbox

Account NameFull name of Snowflake account.




Database

Database name to connect to connect to.






Query

SQL query to run.



querytextarea
CredentialsUsernameUser identity for connecting to the specified database.

usernametextbox

PasswordPassword to use to connect to the specified database.

passwordpassword
Key Pair AuthenticationKey Pair Authentication EnabledIf true, plugin will perform Key Pair authentication.
  • True
  • False
FalsekeyPairEnabledtoggle

Key File PathPath to the private key file.

pathtextbox

UserSnowflake login name.

usertextbox
OAuth2OAuth2 EnabledIf true, plugin will perform OAuth2 authentication.
  • True
  • False
Falseoauth2Enabledtoggle

Auth URLEndpoint for the authorization server used to retrieve the authorization code.

authUrltextbox

Token URLEndpoint for the resource server, which exchanges the authorization code for an access token.

tokenUrltextbox

Client IDClient identifier obtained during the Application registration process.

clientIdtextbox

Client SecretClient secret obtained during the Application registration process.

clientSecretpassword

ScopesScope of the access request, which might have multiple space-separated values.

scopestextbox

Refresh TokenToken used to receive accessToken, which is end product of OAuth2.

refreshTokentextbox
AdvancedBounding QueryBounding Query should return the min and max of the values of the 'splitBy' field. For example, 'SELECT MIN(id),MAX(id) FROM table'. Not required if numSplits is set to one.

boundingQuerytextarea

Split-By Field NameField Name which will be used to generate splits. Not required if numSplits is set to one.

splitBytextbox

Number of Splits to GenerateNumber of splits to generate.

numSplitstextbox

Connection ArgumentsA list of arbitrary string tag/value pairs as connection arguments. See: https://docs.snowflake.net/manuals/user-guide/jdbc-configure.html#jdbc-driver-connection-string

connectionArgumentskeyvalue

Snowflake Copy Action Plugin Properties


User Configuration Label

Label Description

OptionsDefault

Variable

User Widget
General

Label

Label for UI




textbox

...