...
The suggestion is to create a new maven project in it's own repositorysub-module in the database-plugins repo under the data-integrations organization.
Snowflake bulk API for the Sink
Using JDBC for loading data has performance limitations. Snowflake provides bulk APIs for loading data.
COPY INTO <table> command loads data from staged files to an existing table. The files must already be staged in one of the following locations: https://docs.snowflake.net/manuals/user-guide-data-load.html
- Named internal stage (or table/user stage). Files can be staged using the PUT command.
- Named external stage that references an external location (AWS S3, Google Cloud Storage, or Microsoft Azure).
- External location (AWS S3, Google Cloud Storage, or Microsoft Azure).
Example:
Code Block |
---|
-- Stages
copy into mytable from '@mystage/path 1/file 1.csv';
copy into mytable from '@%mytable/path 1/file 1.csv';
copy into mytable from '@~/path 1/file 1.csv';
-- S3 bucket
copy into mytable from 's3://mybucket 1/prefix 1/file 1.csv';
-- Azure container
copy into mytable from 'azure://myaccount.blob.core.windows.net/mycontainer/encrypted_files/file 1.csv'; |
Also, there is an option to use Snowpipe to load data continuously.
Although it is possible to create a Batch Sink Plugin that will accept files' locations and utilize COPY INTO <table> command to load data files into Snowflake, it seems that it's not a good idea:
- COPY INTO <table> command uses locations of the raw data files, so it won't be possible to use this sink in the same way as other Database Sinks.
- No transformations can be done on actual data on the CDAP since we operate on file locations and not on the actual data. It is still possible to perform a transformation on the Snowflake side using Transformation Parameters.
Thus, the suggestion is to create a Snowflake Copy Action Plugin that will utilize COPY INTO <table> command to load data files into Snowflake and COPY INTO <location> command to unload data from a table (or query) into one or more files.
Source Properties
See:
Section | User Configuration Label | Label Description | Options | Default | Variable | User Widget |
---|---|---|---|---|---|---|
General | Label | Label for UI. | textbox | |||
Reference Name | Uniquely identified name for lineage. | referenceName | textbox | |||
Account Name | Full name of Snowflake account. | accountName | textbox | |||
Database | Database name to connect to. | database | textbox | |||
Import Query | Query for import data. | importQuery | textarea | |||
Credentials | Username | User identity for connecting to the specified database. | username | textbox |
...
Password | Password to use to connect to the specified database. | password | password | |||
Key Pair Authentication | Key Pair Authentication Enabled | If true, plugin will perform Key Pair authentication. |
| False | keyPairEnabled | toggle |
Key File Path | Path to the private key file. | path | textbox | |||
User | Snowflake login name. | user | textbox | |||
OAuth2 | OAuth2 Enabled | If true, plugin will perform OAuth2 authentication. |
| False | oauth2Enabled | toggle |
Auth URL | Endpoint for the authorization server used to retrieve the authorization code. | authUrl | textbox | |||
Token URL | Endpoint for the resource server, which exchanges the authorization code for an access token. | tokenUrl | textbox | |||
Client ID | Client identifier obtained during the Application registration process. | clientId | textbox | |||
Client Secret | Client secret obtained during the Application registration process. | clientSecret | password | |||
Scopes | Scope of the access request, which might have multiple space-separated values. | scopes | textbox | |||
Refresh Token | Token used to receive accessToken, which is end product of OAuth2. | refreshToken | textbox | |||
Advanced | Bounding Query | Bounding Query should return the min and max of the values of the 'splitBy' field. For example, 'SELECT MIN(id),MAX(id) FROM table'. Not required if numSplits is set to one. | boundingQuery | textarea | ||
Split-By Field Name | Field Name which will be used to generate splits. Not required if numSplits is set to one. | splitBy | textbox | |||
Number of Splits to Generate | Number of splits to generate. | numSplits | textbox | |||
Connection Arguments | A list of arbitrary string tag/value pairs as connection arguments. See: https://docs.snowflake.net/manuals/ |
...
...
guide/jdbc-configure.html#jdbc-driver-connection-string | connectionArguments | keyvalue |
Notes:
- Please, refer Plugin OAuth2 Common Module for OAuth2 common module design information
Source Data Types Mapping
Snowflake Data Types | CDAP Schema Data Type | Comment |
---|---|---|
NUMBER | decimal | Default precision and scale are (38,0). |
DECIMAL | decimal | Synonymous with NUMBER. |
NUMERIC | decimal | Synonymous with NUMBER. |
INT, INTEGER, BIGINT, SMALLINT | decimal | Synonymous with NUMBER, except that precision and scale cannot be specified (i.e. always defaults to NUMBER(38, 0)). |
FLOAT, FLOAT4, FLOAT8 | double | Snowflake uses double-precision (64 bit) IEEE 754 floating point numbers. |
DOUBLE | double | Synonymous with FLOAT. |
DOUBLE PRECISION | double | Synonymous with FLOAT. |
REAL | double | Synonymous with FLOAT. |
VARCHAR | string | Default (and maximum) is 16,777,216 bytes. |
CHAR, CHARACTER | string | Synonymous with VARCHAR except default length is VARCHAR(1). |
STRING | string | Synonymous with VARCHAR. |
TEXT | string | Synonymous with VARCHAR. |
BINARY | bytes | |
VARBINARY | bytes | Synonymous with BINARY. |
BOOLEAN | boolean | |
DATE | date | |
DATETIME | timestamp | Alias for TIMESTAMP_NTZ |
TIME | time | |
TIMESTAMP | timestamp/string | Alias for one of the TIMESTAMP variations (TIMESTAMP_NTZ by default). |
TIMESTAMP_LTZ | timestamp | TIMESTAMP with local time zone; time zone, if provided, is not stored. |
TIMESTAMP_NTZ | timestamp | TIMESTAMP with no time zone; time zone, if provided, is not stored. |
TIMESTAMP_TZ | string | TIMESTAMP with time zone. |
VARIANT | string | A tagged universal type, which can store values of any other type, including OBJECT and ARRAY, up to a maximum size of 16 MB compressed. |
OBJECT | record | |
ARRAY | array |
See:
Sink Properties
Section | User Configuration Label | Label Description | Options | Default | Variable | User Widget |
---|---|---|---|---|---|---|
General | Label | Label for UI. | textbox | |||
Reference Name | Uniquely identified name for lineage. | referenceName | textbox | |||
Account Name | Full name of Snowflake account. | |||||
Database | Database name to connect to connect to. | |||||
Table Name | Name of a database table to write to. | |||||
Credentials | Username | User identity for connecting to the specified database. | username | textbox | ||
Password | Password to use to connect to the specified database. | password | password | |||
Key Pair Authentication | Key Pair Authentication Enabled | If true, plugin will perform Key Pair authentication. |
| False | keyPairEnabled | toggle |
Key File Path | Path to the private key file. | path | textbox | |||
User | Snowflake login name. | user | textbox | |||
OAuth2 | OAuth2 Enabled | If true, plugin will perform OAuth2 authentication. |
| False | oauth2Enabled | toggle |
Auth URL | Endpoint for the authorization server used to retrieve the authorization code. | authUrl | textbox | |||
Token URL | Endpoint for the resource server, which exchanges the authorization code for an access token. | tokenUrl | textbox | |||
Client ID | Client identifier obtained during the Application registration process. | clientId | textbox | |||
Client Secret | Client secret obtained during the Application registration process. | clientSecret | password | |||
Scopes | Scope of the access request, which might have multiple space-separated values. | scopes | textbox | |||
Refresh Token | Token used to receive accessToken, which is end product of OAuth2. | refreshToken | textbox | |||
Advanced | Bounding Query | Bounding Query should return the min and max of the values of the 'splitBy' field. For example, 'SELECT MIN(id),MAX(id) FROM table'. Not required if numSplits is set to one. | boundingQuery | textarea | ||
Split-By Field Name | Field Name which will be used to generate splits. Not required if numSplits is set to one. | splitBy | textbox | |||
Number of Splits to Generate | Number of splits to generate. | numSplits | textbox | |||
Connection Arguments | A list of arbitrary string tag/value pairs as connection arguments. See: https://docs.snowflake.net/manuals/user-guide/jdbc-configure.html#jdbc-driver-connection-string | connectionArguments | keyvalue |
Sink Data Types Mapping
CDAP Schema Data Type | Snowflake Data Types | Comment |
---|---|---|
boolean | BOOLEAN | |
bytes | BINARY | |
date | DATE | |
double | FLOAT | Snowflake uses double-precision (64 bit) IEEE 754 floating point numbers. |
decimal | NUMBER(s, p) | |
float | FLOAT | |
int | NUMBER(s, p) | Where p >= 10. It's safe to write primitives as values of decimal logical type in the case of valid precision. |
long | NUMBER(s, p) | Where p >= 19. It's safe to write primitives as values of decimal logical type in the case of valid precision. |
string | VARCHAR | |
time | TIME | |
timestamp | TIMESTAMP_NTZ | |
array | ARRAY | |
record | OBJECT | |
enum | VARCHAR | |
map | OBJECT | |
union | VARIANT |
Action Properties
User Configuration Label | Label Description | Options | Default | Variable | User Widget | |
---|---|---|---|---|---|---|
General | Label | Label for UI | textbox | |||
Account Name | Full name of Snowflake account. | |||||
Database | Database name to connect to connect to. | |||||
Query | SQL query to run. | query | textarea | |||
Credentials | Username | User identity for connecting to the specified database. | username | textbox | ||
Password | Password to use to connect to the specified database. | password | password | |||
Key Pair Authentication | Key Pair Authentication Enabled | If true, plugin will perform Key Pair authentication. |
| False | keyPairEnabled | toggle |
Key File Path | Path to the private key file. | path | textbox | |||
User | Snowflake login name. | user | textbox | |||
OAuth2 | OAuth2 Enabled | If true, plugin will perform OAuth2 authentication. |
| False | oauth2Enabled | toggle |
Auth URL | Endpoint for the authorization server used to retrieve the authorization code. | authUrl | textbox | |||
Token URL | Endpoint for the resource server, which exchanges the authorization code for an access token. | tokenUrl | textbox | |||
Client ID | Client identifier obtained during the Application registration process. | clientId | textbox | |||
Client Secret | Client secret obtained during the Application registration process. | clientSecret | password | |||
Scopes | Scope of the access request, which might have multiple space-separated values. | scopes | textbox | |||
Refresh Token | Token used to receive accessToken, which is end product of OAuth2. | refreshToken | textbox | |||
Advanced | Bounding Query | Bounding Query should return the min and max of the values of the 'splitBy' field. For example, 'SELECT MIN(id),MAX(id) FROM table'. Not required if numSplits is set to one. | boundingQuery | textarea | ||
Split-By Field Name | Field Name which will be used to generate splits. Not required if numSplits is set to one. | splitBy | textbox | |||
Number of Splits to Generate | Number of splits to generate. | numSplits | textbox | |||
Connection Arguments | A list of arbitrary string tag/value pairs as connection arguments. See: https://docs.snowflake.net/manuals/user-guide/jdbc-configure.html#jdbc-driver-connection-string | connectionArguments | keyvalue |
Snowflake Copy Action Plugin Properties
User Configuration Label | Label Description | Options | Default | Variable | User Widget | |
---|---|---|---|---|---|---|
General | Label | Label for UI | textbox |
...