Microsoft SQL Server database plugin

Introduction

A separate database plugin to support MSSQL-specific features and configurations.

Use-Case

  • Users can choose and install MSSQL source and sink plugins.
  • Users should see MSSQL logo on plugin configuration page for better experience.
  • Users should get relevant information from the tool tip:
    • The tool tip for the connection string should be customized specifically to the MSSQL database,
    • The tool tip should describe accurately what each field is used for.
  • Users should not have to specify any redundant configuration (ex: JDBC type in source plugin, columns in the sink plugin).
  • Users should get field level lineage for the source and sink that is being used.
  • Reference documentation should be updated to account for the changes.
  • The source code for MSSQL database plugin should be placed in repo under data-integrations org.
  • Integration tests for MSSQL database plugin should be added in the test repo.
  • The data pipeline using source and sink plugins should run on both mapreduce and spark engines.

User Stories

  • User should be able to install MSSQL specific database source and sink plugins from the Hub
  • Users should have each tool tip accurately describe what each field does
  • Users should get field level lineage information for the MSSQL source and sink
  • Users should be able to setup a pipeline avoiding specifying redundant information
  • Users should get updated reference document for MSSQL source and sink
  • Users should be able to read all the DB types

Plugin Type

  • Batch Source
  • Batch Sink 
  • Real-time Source
  • Real-time Sink
  • Action
  • Post-Run Action
  • Aggregate
  • Join
  • Spark Model
  • Spark Compute

Design Tips


MSSQL support connection using Azure Active Directory(AD), to connect to AD the https://github.com/AzureAD/azure-activedirectory-library-for-java need to be on classpath.

Information about types of AD connections: https://docs.microsoft.com/en-us/sql/connect/jdbc/connecting-using-azure-active-directory-authentication?view=sql-server-2017

MSSQL connector reference: https://docs.microsoft.com/en-us/sql/connect/jdbc/download-microsoft-jdbc-driver-for-sql-server?view=sql-server-2017

Existing database plugins: https://github.com/cdapio/hydrator-plugins/tree/develop/database-plugins

MSSQL datatypes mappings and conversions: https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017


Design

The suggestion is to create maven submodule MSSQL under database-plugins repo.


Sink Properties

User Facing NameTypeDescriptionConstraints
LabelString Label for UI
Reference NameStringUniquely identified name for lineage
HostStringMSSQL host (serverName)Required (defaults to localhost on UI)
PortNumberThe port where SQL Server is listening. If the port number is specified in the connection string, no request to SQLbrowser is made. When the port and instanceName are both specified, the connection is made to the specified port. However, the instanceName is validated and an error is thrown if it does not match the port.

Important: We recommend that the port number is always specified, as this is more secure than using SQLbrowser

Optional

(default 1433)

DatabaseStringDatabase name to connectRequired
Authentication TypeSelectIndicates which SQL authentication method will be used for the connection. Use 'SQL Login' to connect to a SQL Server using username and password properties. 'Active Directory Password' can be used to connect to an Azure SQL Database/Data Warehouse using an Azure AD principal name and password
UsernameStringDB usernameRequired
PasswordPasswordUser passwordRequired
Transaction Isolation LevelSelectTransaction isolation level for queries run by this sink
Connection ArgumentsKeyvalue

A list of arbitrary string tag/value pairs as connection arguments, list of properties

https://docs.microsoft.com/en-us/sql/connect/jdbc/setting-the-connection-properties?view=sql-server-2017


Table NameStringName of a database table to write to
Instance NameStringThe SQL Server instance name to connect to. When it is not specified, a connection is made to the default instance. For the case where both the instanceName and port are specified, see the notes for port.

If you specify a Virtual Network Name in the Server connection property, you cannot use instanceName connection property
Optional
Query TimeoutNumberThe number of seconds to wait before a timeout has occurred on a query. The default value is -1, which means infinite timeout. Setting this to 0 also implies to wait indefinitely.Optional
Connect TimeoutNumberTime in seconds to wait for a connection to the server before terminating the attempt and generating an error.Optional
Column EncryptionSelect

Default column encryption setting for all the commands on the connection. When enabled the JDBC driver will transparently encrypt and decrypt sensitive data stored in encrypted database columns in the SQL Server.

Possible values are: 'Enabled' and 'Disabled'.

Default: 'Disabled'.

EncryptSelect

When set to 'Yes', SQL Server uses SSL encryption for all data sent between the client and server if the server has a certificate installed.

Possible values are: 'Yes' and 'No'.

Default: 'No'.

Trust Server CertificateSelect

When set to 'Yes' (and encryption enabled), SQL Server uses SSL encryption for all data sent between the client and server without validating the server certificate.

Possible values are: 'Yes' and 'No'.

Default: 'No'.

Workstation IDStringUsed to identify the specific workstation in various SQL Server profiling and logging tools.Optional
Failover PartnerStringThe name or network address of the instance of SQL Server that acts as failover partner.Optional
Packet SizeNumberThe network packet size used to communicate with SQL Server, specified in bytes. It's not recommended to specify packet size property when the encryption is enabled. Otherwise, the driver might raise a connection error.Optional
Current LanguageStringMust correspond to the SQL Server language record name and specifies the language environment for the session. The session language determines the datetime formats and system messages.Optional

Source Properties


User Facing NameTypeDescriptionConstraints
LabelStringLabel for UI
Reference NameStringUniquely identified name for lineage
HostStringMSSQL host (serverName)Required (defaults to localhost on UI)
PortNumberThe port where SQL Server is listening. If the port number is specified in the connection string, no request to SQLbrowser is made. When the port and instanceName are both specified, the connection is made to the specified port. However, the instanceName is validated and an error is thrown if it does not match the port.

Important: We recommend that the port number is always specified, as this is more secure than using SQLbrowser

Optional

(default 1433)
DatabaseStringDatabase name to connectRequired
Import QueryStringQuery for import dataValid SQL query
Authentication TypeSelectIndicates which SQL authentication method will be used for the connection. Use 'SQL Login' to connect to a SQL Server using username and password properties. 'Active Directory Password' can be used to connect to an Azure SQL Database/Data Warehouse using an Azure AD principal name and password
UsernameStringDB usernameRequired
PasswordStringUser passwordRequired
Bounding QueryStringReturns max and min of split-By FiledValid SQL query
Split-By Field NameStringField name which will be used to generate splits
Number of Splits to GenerateNumberNumber of splits to generate
Transaction Isolation LevelSelectTransaction isolation level for queries run by this sink
Connection ArgumentsKeyvalueA list of arbitrary string tag/value pairs as connection arguments, list of properties https://docs.microsoft.com/en-us/sql/connect/jdbc/setting-the-connection-properties?view=sql-server-2017
Instance NameStringThe SQL Server instance name to connect to. When it is not specified, a connection is made to the default instance. For the case where both the instanceName and port are specified, see the notes for port.

If you specify a Virtual Network Name in the Server connection property, you cannot use instanceName connection property
Optional
Query TimeoutNumberThe number of seconds to wait before a timeout has occurred on a query. The default value is -1, which means infinite timeout. Setting this to 0 also implies to wait indefinitely.

Optional

Connect TimeoutNumberTime in seconds to wait for a connection to the server before terminating the attempt and generating an error.Optional
Column Encryption Select

Default column encryption setting for all the commands on the connection. When enabled the JDBC driver will transparently encrypt and decrypt sensitive data stored in encrypted database columns in the SQL Server.

Possible values are: 'Enabled' and 'Disabled'.

Default: 'Disabled'.

EncryptSelect

When set to 'Yes', SQL Server uses SSL encryption for all data sent between the client and server if the server has a certificate installed.

Possible values are: 'Yes' and 'No'.

Default: 'No'.

Trust Server Certificate Select

When set to 'Yes' (and encryption enabled), SQL Server uses SSL encryption for all data sent between the client and server without validating the server certificate.

Possible values are: 'Yes' and 'No'.

Default: 'No'.

Workstation ID StringUsed to identify the specific workstation in various SQL Server profiling and logging tools. Optional
Failover Partner StringThe name or network address of the instance of SQL Server that acts as failover partner. Optional
Packet Size NumberThe network packet size used to communicate with SQL Server, specified in bytes. It's not recommended to specify packet size property when the encryption is enabled. Otherwise, the driver might raise a connection error. Optional
Current Language StringMust correspond to the SQL Server language record name and specifies the language environment for the session. The session language determines the datetime formats and system messages. Optional


Action Properties


User Facing NameTypeDescriptionConstraints
LabelStringLabel for UI
HostStringMSSQL host (serverName)Required (defaults to localhost on UI)
PortNumberThe port where SQL Server is listening. If the port number is specified in the connection string, no request to SQLbrowser is made. When the port and instanceName are both specified, the connection is made to the specified port. However, the instanceName is validated and an error is thrown if it does not match the port.

Important: We recommend that the port number is always specified, as this is more secure than using SQLbrowser

Optional

(default 1433)
DatabaseStringDatabase name to connectRequired
Authentication Type SelectIndicates which SQL authentication method will be used for the connection. Use 'SQL Login' to connect to a SQL Server using username and password properties. 'Active Directory Password' can be used to connect to an Azure SQL Database/Data Warehouse using an Azure AD principal name and password
Username

String

DB usernameRequired
PasswordStringUser passwordRequired
Connection ArgumentsKeyvalue

A list of arbitrary string tag/value pairs as connection arguments, list of properties 

https://docs.microsoft.com/en-us/sql/connect/jdbc/setting-the-connection-properties?view=sql-server-2017


Database CommandStringDatabase command to runValid SQL query
Instance NameStringThe SQL Server instance name to connect to. When it is not specified, a connection is made to the default instance. For the case where both the instanceName and port are specified, see the notes for port.

If you specify a Virtual Network Name in the Server connection property, you cannot use instanceName connection property
Optional
Query TimeoutNumberThe number of seconds to wait before a timeout has occurred on a query. The default value is -1, which means infinite timeout. Setting this to 0 also implies to wait indefinitely.Optional
Application IntentSelectDeclares the application workload type when connecting to a server.

Possible values: 'ReadWrite' and 'ReadOnly'.

Default: 'ReadWrite'.

Connect TimeoutNumberTime in seconds to wait for a connection to the server before terminating the attempt and generating an error.Optional
Column Encryption Select

Default column encryption setting for all the commands on the connection. When enabled the JDBC driver will transparently encrypt and decrypt sensitive data stored in encrypted database columns in the SQL Server.

Possible values are: 'Enabled' and 'Disabled'.

Default: 'Disabled'.

EncryptSelect

When set to 'Yes', SQL Server uses SSL encryption for all data sent between the client and server if the server has a certificate installed.

Possible values are: 'Yes' and 'No'.

Default: 'No'.

Trust Server Certificate Select

When set to 'Yes' (and encryption enabled), SQL Server uses SSL encryption for all data sent between the client and server without validating the server certificate.

Possible values are: 'Yes' and 'No'.

Default: 'No'.

Workstation ID StringUsed to identify the specific workstation in various SQL Server profiling and logging tools. Optional
Failover Partner StringThe name or network address of the instance of SQL Server that acts as failover partner. Optional
Packet Size NumberThe network packet size used to communicate with SQL Server, specified in bytes. It's not recommended to specify packet size property when the encryption is enabled. Otherwise, the driver might raise a connection error. Optional
Current Language StringMust correspond to the SQL Server language record name and specifies the language environment for the session. The session language determines the datetime formats and system messages. Optional


Data Types Mapping

MS SQL Data TypeCDAP Schema Data TypeSupportComment
BIGINTSchema.Type.LONG+
BINARYSchema.Type.BYTES+
BITSchema.Type.BOOLEAN+
CHARSchema.Type.STRING+
DATESchema.LogicalType.DATE+
DATETIMESchema.LogicalType.TIMESTAMP_MICROS+
DATETIME2Schema.LogicalType.TIMESTAMP_MICROS+
DATETIMEOFFSETSchema.Type.STRING*
DECIMALSchema.LogicalType.DECIMAL+
FLOATSchema.Type.DOUBLE+
IMAGESchema.Type.BYTES+
INTSchema.Type.INT+
MONEYSchema.LogicalType.DECIMAL+
NCHARSchema.Type.STRING+
NTEXTSchema.Type.STRING+
NUMERICSchema.LogicalType.DECIMAL+
NVARCHARSchema.Type.STRING+
NVARCHAR(MAX)Schema.Type.STRING+
REALSchema.Type.FLOAT+
SMALLDATETIMESchema.LogicalType.TIMESTAMP_MICROS+
SMALLINTSchema.Type.INT+
SMALLMONEYSchema.LogicalType.DECIMAL+
TEXTSchema.Type.STRING+
TIMESchema.LogicalType.TIME_MICROS*TIME data type has the accuracy of 100 nanoseconds which can not be supported using TIME_MICROS logical type.
TIMESTAMPSchema.Type.BYTES*TIMESTAMP is the synonym for the ROWVERSION data type, values of which are automatically generated. Thus TIMESTAMP can not be supported by Sink plugin.
TINYINTSchema.Type.INT+
UDTDepends on basic type+UDT will be mapped as basic type if it's an alias of this type. CLR UDT mapped to Schema.Type.BYTES.
UNIQUEIDENTIFIERSchema.Type.STRING+
VARBINARYSchema.Type.BYTES+
VARBINARY(MAX)Schema.Type.BYTES+
VARCHARSchema.Type.STRING+
VARCHAR(MAX)Schema.Type.STRING+
XMLSchema.Type.STRING+
SQLVARIANTSchema.Type.STRING*
GEOMETRYSchema.Type.BYTES+
GEOGRAPHYSchema.Type.BYTES+


Approach

Create a module mssql-plugin in database-plugins project, reuse existing database-plugins code if possible. Add MSSQL-specific properties to configuration, add support for MSSQL-specific datatypes. Update UI widgets JSON definitions.

Pipeline Samples


API changes

Deprecated Programmatic APIs

database-plugins is moved to Data Integrations

UI Impact or Changes

Configurable database properties are presented as named text fields instead of arbitrary key value pairs. MSSQL source and sink are separate entries with MSSQL logo in source and sink lists.

Test Scenarios

TODO

Releases

Release X.Y.Z

Related Work

Database plugin enhancements

Future work

Auroradb database plugin