Introduction
A separate database plugin to support MySQLPostgreSQL-specific features and configurations.
Use-Case
- Users can choose and install MySQL PostgreSQL source and sink plugins.
- Users should see MySQL see PostgreSQL logo on plugin configuration page for better experience.
- Users should get relevant information from the tool tip:
- The tool tip for the connection string should be customized specifically to the MySQL the PostgreSQL database,
- The tool tip should describe accurately what each field is used for.
- User should get a performance comparable to Sqoop by utilizing sqoop libraries for the data ingestion and egress.
- Users should not have to specify any redundant configuration (ex: JDBC type in source plugin, columns in the sink plugin).
- Users should get field level lineage for the source and sink that is being used.
- Reference documentation should be updated to account for the changes.
- The source code for MySQL PostgreSQL database plugin should be placed in repo under data-integrations org.
- Integration tests for MySQL PostgreSQL database plugin should be added in the test repo.
- The data pipeline using source and sink plugins should run on both mapreduce and spark engines.
User Stories
- User should be able to install MySQL PostgreSQL specific database source and sink plugins from the Hub
- Users should have each tool tip accurately describe what each field does
- Users should know the format for the MySql connection string by hovering over tool tip for connection stringUsers should get field level lineage information for the MySQL PostgreSQL source and sink
- Users should get a performance comparable to Sqoop when ingesting data from mysql PostgreSQL and while writing data to MySQL PostgreSQL (within ~15% of the time taken for sqoop)
- Users should be able to setup a pipeline avoiding specifying redundant information
- Users should get updated reference document for MySQL PostgreSQL source and sink
- Users should be able to read all the DB types
Plugin Type
- Batch Source
- Batch Sink
- Real-time Source
- Real-time Sink
- Action
- Post-Run Action
- Aggregate
- Join
- Spark Model
- Spark Compute
Design Tips
MySQL Connector/J 8.0 PostgreSQL connector reference: https://devjdbc.mysqlpostgresql.comorg/docdownload/connector-j/8.0/en/postgresql-9.4.1211.jar
Existing database plugins: https://github.com/cdapio/hydrator-plugins/tree/develop/database-plugins
MySQL PostgreSQL datatypes mappings and conversions: https://dev.mysql.com/doc/connector-j/8.0/en/connector-j-reference-type-conversions.html
Design
Currently the two major MySQL versions supported are 5 and 8. We suggest using MySQL Connector/J 8.0 since it is backward compatible with older versions of MySQL and supports all the new features of recent releases.
The suggestion is to move create maven submodule PostgreSQL under database-plugins module from hydrator-plugins repository to database-plugins repo in data-integrations organization as described in Plugins Repo Split. There is existing code in database-plugins that may be reused for MySQL plugin. We suggest creating a multi-module Maven project where existing `database-plugins` will be a common functionality module for all subsequent DB plugins and each plugin for a specific database (in this case MySQL) will depend on it. Having each DB plugin in a dedicated module allows us to create separately deliverable artifacts, so user can upload only those plugins they need.repo.
Sink Properties
User Facing Name | Type | Description | Constraints |
---|---|---|---|
Label | String | Label for UI | |
Reference Name | String | Uniquely identified name for lineage | |
Host | String |
PostgreSQL host | Required (defaults to localhost on UI) | |
Port | Number | Specific port where |
PostgreSQL running on | Optional (default |
5432) | |||
Database | String | Database name to connect | Required |
Username | String | DB username | Required |
Password | Password | User password | Required |
Transaction Isolation Level | Select | Transaction isolation level for queries run by this sink | |
Connection Arguments | Keyvalue | A list of arbitrary string tag/value pairs as connection arguments, list of properties |
Table Name | String | Name of a database table to write to | |
Connect Timeout | Number | The timeout value used for socket connect operations. If connecting to the server takes longer than this value, the connection is broken. The timeout is specified in seconds and a value of zero means that it is disabled |
Source Properties
User Facing Name | Type | Description | Constraints |
---|---|---|---|
Label | String | Label for UI | |
Reference Name | String | Uniquely identified name for lineage | |
Host | String | Mysql PostgreSQL host | Required (defaults to localhost on UI) |
Port | Number | Specific port where mysql PostgreSQL running on | Optional (default 33065432) |
Database | String | Database name to connect | Required |
Import Query | String | Query for import data | Valid SQL query |
Username | String | DB username | Required |
Password | String | User password | Required |
Bounding Query | String | Returns max and minof split-By Filed | Valid SQL query |
Split-By Field Name | String | Field name which will be used to generate splits | |
Number of Splits to Generate | Number | Number of splits to generate | |
Transaction Isolation Level | Select | Transaction isolation level for queries run by this sink | |
Connection Arguments | Keyvalue | A list of arbitrary string tag/value pairs as connection arguments, list of properties properties https://devjdbc.mysqlpostgresql.comorg/docdocumentation/connector-jhead/8.0/en/connector-j-reference-configuration-properties.htmlconnect.html#connection-parameters | |
Connect Timeout | Number | The timeout value used for socket connect operations. If connecting to the server takes longer than this value, the connection is broken. The timeout is specified in seconds and a value of zero means that it is disabled |
Action Properties
User Facing Name | Type | Description | Constraints |
---|---|---|---|
Label | String | Label for UI | |
Host | String | Mysql PostgreSQL host | Required (defaults to localhost on UI) |
Port | Number | Specific port where mysql PostgreSQL running on | Optional (default 33065432) |
Database | String | Database name to connect | Required |
Username | String | DB username | Required |
Password | String | User password | Required |
Connection Arguments | Keyvalue | A list of arbitrary string tag/value pairs as connection arguments, list of propertiesproperties devmysqlcomdocconnector-j8.0/en/connector-j-reference-configuration-properties.html | |
Database Command | String | Database command to run | Valid SQL query |
Connect Timeout | Number | The timeout value used for socket connect operations. If connecting to the server takes longer than this value, the connection is broken. The timeout is specified in seconds and a value of zero means that it is disabled |
Data Types Mapping
Postgres Data Type | CDAP Schema Data Type | Support | Comment |
---|---|---|---|
BIGINT | Schema.Type.LONG | + | |
BIGSERIAL | Schema.Type.LONG | + | Serial is autoincremented |
BIT(N) | Schema.Type.STRING | + | Bit strings are strings of 1's and 0's |
BIT VARYING(N) | Schema.Type.STRING | + | Bit strings are strings of 1's and 0's |
BOOLEAN | Schema.Type.BOOLEAN | + | |
BYTEA | Schema.Type.BYTES | + | |
CHARACTER | Schema.Type.STRING | + | |
CHARACTER VARYING | Schema.Type.STRING | + | |
DOUBLE PRECISION | Schema.Type.DOUBLE | + | |
INTEGER | Schema.Type.INT | + | |
NUMERIC(p, s)/DECIMAL(p, s) | Schema.LogicalType.DECIMAL | + | |
REAL | Schema.Type.FLOAT | + | |
SMALLINT | Schema.Type.INT | + | |
SMALLSERIAL | Schema.Type.INT | + | Serial is autoincremented |
SERIAL | Schema.Type.INT | + | Serial is autoincremented |
TEXT | Schema.Type.STRING | + | |
DATE | Schema.LogicalType.DATE | + | |
TIME [ (P) ] [ WITHOUT TIME ZONE ] | Schema.LogicalType.TIME_MICROS | + | |
TIME [ (P) ] WITH TIME ZONE | Schema.Type.STRING | + | |
TIMESTAMP [ (P) ] [ WITHOUT TIME ZONE ] | Schema.LogicalType.TIMESTAMP_MICROS | + | |
TIMESTAMP [ (P) ] WITH TIME ZONE | Schema.LogicalType.TIMESTAMP_MICROS | + | Postgresql converts it to UTC(see "Time Stamps" section) |
XML | Schema.Type.STRING | + | |
TSQUERY | Schema.Type.STRING | + | |
TSVECTOR | Schema.Type.STRING | + | |
TXID_SNAPSHOT | - | Postgresql specific, see documentation | |
UUID | Schema.Type.STRING | + | |
BOX | Schema.Type.STRING | + | |
CIDR | Schema.Type.STRING | + | |
CIRCLE | Schema.Type.STRING | + | |
INET | Schema.Type.STRING | + | |
INTERVAL | Schema.Type.STRING | + | |
JSON | Schema.Type.STRING | + | |
JSONB | Schema.Type.STRING | + | |
LINE | Schema.Type.STRING | + | |
LSEG | Schema.Type.STRING | + | |
MACADDR | Schema.Type.STRING | + | |
MACADDR8 | Schema.Type.STRING | + | |
MONEY | Schema.Type.STRING | + | |
PATH | Schema.Type.STRING | + | |
PG_LSN | - | Postgresql specific, see documentation | |
POINT | Schema.Type.STRING | + | |
POLYGON | Schema.Type.STRING | + |
Approach
Create a module mysqlpostgresql-plugin in database-plugins project, reuse existing database-plugins code if possible. Add MySQLPostgreSQL-specific properties to configuration, add support for MySQLPostgreSQL-specific datatypes. Update UI widgets JSON definitions.
Pipeline Samples
...
API changes
Deprecated Programmatic APIs
database-plugins is moved to Data Integrations
UI Impact or Changes
Configurable database properties are presented as named text fields instead of arbitrary key value pairs. MySQL PostgreSQL source and sink are separate entries with MySQL PostgreSQL logo in source and sink lists.
Test Scenarios
TODO
Releases
Release X.Y.Z
Related Work
Future work
PostgreSQL Oracle database plugin