CDC Solution Enhancements
- Sree Raman
- Maksym Lozbin
Introduction
CDAP offers change data capture via three different approaches
- Golden gate for Oracle
- Log miner for Oracle
- Change tracking for SQL server
All these CDC mechanisms are supported via Realtime data pipelines and the plugins are available from Hub. The CDC solution currently runs on Spark 1.x and has experimental support for BigTable.Ā
Use case(s)
The scope of work involves making the CDC solution work with Spark 2.x and being able to write to BigTable. The performance numbers throughput and latency should be published for these two destinations with all the three CDC approaches.
- ETL developers should be able to set up realtime pipelines to write data to BigTable/BigQuery
- Users should get field level lineage for the source and sink that is being used
- Reference documentation should be updated to account for the changesĀ
- The solution should run with all versions of Spark 2.x
- Integration tests for specific plugins should be added in the test repos
- Reference document should be updated for the CDC plugins
DeliverablesĀ
- Source code in cask-solution/cdcĀ repo
- Performance tests for the three approaches with BigTableĀ
- Integration test codeĀ
- Relevant documentation in the source repo and reference documentation section in plugin
Relevant linksĀ
- Existing CDC plugin code:Ā https://github.com/cask-solutions/cdc
- Experimental CDC for Big Table in a branch:Ā https://github.com/cask-solutions/cdc/tree/bigtable-cdc-sink
- Field level lineage:Ā https://docs.cdap.io/cdap/5.1.0-SNAPSHOT/en/developer-manual/metadata/field-lineage.html
Plugin Type
- Batch Source
- Batch SinkĀ
- Real-time Source
- Real-time Sink
- Action
- Post-Run Action
- Aggregate
- Join
- Spark Model
- Spark Compute
Configurables
CDCBigTableĀ Sink Properties
User Facing Name | Type | Description | Constraints |
---|---|---|---|
Reference Name | String | Reference specifies the name to be used to track this external source | Required |
Instance Id | String | BigTable instance id. | Required |
Project Id | String | Google Cloud Project ID, which uniquely identifies a project. If not specified, Project ID will be automatically read from the cluster environment. (Macro-enabled) | Optional |
Service Account File Path | String | Path on the local file system of the service account key used for If the plugin is run on a Google Cloud Dataproc cluster, the service account key does not need to be provided and can be set to 'auto-detect'. When running on other clusters, the file must be present on every node in the cluster. See Google's documentation onĀ Service account credentialsĀ for details. (Macro-enabled) | Optional (default: null) |
CDCOracleLogMinerĀ SourceĀ Properties
User Facing Name | Type | Description | Constraints |
---|---|---|---|
Reference Name | String | Uniquely identified name for lineage | Required |
Host | String | Oracle DB host | Optional (default: localhost) |
Port | Number | Port where Oracle is running | Optional (default: 1521) |
Database | String | Database name to connect | Required |
Username | String | DB username | Required |
Password | String | User password | Required |
Connection Arguments | Keyvalue | A list of arbitrary string tag/value pairs as connection arguments, list of properties https://docs.oracle.com/cd/B28359_01/win.111/b28375/featConnecting.htm |
Design
Approach(s)
Iteration 1: Support Spark 2, update documentation, add lineage.
- Make credentials optional for BigTable Sink.
- Add config, schema and third-party application health validation to "configurePipeline" step for all CDC plugins. To validate schema properly, source and sink plugins should use single schema for messages.
- Create new core library for Spark 2 plugins. Analogue of "cdap-etl-api-spark".
- Migrate CDC plugins to CDAP core v6.x.x.
- UpdateĀ documentation to be consistent with the other GCP plugins.
- Write documentation forĀ CDCKudu Sink,Ā CDCDatabase Source,Ā CTSQLServer Source.
- Update UI widgets for all CDC plugins to match properties and schema defined in documentation.
- Try to support both Spark 1 and Spark 2 but drop Spark 1 support if necessary.
Iteration 2: Implement LogMiner Source Plugin.
- ImplementĀ LogMiner Source Plugin for Oracle (CDCOracleLogMiner Source). Plugin should use JDBC to retrieve updates from LogMiner.
Iteration 3: Integration tests.
- Implement integration tests for all CDC plugins. Tests should pass using Spark 2. Tests should use local environment if possible. Tests against paid services should run only with special maven profile. All tests should run on local and distributed environments. Provide docker-compose for local environment setup.
Iteration 4: Performance tests.
- Implement performance tests for the three approaches with BigTable.
Future iterations (not in scope of current changes):
- Add lineage support for real-time plugins. Requires CDAP core modules modification (co.cask.cdap.etl.spark.streaming.DefaultStreamingContext).Ā
- Register datasets of source CDC plugins for lineage view.
- Register fields of all CDC plugins for lineage view.
Message Schema
{ "type": "record", "name": "etlSchemaBody", "fields": [ { "name": "op_type", "type": [ "string", "null" ] }, { "name": "table", "type": "string" }, { "name": "primary_keys", "type": [ { "type": "array", "items": "string" }, "null" ] }, { "name": "schema", "type": [ "string", "null" ] }, { "name": "change", "type": [ { "type": "map", "keys": "string", "values": [ "null", "boolean", "int", "long", "float", "double", "bytes", "string" ] }, "null" ] } ] }
Limitation(s)
- CDC plugins may not be compatible with Spark 2.
Future Work
- Error rendering macro 'jira' : Unable to locate Jira server for this macro. It may be due to Application Link configuration.
- Oracle LogMiner Source Plugin
- Integration tests for CDC plugins
- Performance tests for CDC plugins
Test Case(s)
- Track changes from Golden gate for OracleĀ (create table, insert row, update row, delete row).
- Track changes fromĀ Log miner for OracleĀ (create table, insert row, update row, delete row).
- Track changes from SQL Server (create table, insert row, update row, delete row).
- Write changes to Apache KuduĀ (create table, insert row, update row, delete row).
- Write changes toĀ HBaseĀ (create table, insert row, update row, delete row).
- Write changes to BigTableĀ (create table, insert row, update row, delete row).
Sample Pipeline
Pipeline #1 (SQL ServerĀ āĀ BigTable)
TODO
Table of Contents
Checklist
- User stories documentedĀ
- User stories reviewedĀ
- Design documentedĀ
- Design reviewedĀ
- Feature mergedĀ
- Examples and guidesĀ
- Integration testsĀ
- Documentation for featureĀ
- Short video demonstrating the feature