Checklist
- User Stories Documented
- User Stories Reviewed
- Design Reviewed
- APIs reviewed
- Release priorities assigned
- Test cases reviewed
- Blog post
Introduction
Capacity of Enterprise Data Warehouses(EDW) are being exhausted with tremendous growth in the generated data. Traditional ETL processes can be used to offload the infrequently used data to the Hadoop cluster. These processes run periodically (weekly, daily) and do the bulk transfer of the data from source to the destination. However since these processes run periodically, it takes time for the data to be available in the Hadoop cluster. Also these processes do the bulk transfer, they put heavy load on the source production systems.
Change Data Capture (CDC) can be used instead of traditional ETL for EDW offloads. CDC identifies, captures, and delivers only the changes that are made to the data systems. By processing only changes, CDC makes the extracting the data from the source data systems efficient without putting much load on the systems. Also since the changes are streamed continuously, latency between the time of change occur in the source system and corresponding change available in the target systems is also greatly reduced.
Goals
- Ability to have CDAP Datasets in sync with the source relational tables. Changes to the data and schema from the source table configured for the CDC should get applied to the CDAP datasets (HBase, Kudu, Hive etc).
- Document the setup for CDC.
User Stories
- Breakdown of User-Stories
- User Story #1
- User Story #2
- User Story #3
Design
Here we need to design for following aspects.
- Configurations required to setup and integrate the Oracle Golden Gate (OGG) CDC with the source database. OGG for big data can be setup to stream the change capture data to HDFS, HBase, Flume, and Kafka (include format and number of topics for Kafka).
- How to perform initial load when we configure the golden gate.
- Transform plugin required in Hydrator to read the change capture data and transform into the StructuredRecord.
- CDC sinks (Kudu, HBase, Hive) which perform the the operations on the destination CDAP dataset.
- Propagating schema changes from the source table to the destination CDAP dataset. (How to keep the Hive Schema in sync?)
Approach
Approach #1
Approach #2
API changes
New Programmatic APIs
New Java APIs introduced (both user facing and internal)
Deprecated Programmatic APIs
New REST APIs
Path | Method | Description | Response Code | Response |
---|---|---|---|---|
/v3/apps/<app-id> | GET | Returns the application spec for a given application | 200 - On success 404 - When application is not available 500 - Any internal errors |
|
Deprecated REST API
Path | Method | Description |
---|---|---|
/v3/apps/<app-id> | GET | Returns the application spec for a given application |
CLI Impact or Changes
- Impact #1
- Impact #2
- Impact #3
UI Impact or Changes
- Impact #1
- Impact #2
- Impact #3
Security Impact
What's the impact on Authorization and how does the design take care of this aspect
Impact on Infrastructure Outages
System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect
Test Scenarios
Test ID | Test Description | Expected Results |
---|---|---|
Releases
Release X.Y.Z
Release X.Y.Z
Related Work
- Work #1
- Work #2
- Work #3