Checklist

User Stories Documented
User Stories Reviewed
Design Reviewed
APIs reviewed
Release priorities assigned
Test cases reviewed
Blog post

Introduction

Capacity of Enterprise Data Warehouses(EDW) are being exhausted with tremendous growth in the generated data. Traditional ETL processes can be used to offload the infrequently used data to the Hadoop cluster. These processes run periodically (weekly, daily) and do the bulk transfer of the data from source to the destination. However since these processes run periodically, it takes time for the data to be available in the Hadoop cluster. Also these processes do the bulk transfer, they put heavy load on the source production systems.

Change Data Capture (CDC) can be used instead of traditional ETL for EDW offloads. CDC identifies, captures, and delivers only the changes that are made to the data systems. By processing only changes, CDC makes the extracting the data from the source data systems efficient without putting much load on the systems. Also since the changes are streamed continuously, latency between the time of change occur in the source system and corresponding change available in the target systems is also greatly reduced.

Goals

Ability to have CDAP Datasets in sync with the source relational tables. Changes to the data and schema from the source table configured for the CDC should get applied to the CDAP datasets (HBase, Kudu, Hive etc).
Document the setup for CDC.

User Stories

Breakdown of User-Stories
User Story #1
User Story #2
User Story #3

Design

Here we need to design for following aspects.

Configurations required to setup and integrate the Oracle Golden Gate (OGG) CDC with the source database. OGG for big data can be setup to stream the change capture data to HDFS, HBase, Flume, and Kafka (include format and number of topics for Kafka).
How to perform initial load when we configure the golden gate.
Transform plugin required in Hydrator to read the change capture data and transform into the StructuredRecord.
CDC sinks (Kudu, HBase, Hive) which perform the the operations on the destination CDAP dataset.
Propagating schema changes from the source table to the destination CDAP dataset. (How to keep the Hive Schema in sync?)

Approach

Approach #1

Approach #2

API changes

New Programmatic APIs

New Java APIs introduced (both user facing and internal)

Deprecated Programmatic APIs

New REST APIs

Path

Method

Description

Response Code

Response

/v3/apps/<app-id>

GET

Returns the application spec for a given application

200 - On success

404 - When application is not available

500 - Any internal errors

Deprecated REST API

Path	Method	Description
/v3/apps/<app-id>	GET	Returns the application spec for a given application

CLI Impact or Changes

Impact #1
Impact #2
Impact #3

UI Impact or Changes

Impact #1
Impact #2
Impact #3

Security Impact

What's the impact on Authorization and how does the design take care of this aspect

Impact on Infrastructure Outages

System behavior (if applicable - document impact on downstream [ YARN, HBase etc ] component failures) and how does the design take care of these aspect

Test Scenarios

Test ID	Test Description	Expected Results

Introduction

Goals

User Stories

Design

Approach

Approach #1

Approach #2

API changes

New Programmatic APIs

Deprecated Programmatic APIs

New REST APIs

Deprecated REST API

CLI Impact or Changes

UI Impact or Changes

Security Impact

Impact on Infrastructure Outages

Test Scenarios

Releases

Release X.Y.Z

Release X.Y.Z

Related Work

Future work

Change Data Capture (WIP)

Introduction

Goals

User Stories

Design

Approach

Approach #1

Approach #2

API changes

New Programmatic APIs

Deprecated Programmatic APIs

New REST APIs

Deprecated REST API

CLI Impact or Changes

UI Impact or Changes

Security Impact

Impact on Infrastructure Outages

Test Scenarios

Releases

Release X.Y.Z

Release X.Y.Z

Related Work

Future work