Checklist

User Stories Documented
User Stories Reviewed
Design Reviewed
APIs reviewed
Release priorities assigned
Test cases reviewed
Blog post

Introduction

Capacity of Enterprise Data Warehouses(EDW) are being exhausted with tremendous growth in the generated data. Traditional ETL processes can be used to offload the infrequently used data to the Hadoop cluster. These processes run periodically (weekly, daily) and do the bulk transfer of the data from source to the destination. However since these processes run periodically, it takes time for the data to be available in the Hadoop cluster. Also these processes do the bulk transfer, they put heavy load on the source production systems.

Change Data Capture (CDC) can be used instead of traditional ETL for EDW offloads. CDC identifies, captures, and delivers only the changes that are made to the data systems. By processing only changes, CDC makes the extracting the data from the source data systems efficient without putting much load on the systems. Also since the changes are streamed continuously, latency between the time of change occur in the source system and corresponding change available in the target systems is also greatly reduced.

Goals

Ability to have CDAP Datasets in sync with the source relational tables. Changes to the data and schema from the source table configured for the CDC should get applied to the CDAP datasets (HBase, Kudu, Hive etc).
Document the setup for CDC.

User Stories

Breakdown of User-Stories
User Story #1
User Story #2
User Story #3

Design

Here we need to design for following aspects.

Configurations required to setup and integrate the Oracle Golden Gate (OGG) CDC with the source database. OGG for big data can be setup to stream the change capture data to HDFS, HBase, Flume, and Kafka.
Perform initial load when we configure the golden gate for the existing tables.
Hydrator plugins required.
Propagating schema changes from the source table to the destination CDAP dataset. (How to keep the Hive Schema in sync?)

Approach

Extracting Change Data from Source Database

OGG will be configured to stream the change data to the Kafka topic say ''oggdata''.
Both schema changes and actual data changes will be sent to the same topic.
Kafka topic named "oggdata" will be created and configured to have single partition. This is to make sure that the changes from the OGG are received in orderly fashion since Kafka does not provide the ordering guarantees across multiple partitions.
DDL changes will be streamed in the Wrapped Avro binary format.
```
{
  "type" : "record",
  "name" : "generic_wrapper",
  "namespace" : "oracle.goldengate",
  "fields" : [ {
    "name" : "table_name",
    "type" : "string"
  }, {
    "name" : "schema_hash",
    "type" : "int"
  }, {
    "name" : "payload",
    "type" : "bytes"
  } ]
}
```
- table_name: represents the name of the table on the source database to which the current change record belongs to.
- schema_hash: represents the fingerprint of the Avro schema with which the message is generated
- payload: the wrapped Avro message

Schema will be propagated in the JSON encoded Avro format. Following is the example of sample Schema

{
  "type" : "record",
  "name" : "CUSTORD",
  "namespace" : "GG",
  "fields" : [ {
    "name" : "table",
    "type" : "string"
  }, {
    "name" : "op_type",
    "type" : "string"
  }, {
    "name" : "op_ts",
    "type" : "string"
  }, {
    "name" : "current_ts",
    "type" : "string"
  }, {
    "name" : "pos",
    "type" : "string"
  }, {
    "name" : "primary_keys",
    "type" : {
      "type" : "array",
      "items" : "string"
    }
  }, {
    "name" : "tokens",
    "type" : {
      "type" : "map",
      "values" : "string"
    },
    "default" : { }
  }, {
    "name" : "before",
    "type" : [ "null", {
      "type" : "record",
      "name" : "columns",
      "fields" : [ {
        "name" : "CUST_CODE",
        "type" : [ "null", "string" ],
        "default" : null
      }, {
        "name" : "CUST_CODE_isMissing",
        "type" : "boolean"
      }, {
        "name" : "ORDER_DATE",
        "type" : [ "null", "string" ],
        "default" : null
      }, {
        "name" : "ORDER_DATE_isMissing",
        "type" : "boolean"
      }, {
        "name" : "PRODUCT_CODE",
        "type" : [ "null", "string" ],
        "default" : null
      }, {
        "name" : "PRODUCT_CODE_isMissing",
        "type" : "boolean"
      }, {
        "name" : "ORDER_ID",
        "type" : [ "null", "string" ],
        "default" : null
      }, {
        "name" : "ORDER_ID_isMissing",
        "type" : "boolean"
      }, {
        "name" : "PRODUCT_PRICE",
        "type" : [ "null", "double" ],
        "default" : null
      }, {
        "name" : "PRODUCT_PRICE_isMissing",
        "type" : "boolean"
      }, {
        "name" : "PRODUCT_AMOUNT",
        "type" : [ "null", "double" ],
        "default" : null
      }, {
        "name" : "PRODUCT_AMOUNT_isMissing",
        "type" : "boolean"
      }, {
        "name" : "TRANSACTION_ID",
        "type" : [ "null", "string" ],
        "default" : null
      }, {
        "name" : "TRANSACTION_ID_isMissing",
        "type" : "boolean"
      } ]
    } ],
    "default" : null
  }, {
    "name" : "after",
    "type" : [ "null", "columns" ],
    "default" : null
  } ]
}

Hydrator Plugins

CDAP Streaming Pipeline will be responsible for reading the changed data from the Kafka and create CDAP Datasets (Kudu, HBase, Hive etc.). Pipeline will look like following -
Golden Gate Source is the simple realtime Kafka streaming source. It will forward the messages to the CDC Normalizer in the form of array of bytes.
CDC Normalizer is responsible for converting array of bytes into StructuredRecord. The message received can contain either the Avro schema (DDL operation) or the wrapped Avro message (DML). The output Schema for the CDC Normalizer would be union of String (representing the Schema) and StructuredRecord (representing the wrapped Avro message).
CDC Sink will accept the configuration for the name of the table to be used as Schema registry. Schema registry table can be shared between multiple pipelines.
Following will be the RowKey for the schema registry table <source_table_name_with_namespace>:<schema_fingerprint> and the value would be JSON formatted Avro schema. Schema fingerprint would be computed using SchemaNormalization.fingerprint64 method.
CDC Sink -

- Receives either the JSON formatted Avro Schema or wrapped Avro messages containing the changes data.
- If the received message is Avro Schema, insert it into Schema Registry table.
- If the received message contains wrapped Avro message the schema for it will be
```
{
  "type" : "record",
  "name" : "generic_wrapper",
  "namespace" : "oracle.goldengate",
  "fields" : [ {
    "name" : "table_name",
    "type" : "string"
  }, {
    "name" : "schema_hash",
    "type" : "int"
  }, {
    "name" : "payload",
    "type" : "bytes"
  } ]
}
```
  Use <namespace>:<table_name>:<schema_hash> as a row key to get the associated Avro schema from the Schema registry. Use this schema to deserialize the payload.
- Deserialized Avro message will contain metadata about the operation such as operation_type (INSERT, UPDATE, DELETE), operation_ts etc and will contain the actual change (columns and corresponding values), which can be used to perform the desired operation on the CDAP dataset.

Updating the CDAP dataset schema (TBD)

Open Questions

How to perform initial load when the golden gate is configured for the existing table.
Handling errors in propagating the changes to Kafka from Golden Gate.
How does the truncate work?

API changes

New Programmatic APIs

New Java APIs introduced (both user facing and internal)

Deprecated Programmatic APIs

New REST APIs

Path

Method

Description

Response Code

Response

/v3/apps/<app-id>

GET

Returns the application spec for a given application

200 - On success

404 - When application is not available

500 - Any internal errors

Deprecated REST API

Path	Method	Description
/v3/apps/<app-id>	GET	Returns the application spec for a given application

CLI Impact or Changes

Impact #1
Impact #2
Impact #3

UI Impact or Changes

Impact #1
Impact #2
Impact #3

Security Impact

What's the impact on Authorization and how does the design take care of this aspect

Impact on Infrastructure Outages

System behavior (if applicable - document impact on downstream [YARN, HBase etc ] component failures) and how does the design take care of these aspect

Test Scenarios

Test ID	Test Description	Expected Results

Introduction

Goals

User Stories

Design

Approach

Extracting Change Data from Source Database

Hydrator Plugins

Updating the CDAP dataset schema (TBD)

Open Questions

API changes

New Programmatic APIs

Deprecated Programmatic APIs

New REST APIs

Deprecated REST API

CLI Impact or Changes

UI Impact or Changes

Security Impact

Impact on Infrastructure Outages

Test Scenarios

Releases

Release X.Y.Z

Release X.Y.Z

Related Work

Future work

Change Data Capture (WIP)

Introduction

Goals

User Stories

Design

Approach

Extracting Change Data from Source Database

Hydrator Plugins

Updating the CDAP dataset schema (TBD)

Open Questions

API changes

New Programmatic APIs

Deprecated Programmatic APIs

New REST APIs

Deprecated REST API

CLI Impact or Changes

UI Impact or Changes

Security Impact

Impact on Infrastructure Outages

Test Scenarios

Releases

Release X.Y.Z

Release X.Y.Z

Related Work

Future work