Data Replication

Owned by Nitin Motgi

Last updated: Nov 16, 2016 by Deepak Wadhwani

Goals

Reference

Components used by CDAP and it's functions

Requirements

Support Active-Active and Active-Passive configuration
Provide tool or status on whether the replication is complete or is in a safe state
Support the ability to replicate HBase DDL to remote cluster – support creation of tables dynamically
Handle Kafka offset management across multiple clusters (Shortcoming of Mirror Maker)
Support replication of routing configuration stored in Zookeeper to remote cluster

Replications:

HDFS:
1. Hadoop Distcp is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.
2. Hadoop Distributed Copy Command: http://hadoop.apache.org/docs/r1.2.1/distcp2.html
3. Cloudera Distcp page: https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_admin_distcp_data_cluster_migrate.html
4. HortonWorks: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.3/bk_Sys_Admin_Guides/content/using_distcp.html
5. How to iteratively copy data? What is data quantum to copy data iteratively? How to define replication complete ?
  - distcp allows an option to copy files, could we copy individual files at certain time boundaries ? End of each day ?
  - distcp also allows -append option which can append to a destination file if the source file is bigger than the destination file. [only sending the diff.]
  - There is also another -diff snapshot option to copy differences of two snapshots.
  - distcp performace analysis: https://developer.ibm.com/hadoop/2016/02/05/fast-can-data-transferred-hadoop-clusters-using-distcp/
6. Cloudera Manager Replication is built on a hardened version of distcp. It uses the scalability and availability of MapReduce and YARN to copy files in parallel, using a specialized MapReduce job or YARN application that runs diffs and transfers only changed files from each mapper to the replica side. Files are selected for copying based on their size and checksums. http://www.cloudera.com/documentation/enterprise/latest/topics/cm_bdr_about.html
7. Cloudera Manager provides some additional capabilities on top of distcp, they are: https://blog.cloudera.com/blog/2016/08/considerations-for-production-environments-running-cloudera-backup-and-disaster-recovery-for-apache-hive-and-hdfs/
  Cloudera has introduced quite a few enhancements on top of Hadoop’s native tooling (such as Distcp) :
  - Consistency guarantees via HDFS snapshots
  - Scheduled execution support
  - Support for multiple Kerberos realm
  - These three can actually be achieved from distcp also:
    - Constant-space parallelized copy listing (Distcp has a similar enhancement but does not optimize for disk space or memory consumption.)
    - Replication between clusters on different Hadoop versions (including different major versions i.e. CDH4/CDH5)
    - Selective path exclusion
HBase:
a. HBase Supports replication to multiple clusters in multiple topologies. Documentation: http://hbase.apache.org/book.html#_cluster_replication
How to check Replication is complete when customer is ready to switch over the cluster:
1. Check if this replication metric can be used to determine the above:
  1. source.sizeOfLogQueue
    number of WALs to process (excludes the one which is being processed) at the Replication source
c. HBase’s CopyTable Utility allows an efficient and scalable copy to a destination HBase cluster, while keeping current table online.
1. CopyTable Usage steps from Pivotal: https://blog.pivotal.io/pivotal/products/migrating-an-apache-hbase-table-between-different-clusters
Kafka:
FileSets

Challenges

Open Questions