Data Replication
Goals
Reference
Requirements
- Support Active-Active and Active-Passive configuration
- Provide tool or status on whether the replication is complete or is in a safe state
- Support the ability to replicate HBase DDL to remote cluster – support creation of tables dynamically
- Handle Kafka offset management across multiple clusters (Shortcoming of Mirror Maker)
- Support replication of routing configuration stored in Zookeeper to remote cluster
Replications:
HDFS:
- Hadoop Distcp is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.
Hadoop Distributed Copy Command: http://hadoop.apache.org/docs/r1.2.1/distcp2.html
Cloudera Distcp page: https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_admin_distcp_data_cluster_migrate.html
HortonWorks: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.3/bk_Sys_Admin_Guides/content/using_distcp.html
How to iteratively copy data? What is data quantum to copy data iteratively? How to define replication complete ?
- distcp allows an option to copy files, could we copy individual files at certain time boundaries ? End of each day ?
- distcp also allows -append option which can append to a destination file if the source file is bigger than the destination file. [only sending the diff.]
- There is also another -diff snapshot option to copy differences of two snapshots.
- distcp performace analysis: https://developer.ibm.com/hadoop/2016/02/05/fast-can-data-transferred-hadoop-clusters-using-distcp/Cloudera Manager Replication is built on a hardened version of distcp. It uses the scalability and availability of MapReduce and YARN to copy files in parallel, using a specialized MapReduce job or YARN application that runs diffs and transfers only changed files from each mapper to the replica side. Files are selected for copying based on their size and checksums. http://www.cloudera.com/documentation/enterprise/latest/topics/cm_bdr_about.html
Cloudera Manager provides some additional capabilities on top of distcp, they are: https://blog.cloudera.com/blog/2016/08/considerations-for-production-environments-running-cloudera-backup-and-disaster-recovery-for-apache-hive-and-hdfs/
Cloudera has introduced quite a few enhancements on top of Hadoop’s native tooling (such as Distcp) :
Consistency guarantees via HDFS snapshots
Scheduled execution support
Support for multiple Kerberos realm
These three can actually be achieved from distcp also:
Constant-space parallelized copy listing (Distcp has a similar enhancement but does not optimize for disk space or memory consumption.)
Replication between clusters on different Hadoop versions (including different major versions i.e. CDH4/CDH5)
Selective path exclusion
HBase:
a. HBase Supports replication to multiple clusters in multiple topologies. Documentation: http://hbase.apache.org/book.html#_cluster_replication
How to check Replication is complete when customer is ready to switch over the cluster:Check if this replication metric can be used to determine the above:
source.sizeOfLogQueue
number of WALs to process (excludes the one which is being processed) at the Replication source
- CopyTable Usage steps from Pivotal: https://blog.pivotal.io/pivotal/products/migrating-an-apache-hbase-table-between-different-clusters
- Kafka:
- FileSets
Challenges