Goals
Reference
Requirements
- Support Active-Active and Active-Passive configuration
- Provide tool or status on whether the replication is complete or is in a safe state
- Support the ability to replicate HBase DDL to remote cluster – support creation of tables dynamically
- Handle Kafka offset management across multiple clusters (Shortcoming of Mirror Maker)
- Support replication of routing configuration stored in Zookeeper to remote cluster
Replications:
- HDFS:
- Hadoop Distcp is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.
Hadoop Distributed Copy Command: http://hadoop.apache.org/docs/r1.2.1/distcp2.html
Cloudera Distcp page: https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_admin_distcp_data_cluster_migrate.html
HortonWorks: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.3/bk_Sys_Admin_Guides/content/using_distcp.html
How to iteratively copy data? What is data quantum to copy data iteratively? How to define replication complete ?
- distcp allows an option to copy files, could we copy individual files at certain time boundaries ? End of each day ?
- distcp also allows -append option which can append to a destination file if the source file is bigger than the destination file. [only sending the diff.]
- There is also another -diff snapshot option to copy differences of two snapshots.
- distcp performace analysis: https://developer.ibm.com/hadoop/2016/02/05/fast-can-data-transferred-hadoop-clusters-using-distcp/
- HBase:
a. HBase Supports replication to multiple clusters in multiple topologies. Documentation: http://hbase.apache.org/book.html#_cluster_replication
b. How to check Replication is complete when customer is ready to switch over the cluster:- Check if this replication metric can be used to determine the above:
source.sizeOfLogQueue
number of WALs to process (excludes the one which is being processed) at the Replication source
- Check if this replication metric can be used to determine the above:
- Kafka:
- FileSets
Challenges