Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. HDFS:
    1. Hadoop Distcp is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.
    2. Hadoop Distributed Copy Command: http://hadoop.apache.org/docs/r1.2.1/distcp2.html

    3. Cloudera Distcp page: https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_admin_distcp_data_cluster_migrate.html

    4. HortonWorks: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.3/bk_Sys_Admin_Guides/content/using_distcp.html

    5. How to iteratively copy data? What is data quantum  to copy data iteratively? How to define replication complete ?

      - distcp allows an option to copy files, could we copy individual files at certain time boundaries ? End of each day ?
      - distcp also allows -append option which can append to a destination file if the source file is bigger than the destination file. [only sending the diff.]
      - There is also another -diff snapshot option to copy differences of two snapshots. 
      - distcp performace analysis: https://developer.ibm.com/hadoop/2016/02/05/fast-can-data-transferred-hadoop-clusters-using-distcp/

    6. Cloudera Manager provides some additional capabilities on top of distcp, they are:  https://blog.cloudera.com/blog/2016/08/considerations-for-production-environments-running-cloudera-backup-and-disaster-recovery-for-apache-hive-and-hdfs/

      Cloudera has introduced quite a few enhancements on top of Hadoop’s native tooling (such as Distcp) :

      • Consistency guarantees via HDFS snapshots
      • Scheduled execution support
      • Support for multiple Kerberos realm
      • These three can actually be achieved from distcp also:
        • Constant-space parallelized copy listing (Distcp has a similar enhancement but does not optimize for disk space or memory consumption.)
        • Replication between clusters on different Hadoop versions (including different major versions i.e. CDH4/CDH5)
        • Selective path exclusion

    7.  

    8.  

  2. HBase:
    a. HBase Supports replication to multiple clusters in multiple topologies. Documentation: http://hbase.apache.org/book.html#_cluster_replication
    b. How to check Replication is complete when customer is ready to switch over the cluster: 
    1. Check if this replication metric can be used to determine the above: 
      1. source.sizeOfLogQueue

        number of WALs to process (excludes the one which is being processed) at the Replication source



  3. Kafka:
  4. FileSets

...