Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. HDFS:
    1. Hadoop Distcp is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.
    2. Hadoop Distributed Copy Command: http://hadoop.apache.org/docs/r1.2.1/distcp2.html

    3. Cloudera Distcp page: https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_admin_distcp_data_cluster_migrate.html

    4. HortonWorks: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.3/bk_Sys_Admin_Guides/content/using_distcp.html

    5. How to iteratively copy data?

    6. What is data quantum  to copy data iteratively. 

    7. distcp allows an option to copy files, could we copy individual files at certain time boundaries ? End of each day ?
  2. HBase:
    a. HBase Supports replication to multiple clusters in multiple topologies. Documentation: http://hbase.apache.org/book.html#_cluster_replication
    b. How to check Replication is complete when customer is ready to switch over the cluster: 
    1. Check if this replication metric can be used to determine the above: 
      1. source.sizeOfLogQueue

        number of WALs to process (excludes the one which is being processed) at the Replication source



  3. Kafka:
  4. FileSets

...