Replication Status Tool

This document describes the design and usage of the CDAP Replication Status Tool. This tool can be used by Cluster admins to check if replication is complete and if it is safe to start CDAP on the cold [slave] cluster.

Essentially, this tool runs these two checks:

The HBase Data Replication from the Master to the Slave cluster has been completed for all CDAP tables.
Transaction Snapshots [and possibly other HDFS files] copied over manually by the admin to the slave cluster are consistent with the Master cluster.

Syntax:

The ReplicationStatusTool needs to be run on both the clusters. When the tool is run on the Master cluster it dumps the state in an output file. The admin needs to copy this file to the Slave cluster and provide it as an argument to the tool.

On Master Cluster, the tool will generate an output file containing status information from the Master.

#cdap run co.cask.cdap.data.tools.ReplicationStatusTool -m -o <file path where the tool will write the status of the Master cluster>

On Slave Cluster, the tool will verify HBase replication is complete and HDFS files are consistent.

#cdap run co.cask.cdap.data.tools.ReplicationStatusTool -i <path of the status file generated by the tool on the master cluster>

Other Options:

-d => This option dumps status of all HBase Regions and HDFS Files on the Cluster.

-f <arg> => The tool by default runs consistency check for all transaction snapshot files on HDFS. This option can be used to override the HDFS paths that the tool will verify by providing a file containing a list of hdfs paths.

-s <arg> => On Master Cluster, this option can be used to override the cdap-master shutdown time.

-h => Shows usage of the tool.

Verification:

The replication tool will guarantee the following:

All edits made to all CDAP HBase tables on the Master Cluster have been successfully replicated to the Slave Cluster.
All HDFS files on the Master cluster are present on the Slave cluster.

Design Details:

HBase Verification:

The Replication Status tool uses HBase Coprocessors to track timestamps from all WAL Entries written on the Master cluster and replicated on the Slave Cluster. To check for completeness the tool compares the last write time of all the regions from the Master against the last replicate time from the Slave. The coprocessors write to an internal HBase table called "Replication-State" to track the timestamps on both the clusters.

HBase Coprocessors:

WALObserver coprocessor on the Master Cluster adds the writeTime from the logKey entry to the Replication State table for every region using this API.

void postWALWrite(ObserverContext<? extends WALCoprocessorEnvironment> ctx, HRegionInfo info, WALKey logKey, WALEdit logEdit) throws IOException;

BaseRegionServerObserver coprocessor on the Slave Cluster adds the writeTime from logKey entry using this API. This is the last WAL entry replicated on the Slave cluster.

void postReplicateLogEntries(final ObserverContext<RegionServerCoprocessorEnvironment> ctx, List<WALEntry> entries, CellScanner cells) throws IOException;

Details:

postWALWrite() gets called on the Master Cluster with the WALKeys carrying the writeTime of the entries written to the WAL.
postReplicateLogEntries() gets called on the Slave Cluster with the WALKeys carrying the writeTime corresponding to the writes on the Master. This timestamp is used as the last replicate time and reflects the current position of replication.
Both the Master and the Slave Cluster use their own Replication-State table. The replication-state table is not replicated between the clusters.

Completion Logic:

The tool runs the following tests for each Region Entry in the Replication State table.

For each regions, replication is complete iff

Replicate Time from the Slave Cluster > Shutdown time on the Master Cluster, or
Write time from the Master Cluster >= Replicate Time on the Slave Cluster

Optimization:

The coprocessor tracks the timestamps for WAL Entries in memory and updates the HBase Table using a repetitive timer. Default time period of 20s is used and can be overridden with configuration.
Writing to the internal HBase Replication State table is executed in a separate thread to keep these updates out of the critical path of the Coprocessor.

Error Scenarios:

In case of RegionServer restarts, some in-memory state might be lost and not persisted to Replication State Table.
1. If replication time could not be written for the last entry for a certain region, tool could get stuck in Not Complete state.
2. If write time did not get written for the last entry for a region, replication could be incorrectly reported complete.
  The Write Time and the Replicate Time for each RS are written to a special HBase Table Replication-State. Any failures to these writes need to be handled.
The above errors could be seen if any write to the Replication State table failed.

Since the above are all extremely corner cases and only happen when other failures occur in the system, they are not handled to avoid complexity in the tool.

HDFS Verification:

The tool utilizes dfs checksum [MD5-of-0MD5-of-512CRC32C] to verify the consistency of HDFS files on the two clusters.

On Master Cluster, the tool generates checksums for all HDFS files specified and writes this status to the output file.

On the Slave Cluster, the tool compares the checksum of HDFS each file in the master status file against the checksums of those files on the local cluster.

CDAP Shutdown Time:

cdap-master shutdown time is required on the Master Cluster to ignore HBase replication for non CDAP tables. When cdap-master exits on the Master cluster it saves this shutdown time for the tool to use.

Deployment:

For replication deployment steps for CDAP, please see here.

In addition to the above, these status tool specific configuration can be used to override the defaults.

1. "hbase.replicationtable.name": The namespace used for the Replication State Table. "default" namespace used by default.
2. "hbase.replicationtable.namespace": The name used for the Replication State Table. "replicationstate" is used by default.
3. "hbase.replicationtable.updatedelay": The delay in ms with which the tool starts updating the Replication State Table. 30s by default.
4. "hbase.replicationtable.updateperiod": The period in ms used for the repetitive timer at which the table is updated. 20s by default.

Incomplete Replication:

In case the tool shows replication incomplete:

For incomplete HBase replication, the tool lists which region was behind or not replicated to the Slave Cluster. Check Hbase RegionServer logs for this region to debug for any errors.
For incomplete HDFS replication, the tool lists the files which show inconsistent checksums. Run manual checksums on these files to verify and copy those files over again.
The replication tool can be run with debug option [-d] to dump the current state of the cluster.