Based on discussion in the community sync this week I want to add some more
information.
Code
For those interested in checking out the code, these are some of the major
classes to start with:
- ReconcileContainerTask: This is the command on the datanode that is
received from SCM to reconcile a container with a datanode’s peers. It
passes through the ReplicationSupervisor just like replication and
reconstruction commands.
- ContainerProtos.ContainerChecksumInfo: This is the proto format of the
new file that is written into the containers with the merkle tree and list
of deleted blocks.
- ContainerMerkleTreeWriter: This class is used to build merkle trees
chunk by chunk and generate a protobuf representation of the tree.
- ContainerChecksumTreeManager: This class coordinates reads and writes
of ContainerChecksumInfo for containers. The diff method determines
which repairs should be done on a container based on a peer’s merkle tree.
- KeyValueContainerCheck#scanData: This is the existing method called by
the background and on-demand container data scanners to scan a container.
It has been updated to build the merkle tree as it runs.
- KeyValueHandler#reconcileContainer: This method updates the container
based on the peer’s replica.
- Major tests for reconciliation have been added to
TestContainerCommandReconciliation (integration test) and
TestContainerReconciliationWithMockDatanodes (unit test with mocked
clients).
- There are more tasks under the reconciliation jira to expand the
types of faults being tested.
Logging
Logging was added on the datanodes to track reconciliation as it is
happening. The datanode application log will print a summary of messages
like this:
2025-06-10 20:13:14,570 [main] INFO keyvalue.KeyValueHandler
(KeyValueHandler.java:reconcileContainer(1595)) - Beginning
reconciliation for container 100 with peer
bbc09073-ac0d-4b2f-afe4-1de5f9dc6f43(dn3/237.6.76.4). Current data
checksum is dcce847d
2025-06-10 20:13:14,589 [main] WARN keyvalue.KeyValueHandler
(KeyValueHandler.java:reconcileContainer(1681)) - Container 100
reconciled with peer
bbc09073-ac0d-4b2f-afe4-1de5f9dc6f43(dn3/237.6.76.4). Data checksum
updated from dcce847d to 16189e0b.
Missing blocks repaired: 5/5
Missing chunks repaired: 0/0
Corrupt chunks repaired: 10/10
Time taken: 19 ms
2025-06-10 20:13:14,589 [main] WARN keyvalue.KeyValueHandler
(KeyValueHandler.java:reconcileContainer(1704)) - Completed
reconciliation for container 100 with 1/1 peers. 15 blocks were
updated. Data checksum updated from dcce847d to 16189e0b
This shows:
- Reconciliation started between this datanode and one other peer for
container 100
- After reconciliation with the peer completed, the data checksum of our
container was updated
- Compared to this peer, we needed to ingest 5 missing blocks and repair
10 corrupt chunks. All operations were successful
- At the end we get a summary of how many changes were done to this
container after consulting all the peers in the reconcile request. In this
case there was only one peer.
By enabling debug logging we can see the individual blocks and chunks
that were repaired as well.
In the dn-container.log file, dataChecksum is now included for every log
line. We also get one new line in this log every time the checksum for a
container is updated.
In case logs roll off, a debug tool to inspect container’s checksum
information locally on a datanode will be implemented in HDDS-13239
<https://issues.apache.org/jira/browse/HDDS-13239>.
Metrics
The metrics for reconciliation tasks are available as a part of
ReplicationSupervisor class which includes:
- numRequestedContainerReconciliations - Number of reconciliation tasks
- numQueuedContainerReconciliations - Number of queued tasks
- numTimeoutContainerReconciliations - Number of timed-out tasks
- numSuccessContainerReconciliations- Number of Success
- numFailureContainerReconciliations - Number of Failures
- numSkippedContainerReconciliations - Number of Skipped Tasks
Latency/Count metrics for the tasks exposed by CommandHandlerMetrics for
ReconcileContainerCommandHandler:
- TotalRunTimeMs - The total runtime of the command handler in
milliseconds
- AvgRunTimeMs - Average run time of the command handler in milliseconds
- QueueWaitingTaskCount - The number of queued tasks waiting for
execution
- InvocationCount - The number of times the command handler has been
invoked
- CommandReceivedCount - The number of received SCM commands for each
command type
Other container reconciliation-related tasks are encapsulated in
ContainerMerkleTreeMetrics:
- numMerkleTreeWriteFailure - Number of Merkle tree write failure
- numMerkleTreeReadFailure - Number of Merkle tree read failure
- numMerkleTreeDiffFailure - Number of Merkle tree diff failure
- numNoRepairContainerDiff - Number of container diff that doesn’t
require repair
- numRepairContainerDiff - Number of container diff that require repair
- merkleTreeWriteLatencyNS- Merkle tree write latency
- merkleTreeReadLatencyNS - Merkle tree read latency
- merkleTreeCreateLatencyNS - Merkle tree creation latency
- merkleTreeDiffLatencyNS - Merkle tree diff latency
Thanks for reviewing
Ethan