+1
Thanks for the great work! The feature is going to be very useful. Some
questions/comments/suggestions below.
"Healthy" related:
- What does ChunkMerkleTree.isHealthy == false means? Checksum
mismatched? Missing?
- Rename ScanResult.isHealthy() to hasErrors() since all the (non-test)
calls are !isHealthy().
* We should avoid the term "healthy". It is not clear what it means when
isHealthy==false.
It is better to have "boolean isChecksumMatched", "boolean hasError"
or enum HealthType {NO_ERROR, CHECKSUM_MISMATCHED, MISSING, IO_EXCEPTION}
Others:
- Why is ContainerChecksumTreeManager created in both BlockDeletingService
and OzoneContainer? Should they share it?
- Use ContainerID instead of Long.
- Add javadoc for all the maps using Long. Describe what it is for (e.g.
block id).
- When converting a checksum (long) to a string, use hexadecimal.
- Avoid Optional, which is slow and generates garbage.
- Fix conflicts with the master.
- Use proto 3 for new protos (but it may be hard to do.)
Tsz-Wo
On Tue, Jun 10, 2025 at 5:57 PM Ethan Rose <[email protected]> wrote:
> Based on discussion in the community sync this week I want to add some more
> information.
> Code
>
> For those interested in checking out the code, these are some of the major
> classes to start with:
>
> - ReconcileContainerTask: This is the command on the datanode that is
> received from SCM to reconcile a container with a datanode’s peers. It
> passes through the ReplicationSupervisor just like replication and
> reconstruction commands.
> - ContainerProtos.ContainerChecksumInfo: This is the proto format of the
> new file that is written into the containers with the merkle tree and
> list
> of deleted blocks.
> - ContainerMerkleTreeWriter: This class is used to build merkle trees
> chunk by chunk and generate a protobuf representation of the tree.
> - ContainerChecksumTreeManager: This class coordinates reads and writes
> of ContainerChecksumInfo for containers. The diff method determines
> which repairs should be done on a container based on a peer’s merkle
> tree.
> - KeyValueContainerCheck#scanData: This is the existing method called by
> the background and on-demand container data scanners to scan a
> container.
> It has been updated to build the merkle tree as it runs.
> - KeyValueHandler#reconcileContainer: This method updates the container
> based on the peer’s replica.
> - Major tests for reconciliation have been added to
> TestContainerCommandReconciliation (integration test) and
> TestContainerReconciliationWithMockDatanodes (unit test with mocked
> clients).
> - There are more tasks under the reconciliation jira to expand the
> types of faults being tested.
>
> Logging
>
> Logging was added on the datanodes to track reconciliation as it is
> happening. The datanode application log will print a summary of messages
> like this:
>
> 2025-06-10 20:13:14,570 [main] INFO keyvalue.KeyValueHandler
> (KeyValueHandler.java:reconcileContainer(1595)) - Beginning
> reconciliation for container 100 with peer
> bbc09073-ac0d-4b2f-afe4-1de5f9dc6f43(dn3/237.6.76.4). Current data
> checksum is dcce847d
> 2025-06-10 20:13:14,589 [main] WARN keyvalue.KeyValueHandler
> (KeyValueHandler.java:reconcileContainer(1681)) - Container 100
> reconciled with peer
> bbc09073-ac0d-4b2f-afe4-1de5f9dc6f43(dn3/237.6.76.4). Data checksum
> updated from dcce847d to 16189e0b.
> Missing blocks repaired: 5/5
> Missing chunks repaired: 0/0
> Corrupt chunks repaired: 10/10
> Time taken: 19 ms
> 2025-06-10 20:13:14,589 [main] WARN keyvalue.KeyValueHandler
> (KeyValueHandler.java:reconcileContainer(1704)) - Completed
> reconciliation for container 100 with 1/1 peers. 15 blocks were
> updated. Data checksum updated from dcce847d to 16189e0b
>
> This shows:
>
> - Reconciliation started between this datanode and one other peer for
> container 100
> - After reconciliation with the peer completed, the data checksum of our
> container was updated
> - Compared to this peer, we needed to ingest 5 missing blocks and repair
> 10 corrupt chunks. All operations were successful
> - At the end we get a summary of how many changes were done to this
> container after consulting all the peers in the reconcile request. In
> this
> case there was only one peer.
> By enabling debug logging we can see the individual blocks and chunks
> that were repaired as well.
>
> In the dn-container.log file, dataChecksum is now included for every log
> line. We also get one new line in this log every time the checksum for a
> container is updated.
>
> In case logs roll off, a debug tool to inspect container’s checksum
> information locally on a datanode will be implemented in HDDS-13239
> <https://issues.apache.org/jira/browse/HDDS-13239>.
> Metrics
>
> The metrics for reconciliation tasks are available as a part of
> ReplicationSupervisor class which includes:
>
> - numRequestedContainerReconciliations - Number of reconciliation tasks
> - numQueuedContainerReconciliations - Number of queued tasks
> - numTimeoutContainerReconciliations - Number of timed-out tasks
> - numSuccessContainerReconciliations- Number of Success
> - numFailureContainerReconciliations - Number of Failures
> - numSkippedContainerReconciliations - Number of Skipped Tasks
>
> Latency/Count metrics for the tasks exposed by CommandHandlerMetrics for
> ReconcileContainerCommandHandler:
>
> - TotalRunTimeMs - The total runtime of the command handler in
> milliseconds
> - AvgRunTimeMs - Average run time of the command handler in milliseconds
> - QueueWaitingTaskCount - The number of queued tasks waiting for
> execution
> - InvocationCount - The number of times the command handler has been
> invoked
> - CommandReceivedCount - The number of received SCM commands for each
> command type
>
> Other container reconciliation-related tasks are encapsulated in
> ContainerMerkleTreeMetrics:
>
> - numMerkleTreeWriteFailure - Number of Merkle tree write failure
> - numMerkleTreeReadFailure - Number of Merkle tree read failure
> - numMerkleTreeDiffFailure - Number of Merkle tree diff failure
> - numNoRepairContainerDiff - Number of container diff that doesn’t
> require repair
> - numRepairContainerDiff - Number of container diff that require repair
> - merkleTreeWriteLatencyNS- Merkle tree write latency
> - merkleTreeReadLatencyNS - Merkle tree read latency
> - merkleTreeCreateLatencyNS - Merkle tree creation latency
> - merkleTreeDiffLatencyNS - Merkle tree diff latency
>
>
> Thanks for reviewing
>
> Ethan
>