+1 for the merge of this feature. Thank you to all who contributed to this feature that enhances the robustness and manageability of Apache Ozone.
- Sid On Mon, Jun 16, 2025 at 10:35 AM Ritesh Shukla <rit...@cloudera.com.invalid> wrote: > +1 > Disclaimer: I am a sponsor and contributor for this work > > On Tue, Jun 10, 2025 at 5:40 PM Ethan Rose <er...@apache.org> wrote: > > > Based on discussion in the community sync this week I want to add some > more > > information. > > Code > > > > For those interested in checking out the code, these are some of the > major > > classes to start with: > > > > - ReconcileContainerTask: This is the command on the datanode that is > > received from SCM to reconcile a container with a datanode’s peers. It > > passes through the ReplicationSupervisor just like replication and > > reconstruction commands. > > - ContainerProtos.ContainerChecksumInfo: This is the proto format of > the > > new file that is written into the containers with the merkle tree and > > list > > of deleted blocks. > > - ContainerMerkleTreeWriter: This class is used to build merkle trees > > chunk by chunk and generate a protobuf representation of the tree. > > - ContainerChecksumTreeManager: This class coordinates reads and > writes > > of ContainerChecksumInfo for containers. The diff method determines > > which repairs should be done on a container based on a peer’s merkle > > tree. > > - KeyValueContainerCheck#scanData: This is the existing method called > by > > the background and on-demand container data scanners to scan a > > container. > > It has been updated to build the merkle tree as it runs. > > - KeyValueHandler#reconcileContainer: This method updates the > container > > based on the peer’s replica. > > - Major tests for reconciliation have been added to > > TestContainerCommandReconciliation (integration test) and > > TestContainerReconciliationWithMockDatanodes (unit test with mocked > > clients). > > - There are more tasks under the reconciliation jira to expand the > > types of faults being tested. > > > > Logging > > > > Logging was added on the datanodes to track reconciliation as it is > > happening. The datanode application log will print a summary of messages > > like this: > > > > 2025-06-10 20:13:14,570 [main] INFO keyvalue.KeyValueHandler > > (KeyValueHandler.java:reconcileContainer(1595)) - Beginning > > reconciliation for container 100 with peer > > bbc09073-ac0d-4b2f-afe4-1de5f9dc6f43(dn3/237.6.76.4). Current data > > checksum is dcce847d > > 2025-06-10 20:13:14,589 [main] WARN keyvalue.KeyValueHandler > > (KeyValueHandler.java:reconcileContainer(1681)) - Container 100 > > reconciled with peer > > bbc09073-ac0d-4b2f-afe4-1de5f9dc6f43(dn3/237.6.76.4). Data checksum > > updated from dcce847d to 16189e0b. > > Missing blocks repaired: 5/5 > > Missing chunks repaired: 0/0 > > Corrupt chunks repaired: 10/10 > > Time taken: 19 ms > > 2025-06-10 20:13:14,589 [main] WARN keyvalue.KeyValueHandler > > (KeyValueHandler.java:reconcileContainer(1704)) - Completed > > reconciliation for container 100 with 1/1 peers. 15 blocks were > > updated. Data checksum updated from dcce847d to 16189e0b > > > > This shows: > > > > - Reconciliation started between this datanode and one other peer for > > container 100 > > - After reconciliation with the peer completed, the data checksum of > our > > container was updated > > - Compared to this peer, we needed to ingest 5 missing blocks and > repair > > 10 corrupt chunks. All operations were successful > > - At the end we get a summary of how many changes were done to this > > container after consulting all the peers in the reconcile request. In > > this > > case there was only one peer. > > By enabling debug logging we can see the individual blocks and chunks > > that were repaired as well. > > > > In the dn-container.log file, dataChecksum is now included for every log > > line. We also get one new line in this log every time the checksum for a > > container is updated. > > > > In case logs roll off, a debug tool to inspect container’s checksum > > information locally on a datanode will be implemented in HDDS-13239 > > <https://issues.apache.org/jira/browse/HDDS-13239>. > > Metrics > > > > The metrics for reconciliation tasks are available as a part of > > ReplicationSupervisor class which includes: > > > > - numRequestedContainerReconciliations - Number of reconciliation > tasks > > - numQueuedContainerReconciliations - Number of queued tasks > > - numTimeoutContainerReconciliations - Number of timed-out tasks > > - numSuccessContainerReconciliations- Number of Success > > - numFailureContainerReconciliations - Number of Failures > > - numSkippedContainerReconciliations - Number of Skipped Tasks > > > > Latency/Count metrics for the tasks exposed by CommandHandlerMetrics for > > ReconcileContainerCommandHandler: > > > > - TotalRunTimeMs - The total runtime of the command handler in > > milliseconds > > - AvgRunTimeMs - Average run time of the command handler in > milliseconds > > - QueueWaitingTaskCount - The number of queued tasks waiting for > > execution > > - InvocationCount - The number of times the command handler has been > > invoked > > - CommandReceivedCount - The number of received SCM commands for each > > command type > > > > Other container reconciliation-related tasks are encapsulated in > > ContainerMerkleTreeMetrics: > > > > - numMerkleTreeWriteFailure - Number of Merkle tree write failure > > - numMerkleTreeReadFailure - Number of Merkle tree read failure > > - numMerkleTreeDiffFailure - Number of Merkle tree diff failure > > - numNoRepairContainerDiff - Number of container diff that doesn’t > > require repair > > - numRepairContainerDiff - Number of container diff that require > repair > > - merkleTreeWriteLatencyNS- Merkle tree write latency > > - merkleTreeReadLatencyNS - Merkle tree read latency > > - merkleTreeCreateLatencyNS - Merkle tree creation latency > > - merkleTreeDiffLatencyNS - Merkle tree diff latency > > > > > > Thanks for reviewing > > > > Ethan > > >