Hi Ethan, Thanks for addressing my comments!
> ... Datanode code consistently identified container IDs as long values. We can shift to using ContainerID in the datanode as well, but that would be a 1. change outside of reconciliation. We have an umbrella JIRA HDDS-12769 <https://issues.apache.org/jira/browse/HDDS-12769>"Use ContainerID instead of Long in datanode". For the new code/new data structures, please use ContainerID unless it is hard to do. Look forward to merging the branch. Tsz-Wo On Wed, Jun 18, 2025 at 7:47 AM Ethan Rose <er...@apache.org> wrote: > Thanks for reviewing the code Nicholas. > > For the “healthy” term, I agree it is not specific enough. In this PR we > are updating it to checksumMatches in the merkle tree proto. We can update > the term in the ScanResult in a follow-up since it is not touching the > proto. Cases where the block is missing or unreadable will not be able to > generate a checksum and chunk will be considered missing from the merkle > tree and need repair anyways, so other states like IO_EXCEPTION shouldn’t > be necessary. > > The BlockDeletingService is supposed to share the same > ContainerChecksumTreeManager as the rest of the code and it does as far as > I can tell. If you are referring to this line > < > https://github.com/apache/ozone/blob/2b4708b70e5c9de133381a38ca7fa4b3cf3caa42/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/BlockDeletingService.java#L76 > >, > that is a test-only constructor. > > ContainerID and its corresponding proto are only used in SCM code right > now. Datanode code consistently identified container IDs as long values. We > can shift to using ContainerID in the datanode as well, but that would be a > change outside of reconciliation. > > For maps with Long keys, I’ve identified a few instances where we can > document it better: > > - Return values from ContainerDiffReport > - this field > < > https://github.com/apache/ozone/blob/a355664093c634c3d04d0601b5c0302260a44c6c/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerMerkleTreeWriter.java#L46 > > > in ContainerMerkleTreeWriter > - Did you have other instances in mind? > > Checksum longs should be converted to hex strings in all log and json > output using HddsUtils#checksumToString > < > https://github.com/apache/ozone/blob/2b4708b70e5c9de133381a38ca7fa4b3cf3caa42/hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/HddsUtils.java#L865 > >. > If there’s spots we missed let us know and we can fix it. When we start > looking at EC reconciliation and a metadata checksum is also added to > containers, it is likely we will end up with one object wrapping all of a > container’s checksums and its toString could handle this. > > Most optional usage that I’m aware of is being removed in this PR > <https://github.com/apache/ozone/pull/8565>. > ContainerChecksumTreeManager#readChecksumInfo still has it after that > change, but we could remove it as part of HDDS-12824 > <https://issues.apache.org/jira/browse/HDDS-12824> which is the next > change > that will touch this method. I’m not sure what the preferred replacement > here is. @Nullable? > > We will do a reverse merge of master into the branch before merging the > branch back to master to resolve conflicts, instead of having the conflict > resolution as part of the final merge commit to master. Our last merge from > master was Jun 9, but things change fast. > > Proto 3 migration is a bigger effort outside of reconciliation. Currently > all proto files use syntax = "proto2" , and I guess our options are to > either increase that to “proto3” (which affects all protos, not just new > ones), or put new protos in a different file with "proto3" syntax (for > which there is currently no existing convention for). > > For tracking the phases of work, the initial plan was that phase 1 was for > ratis reconciliation, phase 2 was for EC reconciliation, and phase 3 was > for replication manager integration. If people think it is easier we can > move the remaining lower priority items after the merge from Ratis > reconciliation to a phase 2 jira and explicitly create new jiras for the > later phases of work. > > Thanks, > Ethan > > On Tue, Jun 17, 2025 at 5:23 PM Tsz-Wo Nicholas Sze <szets...@gmail.com> > wrote: > > > One more thing: how about we call all the completed work Phase 1 and the > > work after merge Phase 2? Then, we should update HDDS-10239 to describe > > what was done in Phase 1 and move the unresolved JIRAs to a new umbrella > > Phase 2 JIRA. > > > > Tsz-Wo > > > > > > On Tue, Jun 17, 2025 at 12:17 PM Tsz-Wo Nicholas Sze <szets...@gmail.com > > > > wrote: > > > > > +1 > > > Thanks for the great work! The feature is going to be very useful. > Some > > > questions/comments/suggestions below. > > > > > > "Healthy" related: > > > - What does ChunkMerkleTree.isHealthy == false means? Checksum > > > mismatched? Missing? > > > - Rename ScanResult.isHealthy() to hasErrors() since all the (non-test) > > > calls are !isHealthy(). > > > * We should avoid the term "healthy". It is not clear what it means > > when > > > isHealthy==false. > > > It is better to have "boolean isChecksumMatched", "boolean hasError" > > > or enum HealthType {NO_ERROR, CHECKSUM_MISMATCHED, MISSING, > > IO_EXCEPTION} > > > > > > Others: > > > - Why is ContainerChecksumTreeManager created in both > > BlockDeletingService > > > and OzoneContainer? Should they share it? > > > - Use ContainerID instead of Long. > > > - Add javadoc for all the maps using Long. Describe what it is for > (e.g. > > > block id). > > > - When converting a checksum (long) to a string, use hexadecimal. > > > - Avoid Optional, which is slow and generates garbage. > > > - Fix conflicts with the master. > > > - Use proto 3 for new protos (but it may be hard to do.) > > > > > > Tsz-Wo > > > > > > > > > On Tue, Jun 10, 2025 at 5:57 PM Ethan Rose <er...@apache.org> wrote: > > > > > >> Based on discussion in the community sync this week I want to add some > > >> more > > >> information. > > >> Code > > >> > > >> For those interested in checking out the code, these are some of the > > major > > >> classes to start with: > > >> > > >> - ReconcileContainerTask: This is the command on the datanode that > is > > >> received from SCM to reconcile a container with a datanode’s peers. > > It > > >> passes through the ReplicationSupervisor just like replication and > > >> reconstruction commands. > > >> - ContainerProtos.ContainerChecksumInfo: This is the proto format > of > > >> the > > >> new file that is written into the containers with the merkle tree > and > > >> list > > >> of deleted blocks. > > >> - ContainerMerkleTreeWriter: This class is used to build merkle > trees > > >> chunk by chunk and generate a protobuf representation of the tree. > > >> - ContainerChecksumTreeManager: This class coordinates reads and > > writes > > >> of ContainerChecksumInfo for containers. The diff method determines > > >> which repairs should be done on a container based on a peer’s > merkle > > >> tree. > > >> - KeyValueContainerCheck#scanData: This is the existing method > called > > >> by > > >> the background and on-demand container data scanners to scan a > > >> container. > > >> It has been updated to build the merkle tree as it runs. > > >> - KeyValueHandler#reconcileContainer: This method updates the > > container > > >> based on the peer’s replica. > > >> - Major tests for reconciliation have been added to > > >> TestContainerCommandReconciliation (integration test) and > > >> TestContainerReconciliationWithMockDatanodes (unit test with mocked > > >> clients). > > >> - There are more tasks under the reconciliation jira to expand > the > > >> types of faults being tested. > > >> > > >> Logging > > >> > > >> Logging was added on the datanodes to track reconciliation as it is > > >> happening. The datanode application log will print a summary of > messages > > >> like this: > > >> > > >> 2025-06-10 20:13:14,570 [main] INFO keyvalue.KeyValueHandler > > >> (KeyValueHandler.java:reconcileContainer(1595)) - Beginning > > >> reconciliation for container 100 with peer > > >> bbc09073-ac0d-4b2f-afe4-1de5f9dc6f43(dn3/237.6.76.4). Current data > > >> checksum is dcce847d > > >> 2025-06-10 20:13:14,589 [main] WARN keyvalue.KeyValueHandler > > >> (KeyValueHandler.java:reconcileContainer(1681)) - Container 100 > > >> reconciled with peer > > >> bbc09073-ac0d-4b2f-afe4-1de5f9dc6f43(dn3/237.6.76.4). Data checksum > > >> updated from dcce847d to 16189e0b. > > >> Missing blocks repaired: 5/5 > > >> Missing chunks repaired: 0/0 > > >> Corrupt chunks repaired: 10/10 > > >> Time taken: 19 ms > > >> 2025-06-10 20:13:14,589 [main] WARN keyvalue.KeyValueHandler > > >> (KeyValueHandler.java:reconcileContainer(1704)) - Completed > > >> reconciliation for container 100 with 1/1 peers. 15 blocks were > > >> updated. Data checksum updated from dcce847d to 16189e0b > > >> > > >> This shows: > > >> > > >> - Reconciliation started between this datanode and one other peer > for > > >> container 100 > > >> - After reconciliation with the peer completed, the data checksum > of > > >> our > > >> container was updated > > >> - Compared to this peer, we needed to ingest 5 missing blocks and > > >> repair > > >> 10 corrupt chunks. All operations were successful > > >> - At the end we get a summary of how many changes were done to this > > >> container after consulting all the peers in the reconcile request. > In > > >> this > > >> case there was only one peer. > > >> By enabling debug logging we can see the individual blocks and > chunks > > >> that were repaired as well. > > >> > > >> In the dn-container.log file, dataChecksum is now included for every > log > > >> line. We also get one new line in this log every time the checksum > for a > > >> container is updated. > > >> > > >> In case logs roll off, a debug tool to inspect container’s checksum > > >> information locally on a datanode will be implemented in HDDS-13239 > > >> <https://issues.apache.org/jira/browse/HDDS-13239>. > > >> Metrics > > >> > > >> The metrics for reconciliation tasks are available as a part of > > >> ReplicationSupervisor class which includes: > > >> > > >> - numRequestedContainerReconciliations - Number of reconciliation > > tasks > > >> - numQueuedContainerReconciliations - Number of queued tasks > > >> - numTimeoutContainerReconciliations - Number of timed-out tasks > > >> - numSuccessContainerReconciliations- Number of Success > > >> - numFailureContainerReconciliations - Number of Failures > > >> - numSkippedContainerReconciliations - Number of Skipped Tasks > > >> > > >> Latency/Count metrics for the tasks exposed by CommandHandlerMetrics > for > > >> ReconcileContainerCommandHandler: > > >> > > >> - TotalRunTimeMs - The total runtime of the command handler in > > >> milliseconds > > >> - AvgRunTimeMs - Average run time of the command handler in > > >> milliseconds > > >> - QueueWaitingTaskCount - The number of queued tasks waiting for > > >> execution > > >> - InvocationCount - The number of times the command handler has > been > > >> invoked > > >> - CommandReceivedCount - The number of received SCM commands for > each > > >> command type > > >> > > >> Other container reconciliation-related tasks are encapsulated in > > >> ContainerMerkleTreeMetrics: > > >> > > >> - numMerkleTreeWriteFailure - Number of Merkle tree write failure > > >> - numMerkleTreeReadFailure - Number of Merkle tree read failure > > >> - numMerkleTreeDiffFailure - Number of Merkle tree diff failure > > >> - numNoRepairContainerDiff - Number of container diff that doesn’t > > >> require repair > > >> - numRepairContainerDiff - Number of container diff that require > > repair > > >> - merkleTreeWriteLatencyNS- Merkle tree write latency > > >> - merkleTreeReadLatencyNS - Merkle tree read latency > > >> - merkleTreeCreateLatencyNS - Merkle tree creation latency > > >> - merkleTreeDiffLatencyNS - Merkle tree diff latency > > >> > > >> > > >> Thanks for reviewing > > >> > > >> Ethan > > >> > > > > > >