Thanks for the votes everyone. Based on the results (9 +1s, no -1s, no 0s) the reconciliation feature branch has been merged into master. To address minor items that were suggested to fix after the merge:
- HDDS-13304 <https://issues.apache.org/jira/browse/HDDS-13304> has been filed to check how the size of the tree file changes with the layout of data within the container (many small blocks vs fewer large blocks) - HDDS-13305 <https://issues.apache.org/jira/browse/HDDS-13305> has been filed to create an object to wrap container checksums - Among other things will give us a toString method to make sure all checksums are rendered as hex in logs and output so we don’t need to manually invoke checksumToString. - HDDS-13340 <https://issues.apache.org/jira/browse/HDDS-13340> / pull/8705 <https://github.com/apache/ozone/pull/8705> Is addressing the following concerns: - Add javadoc to methods in ContainerDiffReport - Implement metrics for types of repairs done. This was left in an old TODO comment but never implemented in the corresponding Jira. - In the test-only constructor of BlockDeletingService, have each test pass their own ChecksumTreeManager so there is no confusion about which instance the class is using. - For using ContainerID in datanodes, we have HDDS-12769 <https://issues.apache.org/jira/browse/HDDS-12769> tracking this. - Based on the current progress I have not identified any issues where the reconciliation branch merge has undone any work already under this epic. We can file more Jiras under this task if this is not the case though. - HDDS-13341 <https://issues.apache.org/jira/browse/HDDS-13341> has been filed to rename ScanResult#isHealthy to ScanResult#hasErrors. Thanks, Ethan On Wed, Jun 18, 2025 at 1:37 PM Tsz-Wo Nicholas Sze <szets...@gmail.com> wrote: > Hi Ethan, > > Thanks for addressing my comments! > > > ... Datanode code consistently identified container IDs as long values. > We can shift to using ContainerID in the datanode as well, but that would > be a > > 1. change outside of reconciliation. > > We have an umbrella JIRA HDDS-12769 > <https://issues.apache.org/jira/browse/HDDS-12769>"Use ContainerID > instead of Long in datanode". For the new code/new data structures, > please > use ContainerID unless it is hard to do. > > > Look forward to merging the branch. > > Tsz-Wo > > On Wed, Jun 18, 2025 at 7:47 AM Ethan Rose <er...@apache.org> wrote: > > > Thanks for reviewing the code Nicholas. > > > > For the “healthy” term, I agree it is not specific enough. In this PR we > > are updating it to checksumMatches in the merkle tree proto. We can > update > > the term in the ScanResult in a follow-up since it is not touching the > > proto. Cases where the block is missing or unreadable will not be able to > > generate a checksum and chunk will be considered missing from the merkle > > tree and need repair anyways, so other states like IO_EXCEPTION shouldn’t > > be necessary. > > > > The BlockDeletingService is supposed to share the same > > ContainerChecksumTreeManager as the rest of the code and it does as far > as > > I can tell. If you are referring to this line > > < > > > https://github.com/apache/ozone/blob/2b4708b70e5c9de133381a38ca7fa4b3cf3caa42/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/BlockDeletingService.java#L76 > > >, > > that is a test-only constructor. > > > > ContainerID and its corresponding proto are only used in SCM code right > > now. Datanode code consistently identified container IDs as long values. > We > > can shift to using ContainerID in the datanode as well, but that would > be a > > change outside of reconciliation. > > > > For maps with Long keys, I’ve identified a few instances where we can > > document it better: > > > > - Return values from ContainerDiffReport > > - this field > > < > > > https://github.com/apache/ozone/blob/a355664093c634c3d04d0601b5c0302260a44c6c/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerMerkleTreeWriter.java#L46 > > > > > in ContainerMerkleTreeWriter > > - Did you have other instances in mind? > > > > Checksum longs should be converted to hex strings in all log and json > > output using HddsUtils#checksumToString > > < > > > https://github.com/apache/ozone/blob/2b4708b70e5c9de133381a38ca7fa4b3cf3caa42/hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/HddsUtils.java#L865 > > >. > > If there’s spots we missed let us know and we can fix it. When we start > > looking at EC reconciliation and a metadata checksum is also added to > > containers, it is likely we will end up with one object wrapping all of a > > container’s checksums and its toString could handle this. > > > > Most optional usage that I’m aware of is being removed in this PR > > <https://github.com/apache/ozone/pull/8565>. > > ContainerChecksumTreeManager#readChecksumInfo still has it after that > > change, but we could remove it as part of HDDS-12824 > > <https://issues.apache.org/jira/browse/HDDS-12824> which is the next > > change > > that will touch this method. I’m not sure what the preferred replacement > > here is. @Nullable? > > > > We will do a reverse merge of master into the branch before merging the > > branch back to master to resolve conflicts, instead of having the > conflict > > resolution as part of the final merge commit to master. Our last merge > from > > master was Jun 9, but things change fast. > > > > Proto 3 migration is a bigger effort outside of reconciliation. Currently > > all proto files use syntax = "proto2" , and I guess our options are to > > either increase that to “proto3” (which affects all protos, not just new > > ones), or put new protos in a different file with "proto3" syntax (for > > which there is currently no existing convention for). > > > > For tracking the phases of work, the initial plan was that phase 1 was > for > > ratis reconciliation, phase 2 was for EC reconciliation, and phase 3 was > > for replication manager integration. If people think it is easier we can > > move the remaining lower priority items after the merge from Ratis > > reconciliation to a phase 2 jira and explicitly create new jiras for the > > later phases of work. > > > > Thanks, > > Ethan > > > > On Tue, Jun 17, 2025 at 5:23 PM Tsz-Wo Nicholas Sze <szets...@gmail.com> > > wrote: > > > > > One more thing: how about we call all the completed work Phase 1 and > the > > > work after merge Phase 2? Then, we should update HDDS-10239 to > describe > > > what was done in Phase 1 and move the unresolved JIRAs to a new > umbrella > > > Phase 2 JIRA. > > > > > > Tsz-Wo > > > > > > > > > On Tue, Jun 17, 2025 at 12:17 PM Tsz-Wo Nicholas Sze < > szets...@gmail.com > > > > > > wrote: > > > > > > > +1 > > > > Thanks for the great work! The feature is going to be very useful. > > Some > > > > questions/comments/suggestions below. > > > > > > > > "Healthy" related: > > > > - What does ChunkMerkleTree.isHealthy == false means? Checksum > > > > mismatched? Missing? > > > > - Rename ScanResult.isHealthy() to hasErrors() since all the > (non-test) > > > > calls are !isHealthy(). > > > > * We should avoid the term "healthy". It is not clear what it means > > > when > > > > isHealthy==false. > > > > It is better to have "boolean isChecksumMatched", "boolean > hasError" > > > > or enum HealthType {NO_ERROR, CHECKSUM_MISMATCHED, MISSING, > > > IO_EXCEPTION} > > > > > > > > Others: > > > > - Why is ContainerChecksumTreeManager created in both > > > BlockDeletingService > > > > and OzoneContainer? Should they share it? > > > > - Use ContainerID instead of Long. > > > > - Add javadoc for all the maps using Long. Describe what it is for > > (e.g. > > > > block id). > > > > - When converting a checksum (long) to a string, use hexadecimal. > > > > - Avoid Optional, which is slow and generates garbage. > > > > - Fix conflicts with the master. > > > > - Use proto 3 for new protos (but it may be hard to do.) > > > > > > > > Tsz-Wo > > > > > > > > > > > > On Tue, Jun 10, 2025 at 5:57 PM Ethan Rose <er...@apache.org> wrote: > > > > > > > >> Based on discussion in the community sync this week I want to add > some > > > >> more > > > >> information. > > > >> Code > > > >> > > > >> For those interested in checking out the code, these are some of the > > > major > > > >> classes to start with: > > > >> > > > >> - ReconcileContainerTask: This is the command on the datanode > that > > is > > > >> received from SCM to reconcile a container with a datanode’s > peers. > > > It > > > >> passes through the ReplicationSupervisor just like replication > and > > > >> reconstruction commands. > > > >> - ContainerProtos.ContainerChecksumInfo: This is the proto format > > of > > > >> the > > > >> new file that is written into the containers with the merkle tree > > and > > > >> list > > > >> of deleted blocks. > > > >> - ContainerMerkleTreeWriter: This class is used to build merkle > > trees > > > >> chunk by chunk and generate a protobuf representation of the > tree. > > > >> - ContainerChecksumTreeManager: This class coordinates reads and > > > writes > > > >> of ContainerChecksumInfo for containers. The diff method > determines > > > >> which repairs should be done on a container based on a peer’s > > merkle > > > >> tree. > > > >> - KeyValueContainerCheck#scanData: This is the existing method > > called > > > >> by > > > >> the background and on-demand container data scanners to scan a > > > >> container. > > > >> It has been updated to build the merkle tree as it runs. > > > >> - KeyValueHandler#reconcileContainer: This method updates the > > > container > > > >> based on the peer’s replica. > > > >> - Major tests for reconciliation have been added to > > > >> TestContainerCommandReconciliation (integration test) and > > > >> TestContainerReconciliationWithMockDatanodes (unit test with > mocked > > > >> clients). > > > >> - There are more tasks under the reconciliation jira to expand > > the > > > >> types of faults being tested. > > > >> > > > >> Logging > > > >> > > > >> Logging was added on the datanodes to track reconciliation as it is > > > >> happening. The datanode application log will print a summary of > > messages > > > >> like this: > > > >> > > > >> 2025-06-10 20:13:14,570 [main] INFO keyvalue.KeyValueHandler > > > >> (KeyValueHandler.java:reconcileContainer(1595)) - Beginning > > > >> reconciliation for container 100 with peer > > > >> bbc09073-ac0d-4b2f-afe4-1de5f9dc6f43(dn3/237.6.76.4). Current data > > > >> checksum is dcce847d > > > >> 2025-06-10 20:13:14,589 [main] WARN keyvalue.KeyValueHandler > > > >> (KeyValueHandler.java:reconcileContainer(1681)) - Container 100 > > > >> reconciled with peer > > > >> bbc09073-ac0d-4b2f-afe4-1de5f9dc6f43(dn3/237.6.76.4). Data checksum > > > >> updated from dcce847d to 16189e0b. > > > >> Missing blocks repaired: 5/5 > > > >> Missing chunks repaired: 0/0 > > > >> Corrupt chunks repaired: 10/10 > > > >> Time taken: 19 ms > > > >> 2025-06-10 20:13:14,589 [main] WARN keyvalue.KeyValueHandler > > > >> (KeyValueHandler.java:reconcileContainer(1704)) - Completed > > > >> reconciliation for container 100 with 1/1 peers. 15 blocks were > > > >> updated. Data checksum updated from dcce847d to 16189e0b > > > >> > > > >> This shows: > > > >> > > > >> - Reconciliation started between this datanode and one other peer > > for > > > >> container 100 > > > >> - After reconciliation with the peer completed, the data checksum > > of > > > >> our > > > >> container was updated > > > >> - Compared to this peer, we needed to ingest 5 missing blocks and > > > >> repair > > > >> 10 corrupt chunks. All operations were successful > > > >> - At the end we get a summary of how many changes were done to > this > > > >> container after consulting all the peers in the reconcile > request. > > In > > > >> this > > > >> case there was only one peer. > > > >> By enabling debug logging we can see the individual blocks and > > chunks > > > >> that were repaired as well. > > > >> > > > >> In the dn-container.log file, dataChecksum is now included for every > > log > > > >> line. We also get one new line in this log every time the checksum > > for a > > > >> container is updated. > > > >> > > > >> In case logs roll off, a debug tool to inspect container’s checksum > > > >> information locally on a datanode will be implemented in HDDS-13239 > > > >> <https://issues.apache.org/jira/browse/HDDS-13239>. > > > >> Metrics > > > >> > > > >> The metrics for reconciliation tasks are available as a part of > > > >> ReplicationSupervisor class which includes: > > > >> > > > >> - numRequestedContainerReconciliations - Number of reconciliation > > > tasks > > > >> - numQueuedContainerReconciliations - Number of queued tasks > > > >> - numTimeoutContainerReconciliations - Number of timed-out tasks > > > >> - numSuccessContainerReconciliations- Number of Success > > > >> - numFailureContainerReconciliations - Number of Failures > > > >> - numSkippedContainerReconciliations - Number of Skipped Tasks > > > >> > > > >> Latency/Count metrics for the tasks exposed by CommandHandlerMetrics > > for > > > >> ReconcileContainerCommandHandler: > > > >> > > > >> - TotalRunTimeMs - The total runtime of the command handler in > > > >> milliseconds > > > >> - AvgRunTimeMs - Average run time of the command handler in > > > >> milliseconds > > > >> - QueueWaitingTaskCount - The number of queued tasks waiting for > > > >> execution > > > >> - InvocationCount - The number of times the command handler has > > been > > > >> invoked > > > >> - CommandReceivedCount - The number of received SCM commands for > > each > > > >> command type > > > >> > > > >> Other container reconciliation-related tasks are encapsulated in > > > >> ContainerMerkleTreeMetrics: > > > >> > > > >> - numMerkleTreeWriteFailure - Number of Merkle tree write failure > > > >> - numMerkleTreeReadFailure - Number of Merkle tree read failure > > > >> - numMerkleTreeDiffFailure - Number of Merkle tree diff failure > > > >> - numNoRepairContainerDiff - Number of container diff that > doesn’t > > > >> require repair > > > >> - numRepairContainerDiff - Number of container diff that require > > > repair > > > >> - merkleTreeWriteLatencyNS- Merkle tree write latency > > > >> - merkleTreeReadLatencyNS - Merkle tree read latency > > > >> - merkleTreeCreateLatencyNS - Merkle tree creation latency > > > >> - merkleTreeDiffLatencyNS - Merkle tree diff latency > > > >> > > > >> > > > >> Thanks for reviewing > > > >> > > > >> Ethan > > > >> > > > > > > > > > >