Re: [VOTE] Merge branch HDDS-10239-container-reconciliation into master

Tsz-Wo Nicholas Sze Wed, 18 Jun 2025 10:37:30 -0700

Hi Ethan,

Thanks for addressing my comments!


> ... Datanode code consistently identified container IDs as long values.
We can shift to using ContainerID in the datanode as well, but that would
be a

   1. change outside of reconciliation.

   We have an umbrella JIRA HDDS-12769
   <https://issues.apache.org/jira/browse/HDDS-12769>"Use ContainerID
   instead of Long in datanode".  For the new code/new data structures, please
   use ContainerID unless it is hard to do.


Look forward to merging the branch.

Tsz-Wo

On Wed, Jun 18, 2025 at 7:47 AM Ethan Rose <er...@apache.org> wrote:

> Thanks for reviewing the code Nicholas.
>
> For the “healthy” term, I agree it is not specific enough. In this PR we
> are updating it to checksumMatches in the merkle tree proto. We can update
> the term in the ScanResult in a follow-up since it is not touching the
> proto. Cases where the block is missing or unreadable will not be able to
> generate a checksum and chunk will be considered missing from the merkle
> tree and need repair anyways, so other states like IO_EXCEPTION shouldn’t
> be necessary.
>
> The BlockDeletingService is supposed to share the same
> ContainerChecksumTreeManager as the rest of the code and it does as far as
> I can tell. If you are referring to this line
> <
> https://github.com/apache/ozone/blob/2b4708b70e5c9de133381a38ca7fa4b3cf3caa42/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/BlockDeletingService.java#L76
> >,
> that is a test-only constructor.
>
> ContainerID and its corresponding proto are only used in SCM code right
> now. Datanode code consistently identified container IDs as long values. We
> can shift to using ContainerID in the datanode as well, but that would be a
> change outside of reconciliation.
>
> For maps with Long keys, I’ve identified a few instances where we can
> document it better:
>
>    - Return values from ContainerDiffReport
>    - this field
>    <
> https://github.com/apache/ozone/blob/a355664093c634c3d04d0601b5c0302260a44c6c/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerMerkleTreeWriter.java#L46
> >
>    in ContainerMerkleTreeWriter
>    - Did you have other instances in mind?
>
> Checksum longs should be converted to hex strings in all log and json
> output using HddsUtils#checksumToString
> <
> https://github.com/apache/ozone/blob/2b4708b70e5c9de133381a38ca7fa4b3cf3caa42/hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/HddsUtils.java#L865
> >.
> If there’s spots we missed let us know and we can fix it. When we start
> looking at EC reconciliation and a metadata checksum is also added to
> containers, it is likely we will end up with one object wrapping all of a
> container’s checksums and its toString could handle this.
>
> Most optional usage that I’m aware of is being removed in this PR
> <https://github.com/apache/ozone/pull/8565>.
> ContainerChecksumTreeManager#readChecksumInfo still has it after that
> change, but we could remove it as part of HDDS-12824
> <https://issues.apache.org/jira/browse/HDDS-12824> which is the next
> change
> that will touch this method. I’m not sure what the preferred replacement
> here is. @Nullable?
>
> We will do a reverse merge of master into the branch before merging the
> branch back to master to resolve conflicts, instead of having the conflict
> resolution as part of the final merge commit to master. Our last merge from
> master was Jun 9, but things change fast.
>
> Proto 3 migration is a bigger effort outside of reconciliation. Currently
> all proto files use syntax = "proto2" , and I guess our options are to
> either increase that to “proto3” (which affects all protos, not just new
> ones), or put new protos in a different file with "proto3" syntax (for
> which there is currently no existing convention for).
>
> For tracking the phases of work, the initial plan was that phase 1 was for
> ratis reconciliation, phase 2 was for EC reconciliation, and phase 3 was
> for replication manager integration. If people think it is easier we can
> move the remaining lower priority items after the merge from Ratis
> reconciliation to a phase 2 jira and explicitly create new jiras for the
> later phases of work.
>
> Thanks,
> Ethan
>
> On Tue, Jun 17, 2025 at 5:23 PM Tsz-Wo Nicholas Sze <szets...@gmail.com>
> wrote:
>
> > One more thing:  how about we call all the completed work Phase 1 and the
> > work after merge Phase 2?  Then, we should update HDDS-10239 to describe
> > what was done in Phase 1 and move the unresolved JIRAs to a new umbrella
> > Phase 2 JIRA.
> >
> > Tsz-Wo
> >
> >
> > On Tue, Jun 17, 2025 at 12:17 PM Tsz-Wo Nicholas Sze <szets...@gmail.com
> >
> > wrote:
> >
> > > +1
> > > Thanks for the great work!  The feature is going to be very useful.
> Some
> > > questions/comments/suggestions below.
> > >
> > > "Healthy" related:
> > > - What does ChunkMerkleTree.isHealthy == false means?  Checksum
> > > mismatched?  Missing?
> > > - Rename ScanResult.isHealthy() to hasErrors() since all the (non-test)
> > > calls are !isHealthy().
> > > *  We should avoid the term "healthy".  It is not clear what it means
> > when
> > > isHealthy==false.
> > >   It is better to have "boolean isChecksumMatched", "boolean hasError"
> > >   or enum HealthType {NO_ERROR, CHECKSUM_MISMATCHED, MISSING,
> > IO_EXCEPTION}
> > >
> > > Others:
> > > - Why is ContainerChecksumTreeManager created in both
> > BlockDeletingService
> > > and OzoneContainer?  Should they share it?
> > > - Use ContainerID instead of Long.
> > > - Add javadoc for all the maps using Long.  Describe what it is for
> (e.g.
> > > block id).
> > > - When converting a checksum (long) to a string, use hexadecimal.
> > > - Avoid Optional, which is slow and generates garbage.
> > > - Fix conflicts with the master.
> > > - Use proto 3 for new protos (but it may be hard to do.)
> > >
> > > Tsz-Wo
> > >
> > >
> > > On Tue, Jun 10, 2025 at 5:57 PM Ethan Rose <er...@apache.org> wrote:
> > >
> > >> Based on discussion in the community sync this week I want to add some
> > >> more
> > >> information.
> > >> Code
> > >>
> > >> For those interested in checking out the code, these are some of the
> > major
> > >> classes to start with:
> > >>
> > >>    - ReconcileContainerTask: This is the command on the datanode that
> is
> > >>    received from SCM to reconcile a container with a datanode’s peers.
> > It
> > >>    passes through the ReplicationSupervisor just like replication and
> > >>    reconstruction commands.
> > >>    - ContainerProtos.ContainerChecksumInfo: This is the proto format
> of
> > >> the
> > >>    new file that is written into the containers with the merkle tree
> and
> > >> list
> > >>    of deleted blocks.
> > >>    - ContainerMerkleTreeWriter: This class is used to build merkle
> trees
> > >>    chunk by chunk and generate a protobuf representation of the tree.
> > >>    - ContainerChecksumTreeManager: This class coordinates reads and
> > writes
> > >>    of ContainerChecksumInfo for containers. The diff method determines
> > >>    which repairs should be done on a container based on a peer’s
> merkle
> > >> tree.
> > >>    - KeyValueContainerCheck#scanData: This is the existing method
> called
> > >> by
> > >>    the background and on-demand container data scanners to scan a
> > >> container.
> > >>    It has been updated to build the merkle tree as it runs.
> > >>    - KeyValueHandler#reconcileContainer: This method updates the
> > container
> > >>    based on the peer’s replica.
> > >>    - Major tests for reconciliation have been added to
> > >>    TestContainerCommandReconciliation (integration test) and
> > >>    TestContainerReconciliationWithMockDatanodes (unit test with mocked
> > >>    clients).
> > >>       - There are more tasks under the reconciliation jira to expand
> the
> > >>       types of faults being tested.
> > >>
> > >> Logging
> > >>
> > >> Logging was added on the datanodes to track reconciliation as it is
> > >> happening. The datanode application log will print a summary of
> messages
> > >> like this:
> > >>
> > >> 2025-06-10 20:13:14,570 [main] INFO  keyvalue.KeyValueHandler
> > >> (KeyValueHandler.java:reconcileContainer(1595)) - Beginning
> > >> reconciliation for container 100 with peer
> > >> bbc09073-ac0d-4b2f-afe4-1de5f9dc6f43(dn3/237.6.76.4). Current data
> > >> checksum is dcce847d
> > >> 2025-06-10 20:13:14,589 [main] WARN  keyvalue.KeyValueHandler
> > >> (KeyValueHandler.java:reconcileContainer(1681)) - Container 100
> > >> reconciled with peer
> > >> bbc09073-ac0d-4b2f-afe4-1de5f9dc6f43(dn3/237.6.76.4). Data checksum
> > >> updated from dcce847d to 16189e0b.
> > >> Missing blocks repaired: 5/5
> > >> Missing chunks repaired: 0/0
> > >> Corrupt chunks repaired:  10/10
> > >> Time taken: 19 ms
> > >> 2025-06-10 20:13:14,589 [main] WARN  keyvalue.KeyValueHandler
> > >> (KeyValueHandler.java:reconcileContainer(1704)) - Completed
> > >> reconciliation for container 100 with 1/1 peers. 15 blocks were
> > >> updated. Data checksum updated from dcce847d to 16189e0b
> > >>
> > >> This shows:
> > >>
> > >>    - Reconciliation started between this datanode and one other peer
> for
> > >>    container 100
> > >>    - After reconciliation with the peer completed, the data checksum
> of
> > >> our
> > >>    container was updated
> > >>    - Compared to this peer, we needed to ingest 5 missing blocks and
> > >> repair
> > >>    10 corrupt chunks. All operations were successful
> > >>    - At the end we get a summary of how many changes were done to this
> > >>    container after consulting all the peers in the reconcile request.
> In
> > >> this
> > >>    case there was only one peer.
> > >>    By enabling debug logging we can see the individual blocks and
> chunks
> > >>    that were repaired as well.
> > >>
> > >> In the dn-container.log file, dataChecksum is now included for every
> log
> > >> line. We also get one new line in this log every time the checksum
> for a
> > >> container is updated.
> > >>
> > >> In case logs roll off, a debug tool to inspect container’s checksum
> > >> information locally on a datanode will be implemented in HDDS-13239
> > >> <https://issues.apache.org/jira/browse/HDDS-13239>.
> > >> Metrics
> > >>
> > >> The metrics for reconciliation tasks are available as a part of
> > >> ReplicationSupervisor class which includes:
> > >>
> > >>    - numRequestedContainerReconciliations - Number of reconciliation
> > tasks
> > >>    - numQueuedContainerReconciliations - Number of queued tasks
> > >>    - numTimeoutContainerReconciliations - Number of timed-out tasks
> > >>    - numSuccessContainerReconciliations- Number of Success
> > >>    - numFailureContainerReconciliations - Number of Failures
> > >>    - numSkippedContainerReconciliations - Number of Skipped Tasks
> > >>
> > >> Latency/Count metrics for the tasks exposed by CommandHandlerMetrics
> for
> > >> ReconcileContainerCommandHandler:
> > >>
> > >>    - TotalRunTimeMs - The total runtime of the command handler in
> > >>    milliseconds
> > >>    - AvgRunTimeMs - Average run time of the command handler in
> > >> milliseconds
> > >>    - QueueWaitingTaskCount - The number of queued tasks waiting for
> > >>    execution
> > >>    - InvocationCount - The number of times the command handler has
> been
> > >>    invoked
> > >>    - CommandReceivedCount - The number of received SCM commands for
> each
> > >>    command type
> > >>
> > >> Other container reconciliation-related tasks are encapsulated in
> > >> ContainerMerkleTreeMetrics:
> > >>
> > >>    - numMerkleTreeWriteFailure - Number of Merkle tree write failure
> > >>    - numMerkleTreeReadFailure - Number of Merkle tree read failure
> > >>    - numMerkleTreeDiffFailure - Number of Merkle tree diff failure
> > >>    - numNoRepairContainerDiff - Number of container diff that doesn’t
> > >>    require repair
> > >>    - numRepairContainerDiff - Number of container diff that require
> > repair
> > >>    - merkleTreeWriteLatencyNS- Merkle tree write latency
> > >>    - merkleTreeReadLatencyNS - Merkle tree read latency
> > >>    - merkleTreeCreateLatencyNS - Merkle tree creation latency
> > >>    - merkleTreeDiffLatencyNS - Merkle tree diff latency
> > >>
> > >>
> > >> Thanks for reviewing
> > >>
> > >> Ethan
> > >>
> > >
> >
>

Re: [VOTE] Merge branch HDDS-10239-container-reconciliation into master

Reply via email to