Re: [VOTE] Merge branch HDDS-10239-container-reconciliation into master

Ethan Rose Wed, 18 Jun 2025 07:47:42 -0700

Thanks for reviewing the code Nicholas.

For the “healthy” term, I agree it is not specific enough. In this PR we
are updating it to checksumMatches in the merkle tree proto. We can update
the term in the ScanResult in a follow-up since it is not touching the
proto. Cases where the block is missing or unreadable will not be able to
generate a checksum and chunk will be considered missing from the merkle
tree and need repair anyways, so other states like IO_EXCEPTION shouldn’t
be necessary.


The BlockDeletingService is supposed to share the same
ContainerChecksumTreeManager as the rest of the code and it does as far as
I can tell. If you are referring to this line
<https://github.com/apache/ozone/blob/2b4708b70e5c9de133381a38ca7fa4b3cf3caa42/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/impl/BlockDeletingService.java#L76>,
that is a test-only constructor.

ContainerID and its corresponding proto are only used in SCM code right
now. Datanode code consistently identified container IDs as long values. We
can shift to using ContainerID in the datanode as well, but that would be a
change outside of reconciliation.

For maps with Long keys, I’ve identified a few instances where we can
document it better:

   - Return values from ContainerDiffReport
   - this field
   
<https://github.com/apache/ozone/blob/a355664093c634c3d04d0601b5c0302260a44c6c/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/checksum/ContainerMerkleTreeWriter.java#L46>
   in ContainerMerkleTreeWriter
   - Did you have other instances in mind?

Checksum longs should be converted to hex strings in all log and json
output using HddsUtils#checksumToString
<https://github.com/apache/ozone/blob/2b4708b70e5c9de133381a38ca7fa4b3cf3caa42/hadoop-hdds/common/src/main/java/org/apache/hadoop/hdds/HddsUtils.java#L865>.
If there’s spots we missed let us know and we can fix it. When we start
looking at EC reconciliation and a metadata checksum is also added to
containers, it is likely we will end up with one object wrapping all of a
container’s checksums and its toString could handle this.

Most optional usage that I’m aware of is being removed in this PR
<https://github.com/apache/ozone/pull/8565>.
ContainerChecksumTreeManager#readChecksumInfo still has it after that
change, but we could remove it as part of HDDS-12824
<https://issues.apache.org/jira/browse/HDDS-12824> which is the next change
that will touch this method. I’m not sure what the preferred replacement
here is. @Nullable?

We will do a reverse merge of master into the branch before merging the
branch back to master to resolve conflicts, instead of having the conflict
resolution as part of the final merge commit to master. Our last merge from
master was Jun 9, but things change fast.

Proto 3 migration is a bigger effort outside of reconciliation. Currently
all proto files use syntax = "proto2" , and I guess our options are to
either increase that to “proto3” (which affects all protos, not just new
ones), or put new protos in a different file with "proto3" syntax (for
which there is currently no existing convention for).

For tracking the phases of work, the initial plan was that phase 1 was for
ratis reconciliation, phase 2 was for EC reconciliation, and phase 3 was
for replication manager integration. If people think it is easier we can
move the remaining lower priority items after the merge from Ratis
reconciliation to a phase 2 jira and explicitly create new jiras for the
later phases of work.

Thanks,
Ethan

On Tue, Jun 17, 2025 at 5:23 PM Tsz-Wo Nicholas Sze <[email protected]>
wrote:

> One more thing:  how about we call all the completed work Phase 1 and the
> work after merge Phase 2?  Then, we should update HDDS-10239 to describe
> what was done in Phase 1 and move the unresolved JIRAs to a new umbrella
> Phase 2 JIRA.
>
> Tsz-Wo
>
>
> On Tue, Jun 17, 2025 at 12:17 PM Tsz-Wo Nicholas Sze <[email protected]>
> wrote:
>
> > +1
> > Thanks for the great work!  The feature is going to be very useful.  Some
> > questions/comments/suggestions below.
> >
> > "Healthy" related:
> > - What does ChunkMerkleTree.isHealthy == false means?  Checksum
> > mismatched?  Missing?
> > - Rename ScanResult.isHealthy() to hasErrors() since all the (non-test)
> > calls are !isHealthy().
> > *  We should avoid the term "healthy".  It is not clear what it means
> when
> > isHealthy==false.
> >   It is better to have "boolean isChecksumMatched", "boolean hasError"
> >   or enum HealthType {NO_ERROR, CHECKSUM_MISMATCHED, MISSING,
> IO_EXCEPTION}
> >
> > Others:
> > - Why is ContainerChecksumTreeManager created in both
> BlockDeletingService
> > and OzoneContainer?  Should they share it?
> > - Use ContainerID instead of Long.
> > - Add javadoc for all the maps using Long.  Describe what it is for (e.g.
> > block id).
> > - When converting a checksum (long) to a string, use hexadecimal.
> > - Avoid Optional, which is slow and generates garbage.
> > - Fix conflicts with the master.
> > - Use proto 3 for new protos (but it may be hard to do.)
> >
> > Tsz-Wo
> >
> >
> > On Tue, Jun 10, 2025 at 5:57 PM Ethan Rose <[email protected]> wrote:
> >
> >> Based on discussion in the community sync this week I want to add some
> >> more
> >> information.
> >> Code
> >>
> >> For those interested in checking out the code, these are some of the
> major
> >> classes to start with:
> >>
> >>    - ReconcileContainerTask: This is the command on the datanode that is
> >>    received from SCM to reconcile a container with a datanode’s peers.
> It
> >>    passes through the ReplicationSupervisor just like replication and
> >>    reconstruction commands.
> >>    - ContainerProtos.ContainerChecksumInfo: This is the proto format of
> >> the
> >>    new file that is written into the containers with the merkle tree and
> >> list
> >>    of deleted blocks.
> >>    - ContainerMerkleTreeWriter: This class is used to build merkle trees
> >>    chunk by chunk and generate a protobuf representation of the tree.
> >>    - ContainerChecksumTreeManager: This class coordinates reads and
> writes
> >>    of ContainerChecksumInfo for containers. The diff method determines
> >>    which repairs should be done on a container based on a peer’s merkle
> >> tree.
> >>    - KeyValueContainerCheck#scanData: This is the existing method called
> >> by
> >>    the background and on-demand container data scanners to scan a
> >> container.
> >>    It has been updated to build the merkle tree as it runs.
> >>    - KeyValueHandler#reconcileContainer: This method updates the
> container
> >>    based on the peer’s replica.
> >>    - Major tests for reconciliation have been added to
> >>    TestContainerCommandReconciliation (integration test) and
> >>    TestContainerReconciliationWithMockDatanodes (unit test with mocked
> >>    clients).
> >>       - There are more tasks under the reconciliation jira to expand the
> >>       types of faults being tested.
> >>
> >> Logging
> >>
> >> Logging was added on the datanodes to track reconciliation as it is
> >> happening. The datanode application log will print a summary of messages
> >> like this:
> >>
> >> 2025-06-10 20:13:14,570 [main] INFO  keyvalue.KeyValueHandler
> >> (KeyValueHandler.java:reconcileContainer(1595)) - Beginning
> >> reconciliation for container 100 with peer
> >> bbc09073-ac0d-4b2f-afe4-1de5f9dc6f43(dn3/237.6.76.4). Current data
> >> checksum is dcce847d
> >> 2025-06-10 20:13:14,589 [main] WARN  keyvalue.KeyValueHandler
> >> (KeyValueHandler.java:reconcileContainer(1681)) - Container 100
> >> reconciled with peer
> >> bbc09073-ac0d-4b2f-afe4-1de5f9dc6f43(dn3/237.6.76.4). Data checksum
> >> updated from dcce847d to 16189e0b.
> >> Missing blocks repaired: 5/5
> >> Missing chunks repaired: 0/0
> >> Corrupt chunks repaired:  10/10
> >> Time taken: 19 ms
> >> 2025-06-10 20:13:14,589 [main] WARN  keyvalue.KeyValueHandler
> >> (KeyValueHandler.java:reconcileContainer(1704)) - Completed
> >> reconciliation for container 100 with 1/1 peers. 15 blocks were
> >> updated. Data checksum updated from dcce847d to 16189e0b
> >>
> >> This shows:
> >>
> >>    - Reconciliation started between this datanode and one other peer for
> >>    container 100
> >>    - After reconciliation with the peer completed, the data checksum of
> >> our
> >>    container was updated
> >>    - Compared to this peer, we needed to ingest 5 missing blocks and
> >> repair
> >>    10 corrupt chunks. All operations were successful
> >>    - At the end we get a summary of how many changes were done to this
> >>    container after consulting all the peers in the reconcile request. In
> >> this
> >>    case there was only one peer.
> >>    By enabling debug logging we can see the individual blocks and chunks
> >>    that were repaired as well.
> >>
> >> In the dn-container.log file, dataChecksum is now included for every log
> >> line. We also get one new line in this log every time the checksum for a
> >> container is updated.
> >>
> >> In case logs roll off, a debug tool to inspect container’s checksum
> >> information locally on a datanode will be implemented in HDDS-13239
> >> <https://issues.apache.org/jira/browse/HDDS-13239>.
> >> Metrics
> >>
> >> The metrics for reconciliation tasks are available as a part of
> >> ReplicationSupervisor class which includes:
> >>
> >>    - numRequestedContainerReconciliations - Number of reconciliation
> tasks
> >>    - numQueuedContainerReconciliations - Number of queued tasks
> >>    - numTimeoutContainerReconciliations - Number of timed-out tasks
> >>    - numSuccessContainerReconciliations- Number of Success
> >>    - numFailureContainerReconciliations - Number of Failures
> >>    - numSkippedContainerReconciliations - Number of Skipped Tasks
> >>
> >> Latency/Count metrics for the tasks exposed by CommandHandlerMetrics for
> >> ReconcileContainerCommandHandler:
> >>
> >>    - TotalRunTimeMs - The total runtime of the command handler in
> >>    milliseconds
> >>    - AvgRunTimeMs - Average run time of the command handler in
> >> milliseconds
> >>    - QueueWaitingTaskCount - The number of queued tasks waiting for
> >>    execution
> >>    - InvocationCount - The number of times the command handler has been
> >>    invoked
> >>    - CommandReceivedCount - The number of received SCM commands for each
> >>    command type
> >>
> >> Other container reconciliation-related tasks are encapsulated in
> >> ContainerMerkleTreeMetrics:
> >>
> >>    - numMerkleTreeWriteFailure - Number of Merkle tree write failure
> >>    - numMerkleTreeReadFailure - Number of Merkle tree read failure
> >>    - numMerkleTreeDiffFailure - Number of Merkle tree diff failure
> >>    - numNoRepairContainerDiff - Number of container diff that doesn’t
> >>    require repair
> >>    - numRepairContainerDiff - Number of container diff that require
> repair
> >>    - merkleTreeWriteLatencyNS- Merkle tree write latency
> >>    - merkleTreeReadLatencyNS - Merkle tree read latency
> >>    - merkleTreeCreateLatencyNS - Merkle tree creation latency
> >>    - merkleTreeDiffLatencyNS - Merkle tree diff latency
> >>
> >>
> >> Thanks for reviewing
> >>
> >> Ethan
> >>
> >
>

Re: [VOTE] Merge branch HDDS-10239-container-reconciliation into master

Reply via email to