Re: [VOTE] Merge HDDS-5713 Disk Balancer into master

Tsz-Wo Nicholas Sze Mon, 11 Aug 2025 10:18:15 -0700

Hi Gargi,

Thanks a lot for the detailed explanation!


> ... SCM's* *DiskBalancerManager* provides essential *centralized
coordination* ...

Why does it need centralized coordination?  Put it in another way, why does
balancing the disks in datanode A have anything to do with datanode B?

It seems that DiskBalancer should be similar to the data scanner.  It
should just keep working locally.

> DiskBalancerTask explicitly acquires a container read lock before any
file operations to prevent concurrent container state modifications like
block deletion.

But the DiskBalancerTask itself will delete the blocks when it is holding
only the readlock.  This seems like a problem as below:
1) DiskBalancerTask has acquired and started copying a container from an
old dir to a new dir.
2) A reader has retrieved the old dir file path of a block but has not yet
opened the file.
3) DiskBalancerTask has finished copying.
4) DiskBalancerTask has deleted the old dir.
5) The reader opens the file using the old dir file path and gets
FileNotFoundException.

Tsz-Wo


On Mon, Aug 11, 2025 at 5:47 AM Gargi Jaiswal
<[email protected]> wrote:

> Hi Tsz Wo Sze,
> Thank you for the review!
> I have tried to answer a few of the questions, hoping it is clear enough.
>
> *- Why does SCM need a DiskBalancerManager?  Should the DiskBalancerService
> just be a local service in Datanode?*
> *SCM's* *DiskBalancerManager* provides essential *centralized
> coordination *that
> local DataNode services cannot achieve. SCM can identify which DataNodes
> most urgently need disk balancing by calculating volume density sums and
> ranking them.
>
> In HDFS Disk Balancer, the schedule is split into two steps: PLAN and
> EXECUTE. Users first get the balance plan, then according to the plan, the
> requests are sent to datanodes to do the real balance job. These are two
> commands for the users to run, which is a little complex. So in Ozone,
> Client
> will retrieve the node information from SCM and decide the balance of
> which node or the whole cluster.
> Single point to start/stop/configure disk balancing across multiple
> DataNodes simultaneously.
> *DataNode:* Handles actual container movement (local execution)
> *SCM:* Provides coordination and cluster-wide management.
>
> *- In DiskBalancerTask, it holds the container read lock.  Would it be
> possible that a file gets deleted when another thread is reading it and
> results in an exception?*
> No, this scenario won't happen.
>
>    - DiskBalancerTask explicitly acquires a container read lock before any
>    file operations to prevent concurrent container state modifications like
>    block deletion.
>    - All deletion operations require a write lock on the container
>    (BlockDeletingTask, markContainerForDelete, etc.).
>    - ReadWriteLock semantics guarantee mutual exclusion between read and
>    write locks - while DiskBalancerTask holds the read lock during copy
>    operations, no deletion thread can acquire the required write lock.
>    - The copy operation (copyContainerData) safely reads all container
>    files under the protection of the read lock, ensuring no concurrent
>    deletions can interfere.
>
> The locking design specifically addresses this concern, as evidenced by the
> comment in the code: 'hold read lock on the container first, to avoid other
> threads to update the container state, such as block deletion.'
>
> *- In DefaultContainerChoosingPolicy, it has a CACHE for caching container
> iterators.  Would it cause ConcurrentModificationException if the
> underlying containerMap is modified?*
> *No*, CACHE will NOT cause ConcurrentModificationException when the
> underlying containerMap is modified. Here's why:
>
>    - The underlying *containerMap* is a *ConcurrentSkipListMap*, which
>    provides fail-safe iterators that don't throw
>    ConcurrentModificationException.
>    - The *cached iterator* operates on a snapshot created by the stream
>    operations (filter + sort), not directly on the live map. This snapshot
> is
>    immune to concurrent modifications.
>    - The existing test
> *testContainerDeletionAfterIteratorGeneration()* specifically
>    validates this scenario by removing containers after iterator creation
> and
>    confirms no exceptions occur.
>    - The cache invalidates exhausted iterators, ensuring fresh snapshots
>    for future requests.
>
> * - In DefaultVolumeChoosingPolicy.chooseVolume(..), getCurrentUsage() is
> called multiple times so it can return different values.*
> Yes, a valid concurrency issue can happen. The current implementation under
> the *ReentrantLock*  prevents concurrent calls to *chooseVolume()* but
> doesn't prevent the underlying usage values from changing during execution.
> *Agreed, this part needs to be fixed.*
>
> *- Is DiskBalancerService.DISK_BALANCER_TMP_DIR the same as
> StorageVolume.TMP_DIR_NAME?  If yes, use StorageVolume.TMP_DIR_NAME instead
> of adding DiskBalancerService.DISK_BALANCER_TMP_DIR.*
>  Yes, they both use the same value "tmp".
>
> *Thank you for pointing to the below refactorings, we will do the changes.*
> - DatanodeDiskBalancerInfoProto should use DatanodeIDProto instead of
> DatanodeDetailsProto
>
> - Remove DatanodeDiskBalancerInfoType since it is never used.
>
> - Remove the new Container.importContainerData(Path containerPath) method
> since it is never used.
>
> - Rename Container.copyContainerData to copyContainerDirectory.  Otherwise,
> it sounds like copy/import the ContainerData class.
>
> - In DiskBalancerService, use ContainerID instead of Long for container id.
>
> - Do not use Optional for parameters; see
>
> https://www.reddit.com/r/java/comments/sat1j4/opinions_on_using_optional_as_parameter/
>
> - Use Objects.requireNonNull(..) instead of Preconditions.checkNotNull(..).
>
> On Sat, Aug 9, 2025 at 12:19 AM Tsz Wo Sze <[email protected]> wrote:
>
> > Hi Sammi and others,
> >
> > Thanks for working on the Disk Balancer feature!  It is going to be very
> > useful.
> >
> > Some comments:
> >
> > - Why does SCM need a DiskBalancerManager?  Should the
> DiskBalancerService
> > just be a local service in Datanode?
> >
> > - In DiskBalancerTask, it holds the container read lock.  Would it be
> > possible that a file gets deleted when another thread is reading it and
> > results in an exception?
> >
> > - In DefaultContainerChoosingPolicy, it has a CACHE for caching container
> > iterators.  Would it cause ConcurrentModificationException if the
> > underlying containerMap is modified?
> >
> > - In DefaultVolumeChoosingPolicy.chooseVolume(..), getCurrentUsage() is
> > called multiple times so it can return different values.
> >
> > - Is DiskBalancerService.DISK_BALANCER_TMP_DIR the same as
> > StorageVolume.TMP_DIR_NAME?  If yes, use StorageVolume.TMP_DIR_NAME
> instead
> > of adding DiskBalancerService.DISK_BALANCER_TMP_DIR.
> >
> > - Do not throw RuntimeException -- how to handle it when it happens?
> >
> > - Since this is a new feature, how about not supporting SCHEMA_V1/V2 ?
> If
> > we want to support them, we should add some tests.
> >
> > - DatanodeDiskBalancerInfoProto should use DatanodeIDProto instead of
> > DatanodeDetailsProto
> >
> > - Remove DatanodeDiskBalancerInfoType since it is never used.
> >
> > - Remove the new Container.importContainerData(Path containerPath) method
> > since it is never used.
> >
> > - Rename Container.copyContainerData to copyContainerDirectory.
> Otherwise,
> > it sounds like copy/import the ContainerData class.
> >
> > - In DiskBalancerService, use ContainerID instead of Long for container
> id.
> >
> > - Do not use Optional for parameters; see
> >
> >
> https://www.reddit.com/r/java/comments/sat1j4/opinions_on_using_optional_as_parameter/
> >
> > - Use Objects.requireNonNull(..) instead of
> Preconditions.checkNotNull(..).
> >
> > Regards,
> > Tsz-Wo
> >
> >
> > On Fri, Aug 8, 2025 at 9:57 AM Ethan Rose <[email protected]> wrote:
> >
> > > Hi Sammi, thanks for working on this feature.
> > >
> > > There should not be any incompatible changes introduced with this
> > feature.
> > >
> > > This section needs more information because it only talks about a
> feature
> > > flag, which is orthogonal to any upgrade/downgrade concerns because
> > feature
> > > flags do not prevent downgrades like finalization does.
> > >
> > >    - What happens if the feature flag is enabled, and later disabled?
> > >    - If an in-progress container move is interrupted by a downgrade,
> what
> > >    does the old version do?
> > >    - Are components persisting any state about the ongoing operations,
> > and
> > >    if so, what happens to that during a downgrade then upgrade flow?
> > >
> > > The disk balancer has a feature flag that currently defaults to false.
> In
> > > this
> > > comment <
> > https://github.com/apache/ozone/pull/8869#issuecomment-3130009094
> > > >
> > > I explained why feature flags for features that are only triggered by
> > > admins are redundant. I’ve heard other devs from HDFS also express
> > concern
> > > over everything in that project being behind feature flags and trying
> to
> > > avoid sending Ozone down that path as well. IMO if we want to indicate
> > that
> > > this is not ready for prod use, it would be better to hide the CLI
> > command
> > > <https://picocli.info/quick-guide.html#_hidden_options_and_parameters>
> > and
> > > add a disclaimer to the help message indicating this. The current
> config
> > > flag approach provides no user facing information as to why it is
> > disabled
> > > by default or what the implications of enabling it are. Additionally,
> why
> > > is this set tofalse by default but considered ready to merge with
> master?
> > >
> > > Ethan
> > >
> > > On Fri, Aug 8, 2025 at 4:53 AM Sammi Chen <[email protected]>
> wrote:
> > >
> > > > Hi Ozone developers,
> > > >
> > > > I would like to propose merging the HDDS-5713 Disk Balancer feature
> > > branch
> > > > into master.
> > > >
> > > > HDDS-5713 adds the support of the disk volume utilization balancer
> > > function
> > > > by selecting and moving containers from high utilized data volume to
> > > lower
> > > > utilized data volume, to achieve an all disk volumes' utilization
> even
> > > > state.
> > > >
> > > >
> > > > Feature Jira Link:
> > > > https://issues.apache.org/jira/browse/HDDS-5713
> > > >
> > > >
> > > > Checklist for feature branch merge:
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/OZONE/Disk+Balancer+For+Datanode+-+HDDS-5713
> > > > <
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/OZONE/Disk+Balancer+For+Datanode+-+HDDS-5713
> > > > >
> > > >
> > > >
> > > > This vote will be open for at least a week.
> > > >
> > > > Thanks,
> > > > Sammi Chen
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [email protected]
> > > > For additional commands, e-mail: [email protected]
> > > >
> > >
> >
>

Re: [VOTE] Merge HDDS-5713 Disk Balancer into master

Reply via email to