Thanks Anthony

changing the scheduler will require to restart all OSDs , right
using  "ceph orch restart osd "

Is this done in a staggering manner or I need to "stagger " them  ?

Steven



On Fri, 11 Jul 2025 at 12:14, Anthony D'Atri <a...@dreamsnake.net> wrote:

> What you describe sounds like expected behavior.  It’s a feature!
>
> Since … Nautilus I think, you or the autoscaler sets pg_num and the
> cluster gradually steps up pgp_num until it matches.
>
> Increasing pg_num means splitting PGs, which in turn perturbs the inputs
> to the CRUSH hash function, so data moves: backfill.
>
> Moving data on HDDs isn’t fast, especially with EC.  These are all random,
> fragmented writes, so model 70 MB/s to a given drive.
>
> > As expected, the backfilling started ...and it never ended ...even now
> > after more than 1 week I still have about 29 pgs backfilling and 13
> > backfilling_wait
>
> Back pre-Nautilus this would have been a thundering herd of backfill.  You
> don’t know how good we have it now ;)
>
> > What worries me is that the number of backfilling PGs varies very little
> > over time  e.g 28 and 12  ALTHOUGH there is constant "recovery" traffic
> > between 250 and 350MiB
>
> The number of PGs backfillING at any given time is a function of multiple
> things, including the value of osd_max_backfills.
> EC means each write ties up 6 drives, so there’s a bit more gridlock
> compared to replicated pools.
>
> >
> > The "recovery" seems to be doing something ( but number of objects remain
> > the same )
>
> The number of objects, or the number of *misplaced/remapped* objects?
>
> Is it showing *keys* per second?  RGW stores a lot of omap data in RocksDB.
>
> > Since the recovery should run over the cluster network and the amount of
> > data in the pool is not huge, I am not sure why it takes so many days -
> it
> > seems stuck actually
>
> Have you reverted to the wpq scheduler?
>
> osd_op_queue = wpq
> osd_mclock_override_recovery_settings
>
> You can also increase the value of osd_max_backfills
>
>
> > The only strange thing I noticed is a discrepancy between the number of
> PG
> > and PGP
> > that the pool currently has ...and what autoscale-status says
>
> It’s in the process of doing what you asked.
>
> >
> > Any help / suggestions would be very appreciated
> >
> > What I have tried so for :
> >     increase recovery speed ( by changing mclock profile to
> > "high_recovery_ops"  and overriding various parameters)
> >     (recovery_max_active, recovery_max_active_hdd ... etc)
>
> If the default mclock scheduler is enabled, that has issues for some
> deployments. There are code improvements in the works, but for now I
> suggest reverting to wpq.
>
> >
> >     redeploying some of the OSDs that were "UP_PRIMARY but part of the
> > backfill_wait PGs
>
> Redeploying OSDs isn’t often called for, and can chum the waters.  It also
> adds a lot of backfill/recovery to what you already have going on.
>
> If you want a gentle goose when things seem stuck, you can try
>
>         ceph osd down XXX
>
> for the lead OSD of a given PG, one at a time
>
> or
>         ceph pg repeer xx.yyyy
>
> >
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to