Hi,
changing the scheduler requires a OSD restart, and it is done in a
staggered manner by default. So the command you mentioned will do that
for you.
https://docs.clyso.com/blog/2023/03/22/ceph-how-do-disable-mclock-scheduler/
Zitat von Anthony D'Atri <a...@dreamsnake.net>:
I don’t *think* OSD restarts are necessary.
On Jul 11, 2025, at 1:05 PM, Steven Vacaroaia <ste...@gmail.com> wrote:
Thanks Anthony
changing the scheduler will require to restart all OSDs , right
using "ceph orch restart osd "
Is this done in a staggering manner or I need to "stagger " them ?
Steven
On Fri, 11 Jul 2025 at 12:14, Anthony D'Atri <a...@dreamsnake.net
<mailto:a...@dreamsnake.net>> wrote:
What you describe sounds like expected behavior. It’s a feature!
Since … Nautilus I think, you or the autoscaler sets pg_num and
the cluster gradually steps up pgp_num until it matches.
Increasing pg_num means splitting PGs, which in turn perturbs the
inputs to the CRUSH hash function, so data moves: backfill.
Moving data on HDDs isn’t fast, especially with EC. These are all
random, fragmented writes, so model 70 MB/s to a given drive.
> As expected, the backfilling started ...and it never ended ...even now
> after more than 1 week I still have about 29 pgs backfilling and 13
> backfilling_wait
Back pre-Nautilus this would have been a thundering herd of
backfill. You don’t know how good we have it now ;)
> What worries me is that the number of backfilling PGs varies very little
> over time e.g 28 and 12 ALTHOUGH there is constant "recovery" traffic
> between 250 and 350MiB
The number of PGs backfillING at any given time is a function of
multiple things, including the value of osd_max_backfills.
EC means each write ties up 6 drives, so there’s a bit more
gridlock compared to replicated pools.
>
> The "recovery" seems to be doing something ( but number of objects remain
> the same )
The number of objects, or the number of *misplaced/remapped* objects?
Is it showing *keys* per second? RGW stores a lot of omap data in RocksDB.
> Since the recovery should run over the cluster network and the amount of
> data in the pool is not huge, I am not sure why it takes so many
days - it
> seems stuck actually
Have you reverted to the wpq scheduler?
osd_op_queue = wpq
osd_mclock_override_recovery_settings
You can also increase the value of osd_max_backfills
> The only strange thing I noticed is a discrepancy between the
number of PG
> and PGP
> that the pool currently has ...and what autoscale-status says
It’s in the process of doing what you asked.
>
> Any help / suggestions would be very appreciated
>
> What I have tried so for :
> increase recovery speed ( by changing mclock profile to
> "high_recovery_ops" and overriding various parameters)
> (recovery_max_active, recovery_max_active_hdd ... etc)
If the default mclock scheduler is enabled, that has issues for
some deployments. There are code improvements in the works, but
for now I suggest reverting to wpq.
>
> redeploying some of the OSDs that were "UP_PRIMARY but part of the
> backfill_wait PGs
Redeploying OSDs isn’t often called for, and can chum the waters.
It also adds a lot of backfill/recovery to what you already have
going on.
If you want a gentle goose when things seem stuck, you can try
ceph osd down XXX
for the lead OSD of a given PG, one at a time
or
ceph pg repeer xx.yyyy
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io