[ceph-users] Re: squid 19.2.2 - troubleshooting pgs in active+remapped+backfill - no pictures

Eugen Block Fri, 11 Jul 2025 10:45:25 -0700

Hi,

changing the scheduler requires a OSD restart, and it is done in astaggered manner by default. So the command you mentioned will do thatfor you.


https://docs.clyso.com/blog/2023/03/22/ceph-how-do-disable-mclock-scheduler/

Zitat von Anthony D'Atri <a...@dreamsnake.net>:

I don’t *think* OSD restarts are necessary.
On Jul 11, 2025, at 1:05 PM, Steven Vacaroaia <ste...@gmail.com> wrote:

Thanks Anthony

changing the scheduler will require to restart all OSDs , right
using  "ceph orch restart osd "

Is this done in a staggering manner or I need to "stagger " them  ?

Steven
On Fri, 11 Jul 2025 at 12:14, Anthony D'Atri <a...@dreamsnake.net<mailto:a...@dreamsnake.net>> wrote:
What you describe sounds like expected behavior.  It’s a feature!
Since … Nautilus I think, you or the autoscaler sets pg_num andthe cluster gradually steps up pgp_num until it matches.
Increasing pg_num means splitting PGs, which in turn perturbs theinputs to the CRUSH hash function, so data moves: backfill.
Moving data on HDDs isn’t fast, especially with EC. These are allrandom, fragmented writes, so model 70 MB/s to a given drive.
> As expected, the backfilling started ...and it never ended ...even now
> after more than 1 week I still have about 29 pgs backfilling and 13
> backfilling_wait
Back pre-Nautilus this would have been a thundering herd ofbackfill. You don’t know how good we have it now ;)
> What worries me is that the number of backfilling PGs varies very little
> over time  e.g 28 and 12  ALTHOUGH there is constant "recovery" traffic
> between 250 and 350MiB
The number of PGs backfillING at any given time is a function ofmultiple things, including the value of osd_max_backfills.EC means each write ties up 6 drives, so there’s a bit moregridlock compared to replicated pools.
>
> The "recovery" seems to be doing something ( but number of objects remain
> the same )

The number of objects, or the number of *misplaced/remapped* objects?

Is it showing *keys* per second?  RGW stores a lot of omap data in RocksDB.

> Since the recovery should run over the cluster network and the amount of
> data in the pool is not huge, I am not sure why it takes so manydays - it
> seems stuck actually

Have you reverted to the wpq scheduler?

osd_op_queue = wpq
osd_mclock_override_recovery_settings

You can also increase the value of osd_max_backfills
> The only strange thing I noticed is a discrepancy between thenumber of PG
> and PGP
> that the pool currently has ...and what autoscale-status says

It’s in the process of doing what you asked.

>
> Any help / suggestions would be very appreciated
>
> What I have tried so for :
>     increase recovery speed ( by changing mclock profile to
> "high_recovery_ops"  and overriding various parameters)
>     (recovery_max_active, recovery_max_active_hdd ... etc)
If the default mclock scheduler is enabled, that has issues forsome deployments. There are code improvements in the works, butfor now I suggest reverting to wpq.
>
>     redeploying some of the OSDs that were "UP_PRIMARY but part of the
> backfill_wait PGs
Redeploying OSDs isn’t often called for, and can chum the waters.It also adds a lot of backfill/recovery to what you already havegoing on.
If you want a gentle goose when things seem stuck, you can try

        ceph osd down XXX

for the lead OSD of a given PG, one at a time

or
        ceph pg repeer xx.yyyy

>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: squid 19.2.2 - troubleshooting pgs in active+remapped+backfill - no pictures

Reply via email to