I don’t *think* OSD restarts are necessary.

> On Jul 11, 2025, at 1:05 PM, Steven Vacaroaia <ste...@gmail.com> wrote:
> 
> Thanks Anthony
> 
> changing the scheduler will require to restart all OSDs , right 
> using  "ceph orch restart osd "
> 
> Is this done in a staggering manner or I need to "stagger " them  ?
> 
> Steven
> 
> 
> 
> On Fri, 11 Jul 2025 at 12:14, Anthony D'Atri <a...@dreamsnake.net 
> <mailto:a...@dreamsnake.net>> wrote:
>> What you describe sounds like expected behavior.  It’s a feature!
>> 
>> Since … Nautilus I think, you or the autoscaler sets pg_num and the cluster 
>> gradually steps up pgp_num until it matches.
>> 
>> Increasing pg_num means splitting PGs, which in turn perturbs the inputs to 
>> the CRUSH hash function, so data moves: backfill.
>> 
>> Moving data on HDDs isn’t fast, especially with EC.  These are all random, 
>> fragmented writes, so model 70 MB/s to a given drive.
>> 
>> > As expected, the backfilling started ...and it never ended ...even now
>> > after more than 1 week I still have about 29 pgs backfilling and 13
>> > backfilling_wait
>> 
>> Back pre-Nautilus this would have been a thundering herd of backfill.  You 
>> don’t know how good we have it now ;)
>> 
>> > What worries me is that the number of backfilling PGs varies very little
>> > over time  e.g 28 and 12  ALTHOUGH there is constant "recovery" traffic
>> > between 250 and 350MiB
>> 
>> The number of PGs backfillING at any given time is a function of multiple 
>> things, including the value of osd_max_backfills.
>> EC means each write ties up 6 drives, so there’s a bit more gridlock 
>> compared to replicated pools.
>> 
>> > 
>> > The "recovery" seems to be doing something ( but number of objects remain
>> > the same )
>> 
>> The number of objects, or the number of *misplaced/remapped* objects?
>> 
>> Is it showing *keys* per second?  RGW stores a lot of omap data in RocksDB.
>> 
>> > Since the recovery should run over the cluster network and the amount of
>> > data in the pool is not huge, I am not sure why it takes so many days - it
>> > seems stuck actually
>> 
>> Have you reverted to the wpq scheduler?
>> 
>> osd_op_queue = wpq
>> osd_mclock_override_recovery_settings
>> 
>> You can also increase the value of osd_max_backfills
>> 
>> 
>> > The only strange thing I noticed is a discrepancy between the number of PG
>> > and PGP
>> > that the pool currently has ...and what autoscale-status says
>> 
>> It’s in the process of doing what you asked. 
>> 
>> > 
>> > Any help / suggestions would be very appreciated
>> > 
>> > What I have tried so for :
>> >     increase recovery speed ( by changing mclock profile to
>> > "high_recovery_ops"  and overriding various parameters)
>> >     (recovery_max_active, recovery_max_active_hdd ... etc)
>> 
>> If the default mclock scheduler is enabled, that has issues for some 
>> deployments. There are code improvements in the works, but for now I suggest 
>> reverting to wpq.
>> 
>> > 
>> >     redeploying some of the OSDs that were "UP_PRIMARY but part of the
>> > backfill_wait PGs
>> 
>> Redeploying OSDs isn’t often called for, and can chum the waters.  It also 
>> adds a lot of backfill/recovery to what you already have going on.
>> 
>> If you want a gentle goose when things seem stuck, you can try
>> 
>>         ceph osd down XXX
>> 
>> for the lead OSD of a given PG, one at a time
>> 
>> or 
>>         ceph pg repeer xx.yyyy
>> 
>> > 
>> 

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to