Thanks Anthony changing the scheduler will require to restart all OSDs , right using "ceph orch restart osd "
Is this done in a staggering manner or I need to "stagger " them ? Steven On Fri, 11 Jul 2025 at 12:14, Anthony D'Atri <a...@dreamsnake.net> wrote: > What you describe sounds like expected behavior. It’s a feature! > > Since … Nautilus I think, you or the autoscaler sets pg_num and the > cluster gradually steps up pgp_num until it matches. > > Increasing pg_num means splitting PGs, which in turn perturbs the inputs > to the CRUSH hash function, so data moves: backfill. > > Moving data on HDDs isn’t fast, especially with EC. These are all random, > fragmented writes, so model 70 MB/s to a given drive. > > > As expected, the backfilling started ...and it never ended ...even now > > after more than 1 week I still have about 29 pgs backfilling and 13 > > backfilling_wait > > Back pre-Nautilus this would have been a thundering herd of backfill. You > don’t know how good we have it now ;) > > > What worries me is that the number of backfilling PGs varies very little > > over time e.g 28 and 12 ALTHOUGH there is constant "recovery" traffic > > between 250 and 350MiB > > The number of PGs backfillING at any given time is a function of multiple > things, including the value of osd_max_backfills. > EC means each write ties up 6 drives, so there’s a bit more gridlock > compared to replicated pools. > > > > > The "recovery" seems to be doing something ( but number of objects remain > > the same ) > > The number of objects, or the number of *misplaced/remapped* objects? > > Is it showing *keys* per second? RGW stores a lot of omap data in RocksDB. > > > Since the recovery should run over the cluster network and the amount of > > data in the pool is not huge, I am not sure why it takes so many days - > it > > seems stuck actually > > Have you reverted to the wpq scheduler? > > osd_op_queue = wpq > osd_mclock_override_recovery_settings > > You can also increase the value of osd_max_backfills > > > > The only strange thing I noticed is a discrepancy between the number of > PG > > and PGP > > that the pool currently has ...and what autoscale-status says > > It’s in the process of doing what you asked. > > > > > Any help / suggestions would be very appreciated > > > > What I have tried so for : > > increase recovery speed ( by changing mclock profile to > > "high_recovery_ops" and overriding various parameters) > > (recovery_max_active, recovery_max_active_hdd ... etc) > > If the default mclock scheduler is enabled, that has issues for some > deployments. There are code improvements in the works, but for now I > suggest reverting to wpq. > > > > > redeploying some of the OSDs that were "UP_PRIMARY but part of the > > backfill_wait PGs > > Redeploying OSDs isn’t often called for, and can chum the waters. It also > adds a lot of backfill/recovery to what you already have going on. > > If you want a gentle goose when things seem stuck, you can try > > ceph osd down XXX > > for the lead OSD of a given PG, one at a time > > or > ceph pg repeer xx.yyyy > > > > > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io