I don’t *think* OSD restarts are necessary. > On Jul 11, 2025, at 1:05 PM, Steven Vacaroaia <ste...@gmail.com> wrote: > > Thanks Anthony > > changing the scheduler will require to restart all OSDs , right > using "ceph orch restart osd " > > Is this done in a staggering manner or I need to "stagger " them ? > > Steven > > > > On Fri, 11 Jul 2025 at 12:14, Anthony D'Atri <a...@dreamsnake.net > <mailto:a...@dreamsnake.net>> wrote: >> What you describe sounds like expected behavior. It’s a feature! >> >> Since … Nautilus I think, you or the autoscaler sets pg_num and the cluster >> gradually steps up pgp_num until it matches. >> >> Increasing pg_num means splitting PGs, which in turn perturbs the inputs to >> the CRUSH hash function, so data moves: backfill. >> >> Moving data on HDDs isn’t fast, especially with EC. These are all random, >> fragmented writes, so model 70 MB/s to a given drive. >> >> > As expected, the backfilling started ...and it never ended ...even now >> > after more than 1 week I still have about 29 pgs backfilling and 13 >> > backfilling_wait >> >> Back pre-Nautilus this would have been a thundering herd of backfill. You >> don’t know how good we have it now ;) >> >> > What worries me is that the number of backfilling PGs varies very little >> > over time e.g 28 and 12 ALTHOUGH there is constant "recovery" traffic >> > between 250 and 350MiB >> >> The number of PGs backfillING at any given time is a function of multiple >> things, including the value of osd_max_backfills. >> EC means each write ties up 6 drives, so there’s a bit more gridlock >> compared to replicated pools. >> >> > >> > The "recovery" seems to be doing something ( but number of objects remain >> > the same ) >> >> The number of objects, or the number of *misplaced/remapped* objects? >> >> Is it showing *keys* per second? RGW stores a lot of omap data in RocksDB. >> >> > Since the recovery should run over the cluster network and the amount of >> > data in the pool is not huge, I am not sure why it takes so many days - it >> > seems stuck actually >> >> Have you reverted to the wpq scheduler? >> >> osd_op_queue = wpq >> osd_mclock_override_recovery_settings >> >> You can also increase the value of osd_max_backfills >> >> >> > The only strange thing I noticed is a discrepancy between the number of PG >> > and PGP >> > that the pool currently has ...and what autoscale-status says >> >> It’s in the process of doing what you asked. >> >> > >> > Any help / suggestions would be very appreciated >> > >> > What I have tried so for : >> > increase recovery speed ( by changing mclock profile to >> > "high_recovery_ops" and overriding various parameters) >> > (recovery_max_active, recovery_max_active_hdd ... etc) >> >> If the default mclock scheduler is enabled, that has issues for some >> deployments. There are code improvements in the works, but for now I suggest >> reverting to wpq. >> >> > >> > redeploying some of the OSDs that were "UP_PRIMARY but part of the >> > backfill_wait PGs >> >> Redeploying OSDs isn’t often called for, and can chum the waters. It also >> adds a lot of backfill/recovery to what you already have going on. >> >> If you want a gentle goose when things seem stuck, you can try >> >> ceph osd down XXX >> >> for the lead OSD of a given PG, one at a time >> >> or >> ceph pg repeer xx.yyyy >> >> > >>
_______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io