What you describe sounds like expected behavior. It’s a feature! Since … Nautilus I think, you or the autoscaler sets pg_num and the cluster gradually steps up pgp_num until it matches.
Increasing pg_num means splitting PGs, which in turn perturbs the inputs to the CRUSH hash function, so data moves: backfill. Moving data on HDDs isn’t fast, especially with EC. These are all random, fragmented writes, so model 70 MB/s to a given drive. > As expected, the backfilling started ...and it never ended ...even now > after more than 1 week I still have about 29 pgs backfilling and 13 > backfilling_wait Back pre-Nautilus this would have been a thundering herd of backfill. You don’t know how good we have it now ;) > What worries me is that the number of backfilling PGs varies very little > over time e.g 28 and 12 ALTHOUGH there is constant "recovery" traffic > between 250 and 350MiB The number of PGs backfillING at any given time is a function of multiple things, including the value of osd_max_backfills. EC means each write ties up 6 drives, so there’s a bit more gridlock compared to replicated pools. > > The "recovery" seems to be doing something ( but number of objects remain > the same ) The number of objects, or the number of *misplaced/remapped* objects? Is it showing *keys* per second? RGW stores a lot of omap data in RocksDB. > Since the recovery should run over the cluster network and the amount of > data in the pool is not huge, I am not sure why it takes so many days - it > seems stuck actually Have you reverted to the wpq scheduler? osd_op_queue = wpq osd_mclock_override_recovery_settings You can also increase the value of osd_max_backfills > The only strange thing I noticed is a discrepancy between the number of PG > and PGP > that the pool currently has ...and what autoscale-status says It’s in the process of doing what you asked. > > Any help / suggestions would be very appreciated > > What I have tried so for : > increase recovery speed ( by changing mclock profile to > "high_recovery_ops" and overriding various parameters) > (recovery_max_active, recovery_max_active_hdd ... etc) If the default mclock scheduler is enabled, that has issues for some deployments. There are code improvements in the works, but for now I suggest reverting to wpq. > > redeploying some of the OSDs that were "UP_PRIMARY but part of the > backfill_wait PGs Redeploying OSDs isn’t often called for, and can chum the waters. It also adds a lot of backfill/recovery to what you already have going on. If you want a gentle goose when things seem stuck, you can try ceph osd down XXX for the lead OSD of a given PG, one at a time or ceph pg repeer xx.yyyy > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io