[ceph-users] squid 19.2.2 - troubleshooting pgs in active+remapped+backfill - no pictures

Steven Vacaroaia Fri, 11 Jul 2025 08:50:02 -0700

Hi,

I sent another version of this message with pictures that awaits moderation
since it is so big  - apologies for that


In the meantime I got approval to share the output of some of the command -
see attached

I have a 19.2.2 cluster deployed with cephadm
 7 nodes and 2 networks ( cluster (2 x 100GB)  and public (2 x 25Gb))
It provides an S3 bucket as a target for my backup

The underlying pool (default.rgw.buckets.data) is using an EC 4+2 profile
 with a storage class for spinning disks

All spinning disks are keeping WAl/DB on NVME

The amount of data grew pretty fast and , since I started the pool with
pg_autoscale_mode = warn,I have decided to increase the number of
PG manually ( from 128 to 256)

As expected, the backfilling started ...and it never ended ...even now
after more than 1 week I still have about 29 pgs backfilling and 13
backfilling_wait

What worries me is that the number of backfilling PGs varies very little
over time  e.g 28 and 12  ALTHOUGH there is constant "recovery" traffic
between 250 and 350MiB

There is no OSD or capacity issue ( if I enable pg_autoscale_mode the
cluster  health is OK )

The "recovery" seems to be doing something ( but number of objects remain
the same )
Since the recovery should run over the cluster network and the amount of
data in the pool is not huge, I am not sure why it takes so many days - it
seems stuck actually

The only strange thing I noticed is a discrepancy between the number of PG
and PGP
that the pool currently has ...and what autoscale-status says

Any help / suggestions would be very appreciated

What I have tried so for :
     increase recovery speed ( by changing mclock profile to
"high_recovery_ops"  and overriding various parameters)
     (recovery_max_active, recovery_max_active_hdd ... etc)

     redeploying some of the OSDs that were "UP_PRIMARY but part of the
backfill_wait PGs

     query the PGs and look for a "stuck reason"

      stop scrub and deep-scrub

     repair the PGs (some)

     change the pg_autoscale_mode to true

     check the balancer status

Many thanks
Steven
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] squid 19.2.2 - troubleshooting pgs in active+remapped+backfill - no pictures

Reply via email to