[ceph-users] Re: Doubled numbers of PGs from 8192 to 16384 - backfill bottlenecked

Anthony D'Atri Tue, 29 Apr 2025 16:10:35 -0700


I increased target_max_misplaced_ratio to ensure the balancer could work out 
all the moves:


[root@ceph-flash1 ~]# ceph config dump | grep misplaced
mgr                    basic     target_max_misplaced_ratio         0.300000

That’s a very high value.  You move less data more than once, at the possible 
risk of too much backfill causing performance impact.  Whatever floats your 
boat. 


>>> 
>>> In order to get our PG sizes better aligned we doubled the number of PGs on 
>>> the pool with the largest PG size. The pool is HDD with DB/WAL on SATA SSD 
>>> and HDD sizes between 2TB and 20TB and PG size was ~140GB before the 
>>> doubling.
>> Please send `ceph osd dump | grep pool`
> 
> [root@lazy ~]# ceph osd dump | grep pool

Why multiple RBD pools?  I suspect that you have multiple device classes / 
media, but still..  Large numbers of pools make it more difficult to calculate 
good pg_num values when not using the autoscaler.

I suggest playing with https://docs.ceph.com/en/squid/rados/operations/pgcalc/

… setting the target PGs per OSD to 250

Note the pools with a bias value >1, typical RGW index and CephFS metadata 
pools.  This is because those pools benefit from a larger pg_num value than 
their bytes usage might otherwise indicate.  You might account for this in the 
pgcalc by giving larger data %, or just shoot higher for those pools than 
calculated.  I would suggest at least the number of SSD OSDs on which these 
pools are placed, round up to the next power of two (and maybe double).  I 
don’t want to assume that your cluster is entirely non-rotational.

> 
>  I then ran pgremapper and got misplaced to less than 1% and then the 
> balancer is slowly increasing the number again. I think those tools are 
> largely doing the same thing? I'll try doing it again.

That high max ratio explains it.  Usually 30% misplaced is an indication that 
something isn’t as expected.


> pool 4 'rbd' replicated size 3 min_size 2 crush_rule 4 object_hash rjenkins 
> pg_num 1024 pgp_num 1024 autoscale_mode off last_change 2816850 lfor 
> 0/1844098/2447930 flags hashpspool,selfmanaged_snaps,bulk stripe_width 0 
> application rbd read_balance_score 3.97
> pool 5 'libvirt' replicated size 3 min_size 2 crush_rule 3 object_hash 
> rjenkins pg_num 256 pgp_num 256 autoscale_mode off last_change 2824108 lfor 
> 0/434267/1506461 flags hashpspool,selfmanaged_snaps stripe_width 0 
> application rbd read_balance_score 6.07
> pool 6 'rbd_internal' replicated size 3 min_size 2 crush_rule 4 object_hash 
> rjenkins pg_num 2048 pgp_num 2048 autoscale_mode off last_change 2816850 lfor 
> 0/1370796/2806939 flags hashpspool,selfmanaged_snaps,bulk stripe_width 0 
> application rbd read_balance_score 2.78
> pool 8 '.mgr' replicated size 2 min_size 1 crush_rule 3 object_hash rjenkins 
> pg_num 1 pgp_num 1 autoscale_mode warn last_change 1667576 flags hashpspool 
> stripe_width 0 pg_num_min 1 application mgr,mgr_devicehealth 
> read_balance_score 40.00
> pool 10 'rbd_ec' replicated size 3 min_size 2 crush_rule 3 object_hash 
> rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 1919209 lfor 
> 0/1180414/1180412 flags hashpspool,selfmanaged_snaps stripe_width 0 
> application rbd read_balance_score 8.16
> pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6 min_size 5 crush_rule 
> 0 object_hash rjenkins pg_num 16384 pgp_num 16384 autoscale_mode off 
> last_change 2832704 lfor 0/1291190/2832700 flags 
> hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 16384 fast_read 
> 1 compression_algorithm snappy compression_mode aggressive application rbd
> pool 23 'rbd.nvme' replicated size 2 min_size 1 crush_rule 5 object_hash 
> rjenkins pg_num 2048 pgp_num 2048 autoscale_mode off last_change 2722280 lfor 
> 0/0/2139786 flags hashpspool,selfmanaged_snaps,bulk stripe_width 0 
> application rbd read_balance_score 1.35
> pool 25 '.nfs' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins 
> pg_num 32 pgp_num 32 autoscale_mode warn last_change 2177402 lfor 0/0/2065595 
> flags hashpspool stripe_width 0 application nfs read_balance_score 8.16
> pool 31 'cephfs.cephfs.meta' replicated size 3 min_size 2 crush_rule 3 
> object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode off last_change 
> 2478849 lfor 0/0/2198357 flags hashpspool stripe_width 0 pg_autoscale_bias 4 
> pg_num_min 16 recovery_priority 5 application cephfs read_balance_score 6.94
> pool 32 'cephfs.cephfs.data' replicated size 3 min_size 2 crush_rule 3 
> object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode off last_change 
> 2178931 lfor 0/2178574/2178572 flags hashpspool stripe_width 0 application 
> cephfs read_balance_score 6.07
> pool 34 'cephfs.nvme.data' replicated size 2 min_size 1 crush_rule 5 
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change 
> 2722280 lfor 0/2147353/2147351 flags hashpspool,bulk stripe_width 0 
> compression_algorithm zstd compression_mode aggressive application cephfs 
> read_balance_score 3.77
> pool 35 'cephfs.ssd.data' replicated size 3 min_size 2 crush_rule 3 
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change 
> 2198980 lfor 0/0/2126134 flags hashpspool,bulk stripe_width 0 
> compression_algorithm zstd compression_mode aggressive application cephfs 
> read_balance_score 8.05
> pool 37 'cephfs.hdd.data' erasure profile DRCMR_k4m5_datacenter_hdd size 9 
> min_size 5 crush_rule 7 object_hash rjenkins pg_num 2048 pgp_num 2048 
> autoscale_mode off last_change 2816850 lfor 0/0/2139486 flags 
> hashpspool,ec_overwrites,bulk stripe_width 16384 fast_read 1 
> compression_algorithm zstd compression_mode aggressive application cephfs
> pool 39 'rbd.ssd' replicated size 3 min_size 2 crush_rule 3 object_hash 
> rjenkins pg_num 64 pgp_num 64 autoscale_mode warn last_change 2541795 flags 
> hashpspool,selfmanaged_snaps stripe_width 0 application rbd 
> read_balance_score 7.52
> pool 43 'rbd.ssd.ec' replicated size 3 min_size 2 crush_rule 3 object_hash 
> rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 2542174 flags 
> hashpspool stripe_width 0 compression_mode aggressive application rbd 
> read_balance_score 8.16
> pool 44 'rbd.ssd.ec.data' erasure profile DRCMR_k4m5_datacenter_ssd size 9 
> min_size 5 crush_rule 6 object_hash rjenkins pg_num 32 pgp_num 32 
> autoscale_mode warn last_change 2542179 flags 
> hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384 
> compression_mode aggressive application rbd
> pool 47 'rbd.nvmebulk.ec' replicated size 3 min_size 2 crush_rule 10 
> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 
> 2737621 flags hashpspool stripe_width 0 application rbd read_balance_score 
> 3.67
> pool 48 'rbd.nvmebulk.data' erasure profile DRCMR_k4m5_datacenter_nvmebulk 
> size 9 min_size 5 crush_rule 11 object_hash rjenkins pg_num 512 pgp_num 512 
> autoscale_mode off last_change 2737621 lfor 0/0/2736420 flags 
> hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384 
> compression_algorithm snappy compression_mode aggressive application rbd
> 
> Pool 11 is the one in question.
> 
>>> 
>>>    osd: 576 osds: 576 up (since 2h), 576 in (since 3d); 8767 remapped pgs
>>> 
>>>    pools:   18 pools, 25249 pgs
>>>    objects: 683.85M objects, 1.6 PiB
>>>    usage:   2.7 PiB used, 1.9 PiB / 4.5 PiB avail
>>>    pgs:     842769842/3951610673 objects misplaced (21.327%)
>>>             16481 active+clean
>>>             8762  active+remapped+backfill_wait
>>>             6     active+remapped+backfilling
>> Are you *sure* that you have both the mclock override enabled and the op 
>> scheduler set to wpq at the proper scope?
> 
> Reasonably sure:
> 
> [root@ceph-flash1 ~]# ceph config dump | grep wpq
> osd advanced  osd_op_queue wpq *
> 
> [root@ceph-flash1 ~]# ceph config dump | grep 
> osd_mclock_override_recovery_settings
> osd                    advanced  osd_mclock_override_recovery_settings        
>  true 
> osd.234                advanced  osd_mclock_override_recovery_settings        
>  true
> 
>> Note that if you’re using a wide EC profile that will gridlock the process 
>> to an extent.
>>> 
>>>  io:
>>>    client:   374 MiB/s rd, 14 MiB/s wr, 2.86k op/s rd, 410 op/s wr
>>>    recovery: 153 MiB/s, 38 objects/s
>>> "
>>> 
>>> The balancer was running and seemingly making very small changes:
>>> 
>>> "
>>> [root@lazy ~]# ceph balancer status
>>> {
>>>    "active": true,
>>>    "last_optimize_duration": "0:00:01.012679",
>>>    "last_optimize_started": "Mon Apr 28 10:01:24 2025",
>>>    "mode": "upmap",
>>>    "no_optimization_needed": true,
>>>    "optimize_result": "Optimization plan created successfully",
>>>    "plans": []
>>> }
>>> "
>> The balancer has a misplaced % above which it won’t make additional changes, 
>> that defaults I think to 5%.  With 21% misplaced the balancer will be on 
>> hold.
> 
> I increased target_max_misplaced_ratio to ensure the balancer could work out 
> all the moves:
> 
> [root@ceph-flash1 ~]# ceph config dump | grep misplaced
> mgr                    basic     target_max_misplaced_ratio         0.300000
> 
>>> 
>>> 
>>> This is going to take a while, any tips on how to escape the apparent 
>>> bottleneck?
>> Try raising
>> osd_recovery_max_active
>> osd_recovery_max_single_start
>> osd_max_backfills
>> to 2 or even 3.  I have no empirical evidence but I’ve observed that when 
>> changing back to wpq that somewhat higher than customary values for these 
>> may be needed to be effective.  Restarting the OSDs one failure domain at a 
>> time, waiting for recovery, might help according to some references.
> 
> I am reluctant to increase osd_max_backfills or osd_recovery_max_active 
> because of the small disks in the cluster and the large PG size. We've 
> historically hit problems with concurrent backfills making disks go 
> backfill_full or even full and then it is suddenly a different problem. Some 
> of the smaller drives are at ~75% utilization currently while larger drives 
> are at ~56%, which is one of the things we hope to improve upon by increasing 
> the pg_num.
> 
> I'll look at osd_recovery_max_single_start.
> 
>>> 
>>> Is having many PGs misplaced actually counter productive
>> Not so much unless you’re severely low on RAM I think, but I would suggest 
>> upmap-remapped to vanish the misplaced PGs and let the balancer do it 
>> incrementally.  If you have 21% misplaced pgremapper may not have worked as 
>> expected - I have never used it, but upmap-remapped has worked well for me, 
>> usually needing 2-3 successive runs.
> 
> The 21% was right after doubling the pg_num. I then ran pgremapper and got 
> misplaced to less than 1% and then the balancer is slowly increasing the 
> number again. I think those tools are largely doing the same thing? I'll try 
> doing it again.
> 
> Thanks.
> 
> Mvh.
> 
> Torkil
> 
>>> I was thinking it was better to let the balancer balance all it could, as 
>>> that would make all the moves available and decrease the risk of 
>>> bottlenecking.
>> Wise choice.
>>> 
>>> Thanks.
>>> 
>>> Mvh.
>>> 
>>> Torkil
>>> 
>>> -- 
>>> Torkil Svensgaard
>>> Sysadmin
>>> MR-Forskningssektionen, afs. 714
>>> DRCMR, Danish Research Centre for Magnetic Resonance
>>> Hvidovre Hospital
>>> Kettegård Allé 30
>>> DK-2650 Hvidovre
>>> Denmark
>>> Tel: +45 386 22828
>>> E-mail: tor...@drcmr.dk
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> -- 
> Torkil Svensgaard
> Sysadmin
> MR-Forskningssektionen, afs. 714
> DRCMR, Danish Research Centre for Magnetic Resonance
> Hvidovre Hospital
> Kettegård Allé 30
> DK-2650 Hvidovre
> Denmark
> Tel: +45 386 22828
> E-mail: tor...@drcmr.dk

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Doubled numbers of PGs from 8192 to 16384 - backfill bottlenecked

Reply via email to