[ceph-users] Re: Doubled numbers of PGs from 8192 to 16384 - backfill bottlenecked

Torkil Svensgaard Wed, 30 Apr 2025 23:12:23 -0700


On 30-04-2025 22:16, Maged Mokhtar wrote:

This is strange indeed..
1) I recommend to first make sure/validate that all active backfillingas well as in the backfill_wait are indeed/mainly from pool 11
ceph pg ls backfilling

ceph pg ls backfill_wait
This is to find out if some other pools are causing this slow backfillactiivity. Specially you have 3 EC pools with size 9, the high size tendto have low active backfill counts (and scrub counts). Notice that thenumber of backfill_wait is larger than 8192, so something else is involved.

I had drained a faulty disk and replaced it with a new one prior toincreasing the number of PGs so that would probably account for theadditional misplaced.

All active backfills are pool 11 but some of the ones in backfill_waitare from pool 37, which is EC 4+5.


Pool ID 11: 3365 PGs
Pool ID 37: 8 PGs
Pool ID 6: 1 PGs

2) As recommended by earlier post, i would increase osd_max_backfills to3 or more. As noted above EC with larger k+m size will benefit fromthis. I understand you have concerns on stressing the drives, so firstincrease the osd_recovery_sleep to 1 (a high value) to offset the largerbackfills, and it is better to monitor with iostat -dxt 5 to make surethe disk %util/busy is not too high (above 80%) then you can adjust theabove 2 values while monitoring iostat.

Not so much worried about strain on the drives as worried about hittingfull. What we have seen in the past is small drives and large PGscausing full events with multiple backfills because the mechanism israther stupid.

We might have drives at 80% utilization, then we add more drives todecrease the utilization but that initially assign more PGs to thealready rather full drives. We can't cancel the backfills so have toresort to reweight or stopping OSDs or other whack a mole games untilutilization decreases. For our setup we used to have cron run ababysitter script to manipulate osd_max_backfills, such that >80%utilization -> osd_max_backfills = 1, >70% utilization ->osd_max_backfills = 2 etc so we ensured not hitting full but onlybackfill_full.

3) One strange thing is that the pool in question, pgp_num alreadyreached 16384, typically when you set pg_num to 16384, internally Cephwill increase pgp_num in steps that does not exceed causetarget_max_misplaced_ratio (same value used by balancer) at a time, sopgp_num will lag pgp_num for some time. Even if you increased this from0.05 to 0.3 (which is not recommended), unless maybe you do have a largenumber of objects stored in the other pools ( ceph df will show this),but in such case then maybe pool 11 is not the only significant pool,and maybe one of your EC 9 (6+3?) has a lot of data and small number ofpgs ( 32 ?) so you have very large pgs that can have dominant effect onbackfill as per point 1).


I don't think our other significant pools are like that:

"
[root@ceph-e3s3 ~]# ceph df detail
--- RAW STORAGE ---
CLASS        SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd       4.1 PiB  1.6 PiB  2.5 PiB   2.5 PiB      60.31
nvme      210 TiB   57 TiB  153 TiB   153 TiB      72.90
nvmebulk  196 TiB  149 TiB   46 TiB    46 TiB      23.70
ssd        49 TiB   33 TiB   15 TiB    15 TiB      31.50
TOTAL     4.5 PiB  1.9 PiB  2.7 PiB   2.7 PiB      59.03

--- POOLS ---

POOL ID PGS STORED (DATA) (OMAP) OBJECTSUSED (DATA) (OMAP) %USED MAX AVAIL QUOTA OBJECTS QUOTA BYTESDIRTY USED COMPR UNDER COMPRrbd 4 1024 110 TiB 110 TiB 9.8 KiB 29.24M 287TiB 287 TiB 29 KiB 25.73 276 TiB N/A N/AN/A 40 TiB 82 TiBlibvirt 5 256 3.3 TiB 3.3 TiB 60 KiB 867.13k 6.7TiB 6.7 TiB 179 KiB 28.85 5.5 TiB N/A N/AN/A 1.5 TiB 4.6 TiBrbd_internal 6 2048 103 TiB 103 TiB 4.9 KiB 32.57M 242TiB 242 TiB 15 KiB 22.62 276 TiB N/A N/AN/A 66 TiB 132 TiB.mgr 8 1 4.9 GiB 4.9 GiB 0 B 1.26k 2.0GiB 2.0 GiB 0 B 0.01 8.3 TiB N/A N/AN/A 2.0 GiB 9.8 GiBrbd_ec 10 32 8.0 MiB 8.0 MiB 1.0 KiB 27 3.4MiB 3.4 MiB 3.1 KiB 0 5.5 TiB N/A N/AN/A 1.5 MiB 23 MiBrbd_ec_data 11 16384 1.0 PiB 1.0 PiB 2.6 KiB 279.74M 1.4PiB 1.4 PiB 4.0 KiB 63.85 552 TiB N/A N/AN/A 139 TiB 277 TiBrbd.nvme 23 2048 95 TiB 95 TiB 3.3 KiB 25.16M 151TiB 151 TiB 6.6 KiB 78.24 21 TiB N/A N/AN/A 32 TiB 72 TiB.nfs 25 32 20 KiB 12 KiB 7.6 KiB 68 275KiB 252 KiB 23 KiB 0 5.5 TiB N/A N/AN/A 0 B 0 Bcephfs.cephfs.meta 31 128 15 GiB 269 MiB 15 GiB 3.05M 46GiB 619 MiB 45 GiB 0.27 5.5 TiB N/A N/AN/A 77 MiB 266 MiBcephfs.cephfs.data 32 512 449 B 449 B 0 B 130.59M 48KiB 48 KiB 0 B 0 5.5 TiB N/A N/AN/A 0 B 0 Bcephfs.nvme.data 34 32 977 GiB 977 GiB 0 B 250k 122GiB 122 GiB 0 B 0.28 21 TiB N/A N/AN/A 122 GiB 1.9 TiBcephfs.ssd.data 35 32 754 GiB 754 GiB 0 B 1.01M 1.7TiB 1.7 TiB 0 B 9.29 5.5 TiB N/A N/AN/A 331 GiB 864 GiBcephfs.hdd.data 37 2048 207 TiB 207 TiB 570 B 174.93M 426TiB 426 TiB 1.3 KiB 34.01 368 TiB N/A N/AN/A 38 TiB 77 TiBrbd.ssd 39 64 1.6 TiB 1.6 TiB 1.5 KiB 431.89k 4.2TiB 4.2 TiB 4.5 KiB 20.42 5.5 TiB N/A N/AN/A 518 GiB 1.2 TiBrbd.ssd.ec 43 32 2.4 KiB 18 B 2.4 KiB 5 19KiB 12 KiB 7.3 KiB 0 5.5 TiB N/A N/AN/A 0 B 0 Brbd.ssd.ec.data 44 32 1.0 TiB 1.0 TiB 0 B 269.92k 2.0TiB 2.0 TiB 0 B 10.55 7.4 TiB N/A N/AN/A 388 GiB 762 GiBrbd.nvmebulk.ec 47 32 3.0 MiB 3.0 MiB 5.0 KiB 6 6.1MiB 6.1 MiB 15 KiB 0 9.1 TiB N/A N/AN/A 528 KiB 4.0 MiBrbd.nvmebulk.data 48 512 23 TiB 23 TiB 0 B 6.00M 46TiB 46 TiB 0 B 62.93 12 TiB N/A N/AN/A 4.1 TiB 9.4 TiB

"

The rbd pool is the one with the largest PGs at around 100GB.

Thanks.

Mvh.

Torkil

again it is quite strange.

/maged


On 30/04/2025 00:54, Torkil Svensgaard wrote:
On 29-04-2025 22:52, Anthony D'Atri wrote:
In order to get our PG sizes better aligned we doubled the number ofPGs on the pool with the largest PG size. The pool is HDD withDB/WAL on SATA SSD and HDD sizes between 2TB and 20TB and PG sizewas ~140GB before the doubling.
Please send `ceph osd dump | grep pool`
[root@lazy ~]# ceph osd dump | grep pool
pool 4 'rbd' replicated size 3 min_size 2 crush_rule 4 object_hashrjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change2816850 lfor 0/1844098/2447930 flags hashpspool,selfmanaged_snaps,bulkstripe_width 0 application rbd read_balance_score 3.97pool 5 'libvirt' replicated size 3 min_size 2 crush_rule 3 object_hashrjenkins pg_num 256 pgp_num 256 autoscale_mode off last_change 2824108lfor 0/434267/1506461 flags hashpspool,selfmanaged_snaps stripe_width0 application rbd read_balance_score 6.07pool 6 'rbd_internal' replicated size 3 min_size 2 crush_rule 4object_hash rjenkins pg_num 2048 pgp_num 2048 autoscale_mode offlast_change 2816850 lfor 0/1370796/2806939 flagshashpspool,selfmanaged_snaps,bulk stripe_width 0 application rbdread_balance_score 2.78pool 8 '.mgr' replicated size 2 min_size 1 crush_rule 3 object_hashrjenkins pg_num 1 pgp_num 1 autoscale_mode warn last_change 1667576flags hashpspool stripe_width 0 pg_num_min 1 applicationmgr,mgr_devicehealth read_balance_score 40.00pool 10 'rbd_ec' replicated size 3 min_size 2 crush_rule 3 object_hashrjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 1919209lfor 0/1180414/1180412 flags hashpspool,selfmanaged_snaps stripe_width0 application rbd read_balance_score 8.16pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6 min_size 5crush_rule 0 object_hash rjenkins pg_num 16384 pgp_num 16384autoscale_mode off last_change 2832704 lfor 0/1291190/2832700 flagshashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 16384fast_read 1 compression_algorithm snappy compression_mode aggressiveapplication rbdpool 23 'rbd.nvme' replicated size 2 min_size 1 crush_rule 5object_hash rjenkins pg_num 2048 pgp_num 2048 autoscale_mode offlast_change 2722280 lfor 0/0/2139786 flagshashpspool,selfmanaged_snaps,bulk stripe_width 0 application rbdread_balance_score 1.35pool 25 '.nfs' replicated size 3 min_size 2 crush_rule 3 object_hashrjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 2177402lfor 0/0/2065595 flags hashpspool stripe_width 0 application nfsread_balance_score 8.16pool 31 'cephfs.cephfs.meta' replicated size 3 min_size 2 crush_rule 3object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode offlast_change 2478849 lfor 0/0/2198357 flags hashpspool stripe_width 0pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 applicationcephfs read_balance_score 6.94pool 32 'cephfs.cephfs.data' replicated size 3 min_size 2 crush_rule 3object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode offlast_change 2178931 lfor 0/2178574/2178572 flags hashpspoolstripe_width 0 application cephfs read_balance_score 6.07pool 34 'cephfs.nvme.data' replicated size 2 min_size 1 crush_rule 5object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode offlast_change 2722280 lfor 0/2147353/2147351 flags hashpspool,bulkstripe_width 0 compression_algorithm zstd compression_mode aggressiveapplication cephfs read_balance_score 3.77pool 35 'cephfs.ssd.data' replicated size 3 min_size 2 crush_rule 3object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode offlast_change 2198980 lfor 0/0/2126134 flags hashpspool,bulkstripe_width 0 compression_algorithm zstd compression_mode aggressiveapplication cephfs read_balance_score 8.05pool 37 'cephfs.hdd.data' erasure profile DRCMR_k4m5_datacenter_hddsize 9 min_size 5 crush_rule 7 object_hash rjenkins pg_num 2048pgp_num 2048 autoscale_mode off last_change 2816850 lfor 0/0/2139486flags hashpspool,ec_overwrites,bulk stripe_width 16384 fast_read 1compression_algorithm zstd compression_mode aggressive application cephfspool 39 'rbd.ssd' replicated size 3 min_size 2 crush_rule 3object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warnlast_change 2541795 flags hashpspool,selfmanaged_snaps stripe_width 0application rbd read_balance_score 7.52pool 43 'rbd.ssd.ec' replicated size 3 min_size 2 crush_rule 3object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warnlast_change 2542174 flags hashpspool stripe_width 0 compression_modeaggressive application rbd read_balance_score 8.16pool 44 'rbd.ssd.ec.data' erasure profile DRCMR_k4m5_datacenter_ssdsize 9 min_size 5 crush_rule 6 object_hash rjenkins pg_num 32 pgp_num32 autoscale_mode warn last_change 2542179 flagshashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384compression_mode aggressive application rbdpool 47 'rbd.nvmebulk.ec' replicated size 3 min_size 2 crush_rule 10object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warnlast_change 2737621 flags hashpspool stripe_width 0 application rbdread_balance_score 3.67pool 48 'rbd.nvmebulk.data' erasure profileDRCMR_k4m5_datacenter_nvmebulk size 9 min_size 5 crush_rule 11object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode offlast_change 2737621 lfor 0/0/2736420 flagshashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384compression_algorithm snappy compression_mode aggressive application rbd
Pool 11 is the one in question.
osd: 576 osds: 576 up (since 2h), 576 in (since 3d); 8767remapped pgs
    pools:   18 pools, 25249 pgs
    objects: 683.85M objects, 1.6 PiB
    usage:   2.7 PiB used, 1.9 PiB / 4.5 PiB avail
    pgs:     842769842/3951610673 objects misplaced (21.327%)
             16481 active+clean
             8762  active+remapped+backfill_wait
             6     active+remapped+backfilling
Are you *sure* that you have both the mclock override enabled and theop scheduler set to wpq at the proper scope?
Reasonably sure:

[root@ceph-flash1 ~]# ceph config dump | grep wpq
osd advanced  osd_op_queue wpq *
[root@ceph-flash1 ~]# ceph config dump | greposd_mclock_override_recovery_settingsosd advanced osd_mclock_override_recovery_settings trueosd.234 advanced osd_mclock_override_recovery_settings true
Note that if you’re using a wide EC profile that will gridlock theprocess to an extent.
  io:
    client:   374 MiB/s rd, 14 MiB/s wr, 2.86k op/s rd, 410 op/s wr
    recovery: 153 MiB/s, 38 objects/s
"

The balancer was running and seemingly making very small changes:

"
[root@lazy ~]# ceph balancer status
{
    "active": true,
    "last_optimize_duration": "0:00:01.012679",
    "last_optimize_started": "Mon Apr 28 10:01:24 2025",
    "mode": "upmap",
    "no_optimization_needed": true,
    "optimize_result": "Optimization plan created successfully",
    "plans": []
}
"
The balancer has a misplaced % above which it won’t make additionalchanges, that defaults I think to 5%. With 21% misplaced thebalancer will be on hold.
I increased target_max_misplaced_ratio to ensure the balancer couldwork out all the moves:
[root@ceph-flash1 ~]# ceph config dump | grep misplaced
mgr                    basic     target_max_misplaced_ratio 0.300000
This is going to take a while, any tips on how to escape theapparent bottleneck?
Try raising

osd_recovery_max_active
osd_recovery_max_single_start
osd_max_backfills
to 2 or even 3. I have no empirical evidence but I’ve observed thatwhen changing back to wpq that somewhat higher than customary valuesfor these may be needed to be effective. Restarting the OSDs onefailure domain at a time, waiting for recovery, might help accordingto some references.
I am reluctant to increase osd_max_backfills orosd_recovery_max_active because of the small disks in the cluster andthe large PG size. We've historically hit problems with concurrentbackfills making disks go backfill_full or even full and then it issuddenly a different problem. Some of the smaller drives are at ~75%utilization currently while larger drives are at ~56%, which is one ofthe things we hope to improve upon by increasing the pg_num.
I'll look at osd_recovery_max_single_start.
Is having many PGs misplaced actually counter productive
Not so much unless you’re severely low on RAM I think, but I wouldsuggest upmap-remapped to vanish the misplaced PGs and let thebalancer do it incrementally. If you have 21% misplaced pgremappermay not have worked as expected - I have never used it, butupmap-remapped has worked well for me, usually needing 2-3 successiveruns.
The 21% was right after doubling the pg_num. I then ran pgremapper andgot misplaced to less than 1% and then the balancer is slowlyincreasing the number again. I think those tools are largely doing thesame thing? I'll try doing it again.
Thanks.

Mvh.

Torkil
I was thinking it was better to let the balancer balance all itcould, as that would make all the moves available and decrease therisk of bottlenecking.
Wise choice.
Thanks.

Mvh.

Torkil

--
Torkil Svensgaard
Sysadmin
MR-Forskningssektionen, afs. 714
DRCMR, Danish Research Centre for Magnetic Resonance
Hvidovre Hospital
Kettegård Allé 30
DK-2650 Hvidovre
Denmark
Tel: +45 386 22828
E-mail: tor...@drcmr.dk
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Torkil Svensgaard
Sysadmin
MR-Forskningssektionen, afs. 714
DRCMR, Danish Research Centre for Magnetic Resonance
Hvidovre Hospital
Kettegård Allé 30
DK-2650 Hvidovre
Denmark
Tel: +45 386 22828
E-mail: tor...@drcmr.dk
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Doubled numbers of PGs from 8192 to 16384 - backfill bottlenecked

Reply via email to