[ceph-users] Re: Doubled numbers of PGs from 8192 to 16384 - backfill bottlenecked

Torkil Svensgaard Tue, 29 Apr 2025 14:56:01 -0700


On 29-04-2025 22:52, Anthony D'Atri wrote:


In order to get our PG sizes better aligned we doubled the number of PGs on the 
pool with the largest PG size. The pool is HDD with DB/WAL on SATA SSD and HDD 
sizes between 2TB and 20TB and PG size was ~140GB before the doubling.



Please send `ceph osd dump | grep pool`


[root@lazy ~]# ceph osd dump | grep pool

pool 4 'rbd' replicated size 3 min_size 2 crush_rule 4 object_hashrjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change 2816850lfor 0/1844098/2447930 flags hashpspool,selfmanaged_snaps,bulkstripe_width 0 application rbd read_balance_score 3.97pool 5 'libvirt' replicated size 3 min_size 2 crush_rule 3 object_hashrjenkins pg_num 256 pgp_num 256 autoscale_mode off last_change 2824108lfor 0/434267/1506461 flags hashpspool,selfmanaged_snaps stripe_width 0application rbd read_balance_score 6.07pool 6 'rbd_internal' replicated size 3 min_size 2 crush_rule 4object_hash rjenkins pg_num 2048 pgp_num 2048 autoscale_mode offlast_change 2816850 lfor 0/1370796/2806939 flagshashpspool,selfmanaged_snaps,bulk stripe_width 0 application rbdread_balance_score 2.78pool 8 '.mgr' replicated size 2 min_size 1 crush_rule 3 object_hashrjenkins pg_num 1 pgp_num 1 autoscale_mode warn last_change 1667576flags hashpspool stripe_width 0 pg_num_min 1 applicationmgr,mgr_devicehealth read_balance_score 40.00pool 10 'rbd_ec' replicated size 3 min_size 2 crush_rule 3 object_hashrjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 1919209lfor 0/1180414/1180412 flags hashpspool,selfmanaged_snaps stripe_width 0application rbd read_balance_score 8.16pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6 min_size 5crush_rule 0 object_hash rjenkins pg_num 16384 pgp_num 16384autoscale_mode off last_change 2832704 lfor 0/1291190/2832700 flagshashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 16384fast_read 1 compression_algorithm snappy compression_mode aggressiveapplication rbdpool 23 'rbd.nvme' replicated size 2 min_size 1 crush_rule 5 object_hashrjenkins pg_num 2048 pgp_num 2048 autoscale_mode off last_change 2722280lfor 0/0/2139786 flags hashpspool,selfmanaged_snaps,bulk stripe_width 0application rbd read_balance_score 1.35pool 25 '.nfs' replicated size 3 min_size 2 crush_rule 3 object_hashrjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 2177402lfor 0/0/2065595 flags hashpspool stripe_width 0 application nfsread_balance_score 8.16pool 31 'cephfs.cephfs.meta' replicated size 3 min_size 2 crush_rule 3object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode offlast_change 2478849 lfor 0/0/2198357 flags hashpspool stripe_width 0pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfsread_balance_score 6.94pool 32 'cephfs.cephfs.data' replicated size 3 min_size 2 crush_rule 3object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode offlast_change 2178931 lfor 0/2178574/2178572 flags hashpspool stripe_width0 application cephfs read_balance_score 6.07pool 34 'cephfs.nvme.data' replicated size 2 min_size 1 crush_rule 5object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change2722280 lfor 0/2147353/2147351 flags hashpspool,bulk stripe_width 0compression_algorithm zstd compression_mode aggressive applicationcephfs read_balance_score 3.77pool 35 'cephfs.ssd.data' replicated size 3 min_size 2 crush_rule 3object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change2198980 lfor 0/0/2126134 flags hashpspool,bulk stripe_width 0compression_algorithm zstd compression_mode aggressive applicationcephfs read_balance_score 8.05pool 37 'cephfs.hdd.data' erasure profile DRCMR_k4m5_datacenter_hdd size9 min_size 5 crush_rule 7 object_hash rjenkins pg_num 2048 pgp_num 2048autoscale_mode off last_change 2816850 lfor 0/0/2139486 flagshashpspool,ec_overwrites,bulk stripe_width 16384 fast_read 1compression_algorithm zstd compression_mode aggressive application cephfspool 39 'rbd.ssd' replicated size 3 min_size 2 crush_rule 3 object_hashrjenkins pg_num 64 pgp_num 64 autoscale_mode warn last_change 2541795flags hashpspool,selfmanaged_snaps stripe_width 0 application rbdread_balance_score 7.52pool 43 'rbd.ssd.ec' replicated size 3 min_size 2 crush_rule 3object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warnlast_change 2542174 flags hashpspool stripe_width 0 compression_modeaggressive application rbd read_balance_score 8.16pool 44 'rbd.ssd.ec.data' erasure profile DRCMR_k4m5_datacenter_ssd size9 min_size 5 crush_rule 6 object_hash rjenkins pg_num 32 pgp_num 32autoscale_mode warn last_change 2542179 flagshashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384compression_mode aggressive application rbdpool 47 'rbd.nvmebulk.ec' replicated size 3 min_size 2 crush_rule 10object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warnlast_change 2737621 flags hashpspool stripe_width 0 application rbdread_balance_score 3.67pool 48 'rbd.nvmebulk.data' erasure profileDRCMR_k4m5_datacenter_nvmebulk size 9 min_size 5 crush_rule 11object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode offlast_change 2737621 lfor 0/0/2736420 flagshashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384compression_algorithm snappy compression_mode aggressive application rbd


Pool 11 is the one in question.


    osd: 576 osds: 576 up (since 2h), 576 in (since 3d); 8767 remapped pgs

    pools:   18 pools, 25249 pgs
    objects: 683.85M objects, 1.6 PiB
    usage:   2.7 PiB used, 1.9 PiB / 4.5 PiB avail
    pgs:     842769842/3951610673 objects misplaced (21.327%)
             16481 active+clean
             8762  active+remapped+backfill_wait
             6     active+remapped+backfilling


Are you *sure* that you have both the mclock override enabled and the op 
scheduler set to wpq at the proper scope?


Reasonably sure:

[root@ceph-flash1 ~]# ceph config dump | grep wpq
osd advanced  osd_op_queue wpq *

[root@ceph-flash1 ~]# ceph config dump | greposd_mclock_override_recovery_settingsosd advanced osd_mclock_override_recovery_settingstrueosd.234 advanced osd_mclock_override_recovery_settingstrue

Note that if you’re using a wide EC profile that will gridlock the process to 
an extent.


  io:
    client:   374 MiB/s rd, 14 MiB/s wr, 2.86k op/s rd, 410 op/s wr
    recovery: 153 MiB/s, 38 objects/s
"

The balancer was running and seemingly making very small changes:

"
[root@lazy ~]# ceph balancer status
{
    "active": true,
    "last_optimize_duration": "0:00:01.012679",
    "last_optimize_started": "Mon Apr 28 10:01:24 2025",
    "mode": "upmap",
    "no_optimization_needed": true,
    "optimize_result": "Optimization plan created successfully",
    "plans": []
}
"


The balancer has a misplaced % above which it won’t make additional changes, 
that defaults I think to 5%.  With 21% misplaced the balancer will be on hold.

I increased target_max_misplaced_ratio to ensure the balancer could workout all the moves:


[root@ceph-flash1 ~]# ceph config dump | grep misplaced

mgr basic target_max_misplaced_ratio0.300000



This is going to take a while, any tips on how to escape the apparent 
bottleneck?


Try raising

osd_recovery_max_active
osd_recovery_max_single_start
osd_max_backfills

to 2 or even 3.  I have no empirical evidence but I’ve observed that when 
changing back to wpq that somewhat higher than customary values for these may 
be needed to be effective.  Restarting the OSDs one failure domain at a time, 
waiting for recovery, might help according to some references.

I am reluctant to increase osd_max_backfills or osd_recovery_max_activebecause of the small disks in the cluster and the large PG size. We'vehistorically hit problems with concurrent backfills making disks gobackfill_full or even full and then it is suddenly a different problem.Some of the smaller drives are at ~75% utilization currently whilelarger drives are at ~56%, which is one of the things we hope to improveupon by increasing the pg_num.


I'll look at osd_recovery_max_single_start.


Is having many PGs misplaced actually counter productive


Not so much unless you’re severely low on RAM I think, but I would suggest 
upmap-remapped to vanish the misplaced PGs and let the balancer do it 
incrementally.  If you have 21% misplaced pgremapper may not have worked as 
expected - I have never used it, but upmap-remapped has worked well for me, 
usually needing 2-3 successive runs.

The 21% was right after doubling the pg_num. I then ran pgremapper andgot misplaced to less than 1% and then the balancer is slowly increasingthe number again. I think those tools are largely doing the same thing?I'll try doing it again.


Thanks.

Mvh.

Torkil

I was thinking it was better to let the balancer balance all it could, as that 
would make all the moves available and decrease the risk of bottlenecking.


Wise choice.


Thanks.

Mvh.

Torkil

--
Torkil Svensgaard
Sysadmin
MR-Forskningssektionen, afs. 714
DRCMR, Danish Research Centre for Magnetic Resonance
Hvidovre Hospital
Kettegård Allé 30
DK-2650 Hvidovre
Denmark
Tel: +45 386 22828
E-mail: tor...@drcmr.dk
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


--
Torkil Svensgaard
Sysadmin
MR-Forskningssektionen, afs. 714
DRCMR, Danish Research Centre for Magnetic Resonance
Hvidovre Hospital
Kettegård Allé 30
DK-2650 Hvidovre
Denmark
Tel: +45 386 22828
E-mail: tor...@drcmr.dk
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Doubled numbers of PGs from 8192 to 16384 - backfill bottlenecked

Reply via email to