[ceph-users] Re: Doubled numbers of PGs from 8192 to 16384 - backfill bottlenecked

Anthony D'Atri Wed, 30 Apr 2025 14:31:48 -0700

It is.  I thought we were discussing within the context of reverting to wpq.


> On Apr 30, 2025, at 4:54 PM, Michel Jouvin <michel.jou...@ijclab.in2p3.fr> 
> wrote:
> 
> Hi,
> 
> It has been a few messages mentioning the increase of osd_max_backfills to 
> boost the number of concurrent backfills. I thought this parameter was 
> ignored/reset when using mclock (requiring setting 
> osd_mclock_override_recovery_settings to true, a value mot recommended I 
> thought). Did Iiss something?
> 
> Michel
> Sent from my mobile
> Le 30 avril 2025 22:19:24 Maged Mokhtar <mmokh...@petasan.org> a écrit :
> 
>> This is strange indeed..
>> 
>> 1) I recommend to first make sure/validate that all active backfilling
>> as well as in the backfill_wait are indeed/mainly from pool 11
>> 
>> ceph pg ls backfilling
>> 
>> ceph pg ls backfill_wait
>> 
>> This is to find out if some other pools are causing this slow backfill
>> actiivity. Specially you have 3 EC pools with size 9, the high size tend
>> to have low active backfill counts (and scrub counts). Notice that the
>> number of backfill_wait is larger than 8192, so something else is involved.
>> 
>> 2) As recommended by earlier post, i would increase osd_max_backfills to
>> 3 or more. As noted above EC with larger k+m size will benefit from
>> this. I understand you have concerns on stressing the drives, so first
>> increase the osd_recovery_sleep to 1 (a high value) to offset the larger
>> backfills, and it is better to monitor with iostat -dxt 5 to make sure
>> the disk %util/busy is not too high (above 80%) then you can adjust the
>> above 2 values while monitoring iostat.
>> 
>> 3) One strange thing is that the pool in question, pgp_num already
>> reached 16384, typically when you set pg_num to 16384, internally Ceph
>> will increase pgp_num in steps that does not exceed cause
>> target_max_misplaced_ratio (same value used by balancer) at a time, so
>> pgp_num will lag pgp_num for some time. Even if you increased this from
>> 0.05 to 0.3 (which is not recommended), unless maybe you do have a large
>> number of objects stored in the other pools ( ceph df will show this),
>> but in such case then maybe pool 11 is not the only significant pool,
>> and maybe one of your EC 9 (6+3?) has a lot of data and small number of
>> pgs ( 32 ?) so you have very large pgs that can have dominant effect on
>> backfill as per point 1).
>> 
>> again it is quite strange.
>> 
>> /maged
>> 
>> 
>> On 30/04/2025 00:54, Torkil Svensgaard wrote:
>>> 
>>> 
>>> On 29-04-2025 22:52, Anthony D'Atri wrote:
>>>> 
>>>> 
>>>>> 
>>>>> In order to get our PG sizes better aligned we doubled the number of
>>>>> PGs on the pool with the largest PG size. The pool is HDD with
>>>>> DB/WAL on SATA SSD and HDD sizes between 2TB and 20TB and PG size
>>>>> was ~140GB before the doubling.
>>>> 
>>>> 
>>>> Please send `ceph osd dump | grep pool`
>>> 
>>> [root@lazy ~]# ceph osd dump | grep pool
>>> pool 4 'rbd' replicated size 3 min_size 2 crush_rule 4 object_hash
>>> rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change
>>> 2816850 lfor 0/1844098/2447930 flags hashpspool,selfmanaged_snaps,bulk
>>> stripe_width 0 application rbd read_balance_score 3.97
>>> pool 5 'libvirt' replicated size 3 min_size 2 crush_rule 3 object_hash
>>> rjenkins pg_num 256 pgp_num 256 autoscale_mode off last_change 2824108
>>> lfor 0/434267/1506461 flags hashpspool,selfmanaged_snaps stripe_width
>>> 0 application rbd read_balance_score 6.07
>>> pool 6 'rbd_internal' replicated size 3 min_size 2 crush_rule 4
>>> object_hash rjenkins pg_num 2048 pgp_num 2048 autoscale_mode off
>>> last_change 2816850 lfor 0/1370796/2806939 flags
>>> hashpspool,selfmanaged_snaps,bulk stripe_width 0 application rbd
>>> read_balance_score 2.78
>>> pool 8 '.mgr' replicated size 2 min_size 1 crush_rule 3 object_hash
>>> rjenkins pg_num 1 pgp_num 1 autoscale_mode warn last_change 1667576
>>> flags hashpspool stripe_width 0 pg_num_min 1 application
>>> mgr,mgr_devicehealth read_balance_score 40.00
>>> pool 10 'rbd_ec' replicated size 3 min_size 2 crush_rule 3 object_hash
>>> rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 1919209
>>> lfor 0/1180414/1180412 flags hashpspool,selfmanaged_snaps stripe_width
>>> 0 application rbd read_balance_score 8.16
>>> pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6 min_size 5
>>> crush_rule 0 object_hash rjenkins pg_num 16384 pgp_num 16384
>>> autoscale_mode off last_change 2832704 lfor 0/1291190/2832700 flags
>>> hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 16384
>>> fast_read 1 compression_algorithm snappy compression_mode aggressive
>>> application rbd
>>> pool 23 'rbd.nvme' replicated size 2 min_size 1 crush_rule 5
>>> object_hash rjenkins pg_num 2048 pgp_num 2048 autoscale_mode off
>>> last_change 2722280 lfor 0/0/2139786 flags
>>> hashpspool,selfmanaged_snaps,bulk stripe_width 0 application rbd
>>> read_balance_score 1.35
>>> pool 25 '.nfs' replicated size 3 min_size 2 crush_rule 3 object_hash
>>> rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 2177402
>>> lfor 0/0/2065595 flags hashpspool stripe_width 0 application nfs
>>> read_balance_score 8.16
>>> pool 31 'cephfs.cephfs.meta' replicated size 3 min_size 2 crush_rule 3
>>> object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode off
>>> last_change 2478849 lfor 0/0/2198357 flags hashpspool stripe_width 0
>>> pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application
>>> cephfs read_balance_score 6.94
>>> pool 32 'cephfs.cephfs.data' replicated size 3 min_size 2 crush_rule 3
>>> object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode off
>>> last_change 2178931 lfor 0/2178574/2178572 flags hashpspool
>>> stripe_width 0 application cephfs read_balance_score 6.07
>>> pool 34 'cephfs.nvme.data' replicated size 2 min_size 1 crush_rule 5
>>> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off
>>> last_change 2722280 lfor 0/2147353/2147351 flags hashpspool,bulk
>>> stripe_width 0 compression_algorithm zstd compression_mode aggressive
>>> application cephfs read_balance_score 3.77
>>> pool 35 'cephfs.ssd.data' replicated size 3 min_size 2 crush_rule 3
>>> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off
>>> last_change 2198980 lfor 0/0/2126134 flags hashpspool,bulk
>>> stripe_width 0 compression_algorithm zstd compression_mode aggressive
>>> application cephfs read_balance_score 8.05
>>> pool 37 'cephfs.hdd.data' erasure profile DRCMR_k4m5_datacenter_hdd
>>> size 9 min_size 5 crush_rule 7 object_hash rjenkins pg_num 2048
>>> pgp_num 2048 autoscale_mode off last_change 2816850 lfor 0/0/2139486
>>> flags hashpspool,ec_overwrites,bulk stripe_width 16384 fast_read 1
>>> compression_algorithm zstd compression_mode aggressive application cephfs
>>> pool 39 'rbd.ssd' replicated size 3 min_size 2 crush_rule 3
>>> object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn
>>> last_change 2541795 flags hashpspool,selfmanaged_snaps stripe_width 0
>>> application rbd read_balance_score 7.52
>>> pool 43 'rbd.ssd.ec' replicated size 3 min_size 2 crush_rule 3
>>> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn
>>> last_change 2542174 flags hashpspool stripe_width 0 compression_mode
>>> aggressive application rbd read_balance_score 8.16
>>> pool 44 'rbd.ssd.ec.data' erasure profile DRCMR_k4m5_datacenter_ssd
>>> size 9 min_size 5 crush_rule 6 object_hash rjenkins pg_num 32 pgp_num
>>> 32 autoscale_mode warn last_change 2542179 flags
>>> hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384
>>> compression_mode aggressive application rbd
>>> pool 47 'rbd.nvmebulk.ec' replicated size 3 min_size 2 crush_rule 10
>>> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn
>>> last_change 2737621 flags hashpspool stripe_width 0 application rbd
>>> read_balance_score 3.67
>>> pool 48 'rbd.nvmebulk.data' erasure profile
>>> DRCMR_k4m5_datacenter_nvmebulk size 9 min_size 5 crush_rule 11
>>> object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode off
>>> last_change 2737621 lfor 0/0/2736420 flags
>>> hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384
>>> compression_algorithm snappy compression_mode aggressive application rbd
>>> 
>>> Pool 11 is the one in question.
>>> 
>>>>> 
>>>>>    osd: 576 osds: 576 up (since 2h), 576 in (since 3d); 8767
>>>>> remapped pgs
>>>>> 
>>>>>    pools:   18 pools, 25249 pgs
>>>>>    objects: 683.85M objects, 1.6 PiB
>>>>>    usage:   2.7 PiB used, 1.9 PiB / 4.5 PiB avail
>>>>>    pgs:     842769842/3951610673 objects misplaced (21.327%)
>>>>>             16481 active+clean
>>>>>             8762  active+remapped+backfill_wait
>>>>>             6     active+remapped+backfilling
>>>> 
>>>> Are you *sure* that you have both the mclock override enabled and the
>>>> op scheduler set to wpq at the proper scope?
>>> 
>>> Reasonably sure:
>>> 
>>> [root@ceph-flash1 ~]# ceph config dump | grep wpq
>>> osd advanced  osd_op_queue wpq *
>>> 
>>> [root@ceph-flash1 ~]# ceph config dump | grep
>>> osd_mclock_override_recovery_settings
>>> osd                    advanced osd_mclock_override_recovery_settings
>>>        true
>>> osd.234                advanced osd_mclock_override_recovery_settings
>>>        true
>>> 
>>>> Note that if you’re using a wide EC profile that will gridlock the
>>>> process to an extent.
>>>> 
>>>>> 
>>>>>  io:
>>>>>    client:   374 MiB/s rd, 14 MiB/s wr, 2.86k op/s rd, 410 op/s wr
>>>>>    recovery: 153 MiB/s, 38 objects/s
>>>>> "
>>>>> 
>>>>> The balancer was running and seemingly making very small changes:
>>>>> 
>>>>> "
>>>>> [root@lazy ~]# ceph balancer status
>>>>> {
>>>>>    "active": true,
>>>>>    "last_optimize_duration": "0:00:01.012679",
>>>>>    "last_optimize_started": "Mon Apr 28 10:01:24 2025",
>>>>>    "mode": "upmap",
>>>>>    "no_optimization_needed": true,
>>>>>    "optimize_result": "Optimization plan created successfully",
>>>>>    "plans": []
>>>>> }
>>>>> "
>>>> 
>>>> The balancer has a misplaced % above which it won’t make additional
>>>> changes, that defaults I think to 5%.  With 21% misplaced the
>>>> balancer will be on hold.
>>> 
>>> I increased target_max_misplaced_ratio to ensure the balancer could
>>> work out all the moves:
>>> 
>>> [root@ceph-flash1 ~]# ceph config dump | grep misplaced
>>> mgr                    basic     target_max_misplaced_ratio
>>> 0.300000
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> This is going to take a while, any tips on how to escape the
>>>>> apparent bottleneck?
>>>> 
>>>> Try raising
>>>> 
>>>> osd_recovery_max_active
>>>> osd_recovery_max_single_start
>>>> osd_max_backfills
>>>> 
>>>> to 2 or even 3.  I have no empirical evidence but I’ve observed that
>>>> when changing back to wpq that somewhat higher than customary values
>>>> for these may be needed to be effective. Restarting the OSDs one
>>>> failure domain at a time, waiting for recovery, might help according
>>>> to some references.
>>> 
>>> I am reluctant to increase osd_max_backfills or
>>> osd_recovery_max_active because of the small disks in the cluster and
>>> the large PG size. We've historically hit problems with concurrent
>>> backfills making disks go backfill_full or even full and then it is
>>> suddenly a different problem. Some of the smaller drives are at ~75%
>>> utilization currently while larger drives are at ~56%, which is one of
>>> the things we hope to improve upon by increasing the pg_num.
>>> 
>>> I'll look at osd_recovery_max_single_start.
>>> 
>>>>> 
>>>>> Is having many PGs misplaced actually counter productive
>>>> 
>>>> Not so much unless you’re severely low on RAM I think, but I would
>>>> suggest upmap-remapped to vanish the misplaced PGs and let the
>>>> balancer do it incrementally.  If you have 21% misplaced pgremapper
>>>> may not have worked as expected - I have never used it, but
>>>> upmap-remapped has worked well for me, usually needing 2-3 successive
>>>> runs.
>>> 
>>> The 21% was right after doubling the pg_num. I then ran pgremapper and
>>> got misplaced to less than 1% and then the balancer is slowly
>>> increasing the number again. I think those tools are largely doing the
>>> same thing? I'll try doing it again.
>>> 
>>> Thanks.
>>> 
>>> Mvh.
>>> 
>>> Torkil
>>> 
>>>> 
>>>>> I was thinking it was better to let the balancer balance all it
>>>>> could, as that would make all the moves available and decrease the
>>>>> risk of bottlenecking.
>>>> 
>>>> Wise choice.
>>>> 
>>>>> 
>>>>> Thanks.
>>>>> 
>>>>> Mvh.
>>>>> 
>>>>> Torkil
>>>>> 
>>>>> --
>>>>> Torkil Svensgaard
>>>>> Sysadmin
>>>>> MR-Forskningssektionen, afs. 714
>>>>> DRCMR, Danish Research Centre for Magnetic Resonance
>>>>> Hvidovre Hospital
>>>>> Kettegård Allé 30
>>>>> DK-2650 Hvidovre
>>>>> Denmark
>>>>> Tel: +45 386 22828
>>>>> E-mail: tor...@drcmr.dk
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>> 
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Doubled numbers of PGs from 8192 to 16384 - backfill bottlenecked

Reply via email to