It is. I thought we were discussing within the context of reverting to wpq.
> On Apr 30, 2025, at 4:54 PM, Michel Jouvin <michel.jou...@ijclab.in2p3.fr> > wrote: > > Hi, > > It has been a few messages mentioning the increase of osd_max_backfills to > boost the number of concurrent backfills. I thought this parameter was > ignored/reset when using mclock (requiring setting > osd_mclock_override_recovery_settings to true, a value mot recommended I > thought). Did Iiss something? > > Michel > Sent from my mobile > Le 30 avril 2025 22:19:24 Maged Mokhtar <mmokh...@petasan.org> a écrit : > >> This is strange indeed.. >> >> 1) I recommend to first make sure/validate that all active backfilling >> as well as in the backfill_wait are indeed/mainly from pool 11 >> >> ceph pg ls backfilling >> >> ceph pg ls backfill_wait >> >> This is to find out if some other pools are causing this slow backfill >> actiivity. Specially you have 3 EC pools with size 9, the high size tend >> to have low active backfill counts (and scrub counts). Notice that the >> number of backfill_wait is larger than 8192, so something else is involved. >> >> 2) As recommended by earlier post, i would increase osd_max_backfills to >> 3 or more. As noted above EC with larger k+m size will benefit from >> this. I understand you have concerns on stressing the drives, so first >> increase the osd_recovery_sleep to 1 (a high value) to offset the larger >> backfills, and it is better to monitor with iostat -dxt 5 to make sure >> the disk %util/busy is not too high (above 80%) then you can adjust the >> above 2 values while monitoring iostat. >> >> 3) One strange thing is that the pool in question, pgp_num already >> reached 16384, typically when you set pg_num to 16384, internally Ceph >> will increase pgp_num in steps that does not exceed cause >> target_max_misplaced_ratio (same value used by balancer) at a time, so >> pgp_num will lag pgp_num for some time. Even if you increased this from >> 0.05 to 0.3 (which is not recommended), unless maybe you do have a large >> number of objects stored in the other pools ( ceph df will show this), >> but in such case then maybe pool 11 is not the only significant pool, >> and maybe one of your EC 9 (6+3?) has a lot of data and small number of >> pgs ( 32 ?) so you have very large pgs that can have dominant effect on >> backfill as per point 1). >> >> again it is quite strange. >> >> /maged >> >> >> On 30/04/2025 00:54, Torkil Svensgaard wrote: >>> >>> >>> On 29-04-2025 22:52, Anthony D'Atri wrote: >>>> >>>> >>>>> >>>>> In order to get our PG sizes better aligned we doubled the number of >>>>> PGs on the pool with the largest PG size. The pool is HDD with >>>>> DB/WAL on SATA SSD and HDD sizes between 2TB and 20TB and PG size >>>>> was ~140GB before the doubling. >>>> >>>> >>>> Please send `ceph osd dump | grep pool` >>> >>> [root@lazy ~]# ceph osd dump | grep pool >>> pool 4 'rbd' replicated size 3 min_size 2 crush_rule 4 object_hash >>> rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change >>> 2816850 lfor 0/1844098/2447930 flags hashpspool,selfmanaged_snaps,bulk >>> stripe_width 0 application rbd read_balance_score 3.97 >>> pool 5 'libvirt' replicated size 3 min_size 2 crush_rule 3 object_hash >>> rjenkins pg_num 256 pgp_num 256 autoscale_mode off last_change 2824108 >>> lfor 0/434267/1506461 flags hashpspool,selfmanaged_snaps stripe_width >>> 0 application rbd read_balance_score 6.07 >>> pool 6 'rbd_internal' replicated size 3 min_size 2 crush_rule 4 >>> object_hash rjenkins pg_num 2048 pgp_num 2048 autoscale_mode off >>> last_change 2816850 lfor 0/1370796/2806939 flags >>> hashpspool,selfmanaged_snaps,bulk stripe_width 0 application rbd >>> read_balance_score 2.78 >>> pool 8 '.mgr' replicated size 2 min_size 1 crush_rule 3 object_hash >>> rjenkins pg_num 1 pgp_num 1 autoscale_mode warn last_change 1667576 >>> flags hashpspool stripe_width 0 pg_num_min 1 application >>> mgr,mgr_devicehealth read_balance_score 40.00 >>> pool 10 'rbd_ec' replicated size 3 min_size 2 crush_rule 3 object_hash >>> rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 1919209 >>> lfor 0/1180414/1180412 flags hashpspool,selfmanaged_snaps stripe_width >>> 0 application rbd read_balance_score 8.16 >>> pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6 min_size 5 >>> crush_rule 0 object_hash rjenkins pg_num 16384 pgp_num 16384 >>> autoscale_mode off last_change 2832704 lfor 0/1291190/2832700 flags >>> hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 16384 >>> fast_read 1 compression_algorithm snappy compression_mode aggressive >>> application rbd >>> pool 23 'rbd.nvme' replicated size 2 min_size 1 crush_rule 5 >>> object_hash rjenkins pg_num 2048 pgp_num 2048 autoscale_mode off >>> last_change 2722280 lfor 0/0/2139786 flags >>> hashpspool,selfmanaged_snaps,bulk stripe_width 0 application rbd >>> read_balance_score 1.35 >>> pool 25 '.nfs' replicated size 3 min_size 2 crush_rule 3 object_hash >>> rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 2177402 >>> lfor 0/0/2065595 flags hashpspool stripe_width 0 application nfs >>> read_balance_score 8.16 >>> pool 31 'cephfs.cephfs.meta' replicated size 3 min_size 2 crush_rule 3 >>> object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode off >>> last_change 2478849 lfor 0/0/2198357 flags hashpspool stripe_width 0 >>> pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application >>> cephfs read_balance_score 6.94 >>> pool 32 'cephfs.cephfs.data' replicated size 3 min_size 2 crush_rule 3 >>> object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode off >>> last_change 2178931 lfor 0/2178574/2178572 flags hashpspool >>> stripe_width 0 application cephfs read_balance_score 6.07 >>> pool 34 'cephfs.nvme.data' replicated size 2 min_size 1 crush_rule 5 >>> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off >>> last_change 2722280 lfor 0/2147353/2147351 flags hashpspool,bulk >>> stripe_width 0 compression_algorithm zstd compression_mode aggressive >>> application cephfs read_balance_score 3.77 >>> pool 35 'cephfs.ssd.data' replicated size 3 min_size 2 crush_rule 3 >>> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off >>> last_change 2198980 lfor 0/0/2126134 flags hashpspool,bulk >>> stripe_width 0 compression_algorithm zstd compression_mode aggressive >>> application cephfs read_balance_score 8.05 >>> pool 37 'cephfs.hdd.data' erasure profile DRCMR_k4m5_datacenter_hdd >>> size 9 min_size 5 crush_rule 7 object_hash rjenkins pg_num 2048 >>> pgp_num 2048 autoscale_mode off last_change 2816850 lfor 0/0/2139486 >>> flags hashpspool,ec_overwrites,bulk stripe_width 16384 fast_read 1 >>> compression_algorithm zstd compression_mode aggressive application cephfs >>> pool 39 'rbd.ssd' replicated size 3 min_size 2 crush_rule 3 >>> object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn >>> last_change 2541795 flags hashpspool,selfmanaged_snaps stripe_width 0 >>> application rbd read_balance_score 7.52 >>> pool 43 'rbd.ssd.ec' replicated size 3 min_size 2 crush_rule 3 >>> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn >>> last_change 2542174 flags hashpspool stripe_width 0 compression_mode >>> aggressive application rbd read_balance_score 8.16 >>> pool 44 'rbd.ssd.ec.data' erasure profile DRCMR_k4m5_datacenter_ssd >>> size 9 min_size 5 crush_rule 6 object_hash rjenkins pg_num 32 pgp_num >>> 32 autoscale_mode warn last_change 2542179 flags >>> hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384 >>> compression_mode aggressive application rbd >>> pool 47 'rbd.nvmebulk.ec' replicated size 3 min_size 2 crush_rule 10 >>> object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn >>> last_change 2737621 flags hashpspool stripe_width 0 application rbd >>> read_balance_score 3.67 >>> pool 48 'rbd.nvmebulk.data' erasure profile >>> DRCMR_k4m5_datacenter_nvmebulk size 9 min_size 5 crush_rule 11 >>> object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode off >>> last_change 2737621 lfor 0/0/2736420 flags >>> hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384 >>> compression_algorithm snappy compression_mode aggressive application rbd >>> >>> Pool 11 is the one in question. >>> >>>>> >>>>> osd: 576 osds: 576 up (since 2h), 576 in (since 3d); 8767 >>>>> remapped pgs >>>>> >>>>> pools: 18 pools, 25249 pgs >>>>> objects: 683.85M objects, 1.6 PiB >>>>> usage: 2.7 PiB used, 1.9 PiB / 4.5 PiB avail >>>>> pgs: 842769842/3951610673 objects misplaced (21.327%) >>>>> 16481 active+clean >>>>> 8762 active+remapped+backfill_wait >>>>> 6 active+remapped+backfilling >>>> >>>> Are you *sure* that you have both the mclock override enabled and the >>>> op scheduler set to wpq at the proper scope? >>> >>> Reasonably sure: >>> >>> [root@ceph-flash1 ~]# ceph config dump | grep wpq >>> osd advanced osd_op_queue wpq * >>> >>> [root@ceph-flash1 ~]# ceph config dump | grep >>> osd_mclock_override_recovery_settings >>> osd advanced osd_mclock_override_recovery_settings >>> true >>> osd.234 advanced osd_mclock_override_recovery_settings >>> true >>> >>>> Note that if you’re using a wide EC profile that will gridlock the >>>> process to an extent. >>>> >>>>> >>>>> io: >>>>> client: 374 MiB/s rd, 14 MiB/s wr, 2.86k op/s rd, 410 op/s wr >>>>> recovery: 153 MiB/s, 38 objects/s >>>>> " >>>>> >>>>> The balancer was running and seemingly making very small changes: >>>>> >>>>> " >>>>> [root@lazy ~]# ceph balancer status >>>>> { >>>>> "active": true, >>>>> "last_optimize_duration": "0:00:01.012679", >>>>> "last_optimize_started": "Mon Apr 28 10:01:24 2025", >>>>> "mode": "upmap", >>>>> "no_optimization_needed": true, >>>>> "optimize_result": "Optimization plan created successfully", >>>>> "plans": [] >>>>> } >>>>> " >>>> >>>> The balancer has a misplaced % above which it won’t make additional >>>> changes, that defaults I think to 5%. With 21% misplaced the >>>> balancer will be on hold. >>> >>> I increased target_max_misplaced_ratio to ensure the balancer could >>> work out all the moves: >>> >>> [root@ceph-flash1 ~]# ceph config dump | grep misplaced >>> mgr basic target_max_misplaced_ratio >>> 0.300000 >>> >>>> >>>>> >>>>> >>>>> This is going to take a while, any tips on how to escape the >>>>> apparent bottleneck? >>>> >>>> Try raising >>>> >>>> osd_recovery_max_active >>>> osd_recovery_max_single_start >>>> osd_max_backfills >>>> >>>> to 2 or even 3. I have no empirical evidence but I’ve observed that >>>> when changing back to wpq that somewhat higher than customary values >>>> for these may be needed to be effective. Restarting the OSDs one >>>> failure domain at a time, waiting for recovery, might help according >>>> to some references. >>> >>> I am reluctant to increase osd_max_backfills or >>> osd_recovery_max_active because of the small disks in the cluster and >>> the large PG size. We've historically hit problems with concurrent >>> backfills making disks go backfill_full or even full and then it is >>> suddenly a different problem. Some of the smaller drives are at ~75% >>> utilization currently while larger drives are at ~56%, which is one of >>> the things we hope to improve upon by increasing the pg_num. >>> >>> I'll look at osd_recovery_max_single_start. >>> >>>>> >>>>> Is having many PGs misplaced actually counter productive >>>> >>>> Not so much unless you’re severely low on RAM I think, but I would >>>> suggest upmap-remapped to vanish the misplaced PGs and let the >>>> balancer do it incrementally. If you have 21% misplaced pgremapper >>>> may not have worked as expected - I have never used it, but >>>> upmap-remapped has worked well for me, usually needing 2-3 successive >>>> runs. >>> >>> The 21% was right after doubling the pg_num. I then ran pgremapper and >>> got misplaced to less than 1% and then the balancer is slowly >>> increasing the number again. I think those tools are largely doing the >>> same thing? I'll try doing it again. >>> >>> Thanks. >>> >>> Mvh. >>> >>> Torkil >>> >>>> >>>>> I was thinking it was better to let the balancer balance all it >>>>> could, as that would make all the moves available and decrease the >>>>> risk of bottlenecking. >>>> >>>> Wise choice. >>>> >>>>> >>>>> Thanks. >>>>> >>>>> Mvh. >>>>> >>>>> Torkil >>>>> >>>>> -- >>>>> Torkil Svensgaard >>>>> Sysadmin >>>>> MR-Forskningssektionen, afs. 714 >>>>> DRCMR, Danish Research Centre for Magnetic Resonance >>>>> Hvidovre Hospital >>>>> Kettegård Allé 30 >>>>> DK-2650 Hvidovre >>>>> Denmark >>>>> Tel: +45 386 22828 >>>>> E-mail: tor...@drcmr.dk >>>>> _______________________________________________ >>>>> ceph-users mailing list -- ceph-users@ceph.io >>>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@ceph.io >>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>> >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io > > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io