I increased target_max_misplaced_ratio to ensure the balancer could work out all the moves:
[root@ceph-flash1 ~]# ceph config dump | grep misplaced mgr basic target_max_misplaced_ratio 0.300000 That’s a very high value. You move less data more than once, at the possible risk of too much backfill causing performance impact. Whatever floats your boat. >>> >>> In order to get our PG sizes better aligned we doubled the number of PGs on >>> the pool with the largest PG size. The pool is HDD with DB/WAL on SATA SSD >>> and HDD sizes between 2TB and 20TB and PG size was ~140GB before the >>> doubling. >> Please send `ceph osd dump | grep pool` > > [root@lazy ~]# ceph osd dump | grep pool Why multiple RBD pools? I suspect that you have multiple device classes / media, but still.. Large numbers of pools make it more difficult to calculate good pg_num values when not using the autoscaler. I suggest playing with https://docs.ceph.com/en/squid/rados/operations/pgcalc/ … setting the target PGs per OSD to 250 Note the pools with a bias value >1, typical RGW index and CephFS metadata pools. This is because those pools benefit from a larger pg_num value than their bytes usage might otherwise indicate. You might account for this in the pgcalc by giving larger data %, or just shoot higher for those pools than calculated. I would suggest at least the number of SSD OSDs on which these pools are placed, round up to the next power of two (and maybe double). I don’t want to assume that your cluster is entirely non-rotational. > > I then ran pgremapper and got misplaced to less than 1% and then the > balancer is slowly increasing the number again. I think those tools are > largely doing the same thing? I'll try doing it again. That high max ratio explains it. Usually 30% misplaced is an indication that something isn’t as expected. > pool 4 'rbd' replicated size 3 min_size 2 crush_rule 4 object_hash rjenkins > pg_num 1024 pgp_num 1024 autoscale_mode off last_change 2816850 lfor > 0/1844098/2447930 flags hashpspool,selfmanaged_snaps,bulk stripe_width 0 > application rbd read_balance_score 3.97 > pool 5 'libvirt' replicated size 3 min_size 2 crush_rule 3 object_hash > rjenkins pg_num 256 pgp_num 256 autoscale_mode off last_change 2824108 lfor > 0/434267/1506461 flags hashpspool,selfmanaged_snaps stripe_width 0 > application rbd read_balance_score 6.07 > pool 6 'rbd_internal' replicated size 3 min_size 2 crush_rule 4 object_hash > rjenkins pg_num 2048 pgp_num 2048 autoscale_mode off last_change 2816850 lfor > 0/1370796/2806939 flags hashpspool,selfmanaged_snaps,bulk stripe_width 0 > application rbd read_balance_score 2.78 > pool 8 '.mgr' replicated size 2 min_size 1 crush_rule 3 object_hash rjenkins > pg_num 1 pgp_num 1 autoscale_mode warn last_change 1667576 flags hashpspool > stripe_width 0 pg_num_min 1 application mgr,mgr_devicehealth > read_balance_score 40.00 > pool 10 'rbd_ec' replicated size 3 min_size 2 crush_rule 3 object_hash > rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 1919209 lfor > 0/1180414/1180412 flags hashpspool,selfmanaged_snaps stripe_width 0 > application rbd read_balance_score 8.16 > pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6 min_size 5 crush_rule > 0 object_hash rjenkins pg_num 16384 pgp_num 16384 autoscale_mode off > last_change 2832704 lfor 0/1291190/2832700 flags > hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 16384 fast_read > 1 compression_algorithm snappy compression_mode aggressive application rbd > pool 23 'rbd.nvme' replicated size 2 min_size 1 crush_rule 5 object_hash > rjenkins pg_num 2048 pgp_num 2048 autoscale_mode off last_change 2722280 lfor > 0/0/2139786 flags hashpspool,selfmanaged_snaps,bulk stripe_width 0 > application rbd read_balance_score 1.35 > pool 25 '.nfs' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins > pg_num 32 pgp_num 32 autoscale_mode warn last_change 2177402 lfor 0/0/2065595 > flags hashpspool stripe_width 0 application nfs read_balance_score 8.16 > pool 31 'cephfs.cephfs.meta' replicated size 3 min_size 2 crush_rule 3 > object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode off last_change > 2478849 lfor 0/0/2198357 flags hashpspool stripe_width 0 pg_autoscale_bias 4 > pg_num_min 16 recovery_priority 5 application cephfs read_balance_score 6.94 > pool 32 'cephfs.cephfs.data' replicated size 3 min_size 2 crush_rule 3 > object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode off last_change > 2178931 lfor 0/2178574/2178572 flags hashpspool stripe_width 0 application > cephfs read_balance_score 6.07 > pool 34 'cephfs.nvme.data' replicated size 2 min_size 1 crush_rule 5 > object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change > 2722280 lfor 0/2147353/2147351 flags hashpspool,bulk stripe_width 0 > compression_algorithm zstd compression_mode aggressive application cephfs > read_balance_score 3.77 > pool 35 'cephfs.ssd.data' replicated size 3 min_size 2 crush_rule 3 > object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change > 2198980 lfor 0/0/2126134 flags hashpspool,bulk stripe_width 0 > compression_algorithm zstd compression_mode aggressive application cephfs > read_balance_score 8.05 > pool 37 'cephfs.hdd.data' erasure profile DRCMR_k4m5_datacenter_hdd size 9 > min_size 5 crush_rule 7 object_hash rjenkins pg_num 2048 pgp_num 2048 > autoscale_mode off last_change 2816850 lfor 0/0/2139486 flags > hashpspool,ec_overwrites,bulk stripe_width 16384 fast_read 1 > compression_algorithm zstd compression_mode aggressive application cephfs > pool 39 'rbd.ssd' replicated size 3 min_size 2 crush_rule 3 object_hash > rjenkins pg_num 64 pgp_num 64 autoscale_mode warn last_change 2541795 flags > hashpspool,selfmanaged_snaps stripe_width 0 application rbd > read_balance_score 7.52 > pool 43 'rbd.ssd.ec' replicated size 3 min_size 2 crush_rule 3 object_hash > rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 2542174 flags > hashpspool stripe_width 0 compression_mode aggressive application rbd > read_balance_score 8.16 > pool 44 'rbd.ssd.ec.data' erasure profile DRCMR_k4m5_datacenter_ssd size 9 > min_size 5 crush_rule 6 object_hash rjenkins pg_num 32 pgp_num 32 > autoscale_mode warn last_change 2542179 flags > hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384 > compression_mode aggressive application rbd > pool 47 'rbd.nvmebulk.ec' replicated size 3 min_size 2 crush_rule 10 > object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change > 2737621 flags hashpspool stripe_width 0 application rbd read_balance_score > 3.67 > pool 48 'rbd.nvmebulk.data' erasure profile DRCMR_k4m5_datacenter_nvmebulk > size 9 min_size 5 crush_rule 11 object_hash rjenkins pg_num 512 pgp_num 512 > autoscale_mode off last_change 2737621 lfor 0/0/2736420 flags > hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384 > compression_algorithm snappy compression_mode aggressive application rbd > > Pool 11 is the one in question. > >>> >>> osd: 576 osds: 576 up (since 2h), 576 in (since 3d); 8767 remapped pgs >>> >>> pools: 18 pools, 25249 pgs >>> objects: 683.85M objects, 1.6 PiB >>> usage: 2.7 PiB used, 1.9 PiB / 4.5 PiB avail >>> pgs: 842769842/3951610673 objects misplaced (21.327%) >>> 16481 active+clean >>> 8762 active+remapped+backfill_wait >>> 6 active+remapped+backfilling >> Are you *sure* that you have both the mclock override enabled and the op >> scheduler set to wpq at the proper scope? > > Reasonably sure: > > [root@ceph-flash1 ~]# ceph config dump | grep wpq > osd advanced osd_op_queue wpq * > > [root@ceph-flash1 ~]# ceph config dump | grep > osd_mclock_override_recovery_settings > osd advanced osd_mclock_override_recovery_settings > true > osd.234 advanced osd_mclock_override_recovery_settings > true > >> Note that if you’re using a wide EC profile that will gridlock the process >> to an extent. >>> >>> io: >>> client: 374 MiB/s rd, 14 MiB/s wr, 2.86k op/s rd, 410 op/s wr >>> recovery: 153 MiB/s, 38 objects/s >>> " >>> >>> The balancer was running and seemingly making very small changes: >>> >>> " >>> [root@lazy ~]# ceph balancer status >>> { >>> "active": true, >>> "last_optimize_duration": "0:00:01.012679", >>> "last_optimize_started": "Mon Apr 28 10:01:24 2025", >>> "mode": "upmap", >>> "no_optimization_needed": true, >>> "optimize_result": "Optimization plan created successfully", >>> "plans": [] >>> } >>> " >> The balancer has a misplaced % above which it won’t make additional changes, >> that defaults I think to 5%. With 21% misplaced the balancer will be on >> hold. > > I increased target_max_misplaced_ratio to ensure the balancer could work out > all the moves: > > [root@ceph-flash1 ~]# ceph config dump | grep misplaced > mgr basic target_max_misplaced_ratio 0.300000 > >>> >>> >>> This is going to take a while, any tips on how to escape the apparent >>> bottleneck? >> Try raising >> osd_recovery_max_active >> osd_recovery_max_single_start >> osd_max_backfills >> to 2 or even 3. I have no empirical evidence but I’ve observed that when >> changing back to wpq that somewhat higher than customary values for these >> may be needed to be effective. Restarting the OSDs one failure domain at a >> time, waiting for recovery, might help according to some references. > > I am reluctant to increase osd_max_backfills or osd_recovery_max_active > because of the small disks in the cluster and the large PG size. We've > historically hit problems with concurrent backfills making disks go > backfill_full or even full and then it is suddenly a different problem. Some > of the smaller drives are at ~75% utilization currently while larger drives > are at ~56%, which is one of the things we hope to improve upon by increasing > the pg_num. > > I'll look at osd_recovery_max_single_start. > >>> >>> Is having many PGs misplaced actually counter productive >> Not so much unless you’re severely low on RAM I think, but I would suggest >> upmap-remapped to vanish the misplaced PGs and let the balancer do it >> incrementally. If you have 21% misplaced pgremapper may not have worked as >> expected - I have never used it, but upmap-remapped has worked well for me, >> usually needing 2-3 successive runs. > > The 21% was right after doubling the pg_num. I then ran pgremapper and got > misplaced to less than 1% and then the balancer is slowly increasing the > number again. I think those tools are largely doing the same thing? I'll try > doing it again. > > Thanks. > > Mvh. > > Torkil > >>> I was thinking it was better to let the balancer balance all it could, as >>> that would make all the moves available and decrease the risk of >>> bottlenecking. >> Wise choice. >>> >>> Thanks. >>> >>> Mvh. >>> >>> Torkil >>> >>> -- >>> Torkil Svensgaard >>> Sysadmin >>> MR-Forskningssektionen, afs. 714 >>> DRCMR, Danish Research Centre for Magnetic Resonance >>> Hvidovre Hospital >>> Kettegård Allé 30 >>> DK-2650 Hvidovre >>> Denmark >>> Tel: +45 386 22828 >>> E-mail: tor...@drcmr.dk >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@ceph.io >>> To unsubscribe send an email to ceph-users-le...@ceph.io >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io > > -- > Torkil Svensgaard > Sysadmin > MR-Forskningssektionen, afs. 714 > DRCMR, Danish Research Centre for Magnetic Resonance > Hvidovre Hospital > Kettegård Allé 30 > DK-2650 Hvidovre > Denmark > Tel: +45 386 22828 > E-mail: tor...@drcmr.dk _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io