On 30-04-2025 08:11, Torkil Svensgaard wrote:
On 30/04/2025 01:08, Anthony D'Atri wrote:
I increased target_max_misplaced_ratio to ensure the balancer could
work out all the moves:
[root@ceph-flash1 ~]# ceph config dump | grep misplaced
mgr basic target_max_misplaced_ratio
0.300000
That’s a very high value. You move less data more than once, at the
possible risk of too much backfill causing performance impact.
Whatever floats your boat.
So perhaps not wise after all to have a large target_max_misplaced_ratio
to map out all the moves. I'm going to reduce it to the default and
clear the misplaced PGs to see if staying at a low misplaced percentage
might work better.
"
pgs: 197829350/3953318209 objects misplaced (5.004%)
21780 active+clean
3394 active+remapped+backfill_wait
75 active+remapped+backfilling
io:
client: 133 MiB/s rd, 106 MiB/s wr, 1.11k op/s rd, 727 op/s wr
recovery: 4.8 GiB/s, 1.22k objects/s
"
It could of course be something else at play here but staying at 5% max
misplaced seems to have improved the situation. Or it could be a fluke.
Mvh.
Torkil
In order to get our PG sizes better aligned we doubled the number
of PGs on the pool with the largest PG size. The pool is HDD with
DB/WAL on SATA SSD and HDD sizes between 2TB and 20TB and PG size
was ~140GB before the doubling.
Please send `ceph osd dump | grep pool`
[root@lazy ~]# ceph osd dump | grep pool
Why multiple RBD pools? I suspect that you have multiple device
classes / media, but still.. Large numbers of pools make it more
difficult to calculate good pg_num values when not using the autoscaler.
Multiple device classes and use cases, but there's room for improvement.
Several of the pools aren't used and were just created to test
performance for a given configuration.
I suggest playing with
https://docs.ceph.com/en/squid/rados/operations/pgcalc/
… setting the target PGs per OSD to 250
There was a thread[1] last year about many PGs pr OSD without any firm
conclusions, so we are going to bump our number of PGs for the largest
HDDs a lot higher than 250 while keeping an eye on the impact. Currently
sitting at something like 550 PGs for a 20TB drive.
Note the pools with a bias value >1, typical RGW index and CephFS
metadata pools. This is because those pools benefit from a larger
pg_num value than their bytes usage might otherwise indicate. You
might account for this in the pgcalc by giving larger data %, or just
shoot higher for those pools than calculated. I would suggest at
least the number of SSD OSDs on which these pools are placed, round up
to the next power of two (and maybe double). I don’t want to assume
that your cluster is entirely non-rotational.
Our cluster is largely rotational but moving towards flash going
forward. Thanks for the pointers, we'll go over the values.
Mvh.
Torkil
[1] https://www.mail-archive.com/ceph-users@ceph.io/msg27153.html
I then ran pgremapper and got misplaced to less than 1% and then
the balancer is slowly increasing the number again. I think those
tools are largely doing the same thing? I'll try doing it again.
That high max ratio explains it. Usually 30% misplaced is an
indication that something isn’t as expected.
pool 4 'rbd' replicated size 3 min_size 2 crush_rule 4 object_hash
rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change
2816850 lfor 0/1844098/2447930 flags
hashpspool,selfmanaged_snaps,bulk stripe_width 0 application rbd
read_balance_score 3.97
pool 5 'libvirt' replicated size 3 min_size 2 crush_rule 3
object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode off
last_change 2824108 lfor 0/434267/1506461 flags
hashpspool,selfmanaged_snaps stripe_width 0 application rbd
read_balance_score 6.07
pool 6 'rbd_internal' replicated size 3 min_size 2 crush_rule 4
object_hash rjenkins pg_num 2048 pgp_num 2048 autoscale_mode off
last_change 2816850 lfor 0/1370796/2806939 flags
hashpspool,selfmanaged_snaps,bulk stripe_width 0 application rbd
read_balance_score 2.78
pool 8 '.mgr' replicated size 2 min_size 1 crush_rule 3 object_hash
rjenkins pg_num 1 pgp_num 1 autoscale_mode warn last_change 1667576
flags hashpspool stripe_width 0 pg_num_min 1 application
mgr,mgr_devicehealth read_balance_score 40.00
pool 10 'rbd_ec' replicated size 3 min_size 2 crush_rule 3
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn
last_change 1919209 lfor 0/1180414/1180412 flags
hashpspool,selfmanaged_snaps stripe_width 0 application rbd
read_balance_score 8.16
pool 11 'rbd_ec_data' erasure profile DRCMR_k4m2 size 6 min_size 5
crush_rule 0 object_hash rjenkins pg_num 16384 pgp_num 16384
autoscale_mode off last_change 2832704 lfor 0/1291190/2832700 flags
hashpspool,ec_overwrites,selfmanaged_snaps,bulk stripe_width 16384
fast_read 1 compression_algorithm snappy compression_mode aggressive
application rbd
pool 23 'rbd.nvme' replicated size 2 min_size 1 crush_rule 5
object_hash rjenkins pg_num 2048 pgp_num 2048 autoscale_mode off
last_change 2722280 lfor 0/0/2139786 flags
hashpspool,selfmanaged_snaps,bulk stripe_width 0 application rbd
read_balance_score 1.35
pool 25 '.nfs' replicated size 3 min_size 2 crush_rule 3 object_hash
rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 2177402
lfor 0/0/2065595 flags hashpspool stripe_width 0 application nfs
read_balance_score 8.16
pool 31 'cephfs.cephfs.meta' replicated size 3 min_size 2 crush_rule
3 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode off
last_change 2478849 lfor 0/0/2198357 flags hashpspool stripe_width 0
pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application
cephfs read_balance_score 6.94
pool 32 'cephfs.cephfs.data' replicated size 3 min_size 2 crush_rule
3 object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode off
last_change 2178931 lfor 0/2178574/2178572 flags hashpspool
stripe_width 0 application cephfs read_balance_score 6.07
pool 34 'cephfs.nvme.data' replicated size 2 min_size 1 crush_rule 5
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off
last_change 2722280 lfor 0/2147353/2147351 flags hashpspool,bulk
stripe_width 0 compression_algorithm zstd compression_mode aggressive
application cephfs read_balance_score 3.77
pool 35 'cephfs.ssd.data' replicated size 3 min_size 2 crush_rule 3
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off
last_change 2198980 lfor 0/0/2126134 flags hashpspool,bulk
stripe_width 0 compression_algorithm zstd compression_mode aggressive
application cephfs read_balance_score 8.05
pool 37 'cephfs.hdd.data' erasure profile DRCMR_k4m5_datacenter_hdd
size 9 min_size 5 crush_rule 7 object_hash rjenkins pg_num 2048
pgp_num 2048 autoscale_mode off last_change 2816850 lfor 0/0/2139486
flags hashpspool,ec_overwrites,bulk stripe_width 16384 fast_read 1
compression_algorithm zstd compression_mode aggressive application
cephfs
pool 39 'rbd.ssd' replicated size 3 min_size 2 crush_rule 3
object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode warn
last_change 2541795 flags hashpspool,selfmanaged_snaps stripe_width 0
application rbd read_balance_score 7.52
pool 43 'rbd.ssd.ec' replicated size 3 min_size 2 crush_rule 3
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn
last_change 2542174 flags hashpspool stripe_width 0 compression_mode
aggressive application rbd read_balance_score 8.16
pool 44 'rbd.ssd.ec.data' erasure profile DRCMR_k4m5_datacenter_ssd
size 9 min_size 5 crush_rule 6 object_hash rjenkins pg_num 32 pgp_num
32 autoscale_mode warn last_change 2542179 flags
hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384
compression_mode aggressive application rbd
pool 47 'rbd.nvmebulk.ec' replicated size 3 min_size 2 crush_rule 10
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn
last_change 2737621 flags hashpspool stripe_width 0 application rbd
read_balance_score 3.67
pool 48 'rbd.nvmebulk.data' erasure profile
DRCMR_k4m5_datacenter_nvmebulk size 9 min_size 5 crush_rule 11
object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode off
last_change 2737621 lfor 0/0/2736420 flags
hashpspool,ec_overwrites,selfmanaged_snaps stripe_width 16384
compression_algorithm snappy compression_mode aggressive application rbd
Pool 11 is the one in question.
osd: 576 osds: 576 up (since 2h), 576 in (since 3d); 8767
remapped pgs
pools: 18 pools, 25249 pgs
objects: 683.85M objects, 1.6 PiB
usage: 2.7 PiB used, 1.9 PiB / 4.5 PiB avail
pgs: 842769842/3951610673 objects misplaced (21.327%)
16481 active+clean
8762 active+remapped+backfill_wait
6 active+remapped+backfilling
Are you *sure* that you have both the mclock override enabled and
the op scheduler set to wpq at the proper scope?
Reasonably sure:
[root@ceph-flash1 ~]# ceph config dump | grep wpq
osd advanced osd_op_queue wpq *
[root@ceph-flash1 ~]# ceph config dump | grep
osd_mclock_override_recovery_settings
osd advanced
osd_mclock_override_recovery_settings true
osd.234 advanced
osd_mclock_override_recovery_settings true
Note that if you’re using a wide EC profile that will gridlock the
process to an extent.
io:
client: 374 MiB/s rd, 14 MiB/s wr, 2.86k op/s rd, 410 op/s wr
recovery: 153 MiB/s, 38 objects/s
"
The balancer was running and seemingly making very small changes:
"
[root@lazy ~]# ceph balancer status
{
"active": true,
"last_optimize_duration": "0:00:01.012679",
"last_optimize_started": "Mon Apr 28 10:01:24 2025",
"mode": "upmap",
"no_optimization_needed": true,
"optimize_result": "Optimization plan created successfully",
"plans": []
}
"
The balancer has a misplaced % above which it won’t make additional
changes, that defaults I think to 5%. With 21% misplaced the
balancer will be on hold.
I increased target_max_misplaced_ratio to ensure the balancer could
work out all the moves:
[root@ceph-flash1 ~]# ceph config dump | grep misplaced
mgr basic target_max_misplaced_ratio
0.300000
This is going to take a while, any tips on how to escape the
apparent bottleneck?
Try raising
osd_recovery_max_active
osd_recovery_max_single_start
osd_max_backfills
to 2 or even 3. I have no empirical evidence but I’ve observed that
when changing back to wpq that somewhat higher than customary values
for these may be needed to be effective. Restarting the OSDs one
failure domain at a time, waiting for recovery, might help according
to some references.
I am reluctant to increase osd_max_backfills or
osd_recovery_max_active because of the small disks in the cluster and
the large PG size. We've historically hit problems with concurrent
backfills making disks go backfill_full or even full and then it is
suddenly a different problem. Some of the smaller drives are at ~75%
utilization currently while larger drives are at ~56%, which is one
of the things we hope to improve upon by increasing the pg_num.
I'll look at osd_recovery_max_single_start.
Is having many PGs misplaced actually counter productive
Not so much unless you’re severely low on RAM I think, but I would
suggest upmap-remapped to vanish the misplaced PGs and let the
balancer do it incrementally. If you have 21% misplaced pgremapper
may not have worked as expected - I have never used it, but
upmap-remapped has worked well for me, usually needing 2-3
successive runs.
The 21% was right after doubling the pg_num. I then ran pgremapper
and got misplaced to less than 1% and then the balancer is slowly
increasing the number again. I think those tools are largely doing
the same thing? I'll try doing it again.
Thanks.
Mvh.
Torkil
I was thinking it was better to let the balancer balance all it
could, as that would make all the moves available and decrease the
risk of bottlenecking.
Wise choice.
Thanks.
Mvh.
Torkil
--
Torkil Svensgaard
Sysadmin
MR-Forskningssektionen, afs. 714
DRCMR, Danish Research Centre for Magnetic Resonance
Hvidovre Hospital
Kettegård Allé 30
DK-2650 Hvidovre
Denmark
Tel: +45 386 22828
E-mail: tor...@drcmr.dk
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
--
Torkil Svensgaard
Sysadmin
MR-Forskningssektionen, afs. 714
DRCMR, Danish Research Centre for Magnetic Resonance
Hvidovre Hospital
Kettegård Allé 30
DK-2650 Hvidovre
Denmark
Tel: +45 386 22828
E-mail: tor...@drcmr.dk
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
--
Torkil Svensgaard
Sysadmin
MR-Forskningssektionen, afs. 714
DRCMR, Danish Research Centre for Magnetic Resonance
Hvidovre Hospital
Kettegård Allé 30
DK-2650 Hvidovre
Denmark
Tel: +45 386 22828
E-mail: tor...@drcmr.dk
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io