[ceph-users] Re: NVMe ARM cluster: throughput plateaus and initial OSD hotspot

Anthony D'Atri via ceph-users Wed, 25 Feb 2026 05:50:14 -0800


> What internal Ceph mechanism could explain these three performance phases?
>


Not necessarily internal.  Please tell us exactly which model these SSDs are, 
and if they have the latest firmware.
I suspect that they are client-class and not intended for sustained workload.

> Why does one OSD initially receive more load before balancing occurs?
> 
CRUSH is effectively a hash function, and the output is typically a bell curve. 
Balancing corrects for that, like a field flattener on a telescope or a 
collimator on a laser.

As your cluster fills from empty, with defaults on the balancer and autoscaler 
will split and move PGs, so this isn’t an ideal scenario for benching.

I suggest adding the `bulk` flag to the data pools, filling the cluster at 
least halfway, then disabling the balancer and autoscaler so neither kicks in 
during your benching.

Also, if you’re using the PG autoscaler with default settings, you likely have 
way too few PGs.

ceph config set global mon_max_pg_per_osd 600
ceph config set global mon_target_pg_per_osd 300

That will help the utilization even out.


_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: NVMe ARM cluster: throughput plateaus and initial OSD hotspot

Reply via email to