> On Jun 20, 2025, at 8:20 PM, Niklas Hambüchen <m...@nh2.me> wrote: > > I have 2 clusters; both have HDDs and SSDs. Reporting only the HDDs which > have their own pools: > > "rep-cluster": hdd-pool 3-replication, 86 OSDs (16 TiB each), 1024 PGs, 78 > %RAW USED, 100 M objects > "ec-cluster": hdd-pool erasure k=4 m=2, 58 OSDs (16 TiB each), 256 PGs, 60 > %RAW USED, 450 M objects > > Both are Ceph 18.2.1, Bluestore, and have the autoscaler enabled. > As you can see, I have many small objects. > > My PGs-copies-per-OSD seem far off from the recommendation of 100 PGs per OSD > (`mon_target_pg_per_osd`): > > rep-cluster: 35 PGs/OSD (= 1024*3/86) > ec-cluster: 26 PGs/OSD (= 256*6/58)
The nomenclature here can be tricky. As I’ve encountered documentation of what we at least used to call the PG ratio I’ve tried to describe this target as the number of *PG replicas* per OSD, because often enough folks don’t multiply by the replication size / EC K+M when doing the math, which I I see you’ve done. When there are multiple device classes and/or pools, especially with varying data protection strategies, it can get a bit complicated. Please share `ceph osd df` for each cluster, trimmed to include only the column header and a handful of representative OSDs for each device class. And the last two lines with the stddev. And `ceph df` and `ceph balancer status` Check the STDDEV figure at the bottom of `ceph osd df`, though if your SSD OSDs are significantly smaller than the HDDs that can confound the reporting. I have an RFE in to report the standard deviation per-device-class in addition to for the cluster as a whole. Also check the VAR column for OSDs within a device class: # ceph osd df | head ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 217 hdd 18.53969 1.00000 19 TiB 9.8 TiB 9.5 TiB 5 KiB 66 GiB 8.7 TiB 52.92 0.89 115 up 219 hdd 18.53969 1.00000 19 TiB 8.5 TiB 8.2 TiB 1 KiB 71 GiB 10 TiB 46.11 0.77 104 up 221 hdd 18.53969 1.00000 19 TiB 11 TiB 10 TiB 2 KiB 76 GiB 7.9 TiB 57.65 0.97 121 up The VAR(iance) is relative to the average number of PG replicas. Ideally — at least for a given device class — this value will not much more or less than 1.00. In this example the cluster was doubled in size and with the grace of upmap-remapped and the balancer is slowly but surely balancing data, which is why the variances are high. > Reporting only the HDDs which have their own pools When one has OSDs of varying sizes and/or device classes, the balancer and pg autoscaler can be confounded to varying degrees. Since you have multiple device classes, I imagine you have CRUSH rules that constrain pools to one or the other? "rule_id": 6, "rule_name": "ssd_crush", "type": 1, "steps": [ { "op": "take", "item": -33, "item_name": "default~ssd" Are there any CRUSH rules — especially #0 default replicated rule — that do not specify a device class in this way? If so, are there any pools that select such a rule? If so, changing the default or other rules to specify a device class, or changing pools using them to use a device-class-specific rule, can help. > > So I'm at least 3x-4x off. > Why? > Should the autoscaler not have increased the PGs here? The autoscaler is a fantastic idea from a usability perspective. It is though imperfect and benefits from kaizen. My understanding is that the autoscaler won’t jump a pg_num value until the new value is (by default) a factor of 3 high or low. I suspect that his enforces a manner of hysteresis, so that small fluctuations in pool usage or OSD count don’t result in annoying flapping back and forth. > > I believe that because of this I suffer some drawbacks: > > * On ec-cluster, a PG contains ~2 TiB and ~2 M objects, causing rebalances to > happen in coarse, slow steps. That’s one big reason why the current PG ratio target of 100 is suboptimal. The guidance used to be 200, it was retconned to 100 a handful of years ago because reasons. At a time when the largest OSDs were on the order of 8TB. Today one can buy a 122TB SSD, and SKUs double that size are on the horizon. For today I suggest ceph config set global target_size_ratio 250 ceph config set global mon_max_pg_per_osd 1000 The first sets the target back to a sane value; I have a PR pending to change this default. This gives the autoscaler more room to do its thing. The second is a guardrail; it does not itself change calculations, but allows headroom for clusters with varying OSD sizes and/or failure domains of varying weights avoid irksome PG activation failures in certain scenarios. Also, when the cluster contains OSDs of significantly varying weights — regardless of device class — the balancer can be facilitated by setting mgr advanced mgr/balancer/upmap_max_deviation 1 I suspect that the above steps will get you closer to where you want to be. > > Should I take some steps to force the autoscaler to increase PGs, and if yes, > which approach would be best here? > > Thanks for your tips! > Niklas > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io