Hi Anthony,
thanks for your quick reply.
The nomenclature here can be tricky
Yes, I think the Ceph docs should get some minor updates to make the difference
between PGs and PG replicas (PG * replicationFactor) even more explicit.
Please share `ceph osd df`
Please see below.
* The "rep-cluster" is fully balanced.
* The "ec-cluster" got 2 machines added (from 4 to 6) 2 days ago and is thus
rebalancing (which is where I noticed the impact of the large PGs and decided to look
into it in more detail than before). I've included 2 old machines and 1 new machine.
rep-cluster # ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META
AVAIL %USE VAR PGS STATUS
[ a machine with 10 HDDs of 16 TB each, + 2 NVMe SSDs ]
0 ssd 0.16370 1.00000 168 GiB 3.4 GiB 59 MiB 1.6 GiB 1.7 GiB
164 GiB 2.01 0.03 2 up
1 ssd 0.16370 1.00000 168 GiB 3.9 GiB 72 MiB 1.6 GiB 2.3 GiB
164 GiB 2.34 0.03 2 up
2 hdd 14.61089 1.00000 15 TiB 12 TiB 12 TiB 6 KiB 21 GiB
2.6 TiB 82.24 1.05 35 up
3 hdd 14.61089 1.00000 15 TiB 12 TiB 12 TiB 0 B 21 GiB
2.6 TiB 82.27 1.05 35 up
...
11 hdd 14.61089 1.00000 15 TiB 12 TiB 12 TiB 0 B 22 GiB
2.3 TiB 84.54 1.08 36 up
[ another such machine ]
12 ssd 0.16370 1.00000 168 GiB 2.7 GiB 48 MiB 818 MiB 1.9 GiB
165 GiB 1.61 0.02 1 up
13 ssd 0.16370 1.00000 168 GiB 4.5 GiB 72 MiB 2.5 GiB 2.0 GiB
163 GiB 2.71 0.03 3 up
14 hdd 14.61089 1.00000 15 TiB 12 TiB 12 TiB 0 B 21 GiB
2.9 TiB 79.93 1.02 34 up
15 hdd 14.61089 1.00000 15 TiB 12 TiB 12 TiB 0 B 21 GiB
2.9 TiB 79.86 1.02 34 up
...
23 hdd 14.61089 1.00000 15 TiB 12 TiB 12 TiB 6 KiB 23 GiB
2.3 TiB 84.54 1.08 36 up
TOTAL 1.3 PiB 1.0 PiB 1.0 PiB 40 GiB 1.8 TiB
295 TiB 78.11
MIN/MAX VAR: 0.00/1.08 STDDEV: 32.09
ec-cluster # ceph osd df
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META
AVAIL %USE VAR PGS STATUS
[ a machine with 10 HDDs of 16 TB each, + 2 NVMe SSDs ]
0 ssd 0.16370 1.00000 168 GiB 40 GiB 177 MiB 17 GiB 23 GiB
127 GiB 24.05 0.40 107 up
1 ssd 0.16370 1.00000 168 GiB 30 GiB 173 MiB 7.4 GiB 23 GiB
137 GiB 18.16 0.30 97 up
2 hdd 14.61089 1.00000 15 TiB 12 TiB 12 TiB 8 KiB 53 GiB
2.1 TiB 85.39 1.41 35 up
3 hdd 14.61089 1.00000 15 TiB 10 TiB 10 TiB 1 KiB 47 GiB
4.4 TiB 70.13 1.16 30 up
...
11 hdd 14.61089 1.00000 15 TiB 13 TiB 13 TiB 1 KiB 58 GiB
1.8 TiB 87.88 1.45 38 up
[ another such machine ]
12 ssd 0.16370 1.00000 168 GiB 28 GiB 165 MiB 9.4 GiB 19 GiB
139 GiB 16.88 0.28 90 up
13 ssd 0.16370 1.00000 168 GiB 36 GiB 174 MiB 11 GiB 25 GiB
131 GiB 21.71 0.36 103 up
14 hdd 14.61089 1.00000 15 TiB 13 TiB 12 TiB 1 KiB 55 GiB
2.1 TiB 85.75 1.42 37 up
15 hdd 14.61089 1.00000 15 TiB 13 TiB 13 TiB 1 KiB 53 GiB
2.0 TiB 86.63 1.43 39 up
...
23 hdd 14.61089 1.00000 15 TiB 12 TiB 12 TiB 1 KiB 52 GiB
2.3 TiB 84.04 1.39 38 up
[ another such machine, new (added to the cluster 2 days
ago), that is currently being rebalanced to ]
86 ssd 0.16370 1.00000 168 GiB 32 GiB 161 MiB 5.7 GiB 26 GiB
136 GiB 19.06 0.32 119 up
87 ssd 0.16370 1.00000 168 GiB 38 GiB 599 MiB 12 GiB 25 GiB
130 GiB 22.41 0.37 108 up
88 hdd 14.61089 1.00000 15 TiB 2.0 TiB 1.9 TiB 1 KiB 8.9 GiB
13 TiB 13.47 0.22 5 up
89 hdd 14.61089 1.00000 15 TiB 1.9 TiB 1.8 TiB 1 KiB 7.5 GiB
13 TiB 12.94 0.21 5 up
...
97 hdd 14.61089 1.00000 15 TiB 2.9 TiB 2.9 TiB 1 KiB 12 GiB
12 TiB 19.96 0.33 6 up
TOTAL 883 TiB 534 TiB 530 TiB 368 GiB 3.0 TiB
349 TiB 60.51
MIN/MAX VAR: 0.17/1.46 STDDEV: 34.51
STDDEV [..] if your SSD OSDs are significantly smaller than the HDDs that can
confound the reporting
Yes, indeed the SSD OSDs are 100x smaller than the HDD OSDs.
rep-cluster # ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 1.3 PiB 289 TiB 1.0 PiB 1.0 PiB 78.46
ssd 6.2 TiB 6.1 TiB 81 GiB 81 GiB 1.28
TOTAL 1.3 PiB 296 TiB 1.0 PiB 1.0 PiB 78.11
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
.mgr 1 1 1.1 GiB 222 3.4 GiB 0.06 1.9 TiB
data 2 1024 350 TiB 108.95M 1.0 PiB 88.20 47 TiB
metadata 3 16 11 GiB 680.79k 12 GiB 0.21 1.9 TiB
ec-cluster # ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 876 TiB 343 TiB 534 TiB 534 TiB 60.87
ssd 6.6 TiB 5.6 TiB 1.1 TiB 1.1 TiB 16.24
TOTAL 883 TiB 348 TiB 535 TiB 535 TiB 60.54
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
.mgr 1 1 430 MiB 109 1.3 GiB 0.03 1.6 TiB
data 2 1024 0 B 219.20M 0 B 0 1.6 TiB
data_ec 3 256 360 TiB 236.64M 530 TiB 90.02 40 TiB
metadata 4 64 123 GiB 32.86k 369 GiB 7.11 1.6 TiB
rep-cluster # ceph balancer status
{
"active": true,
"last_optimize_duration": "0:00:00.018905",
"last_optimize_started": "Sat Jun 21 13:45:05 2025",
"mode": "upmap",
"no_optimization_needed": true,
"optimize_result": "Unable to find further optimization, or pool(s) pg_num is
decreasing, or distribution is already perfect",
"plans": []
}
ec-cluster # ceph balancer status
{
"active": true,
"last_optimize_duration": "0:00:00.000194",
"last_optimize_started": "Sat Jun 21 13:45:08 2025",
"mode": "upmap",
"no_optimization_needed": false,
"optimize_result": "Some objects (0.013053) are degraded; try again later",
"plans": []
}
Potentially useful to know:
* The rep-cluster is in HEALTH_OK for a long time.
* The ec-cluster suffers from `37 OSD(s) experiencing BlueFS spillover` for a
long time (I have not solved that yet; I suspect that Ceph would simply like
larger DB/WAL devices on my SSDs for the size / object count I have on the
HDDs, but if so that is unfixable for me because I use Hetzner SX134 servers).
I do not know if that HEALTH_WARN caused by that spillover will permanently
inhibit the balancer. That said, I occasionally use
https://github.com/TheJJ/ceph-balancer, which takes into account the actual
sizes of objects when balancing.
Another question:
Why do you inquire about the balancer? Does it affect the autoscaler?
So far I thought balancing PGs as a concept comes after the choice/computation
of how many PGs to use.
I imagine you have CRUSH rules that constrain pools to one or the other?
Yes.
Are there any CRUSH rules — especially #0 default replicated rule — that do not
specify a device class in this way?
No, all all CRUSH rules that are in use do specify a device class:
rep-cluster # for POOL in $(ceph osd pool ls); do echo -n "$POOL "; ceph osd pool get
"$POOL" crush_rule; done
.mgr crush_rule: mgr_replicated_ssd_rule_datacenter
data crush_rule: rule_data_datacenter
metadata crush_rule: rule_metadata_datacenter
mgr_replicated_ssd_rule_datacenter, "type": 1, "steps":
{ "op": "take", "item": -2, "item_name": "default~ssd" }
{ "op": "chooseleaf_firstn", "num": 0, "type": "datacenter" }
{ "op": "emit" }
rule_data_datacenter", "type": 1, "steps":
{ "op": "take", "item": -6, "item_name": "default~hdd" }
{ "op": "chooseleaf_firstn", "num": 0, "type": "datacenter" }
{ "op": "emit" }
rule_metadata_datacenter", "type": 1, "steps":
{ "op": "take", "item": -2, "item_name": "default~ssd" },
{ "op": "chooseleaf_firstn", "num": 0, "type": "datacenter" }
{ "op": "emit" }
ec-cluster # for POOL in $(ceph osd pool ls); do echo -n "$POOL "; ceph osd pool get
"$POOL" crush_rule; done
.mgr crush_rule: mgr_replicated_ssd_rule_datacenter
data crush_rule: rule_data_ssd_datacenter
data_ec crush_rule: rule_data_ec_datacenter
metadata crush_rule: rule_metadata_datacenter
mgr_replicated_ssd_rule_datacenter, "type": 1, "steps":
{ "op": "take", "item": -2, "item_name": "default~ssd" }
{ "op": "chooseleaf_firstn", "num": 0, "type": "datacenter" }
{ "op": "emit" }
rule_data_ssd_datacenter, "type": 1, "steps":
{ "op": "take", "item": -2, "item_name": "default~ssd" }
{ "op": "chooseleaf_firstn", "num": 0, "type": "datacenter" }
{ "op": "emit" }
rule_data_ec_datacenter, "type": 3, "steps":
{ "op": "set_chooseleaf_tries", "num": 5 }
{ "op": "set_choose_tries", "num": 100 }
{ "op": "take", "item": -21, "item_name": "default~hdd" },
{ "op": "chooseleaf_indep", "num": 0, "type": "datacenter" }
{ "op": "emit" }
rule_metadata_datacenter, "type": 1, "steps":
{ "op": "take", "item": -2, "item_name": "default~ssd" }
{ "op": "chooseleaf_firstn", "num": 0, "type": "datacenter" }
{ "op": "emit" }
My understanding is that the autoscaler won’t jump a pg_num value until the new
value is (by default) a factor of 3 high or low
Indeed, but isn't it factor 4x too low already?
Is there a way I can see the computations and decisions of the autoscaler?
I find it confusing that `ceph osd pool autoscale-status` does not have any
column related to OSDs, when `mon_target_pg_per_osd` is a key input to the
algorithm that controls the ratio between PGs and OSDs.
Following https://ceph.io/en/news/blog/2022/autoscaler_tuning/ section "How do I
know what the autoscaler is doing?"
# grep 'space, bias' /var/log/ceph/ceph-mgr.backupfs-1.log
2025-06-18T14:40:26.208+0000 7faa03a726c0 0 [pg_autoscaler INFO root] Pool
'benacofs_data_ec' root_id -21 using 0.5698246259869221 of space, bias 1.0, pg
target 550.8304717873581 quantized to 512 (current 256)
This seems to suggest that 512 PGs should be the target, instead of the current
256, which would bring me within factor 3x ratio.
Why doesn't `ceph osd pool autoscale-status` contain any info that suggests
that some autoscaling should happen then?
There are no other `pg_autoscaler` logs that suggests that it's somehow giving
up.
Also again here we have "blog-driven documentation"; none of this info from the
blog seems to be anywhere in Ceph upstream documentation.
It also mentions "ceph progress".
In that output, it is annoying that there's no time information at all.
The listed events could be recent or years old.
It's not even clear what the order is (old to new, or the other way around?).
I can use `ceph progress json` but then have to read UNIX timestamps.
I filed an issue for it now: https://tracker.ceph.com/issues/71781
I also noticed that the dates in `ceph progress json` look bugged:
https://tracker.ceph.com/issues/71782
ceph config set global target_size_ratio 250
I don't fully understand this suggestion.
Isn't target_size_ratio "relative to other pools that have target_size_ratio
set"?
https://docs.ceph.com/en/squid/rados/operations/placement-groups/#specifying-expected-pool-size
If I set it globally (thus for all pools), isn't the ratio between them still
the same?
The first sets the target back to a sane value
How can I check what's currently set? Currently nothing seems set at all:
# ceph osd pool get data_ec target_size_ratio
Error ENOENT: option 'target_size_ratio' is not set on pool 'data_ec'
Similar, how can I check the `threshold` value that one can set with `ceph osd
pool set threshold`?
(I'll send another email "Why is it still so difficult to just dump all config and
where it comes from?" to the list for this.)
Also, should I be setting `pg_autoscale_bias` to increase the number of PGs
that the autoscaler comes up with, by a fixed factor, to adjust for my small
objects?
This is suggested by
https://docs.redhat.com/en/documentation/red_hat_ceph_storage/4/html/storage_strategies_guide/placement_groups_pgs
This property is particularly used for metadata pools which might be small in
size but have large number of objects, so scaling them faster is important for
better performance.
Separate:
I read https://docs.ceph.com/en/squid/rados/operations/balancer/#throttling
I think these docs need improvement:
There is a separate setting for how uniform the distribution of PGs must be for
the module to consider the cluster adequately balanced. At the time of writing
(June 2025), this value defaults to `5`
So "there is a setting, and its default value is 5" ... but what's the name of
the setting?
Is it `upmap_max_deviation` from 4 paragraphs further down?
Thanks,
Niklas
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io