[ceph-users] Re: Suspiciously low PG count for CephFS with many small files

Niklas Hambüchen Sat, 21 Jun 2025 09:10:28 -0700

Hi Anthony,

thanks for your quick reply.

The nomenclature here can be tricky


Yes, I think the Ceph docs should get some minor updates to make the difference 
between PGs and PG replicas (PG * replicationFactor) even more explicit.

Please share `ceph osd df`


Please see below.

* The "rep-cluster" is fully balanced.
* The "ec-cluster" got 2 machines added (from 4 to 6) 2 days ago and is thus 
rebalancing (which is where I noticed the impact of the large PGs and decided to look 
into it in more detail than before). I've included 2 old machines and 1 new machine.

rep-cluster # ceph osd df

    ID   CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP     META    
 AVAIL    %USE   VAR   PGS  STATUS

                   [ a machine with 10 HDDs of 16 TB each, + 2 NVMe SSDs  ]

      0    ssd   0.16370   1.00000  168 GiB  3.4 GiB   59 MiB  1.6 GiB  1.7 GiB 
 164 GiB   2.01  0.03    2      up
      1    ssd   0.16370   1.00000  168 GiB  3.9 GiB   72 MiB  1.6 GiB  2.3 GiB 
 164 GiB   2.34  0.03    2      up
      2    hdd  14.61089   1.00000   15 TiB   12 TiB   12 TiB    6 KiB   21 GiB 
 2.6 TiB  82.24  1.05   35      up
      3    hdd  14.61089   1.00000   15 TiB   12 TiB   12 TiB      0 B   21 GiB 
 2.6 TiB  82.27  1.05   35      up
      ...
     11    hdd  14.61089   1.00000   15 TiB   12 TiB   12 TiB      0 B   22 GiB 
 2.3 TiB  84.54  1.08   36      up

                   [ another such machine ]

     12    ssd   0.16370   1.00000  168 GiB  2.7 GiB   48 MiB  818 MiB  1.9 GiB 
 165 GiB   1.61  0.02    1      up
     13    ssd   0.16370   1.00000  168 GiB  4.5 GiB   72 MiB  2.5 GiB  2.0 GiB 
 163 GiB   2.71  0.03    3      up
     14    hdd  14.61089   1.00000   15 TiB   12 TiB   12 TiB      0 B   21 GiB 
 2.9 TiB  79.93  1.02   34      up
     15    hdd  14.61089   1.00000   15 TiB   12 TiB   12 TiB      0 B   21 GiB 
 2.9 TiB  79.86  1.02   34      up
     ...
     23    hdd  14.61089   1.00000   15 TiB   12 TiB   12 TiB    6 KiB   23 GiB 
 2.3 TiB  84.54  1.08   36      up

                             TOTAL  1.3 PiB  1.0 PiB  1.0 PiB   40 GiB  1.8 TiB 
 295 TiB  78.11
    MIN/MAX VAR: 0.00/1.08  STDDEV: 32.09


ec-cluster # ceph osd df

    ID   CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP     META    
 AVAIL    %USE   VAR   PGS  STATUS

                   [ a machine with 10 HDDs of 16 TB each, + 2 NVMe SSDs  ]

      0    ssd   0.16370   1.00000  168 GiB   40 GiB  177 MiB   17 GiB   23 GiB 
 127 GiB  24.05  0.40  107      up
      1    ssd   0.16370   1.00000  168 GiB   30 GiB  173 MiB  7.4 GiB   23 GiB 
 137 GiB  18.16  0.30   97      up
      2    hdd  14.61089   1.00000   15 TiB   12 TiB   12 TiB    8 KiB   53 GiB 
 2.1 TiB  85.39  1.41   35      up
      3    hdd  14.61089   1.00000   15 TiB   10 TiB   10 TiB    1 KiB   47 GiB 
 4.4 TiB  70.13  1.16   30      up
      ...
     11    hdd  14.61089   1.00000   15 TiB   13 TiB   13 TiB    1 KiB   58 GiB 
 1.8 TiB  87.88  1.45   38      up

                   [ another such machine ]

     12    ssd   0.16370   1.00000  168 GiB   28 GiB  165 MiB  9.4 GiB   19 GiB 
 139 GiB  16.88  0.28   90      up
     13    ssd   0.16370   1.00000  168 GiB   36 GiB  174 MiB   11 GiB   25 GiB 
 131 GiB  21.71  0.36  103      up
     14    hdd  14.61089   1.00000   15 TiB   13 TiB   12 TiB    1 KiB   55 GiB 
 2.1 TiB  85.75  1.42   37      up
     15    hdd  14.61089   1.00000   15 TiB   13 TiB   13 TiB    1 KiB   53 GiB 
 2.0 TiB  86.63  1.43   39      up
     ...
     23    hdd  14.61089   1.00000   15 TiB   12 TiB   12 TiB    1 KiB   52 GiB 
 2.3 TiB  84.04  1.39   38      up

                   [ another such machine, new (added to the cluster 2 days 
ago), that is currently being rebalanced to ]

     86    ssd   0.16370   1.00000  168 GiB   32 GiB  161 MiB  5.7 GiB   26 GiB 
 136 GiB  19.06  0.32  119      up
     87    ssd   0.16370   1.00000  168 GiB   38 GiB  599 MiB   12 GiB   25 GiB 
 130 GiB  22.41  0.37  108      up
     88    hdd  14.61089   1.00000   15 TiB  2.0 TiB  1.9 TiB    1 KiB  8.9 GiB 
  13 TiB  13.47  0.22    5      up
     89    hdd  14.61089   1.00000   15 TiB  1.9 TiB  1.8 TiB    1 KiB  7.5 GiB 
  13 TiB  12.94  0.21    5      up
     ...
     97    hdd  14.61089   1.00000   15 TiB  2.9 TiB  2.9 TiB    1 KiB   12 GiB 
  12 TiB  19.96  0.33    6      up

                             TOTAL  883 TiB  534 TiB  530 TiB  368 GiB  3.0 TiB 
 349 TiB  60.51
    MIN/MAX VAR: 0.17/1.46  STDDEV: 34.51

STDDEV [..] if your SSD OSDs are significantly smaller than the HDDs that can 
confound the reporting


Yes, indeed the SSD OSDs are 100x smaller than the HDD OSDs.


rep-cluster # ceph df

    --- RAW STORAGE ---
    CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
    hdd    1.3 PiB  289 TiB  1.0 PiB   1.0 PiB      78.46
    ssd    6.2 TiB  6.1 TiB   81 GiB    81 GiB       1.28
    TOTAL  1.3 PiB  296 TiB  1.0 PiB   1.0 PiB      78.11

--- POOLS ---

    POOL               ID   PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
    .mgr                1     1  1.1 GiB      222  3.4 GiB   0.06    1.9 TiB
    data                2  1024  350 TiB  108.95M  1.0 PiB  88.20     47 TiB
    metadata            3    16   11 GiB  680.79k   12 GiB   0.21    1.9 TiB


ec-cluster # ceph df

    --- RAW STORAGE ---
    CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
    hdd    876 TiB  343 TiB  534 TiB   534 TiB      60.87
    ssd    6.6 TiB  5.6 TiB  1.1 TiB   1.1 TiB      16.24
    TOTAL  883 TiB  348 TiB  535 TiB   535 TiB      60.54

--- POOLS ---

    POOL               ID   PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
    .mgr                1     1  430 MiB      109  1.3 GiB   0.03    1.6 TiB
    data                2  1024      0 B  219.20M      0 B      0    1.6 TiB
    data_ec             3   256  360 TiB  236.64M  530 TiB  90.02     40 TiB
    metadata            4    64  123 GiB   32.86k  369 GiB   7.11    1.6 TiB


rep-cluster # ceph balancer status
{
    "active": true,
    "last_optimize_duration": "0:00:00.018905",
    "last_optimize_started": "Sat Jun 21 13:45:05 2025",
    "mode": "upmap",
    "no_optimization_needed": true,
    "optimize_result": "Unable to find further optimization, or pool(s) pg_num is 
decreasing, or distribution is already perfect",
    "plans": []
}


ec-cluster # ceph balancer status
{
    "active": true,
    "last_optimize_duration": "0:00:00.000194",
    "last_optimize_started": "Sat Jun 21 13:45:08 2025",
    "mode": "upmap",
    "no_optimization_needed": false,
    "optimize_result": "Some objects (0.013053) are degraded; try again later",
    "plans": []
}


Potentially useful to know:

* The rep-cluster is in HEALTH_OK for a long time.
* The ec-cluster suffers from `37 OSD(s) experiencing BlueFS spillover` for a 
long time (I have not solved that yet; I suspect that Ceph would simply like 
larger DB/WAL devices on my SSDs for the size / object count I have on the 
HDDs, but if so that is unfixable for me because I use Hetzner SX134 servers). 
I do not know if that HEALTH_WARN caused by that spillover will permanently 
inhibit the balancer. That said, I occasionally use 
https://github.com/TheJJ/ceph-balancer, which takes into account the actual 
sizes of objects when balancing.

Another question:

Why do you inquire about the balancer? Does it affect the autoscaler?
So far I thought balancing PGs as a concept comes after the choice/computation 
of how many PGs to use.

I imagine you have CRUSH rules that constrain pools to one or the other?


Yes.

Are there any CRUSH rules — especially #0 default replicated rule — that do not 
specify a device class in this way?


No, all all CRUSH rules that are in use do specify a device class:

rep-cluster # for POOL in $(ceph osd pool ls); do echo -n "$POOL "; ceph osd pool get 
"$POOL" crush_rule; done

    .mgr     crush_rule: mgr_replicated_ssd_rule_datacenter
    data     crush_rule: rule_data_datacenter
    metadata crush_rule: rule_metadata_datacenter

    mgr_replicated_ssd_rule_datacenter, "type": 1, "steps":
        { "op": "take", "item": -2, "item_name": "default~ssd" }
        { "op": "chooseleaf_firstn", "num": 0, "type": "datacenter" }
        { "op": "emit" }
    rule_data_datacenter", "type": 1, "steps":
        { "op": "take", "item": -6, "item_name": "default~hdd" }
        { "op": "chooseleaf_firstn", "num": 0, "type": "datacenter" }
        { "op": "emit" }
    rule_metadata_datacenter", "type": 1, "steps":
        { "op": "take", "item": -2, "item_name": "default~ssd" },
        { "op": "chooseleaf_firstn", "num": 0, "type": "datacenter" }
        { "op": "emit" }

ec-cluster # for POOL in $(ceph osd pool ls); do echo -n "$POOL "; ceph osd pool get 
"$POOL" crush_rule; done

    .mgr     crush_rule: mgr_replicated_ssd_rule_datacenter
    data     crush_rule: rule_data_ssd_datacenter
    data_ec  crush_rule: rule_data_ec_datacenter
    metadata crush_rule: rule_metadata_datacenter

    mgr_replicated_ssd_rule_datacenter, "type": 1, "steps":
        { "op": "take", "item": -2, "item_name": "default~ssd" }
        { "op": "chooseleaf_firstn", "num": 0, "type": "datacenter" }
        { "op": "emit" }
    rule_data_ssd_datacenter, "type": 1, "steps":
        { "op": "take", "item": -2, "item_name": "default~ssd" }
        { "op": "chooseleaf_firstn", "num": 0, "type": "datacenter" }
        { "op": "emit" }
    rule_data_ec_datacenter, "type": 3, "steps":
        { "op": "set_chooseleaf_tries", "num": 5 }
        { "op": "set_choose_tries", "num": 100 }
        { "op": "take", "item": -21, "item_name": "default~hdd" },
        { "op": "chooseleaf_indep", "num": 0, "type": "datacenter" }
        { "op": "emit" }
    rule_metadata_datacenter, "type": 1, "steps":
        { "op": "take", "item": -2, "item_name": "default~ssd" }
        { "op": "chooseleaf_firstn", "num": 0, "type": "datacenter" }
        { "op": "emit" }

My understanding is that the autoscaler won’t jump a pg_num value until the new 
value is (by default) a factor of 3 high or low


Indeed, but isn't it factor 4x too low already?

Is there a way I can see the computations and decisions of the autoscaler?

I find it confusing that `ceph osd pool autoscale-status` does not have any 
column related to OSDs, when `mon_target_pg_per_osd` is a key input to the 
algorithm that controls the ratio between PGs and OSDs.

Following https://ceph.io/en/news/blog/2022/autoscaler_tuning/ section "How do I 
know what the autoscaler is doing?"

    # grep 'space, bias' /var/log/ceph/ceph-mgr.backupfs-1.log
    2025-06-18T14:40:26.208+0000 7faa03a726c0  0 [pg_autoscaler INFO root] Pool 
'benacofs_data_ec' root_id -21 using 0.5698246259869221 of space, bias 1.0, pg 
target 550.8304717873581 quantized to 512 (current 256)

This seems to suggest that 512 PGs should be the target, instead of the current 
256, which would bring me within factor 3x ratio.
Why doesn't `ceph osd pool autoscale-status` contain any info that suggests 
that some autoscaling should happen then?
There are no other `pg_autoscaler` logs that suggests that it's somehow giving 
up.

Also again here we have "blog-driven documentation"; none of this info from the 
blog seems to be anywhere in Ceph upstream documentation.

It also mentions "ceph progress".
In that output, it is annoying that there's no time information at all.
The listed events could be recent or years old.
It's not even clear what the order is (old to new, or the other way around?).
I can use `ceph progress json` but then have to read UNIX timestamps.
I filed an issue for it now: https://tracker.ceph.com/issues/71781

I also noticed that the dates in `ceph progress json` look bugged:
https://tracker.ceph.com/issues/71782

ceph config set global target_size_ratio 250


I don't fully understand this suggestion.

Isn't target_size_ratio "relative to other pools that have target_size_ratio 
set"?
https://docs.ceph.com/en/squid/rados/operations/placement-groups/#specifying-expected-pool-size

If I set it globally (thus for all pools), isn't the ratio between them still 
the same?

The first sets the target back to a sane value


How can I check what's currently set? Currently nothing seems set at all:

    # ceph osd pool get data_ec target_size_ratio
    Error ENOENT: option 'target_size_ratio' is not set on pool 'data_ec'

Similar, how can I check the `threshold` value that one can set with `ceph osd 
pool set threshold`?
(I'll send another email "Why is it still so difficult to just dump all config and 
where it comes from?" to the list for this.)

Also, should I be setting `pg_autoscale_bias` to increase the number of PGs 
that the autoscaler comes up with, by a fixed factor, to adjust for my small 
objects?

This is suggested by
https://docs.redhat.com/en/documentation/red_hat_ceph_storage/4/html/storage_strategies_guide/placement_groups_pgs

This property is particularly used for metadata pools which might be small in 
size but have large number of objects, so scaling them faster is important for 
better performance.


Separate:
I read https://docs.ceph.com/en/squid/rados/operations/balancer/#throttling
I think these docs need improvement:

There is a separate setting for how uniform the distribution of PGs must be for 
the module to consider the cluster adequately balanced. At the time of writing 
(June 2025), this value defaults to `5`


So "there is a setting, and its default value is 5" ... but what's the name of 
the setting?
Is it `upmap_max_deviation` from 4 paragraphs further down?

Thanks,
Niklas
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Suspiciously low PG count for CephFS with many small files

Reply via email to