[ceph-users] Re: Suspiciously low PG count for CephFS with many small files

Anthony D'Atri Sat, 21 Jun 2025 09:38:57 -0700

> Yes, I think the Ceph docs should get some minor updates to make the 
> difference between PGs and PG replicas (PG * replicationFactor) even more 
> explicit.


Please open a tracker and list places you find where this isn’t already made 
clear.

> Can we please have 1 command, that can dump all config (including from config 
> files, monitor central configuration database, all currently running 
> daemons), and nicely point out what's set and overridden where and which 
> value is in effect?


Sounds like an opportunity to enter a tracker issue or a PR.


>      0    ssd   0.16370   1.00000  168 GiB  3.4 GiB   59 MiB  1.6 GiB  1.7 
> GiB  164 GiB   2.01  0.03    2      up
>      1    ssd   0.16370   1.00000  168 GiB  3.9 GiB   72 MiB  1.6 GiB  2.3 
> GiB  164 GiB   2.34  0.03    2      up
> 
>> STDDEV [..] if your SSD OSDs are significantly smaller than the HDDs that 
>> can confound the reporting
> 
> Yes, indeed the SSD OSDs are 100x smaller than the HDD OSDs.

What model are they that they’re that small?  Are they enterprise-quality?  
OSDs that small can present difficulties.

> Indeed, but isn't it factor 4x too low already?
> 

> 
> Potentially useful to know:
> 
> * The rep-cluster is in HEALTH_OK for a long time.
> * The ec-cluster suffers from `37 OSD(s) experiencing BlueFS spillover` for a 
> long time

How large are those DB+WAL slices?  Please share BlueFS stats:

https://www.ibm.com/docs/en/storage-ceph/7.1.0?topic=bluefs-viewing-ceph-statistics-ceph-osds


> (I have not solved that yet; I suspect that Ceph would simply like larger 
> DB/WAL devices on my SSDs for the size / object count I have on the HDDs, but 
> if so that is unfixable for me because I use Hetzner SX134 servers). I do not 
> know if that HEALTH_WARN caused by that spillover will permanently inhibit 
> the balancer.

I wouldn’t think so, but it may be possible to address them. 


> That said, I occasionally use https://github.com/TheJJ/ceph-balancer, which 
> takes into account the actual sizes of objects when balancing.
> 
> Another question:
> 
> Why do you inquire about the balancer? Does it affect the autoscaler?

It can contribute to suboptimal PG ratios on OSDs.


>  Oddly, not listed in 
> https://docs.ceph.com/en/squid/rados/configuration/ceph-conf/#commands
>  But I think 
> https://docs.ceph.com/en/squid/rados/configuration/ceph-conf/#commands should 
> list it so that from there one can easily see it's legacy.

I look forward to your PR.

> 
>> My understanding is that the autoscaler won’t jump a pg_num value until the 
>> new value is (by default) a factor of 3 high or low
> 
> Indeed, but isn't it factor 4x too low already?

One would think.

> rep-cluster: 35 PGs/OSD (= 1024*3/86)

35 > 100/3

> ec-cluster:  26 PGs/OSD (= 256*6/58)
>                   [ another such machine, new (added to the cluster 2 days 
> ago), that is currently being rebalanced to ]

I suspect that once backfill completes you’ll see a ratio > 33


>> ceph config set global target_size_ratio 250
> 
> I don't fully understand this suggestion.

Apologies, I meant mon_target_pg_per_osd = 250

> Also, should I be setting `pg_autoscale_bias` to increase the number of PGs 
> that the autoscaler comes up with, by a fixed factor, to adjust for my small 
> objects?

In most cases that should mostly be set for metadata / index pools. Mostly.

> 
> This is suggested by
> https://docs.redhat.com/en/documentation/red_hat_ceph_storage/4/html/storage_strategies_guide/placement_groups_pgs
> 
>> This property is particularly used for metadata pools which might be small 
>> in size but have large number of objects, so scaling them faster is 
>> important for better performance.

That’s a Nautilus page, so be careful using docs that old.  But yes, see above.

> 
> Separate:
> I read https://docs.ceph.com/en/squid/rados/operations/balancer/#throttling
> I think these docs need improvement:
> 
>> There is a separate setting for how uniform the distribution of PGs must be 
>> for the module to consider the cluster adequately balanced. At the time of 
>> writing (June 2025), this value defaults to `5`
> 
> So "there is a setting, and its default value is 5" ... but what's the name 
> of the setting?
> Is it `upmap_max_deviation` from 4 paragraphs further down?

Yes.

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Suspiciously low PG count for CephFS with many small files

Reply via email to