[ceph-users] Re: Suspiciously low PG count for CephFS with many small files

Anthony D'Atri Mon, 30 Jun 2025 08:11:17 -0700


> On Jun 30, 2025, at 9:53 AM, Niklas Hambüchen <m...@nh2.me> wrote:
> 
>> Is this cluster serving RGW?  RBD? CephFS? Those pool names are unusual.
> 
> Just CephFS.
> I named the pools this way, following 
> https://docs.ceph.com/en/reef/cephfs/createfs/#creating-a-file-system


That I think shows names like cephfs_metadata and cephfs_data.  At such a time 
that you mix in RBD and RGW you may find less-descriptive names confusing. ymmv.

> 
>> I suggested 250.
> 
> Yes, but it is actually great to have even less objects per PG, because then 
> I'd arrive at 250k objects/PG (instead of my 2M from before), which should 
> make the recovery time of an individual PG more reasonable.

Recovery AIUI proceeds at object granularity, so there’s less of a benefit 
there than you might think.  More PGs also means more peering and memory use.  
Note also that this target is a maximum, depending on where the calculated 
pg_num values for pools land the effective PG ratio as reported by the PGS 
column in `ceph osd df` will usually be lower.

Sometimes there can be too much of a good thing.  I still suggest a more modest 
value to start.

> 
> So I think it's great that I get 8x more PGs.
> 
> But I'd like to understand _why_ it's happening as it did, because I expected 
> that a 4x increase of `*_pg_per_osd` should only be able to achieve a 4x PG 
> increase.

Remember that the autoscaler is juggling the parameters of multiple pools, and 
with both EC and replicated pools in the mix the calculations become nuanced.  
So the default max PG per OSD target of 100 might constrain each pool to a 
difference extent, and as you increase it, each pool’s calculated value may 
increase independently.  It’s not a strict multiplier.

> I'm wondering if I had hit an autoscaler bug before (that my PGs for data_ec 
> should really have been at 512 instead of 256), which would be good to report 
> if so.

I don’t think there’s a bug as such, but rather that the default target of 100 
is suboptimal in the BlueStore era, which leads to the autoscaler’s hands 
sometimes being tied with respect to what it really should be doing. 

> 
> I think your explanation with `NEW PG_NUM` having to be 3x larger for the 
> autoscaler to take action
> (https://docs.ceph.com/en/latest/rados/operations/placement-groups/#viewing-pg-scaling-recommendations)
> makes sense:
> 
>> It was constrained tightly, and in order to avoid flapping it only takes 
>> action when the value of pg_num it sets will increase by at least 3x.
> 
> E.g. if before I was at 256, and `NEW PG_NUM` was at at 500 (< 768 = 3*256), 
> then it would not take action; if my increase in settings by 4x would result 
> in `NEW PG_NUM` being 500 * 4 = 2000, it makes sense that it then sets it to 
> 2048.

pg_num values should always be a power of 2.  Other values are possible but 
lead to certain suboptimal dynamics including the potential for decreased 
balancer efficiency.

> 
> So I think that explains it sufficiently, thanks!

Glad to help.

> 
>> Remember that mon_max_pg_per_osd is a failsafe, it does not affect the 
>> autoscaler’s determinations.
> 
> Yes, that makes sense.
> 
>> ` ceph osd pool ls detail` will show you a bit more detail - the pg_num vs 
>> pgp_num values for each pool and given your names, the application 
>> association for each.
> 
> For reference, here's my output:
> 
>    pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 6 object_hash 
> rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 21504 flags 
> hashpspool stripe_width 0 pg_num_min 1 application mgr,mgr_devicehealth 
> read_balance_score 15.79

This will always only have one PG.

>    pool 2 'data' replicated size 3 min_size 2 crush_rule 7 object_hash 
> rjenkins pg_num 1024 pgp_num 1024 autoscale_mode on last_change 21506 lfor 
> 0/1905/12064 flags hashpspool stripe_width 0 pg_num_min 1024 application 
> cephfs read_balance_score 2.67

>    pool 3 'data_ec' erasure profile ec_profile size 6 min_size 5 crush_rule 8 
> object_hash rjenkins pg_num 505 pgp_num 377 pg_num_target 2048 pgp_num_target 
> 2048 autoscale_mode on last_change 24867 lfor 0/2200/24867 flags 
> hashpspool,ec_overwrites stripe_width 16384 application cephfs

This pool is in the process of being scaled — or mon_max_pg_per_osd is not 
properly increased.  Run

ceph config dump | grep mon_max_pg_per_osd

See if you have this set at `global` scope.  It’s possible that there are 
different values set for `global` and `osd` scopes, which would lead to the 
global setting not taking effect.

>    pool 4 'metadata' replicated size 3 min_size 2 crush_rule 9 object_hash 
> rjenkins pg_num 306 pgp_num 178 pg_num_target 512 pgp_num_target 512 
> autoscale_mode on last_change 24869 lfor 0/17183/24869 flags hashpspool 
> stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 
> application cephfs read_balance_score 2.93
> 

Same here.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Suspiciously low PG count for CephFS with many small files

Reply via email to