[ceph-users] Re: Optimal number of placement groups per OSD

Janne Johansson Wed, 30 Apr 2025 23:39:06 -0700

> Hi Guys,
>                 I have 7 OSD nodes with 10X15T NVME disk on each OSD node.
>
> To start with , I want to use only 8X15T disk on each osd node and keep
> 2X15 disk spare in case of any disk failure and recovery event.
>
> I am going to use the 4X2 EC  CephFS data pool to store data.
>
> So, with the above set-up, what will be the optimal number of placement
> groups per OSD.
>
> As per the PG calculator :-
>
> ( 8X7X100 ) / 6 = 933.33  nearest power of 2 is 1024.
>
> With 1024 placement groups distributed across 56 OSDs, that evaluates to
> approximately 18 placement groups per OSD.


With 1024 PGs, each consisting of 6 parts (4 pieces of split data and
2 checksum pieces), there will be 6144 pieces to be spread out on 56
OSDs, giving you 109 per OSD, a good number.

There is a naming problem here, a PG could either mean "a piece of a
pools total storage", but it is also the actual piece that has to live
on an OSD. The latter should be around 100 per OSD, the former should
be a nice power of two.

> I don't think its optimal as Ceph doc recommends 50-100 PG per OSD.

It will be fine, unless you have tons of other similar pools on the
same OSDs of course. If you made two such pools, you would have to
halve the number to 512 PGs in the pools.

> So, am I doing something wrong? or missing something white calculating the
> number of PG per OSD.
>
> Also, will it be best practice to keep 2X15T disk spare on each OSD or
> should I use all of them.

Use all of them. They will act as both storage and "hot spares" if
they are already in, making it possible for the cluster to handle a
faulted OSD without you taking any action, moving data over to any
other OSD with free space by itself, rebuilding from the 4+2 (now 3+2
or 4+1 for that PG) setup.

You can just decide to buy more space when the space 8x15T could bear
is used up, even when using 10x15T per host.

And also, you can't really go over 85% full on OSDs for long times
because sad things might happen if you get faults at that point, so
the more OSDs you have, the better you can plan for expansions and the
less impact does a single failing disk have on the whole cluster. (ie,
if you only had 10 OSDs and one fail, 10% of the data needs to be
rebuilt, while if you had 100 OSDs and one fails, only 1% needs to
find new OSDs to land on)

> Also, I am going to deploy 7 OSD nodes across 4 Racks and will be using the
> failure domain as "rack" to have Ceph handle the entire rack failure.

While you can build with topologies like rack, zone, dc and so on, you
don've have to. Just having the crush rules use the default "host"
failure domain is very often sufficient, especially when having so few
OSD hosts as you currently have. I can't speak for all the worlds
cohosting sites, but in my 30 years of admining I've yet to get a rack
failure. I've seen tons of versions of disk/host failures/crashes and
whole sites going out of power, but never a single rack failing while
the rack next to it goes on.

-- 
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Optimal number of placement groups per OSD

Reply via email to