HI Janne,
                     Thanks for the explanation.

So, using all 10X15 Disks on 7 OSD nodes. Number PG will be :-

( 10X7X100 ) / 6 =  1166.666  nearest power of 2 is 2028.

So, I will need to set 2028 placement groups. With 2028 PG , three will be
12,168 pieces to be spread out on 70
OSDs, which will give me 173.82  per OSD.  Will this be an optimal number ?
Or this should be rounded off to 174


Thanks,
Gagan



On Thu, May 1, 2025 at 12:08 PM Janne Johansson <icepic...@gmail.com> wrote:

> > Hi Guys,
> >                 I have 7 OSD nodes with 10X15T NVME disk on each OSD
> node.
> >
> > To start with , I want to use only 8X15T disk on each osd node and keep
> > 2X15 disk spare in case of any disk failure and recovery event.
> >
> > I am going to use the 4X2 EC  CephFS data pool to store data.
> >
> > So, with the above set-up, what will be the optimal number of placement
> > groups per OSD.
> >
> > As per the PG calculator :-
> >
> > ( 8X7X100 ) / 6 = 933.33  nearest power of 2 is 1024.
> >
> > With 1024 placement groups distributed across 56 OSDs, that evaluates to
> > approximately 18 placement groups per OSD.
>
> With 1024 PGs, each consisting of 6 parts (4 pieces of split data and
> 2 checksum pieces), there will be 6144 pieces to be spread out on 56
> OSDs, giving you 109 per OSD, a good number.
>
> There is a naming problem here, a PG could either mean "a piece of a
> pools total storage", but it is also the actual piece that has to live
> on an OSD. The latter should be around 100 per OSD, the former should
> be a nice power of two.
>
> > I don't think its optimal as Ceph doc recommends 50-100 PG per OSD.
>
> It will be fine, unless you have tons of other similar pools on the
> same OSDs of course. If you made two such pools, you would have to
> halve the number to 512 PGs in the pools.
>
> > So, am I doing something wrong? or missing something white calculating
> the
> > number of PG per OSD.
> >
> > Also, will it be best practice to keep 2X15T disk spare on each OSD or
> > should I use all of them.
>
> Use all of them. They will act as both storage and "hot spares" if
> they are already in, making it possible for the cluster to handle a
> faulted OSD without you taking any action, moving data over to any
> other OSD with free space by itself, rebuilding from the 4+2 (now 3+2
> or 4+1 for that PG) setup.
>
> You can just decide to buy more space when the space 8x15T could bear
> is used up, even when using 10x15T per host.
>
> And also, you can't really go over 85% full on OSDs for long times
> because sad things might happen if you get faults at that point, so
> the more OSDs you have, the better you can plan for expansions and the
> less impact does a single failing disk have on the whole cluster. (ie,
> if you only had 10 OSDs and one fail, 10% of the data needs to be
> rebuilt, while if you had 100 OSDs and one fails, only 1% needs to
> find new OSDs to land on)
>
> > Also, I am going to deploy 7 OSD nodes across 4 Racks and will be using
> the
> > failure domain as "rack" to have Ceph handle the entire rack failure.
>
> While you can build with topologies like rack, zone, dc and so on, you
> don've have to. Just having the crush rules use the default "host"
> failure domain is very often sufficient, especially when having so few
> OSD hosts as you currently have. I can't speak for all the worlds
> cohosting sites, but in my 30 years of admining I've yet to get a rack
> failure. I've seen tons of versions of disk/host failures/crashes and
> whole sites going out of power, but never a single rack failing while
> the rack next to it goes on.
>
> --
> May the most significant bit of your life be positive.
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to