HI Janne, Thanks for the explanation. So, using all 10X15 Disks on 7 OSD nodes. Number PG will be :-
( 10X7X100 ) / 6 = 1166.666 nearest power of 2 is 2028. So, I will need to set 2028 placement groups. With 2028 PG , three will be 12,168 pieces to be spread out on 70 OSDs, which will give me 173.82 per OSD. Will this be an optimal number ? Or this should be rounded off to 174 Thanks, Gagan On Thu, May 1, 2025 at 12:08 PM Janne Johansson <icepic...@gmail.com> wrote: > > Hi Guys, > > I have 7 OSD nodes with 10X15T NVME disk on each OSD > node. > > > > To start with , I want to use only 8X15T disk on each osd node and keep > > 2X15 disk spare in case of any disk failure and recovery event. > > > > I am going to use the 4X2 EC CephFS data pool to store data. > > > > So, with the above set-up, what will be the optimal number of placement > > groups per OSD. > > > > As per the PG calculator :- > > > > ( 8X7X100 ) / 6 = 933.33 nearest power of 2 is 1024. > > > > With 1024 placement groups distributed across 56 OSDs, that evaluates to > > approximately 18 placement groups per OSD. > > With 1024 PGs, each consisting of 6 parts (4 pieces of split data and > 2 checksum pieces), there will be 6144 pieces to be spread out on 56 > OSDs, giving you 109 per OSD, a good number. > > There is a naming problem here, a PG could either mean "a piece of a > pools total storage", but it is also the actual piece that has to live > on an OSD. The latter should be around 100 per OSD, the former should > be a nice power of two. > > > I don't think its optimal as Ceph doc recommends 50-100 PG per OSD. > > It will be fine, unless you have tons of other similar pools on the > same OSDs of course. If you made two such pools, you would have to > halve the number to 512 PGs in the pools. > > > So, am I doing something wrong? or missing something white calculating > the > > number of PG per OSD. > > > > Also, will it be best practice to keep 2X15T disk spare on each OSD or > > should I use all of them. > > Use all of them. They will act as both storage and "hot spares" if > they are already in, making it possible for the cluster to handle a > faulted OSD without you taking any action, moving data over to any > other OSD with free space by itself, rebuilding from the 4+2 (now 3+2 > or 4+1 for that PG) setup. > > You can just decide to buy more space when the space 8x15T could bear > is used up, even when using 10x15T per host. > > And also, you can't really go over 85% full on OSDs for long times > because sad things might happen if you get faults at that point, so > the more OSDs you have, the better you can plan for expansions and the > less impact does a single failing disk have on the whole cluster. (ie, > if you only had 10 OSDs and one fail, 10% of the data needs to be > rebuilt, while if you had 100 OSDs and one fails, only 1% needs to > find new OSDs to land on) > > > Also, I am going to deploy 7 OSD nodes across 4 Racks and will be using > the > > failure domain as "rack" to have Ceph handle the entire rack failure. > > While you can build with topologies like rack, zone, dc and so on, you > don've have to. Just having the crush rules use the default "host" > failure domain is very often sufficient, especially when having so few > OSD hosts as you currently have. I can't speak for all the worlds > cohosting sites, but in my 30 years of admining I've yet to get a rack > failure. I've seen tons of versions of disk/host failures/crashes and > whole sites going out of power, but never a single rack failing while > the rack next to it goes on. > > -- > May the most significant bit of your life be positive. > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io