> Hi Guys, > I have 7 OSD nodes with 10X15T NVME disk on each OSD node. > > To start with , I want to use only 8X15T disk on each osd node and keep > 2X15 disk spare in case of any disk failure and recovery event. > > I am going to use the 4X2 EC CephFS data pool to store data. > > So, with the above set-up, what will be the optimal number of placement > groups per OSD. > > As per the PG calculator :- > > ( 8X7X100 ) / 6 = 933.33 nearest power of 2 is 1024. > > With 1024 placement groups distributed across 56 OSDs, that evaluates to > approximately 18 placement groups per OSD.
With 1024 PGs, each consisting of 6 parts (4 pieces of split data and 2 checksum pieces), there will be 6144 pieces to be spread out on 56 OSDs, giving you 109 per OSD, a good number. There is a naming problem here, a PG could either mean "a piece of a pools total storage", but it is also the actual piece that has to live on an OSD. The latter should be around 100 per OSD, the former should be a nice power of two. > I don't think its optimal as Ceph doc recommends 50-100 PG per OSD. It will be fine, unless you have tons of other similar pools on the same OSDs of course. If you made two such pools, you would have to halve the number to 512 PGs in the pools. > So, am I doing something wrong? or missing something white calculating the > number of PG per OSD. > > Also, will it be best practice to keep 2X15T disk spare on each OSD or > should I use all of them. Use all of them. They will act as both storage and "hot spares" if they are already in, making it possible for the cluster to handle a faulted OSD without you taking any action, moving data over to any other OSD with free space by itself, rebuilding from the 4+2 (now 3+2 or 4+1 for that PG) setup. You can just decide to buy more space when the space 8x15T could bear is used up, even when using 10x15T per host. And also, you can't really go over 85% full on OSDs for long times because sad things might happen if you get faults at that point, so the more OSDs you have, the better you can plan for expansions and the less impact does a single failing disk have on the whole cluster. (ie, if you only had 10 OSDs and one fail, 10% of the data needs to be rebuilt, while if you had 100 OSDs and one fails, only 1% needs to find new OSDs to land on) > Also, I am going to deploy 7 OSD nodes across 4 Racks and will be using the > failure domain as "rack" to have Ceph handle the entire rack failure. While you can build with topologies like rack, zone, dc and so on, you don've have to. Just having the crush rules use the default "host" failure domain is very often sufficient, especially when having so few OSD hosts as you currently have. I can't speak for all the worlds cohosting sites, but in my 30 years of admining I've yet to get a rack failure. I've seen tons of versions of disk/host failures/crashes and whole sites going out of power, but never a single rack failing while the rack next to it goes on. -- May the most significant bit of your life be positive. _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io