Sorry that was typo I meant 4T SSD not 6T On Mon, 21 Apr, 2025, 5:18 pm Anthony D'Atri, <anthony.da...@gmail.com> wrote:
> > > On Apr 21, 2025, at 6:54 AM, gagan tiwari <gagan.tiw...@mathisys-india.com> > wrote: > > HI Anthony, > Based on your inputs and further digging into > Ceph documentation, I am now thinking to go for 6 OSD nodes to have k=4 > and m=2 EC set-up. > > > Be aware that with that architecture when you lose one drive, the > cluster’s capacity will decrease by that drive’s capacity until it is > restored. > > As I mentioned., we need maximum usable space and we are more concerned > about data safety and best read performance from the cluster. Writes > operation will be done on a separate storage solution via NFS. > > > Different data sets? Almost sounds like a task for Aerospike. > > > So, with each OSD node having 22X4T Enterprise SSD > > > No QVOs? > > we will have 88X6 = 528T Raw Space. With 4X2 EC , it will hopefully > provide us with 390T usable space. So, that will be enough for us to > start with. > > > > 6TB sounds like mixed-use 3DWPD SSDs? If so, those are almost certainly > overkill. You’ll be fine with read-intensive SSDs which would be 7.6TB. > > Remember the below when planning usable space: > > * Storage vendors use base-10 units (TB) while humans mostly use base-2 > units (TiB). So 528 TB = 480 TiB > * Ceph has nearfull, backfillfull, and full ratios. The default nearfull > ratio is 85%, so you will get a warning state at roughly 408TiB stored, > OSDs will no longer accept backfill at roughly 432TiB stored, and will no > longer accept writes at 456TiB stored. > * With CephFS files smaller than, say, 128KB will currently waste a > noticeable fraction of raw capacity. How large are your files? > > > > So, I need to know what will be data safely level with the above set-up ( > i.e. 6 OSDs with 4X2 EC ). How many OSDs ( disks ) and nodes failure , > above set-up can withstand. > > > With the above topology, you can sustain one OSD failure at a time without > losing data availability. You can sustain two overlapping OSD failures > without losing data, but it will become unavailable until replication is > restored. > > You can sustain one node being down and data will still be available. You > can sustain two nodes being down without data loss. > > > Also, if , later, we need to add more OSD modes to get more usable > space, will we need to add same size disks ( 4T ) or can we add nodes > with bigger size disks ( 8T or 15T ) ? > > > Above you wrote 6T but here you write 4T, which is it? Note that a > read-intensive enterprise SSD will be 3.84 TB which means 3.5 TiB. > > You can mix OSD drive sizes, but be aware that with a 4,2 EC profile for > your bulk data you will absolutely want to add them evenly across nodes. > You will want every node to have the same total capacity, otherwise some > capacity may not be usable, because every node will need to place one shard > of that bulk EC data. > > ceph config set global mon_max_pg_per_osd 1000 > > ^ this will help avoid certain problem scenarios when mixing drive > capacities. > > Beside OSDs server , going to have three Dell servers with 8 Core and 64G > RAM to run 3 monitor daemons one on each server. > > > Ok. Better yet would be also run 2 mons on the OSD servers as well. > > > One 4 core and 64G RAM with high core freq ( 4800 MHz ) server to run MDS > daemon. > > Please advise > > > Thanks, > Gagan > > > > > > > > > > > On Tue, Apr 15, 2025 at 8:14 PM Anthony D'Atri <anthony.da...@gmail.com> > wrote: > >> It’s a function of your use-case. >> >> >> > On Apr 14, 2025, at 8:41 AM, Anthony Fecarotta <anth...@linehaul.ai> >> wrote: >> > >> >> MDS (if you’re going to CephFS vs using S3 object storage or RBD block) >> > Hi Anthony, >> > >> > Can you elaborate on this remark? >> > >> > Should one choose between using CephFS vs S3 Storage (as it pertains to >> best practices)? >> > >> > On Proxmox, I am. using both CephFS and RBD. >> > >> > >> > Regards, >> > [image] >> > Anthony Fecarotta >> > Founder & President >> > [image] anth...@linehaul.ai <mailto:anth...@linehaul.ai> >> > [image] 224-339-1182 [image] (855) 625-0300 >> > [image] 1 Mid America Plz Flr 3 Oakbrook Terrace, IL 60181 >> <https://www.google.com/maps/search/1+Mid+America+Plz+Flr+3+Oakbrook+Terrace,+IL+60181?entry=gmail&source=g> >> > [image] www.linehaul.ai <http://www.linehaul.ai/> >> > [image] <http://www.linehaul.ai/> >> > [image] <https://www.linkedin.com/in/anthony-fec/> >> > >> > On Sun Apr 13, 2025, 04:28 PM GMT, Anthony D'Atri <mailto: >> anthony.da...@gmail.com> wrote: >> >> >> >>> On Apr 13, 2025, at 12:00 PM, Brendon Baumgartner <bren...@netcal.com> >> wrote: >> >>> >> >>> >> >>>> On Apr 11, 2025, at 10:13, gagan tiwari < >> gagan.tiw...@mathisys-india.com> wrote: >> >>>> >> >>>> Hi Anthony, >> >>>> We will be using Samsung SSD 870 QVO 8TB disks on >> >>>> all OSD servers. >> >>> >> >>> I’m a newbie to ceph and I have a 4 node cluster and it doesn’t have >> a lot of users so downtime is easily scheduled for tinkering. I started >> with consumer SSDs (SATA/NVMEs) because they were free and lying around. >> Performance was bad. Then just the NVMEs, still bad. Then enterprise SSDs, >> still bad (relative to DAS anyway). >> >> >> >> Real enteprise SSDs? Enterprise NVMe not enterprise SATA? Sellers can >> lie sometimes. Also be sure to update firmware to the latest, that can make >> a substantial difference. >> >> >> >> Other factors include: >> >> >> >> * Enough hosts and OSDs. Three hosts with one OSD each aren’t going to >> deliver a great experience >> >> * At least 6GB of available physmem per NVMe OSD >> >> * How you measure - a 1K QD1 fsync workload is going to be more >> demanding than a buffered 64K QD32 workload. >> >>> >> >>> Each step on the journey to enterprise SSDs made things faster. The >> problem with the consumer stuff is the latency. Enterprise SSDs are 0-2ms. >> Consumer SSDs are 15-300ms. As you can see, the latency difference is >> significant. >> >> >> >> Some client SSDs are “DRAMless”, they don’t use ~~ 1GB of onboard RAM >> per 1TB of capacity as the LBA indirection table. This can be a substantial >> issue for enterprise workloads. >> >> >> >>> >> >>> So from my experience, I would say ceph is very slow in general >> compared to DAS. You need all the help you can get. >> >>> >> >>> If you want to use the consumer stuff, I would recommend to make a >> slow tier (2nd pool with a different policy). Or I suppose just expect it >> to be slow in general. I still have my consumer drives installed, just >> configured as a 2nd tier which is unused right now because we have an old >> JBOD for 2nd tier that is much faster. >> >> >> >> How much drives in each? >> >>> >> >>> Good luck! >> >>> >> >>> _BB >> >>> >> >>> >> >>> _______________________________________________ >> >>> ceph-users mailing list -- ceph-users@ceph.io >> >>> To unsubscribe send an email to ceph-users-le...@ceph.io >> >> _______________________________________________ >> >> ceph-users mailing list -- ceph-users@ceph.io >> >> To unsubscribe send an email to ceph-users-le...@ceph.io >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users@ceph.io >> > To unsubscribe send an email to ceph-users-le...@ceph.io >> >> > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io