[ceph-users] Re: ceph deployment best practice

Anthony D'Atri Mon, 21 Apr 2025 04:50:38 -0700


> On Apr 21, 2025, at 6:54 AM, gagan tiwari <gagan.tiw...@mathisys-india.com> 
> wrote:
> 
> HI Anthony,
>                           Based on your inputs and further digging into Ceph 
> documentation,  I am now thinking to go for 6 OSD nodes to have k=4 and m=2 
> EC set-up.


Be aware that with that architecture when you lose one drive, the cluster’s 
capacity will decrease by that drive’s capacity until it is restored.

> As I mentioned., we need maximum usable space and we are more concerned about 
> data safety and  best read performance from the cluster.  Writes operation 
> will be done on a separate storage solution via NFS. 

Different data sets?  Almost sounds like a task for Aerospike.

> 
> So, with each OSD node having 22X4T Enterprise SSD

No QVOs?

>  we will have 88X6 = 528T Raw Space. With 4X2 EC , it will hopefully provide 
> us with 390T usable space.   So, that will be enough for us to start with. 


6TB sounds like mixed-use 3DWPD SSDs?  If so, those are almost certainly 
overkill.  You’ll be fine with read-intensive SSDs which would be 7.6TB.

Remember the below when planning usable space:

* Storage vendors use base-10 units (TB) while humans mostly use base-2 units 
(TiB). So 528 TB = 480 TiB
* Ceph has nearfull, backfillfull, and full ratios.  The default nearfull ratio 
is 85%, so you will get a warning state at roughly 408TiB stored, OSDs will no 
longer accept backfill at roughly 432TiB stored, and will no longer accept 
writes at 456TiB stored.
* With CephFS files smaller than, say, 128KB will currently waste a noticeable 
fraction of raw capacity.  How large are your files?


> 
> So, I need to know what will be data safely level with the above set-up ( 
> i.e.  6 OSDs with  4X2 EC  ). How many OSDs ( disks ) and nodes failure , 
> above set-up can withstand. 

With the above topology, you can sustain one OSD failure at a time without 
losing data availability.  You can sustain two overlapping OSD failures without 
losing data, but it will become unavailable until replication is restored.

You can sustain one node being down and data will still be available.  You can 
sustain two nodes being down without data loss.

> 
> Also,  if , later,  we need to add more OSD modes to get more usable space,   
> will we need to add same size disks ( 4T ) or can we add nodes with bigger 
> size disks ( 8T or 15T )  ? 

Above you wrote 6T but here you write 4T, which is it?  Note that a 
read-intensive enterprise SSD will be 3.84 TB which means 3.5 TiB.

You can mix OSD drive sizes, but be aware that with a 4,2 EC profile for your 
bulk data you will absolutely want to add them evenly across nodes. You will 
want every node to have the same total capacity, otherwise some capacity may 
not be usable, because every node will need to place one shard of that bulk EC 
data.

ceph config set global mon_max_pg_per_osd 1000

^ this will help avoid certain problem scenarios when mixing drive capacities.

> Beside OSDs server ,  going to have three Dell servers with 8 Core and 64G 
> RAM to run 3 monitor daemons one on  each server. 

Ok.  Better yet would be also run 2 mons on the OSD servers as well.

> 
> One 4 core and 64G RAM with high core freq ( 4800 MHz ) server to run MDS 
> daemon. 
> 
> Please advise
> 
> 
> Thanks,
> Gagan 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Tue, Apr 15, 2025 at 8:14 PM Anthony D'Atri <anthony.da...@gmail.com 
> <mailto:anthony.da...@gmail.com>> wrote:
>> It’s a function of your use-case.
>> 
>> 
>> > On Apr 14, 2025, at 8:41 AM, Anthony Fecarotta <anth...@linehaul.ai 
>> > <mailto:anth...@linehaul.ai>> wrote:
>> > 
>> >> MDS (if you’re going to CephFS vs using S3 object storage or RBD block)
>> > Hi Anthony,
>> > 
>> > Can you elaborate on this remark?
>> > 
>> > Should one choose between using CephFS vs S3 Storage (as it pertains to 
>> > best practices)?
>> > 
>> > On Proxmox, I am. using both CephFS and RBD.
>> > 
>> > 
>> > Regards,
>> > [image]
>> > Anthony Fecarotta
>> > Founder & President
>> > [image] anth...@linehaul.ai <mailto:anth...@linehaul.ai> 
>> > <mailto:anth...@linehaul.ai <mailto:anth...@linehaul.ai>>
>> > [image] 224-339-1182 [image] (855) 625-0300
>> > [image] 1 Mid America Plz Flr 3 Oakbrook Terrace, IL 60181
>> > [image] www.linehaul.ai <http://www.linehaul.ai/> <http://www.linehaul.ai/>
>> > [image] <http://www.linehaul.ai/>
>> > [image] <https://www.linkedin.com/in/anthony-fec/>
>> > 
>> > On Sun Apr 13, 2025, 04:28 PM GMT, Anthony D'Atri 
>> > <mailto:anthony.da...@gmail.com <mailto:anthony.da...@gmail.com>> wrote:
>> >> 
>> >>> On Apr 13, 2025, at 12:00 PM, Brendon Baumgartner <bren...@netcal.com 
>> >>> <mailto:bren...@netcal.com>> wrote:
>> >>> 
>> >>> 
>> >>>> On Apr 11, 2025, at 10:13, gagan tiwari 
>> >>>> <gagan.tiw...@mathisys-india.com 
>> >>>> <mailto:gagan.tiw...@mathisys-india.com>> wrote:
>> >>>> 
>> >>>> Hi Anthony,
>> >>>> We will be using Samsung SSD 870 QVO 8TB disks on
>> >>>> all OSD servers.
>> >>> 
>> >>> I’m a newbie to ceph and I have a 4 node cluster and it doesn’t have a 
>> >>> lot of users so downtime is easily scheduled for tinkering. I started 
>> >>> with consumer SSDs (SATA/NVMEs) because they were free and lying around. 
>> >>> Performance was bad. Then just the NVMEs, still bad. Then enterprise 
>> >>> SSDs, still bad (relative to DAS anyway).
>> >> 
>> >> Real enteprise SSDs? Enterprise NVMe not enterprise SATA? Sellers can lie 
>> >> sometimes. Also be sure to update firmware to the latest, that can make a 
>> >> substantial difference.
>> >> 
>> >> Other factors include:
>> >> 
>> >> * Enough hosts and OSDs. Three hosts with one OSD each aren’t going to 
>> >> deliver a great experience
>> >> * At least 6GB of available physmem per NVMe OSD
>> >> * How you measure - a 1K QD1 fsync workload is going to be more demanding 
>> >> than a buffered 64K QD32 workload.
>> >>> 
>> >>> Each step on the journey to enterprise SSDs made things faster. The 
>> >>> problem with the consumer stuff is the latency. Enterprise SSDs are 
>> >>> 0-2ms. Consumer SSDs are 15-300ms. As you can see, the latency 
>> >>> difference is significant.
>> >> 
>> >> Some client SSDs are “DRAMless”, they don’t use ~~ 1GB of onboard RAM per 
>> >> 1TB of capacity as the LBA indirection table. This can be a substantial 
>> >> issue for enterprise workloads.
>> >> 
>> >>> 
>> >>> So from my experience, I would say ceph is very slow in general compared 
>> >>> to DAS. You need all the help you can get.
>> >>> 
>> >>> If you want to use the consumer stuff, I would recommend to make a slow 
>> >>> tier (2nd pool with a different policy). Or I suppose just expect it to 
>> >>> be slow in general. I still have my consumer drives installed, just 
>> >>> configured as a 2nd tier which is unused right now because we have an 
>> >>> old JBOD for 2nd tier that is much faster.
>> >> 
>> >> How much drives in each?
>> >>> 
>> >>> Good luck!
>> >>> 
>> >>> _BB
>> >>> 
>> >>> 
>> >>> _______________________________________________
>> >>> ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io>
>> >>> To unsubscribe send an email to ceph-users-le...@ceph.io 
>> >>> <mailto:ceph-users-le...@ceph.io>
>> >> _______________________________________________
>> >> ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io>
>> >> To unsubscribe send an email to ceph-users-le...@ceph.io 
>> >> <mailto:ceph-users-le...@ceph.io>
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io>
>> > To unsubscribe send an email to ceph-users-le...@ceph.io 
>> > <mailto:ceph-users-le...@ceph.io>
>> 

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph deployment best practice

Reply via email to