[ceph-users] Re: ceph deployment best practice

gagan tiwari Mon, 21 Apr 2025 07:45:04 -0700

Sorry that was typo I meant 4T SSD not 6T

On Mon, 21 Apr, 2025, 5:18 pm Anthony D'Atri, <anthony.da...@gmail.com>
wrote:


>
>
> On Apr 21, 2025, at 6:54 AM, gagan tiwari <gagan.tiw...@mathisys-india.com>
> wrote:
>
> HI Anthony,
>                           Based on your inputs and further digging into
> Ceph documentation,  I am now thinking to go for 6 OSD nodes to have k=4
> and m=2 EC set-up.
>
>
> Be aware that with that architecture when you lose one drive, the
> cluster’s capacity will decrease by that drive’s capacity until it is
> restored.
>
> As I mentioned., we need maximum usable space and we are more concerned
> about data safety and  best read performance from the cluster.  Writes
> operation will be done on a separate storage solution via NFS.
>
>
> Different data sets?  Almost sounds like a task for Aerospike.
>
>
> So, with each OSD node having 22X4T Enterprise SSD
>
>
> No QVOs?
>
>  we will have 88X6 = 528T Raw Space. With 4X2 EC , it will hopefully
> provide us with 390T usable space.   So, that will be enough for us to
> start with.
>
>
>
> 6TB sounds like mixed-use 3DWPD SSDs?  If so, those are almost certainly
> overkill.  You’ll be fine with read-intensive SSDs which would be 7.6TB.
>
> Remember the below when planning usable space:
>
> * Storage vendors use base-10 units (TB) while humans mostly use base-2
> units (TiB). So 528 TB = 480 TiB
> * Ceph has nearfull, backfillfull, and full ratios.  The default nearfull
> ratio is 85%, so you will get a warning state at roughly 408TiB stored,
> OSDs will no longer accept backfill at roughly 432TiB stored, and will no
> longer accept writes at 456TiB stored.
> * With CephFS files smaller than, say, 128KB will currently waste a
> noticeable fraction of raw capacity.  How large are your files?
>
>
>
> So, I need to know what will be data safely level with the above set-up (
> i.e.  6 OSDs with  4X2 EC  ). How many OSDs ( disks ) and nodes failure ,
> above set-up can withstand.
>
>
> With the above topology, you can sustain one OSD failure at a time without
> losing data availability.  You can sustain two overlapping OSD failures
> without losing data, but it will become unavailable until replication is
> restored.
>
> You can sustain one node being down and data will still be available.  You
> can sustain two nodes being down without data loss.
>
>
> Also,  if , later,  we need to add more OSD modes to get more usable
> space,   will we need to add same size disks ( 4T ) or can we add nodes
> with bigger size disks ( 8T or 15T )  ?
>
>
> Above you wrote 6T but here you write 4T, which is it?  Note that a
> read-intensive enterprise SSD will be 3.84 TB which means 3.5 TiB.
>
> You can mix OSD drive sizes, but be aware that with a 4,2 EC profile for
> your bulk data you will absolutely want to add them evenly across nodes.
> You will want every node to have the same total capacity, otherwise some
> capacity may not be usable, because every node will need to place one shard
> of that bulk EC data.
>
> ceph config set global mon_max_pg_per_osd 1000
>
> ^ this will help avoid certain problem scenarios when mixing drive
> capacities.
>
> Beside OSDs server ,  going to have three Dell servers with 8 Core and 64G
> RAM to run 3 monitor daemons one on  each server.
>
>
> Ok.  Better yet would be also run 2 mons on the OSD servers as well.
>
>
> One 4 core and 64G RAM with high core freq ( 4800 MHz ) server to run MDS
> daemon.
>
> Please advise
>
>
> Thanks,
> Gagan
>
>
>
>
>
>
>
>
>
>
> On Tue, Apr 15, 2025 at 8:14 PM Anthony D'Atri <anthony.da...@gmail.com>
> wrote:
>
>> It’s a function of your use-case.
>>
>>
>> > On Apr 14, 2025, at 8:41 AM, Anthony Fecarotta <anth...@linehaul.ai>
>> wrote:
>> >
>> >> MDS (if you’re going to CephFS vs using S3 object storage or RBD block)
>> > Hi Anthony,
>> >
>> > Can you elaborate on this remark?
>> >
>> > Should one choose between using CephFS vs S3 Storage (as it pertains to
>> best practices)?
>> >
>> > On Proxmox, I am. using both CephFS and RBD.
>> >
>> >
>> > Regards,
>> > [image]
>> > Anthony Fecarotta
>> > Founder & President
>> > [image] anth...@linehaul.ai <mailto:anth...@linehaul.ai>
>> > [image] 224-339-1182 [image] (855) 625-0300
>> > [image] 1 Mid America Plz Flr 3 Oakbrook Terrace, IL 60181
>> <https://www.google.com/maps/search/1+Mid+America+Plz+Flr+3+Oakbrook+Terrace,+IL+60181?entry=gmail&source=g>
>> > [image] www.linehaul.ai <http://www.linehaul.ai/>
>> > [image] <http://www.linehaul.ai/>
>> > [image] <https://www.linkedin.com/in/anthony-fec/>
>> >
>> > On Sun Apr 13, 2025, 04:28 PM GMT, Anthony D'Atri <mailto:
>> anthony.da...@gmail.com> wrote:
>> >>
>> >>> On Apr 13, 2025, at 12:00 PM, Brendon Baumgartner <bren...@netcal.com>
>> wrote:
>> >>>
>> >>>
>> >>>> On Apr 11, 2025, at 10:13, gagan tiwari <
>> gagan.tiw...@mathisys-india.com> wrote:
>> >>>>
>> >>>> Hi Anthony,
>> >>>> We will be using Samsung SSD 870 QVO 8TB disks on
>> >>>> all OSD servers.
>> >>>
>> >>> I’m a newbie to ceph and I have a 4 node cluster and it doesn’t have
>> a lot of users so downtime is easily scheduled for tinkering. I started
>> with consumer SSDs (SATA/NVMEs) because they were free and lying around.
>> Performance was bad. Then just the NVMEs, still bad. Then enterprise SSDs,
>> still bad (relative to DAS anyway).
>> >>
>> >> Real enteprise SSDs? Enterprise NVMe not enterprise SATA? Sellers can
>> lie sometimes. Also be sure to update firmware to the latest, that can make
>> a substantial difference.
>> >>
>> >> Other factors include:
>> >>
>> >> * Enough hosts and OSDs. Three hosts with one OSD each aren’t going to
>> deliver a great experience
>> >> * At least 6GB of available physmem per NVMe OSD
>> >> * How you measure - a 1K QD1 fsync workload is going to be more
>> demanding than a buffered 64K QD32 workload.
>> >>>
>> >>> Each step on the journey to enterprise SSDs made things faster. The
>> problem with the consumer stuff is the latency. Enterprise SSDs are 0-2ms.
>> Consumer SSDs are 15-300ms. As you can see, the latency difference is
>> significant.
>> >>
>> >> Some client SSDs are “DRAMless”, they don’t use ~~ 1GB of onboard RAM
>> per 1TB of capacity as the LBA indirection table. This can be a substantial
>> issue for enterprise workloads.
>> >>
>> >>>
>> >>> So from my experience, I would say ceph is very slow in general
>> compared to DAS. You need all the help you can get.
>> >>>
>> >>> If you want to use the consumer stuff, I would recommend to make a
>> slow tier (2nd pool with a different policy). Or I suppose just expect it
>> to be slow in general. I still have my consumer drives installed, just
>> configured as a 2nd tier which is unused right now because we have an old
>> JBOD for 2nd tier that is much faster.
>> >>
>> >> How much drives in each?
>> >>>
>> >>> Good luck!
>> >>>
>> >>> _BB
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> ceph-users mailing list -- ceph-users@ceph.io
>> >>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> >> _______________________________________________
>> >> ceph-users mailing list -- ceph-users@ceph.io
>> >> To unsubscribe send an email to ceph-users-le...@ceph.io
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph deployment best practice

Reply via email to