[ceph-users] Re: ceph deployment best practice

gagan tiwari Mon, 21 Apr 2025 21:06:39 -0700

Files are not of smaller size.   Sizes are in GB and MB.  Few files  will
be 2.5 K.


Thanks,
Gagan

On Mon, Apr 21, 2025 at 8:13 PM gagan tiwari <
gagan.tiw...@mathisys-india.com> wrote:

> Sorry that was typo I meant 4T SSD not 6T
>
> On Mon, 21 Apr, 2025, 5:18 pm Anthony D'Atri, <anthony.da...@gmail.com>
> wrote:
>
>>
>>
>> On Apr 21, 2025, at 6:54 AM, gagan tiwari <
>> gagan.tiw...@mathisys-india.com> wrote:
>>
>> HI Anthony,
>>                           Based on your inputs and further digging into
>> Ceph documentation,  I am now thinking to go for 6 OSD nodes to have k=4
>> and m=2 EC set-up.
>>
>>
>> Be aware that with that architecture when you lose one drive, the
>> cluster’s capacity will decrease by that drive’s capacity until it is
>> restored.
>>
>> As I mentioned., we need maximum usable space and we are more concerned
>> about data safety and  best read performance from the cluster.  Writes
>> operation will be done on a separate storage solution via NFS.
>>
>>
>> Different data sets?  Almost sounds like a task for Aerospike.
>>
>>
>> So, with each OSD node having 22X4T Enterprise SSD
>>
>>
>> No QVOs?
>>
>>  we will have 88X6 = 528T Raw Space. With 4X2 EC , it will hopefully
>> provide us with 390T usable space.   So, that will be enough for us to
>> start with.
>>
>>
>>
>> 6TB sounds like mixed-use 3DWPD SSDs?  If so, those are almost certainly
>> overkill.  You’ll be fine with read-intensive SSDs which would be 7.6TB.
>>
>> Remember the below when planning usable space:
>>
>> * Storage vendors use base-10 units (TB) while humans mostly use base-2
>> units (TiB). So 528 TB = 480 TiB
>> * Ceph has nearfull, backfillfull, and full ratios.  The default nearfull
>> ratio is 85%, so you will get a warning state at roughly 408TiB stored,
>> OSDs will no longer accept backfill at roughly 432TiB stored, and will no
>> longer accept writes at 456TiB stored.
>> * With CephFS files smaller than, say, 128KB will currently waste a
>> noticeable fraction of raw capacity.  How large are your files?
>>
>>
>>
>> So, I need to know what will be data safely level with the above set-up (
>> i.e.  6 OSDs with  4X2 EC  ). How many OSDs ( disks ) and nodes failure ,
>> above set-up can withstand.
>>
>>
>> With the above topology, you can sustain one OSD failure at a time
>> without losing data availability.  You can sustain two overlapping OSD
>> failures without losing data, but it will become unavailable until
>> replication is restored.
>>
>> You can sustain one node being down and data will still be available.
>> You can sustain two nodes being down without data loss.
>>
>>
>> Also,  if , later,  we need to add more OSD modes to get more usable
>> space,   will we need to add same size disks ( 4T ) or can we add nodes
>> with bigger size disks ( 8T or 15T )  ?
>>
>>
>> Above you wrote 6T but here you write 4T, which is it?  Note that a
>> read-intensive enterprise SSD will be 3.84 TB which means 3.5 TiB.
>>
>> You can mix OSD drive sizes, but be aware that with a 4,2 EC profile for
>> your bulk data you will absolutely want to add them evenly across nodes.
>> You will want every node to have the same total capacity, otherwise some
>> capacity may not be usable, because every node will need to place one shard
>> of that bulk EC data.
>>
>> ceph config set global mon_max_pg_per_osd 1000
>>
>> ^ this will help avoid certain problem scenarios when mixing drive
>> capacities.
>>
>> Beside OSDs server ,  going to have three Dell servers with 8 Core and
>> 64G RAM to run 3 monitor daemons one on  each server.
>>
>>
>> Ok.  Better yet would be also run 2 mons on the OSD servers as well.
>>
>>
>> One 4 core and 64G RAM with high core freq ( 4800 MHz ) server to run MDS
>> daemon.
>>
>> Please advise
>>
>>
>> Thanks,
>> Gagan
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Apr 15, 2025 at 8:14 PM Anthony D'Atri <anthony.da...@gmail.com>
>> wrote:
>>
>>> It’s a function of your use-case.
>>>
>>>
>>> > On Apr 14, 2025, at 8:41 AM, Anthony Fecarotta <anth...@linehaul.ai>
>>> wrote:
>>> >
>>> >> MDS (if you’re going to CephFS vs using S3 object storage or RBD
>>> block)
>>> > Hi Anthony,
>>> >
>>> > Can you elaborate on this remark?
>>> >
>>> > Should one choose between using CephFS vs S3 Storage (as it pertains
>>> to best practices)?
>>> >
>>> > On Proxmox, I am. using both CephFS and RBD.
>>> >
>>> >
>>> > Regards,
>>> > [image]
>>> > Anthony Fecarotta
>>> > Founder & President
>>> > [image] anth...@linehaul.ai <mailto:anth...@linehaul.ai>
>>> > [image] 224-339-1182 [image] (855) 625-0300
>>> > [image] 1 Mid America Plz Flr 3 Oakbrook Terrace, IL 60181
>>> <https://www.google.com/maps/search/1+Mid+America+Plz+Flr+3+Oakbrook+Terrace,+IL+60181?entry=gmail&source=g>
>>> > [image] www.linehaul.ai <http://www.linehaul.ai/>
>>> > [image] <http://www.linehaul.ai/>
>>> > [image] <https://www.linkedin.com/in/anthony-fec/>
>>> >
>>> > On Sun Apr 13, 2025, 04:28 PM GMT, Anthony D'Atri <mailto:
>>> anthony.da...@gmail.com> wrote:
>>> >>
>>> >>> On Apr 13, 2025, at 12:00 PM, Brendon Baumgartner <
>>> bren...@netcal.com> wrote:
>>> >>>
>>> >>>
>>> >>>> On Apr 11, 2025, at 10:13, gagan tiwari <
>>> gagan.tiw...@mathisys-india.com> wrote:
>>> >>>>
>>> >>>> Hi Anthony,
>>> >>>> We will be using Samsung SSD 870 QVO 8TB disks on
>>> >>>> all OSD servers.
>>> >>>
>>> >>> I’m a newbie to ceph and I have a 4 node cluster and it doesn’t have
>>> a lot of users so downtime is easily scheduled for tinkering. I started
>>> with consumer SSDs (SATA/NVMEs) because they were free and lying around.
>>> Performance was bad. Then just the NVMEs, still bad. Then enterprise SSDs,
>>> still bad (relative to DAS anyway).
>>> >>
>>> >> Real enteprise SSDs? Enterprise NVMe not enterprise SATA? Sellers can
>>> lie sometimes. Also be sure to update firmware to the latest, that can make
>>> a substantial difference.
>>> >>
>>> >> Other factors include:
>>> >>
>>> >> * Enough hosts and OSDs. Three hosts with one OSD each aren’t going
>>> to deliver a great experience
>>> >> * At least 6GB of available physmem per NVMe OSD
>>> >> * How you measure - a 1K QD1 fsync workload is going to be more
>>> demanding than a buffered 64K QD32 workload.
>>> >>>
>>> >>> Each step on the journey to enterprise SSDs made things faster. The
>>> problem with the consumer stuff is the latency. Enterprise SSDs are 0-2ms.
>>> Consumer SSDs are 15-300ms. As you can see, the latency difference is
>>> significant.
>>> >>
>>> >> Some client SSDs are “DRAMless”, they don’t use ~~ 1GB of onboard RAM
>>> per 1TB of capacity as the LBA indirection table. This can be a substantial
>>> issue for enterprise workloads.
>>> >>
>>> >>>
>>> >>> So from my experience, I would say ceph is very slow in general
>>> compared to DAS. You need all the help you can get.
>>> >>>
>>> >>> If you want to use the consumer stuff, I would recommend to make a
>>> slow tier (2nd pool with a different policy). Or I suppose just expect it
>>> to be slow in general. I still have my consumer drives installed, just
>>> configured as a 2nd tier which is unused right now because we have an old
>>> JBOD for 2nd tier that is much faster.
>>> >>
>>> >> How much drives in each?
>>> >>>
>>> >>> Good luck!
>>> >>>
>>> >>> _BB
>>> >>>
>>> >>>
>>> >>> _______________________________________________
>>> >>> ceph-users mailing list -- ceph-users@ceph.io
>>> >>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>> >> _______________________________________________
>>> >> ceph-users mailing list -- ceph-users@ceph.io
>>> >> To unsubscribe send an email to ceph-users-le...@ceph.io
>>> > _______________________________________________
>>> > ceph-users mailing list -- ceph-users@ceph.io
>>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>>>
>>>
>>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph deployment best practice

Reply via email to