[ceph-users] Re: ceph deployment best practice

Anthony D'Atri Fri, 11 Apr 2025 14:20:03 -0700


> 
> Hi Anthony,
>                         We will be using Samsung SSD 870 QVO 8TB disks on
> all OSD servers.


Your choices are yours to make, but for what it’s worth, I would not use these.

* They are client-class, not designed for enterprise workloads or duty cycle
* Best I can tell this lacks PLP power loss protection, which can result in 
corrupted or lost data
* QLC can be just smurfy for object storage workloads that are read-mostly, but 
can be disappointing for RBD or small objects/files
* 3 year warranty instead of the 5 years typical for enterprise SKUs
* Slow writes after the SLC cache portion fills, this is designed for desktop 
intermittent workload, not sustained enterprise workload.
* Rated endurance for a 4KB random write workload is ~ 0.33 DWPD over the 3 
year warranty period, which if divided by the enterprise 5 year warranty 
workload would be .20 DWPD.

If you expect a low write workload and have VERY limited performance 
expectations, maybe they’d work for you, but especially don’t think you can 
safely do replication size=2 or EC 2/3+1.  A few months ago someone in the 
community unilaterally sent me money *begging* me to make their cluster of 
these faster.  Nothing I could do sort of recommending that they be replaced 
with a more appropriate SKU.


> One more thing , I want to know is that CephFS supports  mounting with
> FsCache on clients ?

I find some references on the net to people doing this, but have zero 
experience with it.

>  500T data stored in the cluster will be accessed by
> the jobs running on the clients nodes and we need super fast read
> performance.

Client-class media are incompatible with super fast anything.  I don’t recall 
you mentioning the network — bonded 10GE at least?

> For that we do have additional cache disk installed on all the
> clients nodes. And the way NFS V4 supports mount NFS share with FsCache on
> clients' hosts ,CephFS also supports that.

You would do better to invest in enterprise cluster tech than in band-aids that 
may or may not work well.


{Good,Fast,Cheap} Pick Any Two.

Trite but so often true.

> 
> On those  4x non-OSD nodes, I will probably run ldap and HTCondor service.
> But mds node will not be used for anything other than mds daemon.
> 
> Thanks,
> Gagan
> 
> 
> 
> On Fri, Apr 11, 2025 at 8:45 PM Anthony D'Atri <anthony.da...@gmail.com>
> wrote:
> 
>> 
>> 
>>> On Apr 11, 2025, at 4:04 AM, gagan tiwari <
>> gagan.tiw...@mathisys-india.com> wrote:
>>> 
>>> Hi Anthony,
>>>                      Thanks for the reply!
>>> 
>>> We will be using  CephFS  to access  Ceph Storage from clients.  So, this
>>> will need MDS daemon also.
>> 
>> MDS is single-threaded, so unlike most Ceph daemons it benefits more from
>> a high-frequency CPU than core count.
>> 
>>> So, based on your advice, I am thinking of having 4 Dell PowerEdge
>> servers
>>> . 3 of them will run 3 Monitor daemons and one of them  will run MDS
>>> daemon.
>>> 
>>> These Dell Servers will have following hardware :-
>>> 
>>> 1. 4 cores (  8 threads )  ( Can go for 8 core and 16 threads )
>>> 
>>> 2.  64G RAM
>>> 
>>> 3. 2x4T  Samsung SSD  with RA!D 1 to install OS and run monitor and
>>> metadata services.
>> 
>> That probably suffices for a small cluster.  Are those Samsungs
>> enterprise?
>> 
>> 
>>> OSD nodes will be upgraded to have 32 cores ( 64 threads ).  Disk and RAM
>>> will remain same ( 128G and 22X8T Samsung SSD )
>> 
>> Which Samsung SSD?  Using client SKUs for OSDs has a way of leading to
>> heartbreak.
>> 
>> 64 threads would be better for a 22x OSD node, though still a bit light.
>> Are these SATA or NVMe?
>> 
>>> Actually , I want to use OSD nodes to run OSD damons and not any
>>> other demons and which is why I am thinking of having 4 additional Dell
>>> servers as mentioned above.
>> 
>> Colocation of daemons is common these days, especially with smaller
>> clusters.
>> 
>>> 
>>> Please advise if this plan will be better.
>> 
>> That’ll work, but unless you already have those quite-modest 4x non-OSD
>> nodes sitting around idle you might consider just going with the OSD nodes
>> and bumping the CPU again so you can colocate all the daemons.
>> 
>>> 
>>> Thanks,
>>> Gagan
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Wed, Apr 9, 2025 at 8:12 PM Anthony D'Atri <anthony.da...@gmail.com>
>>> wrote:
>>> 
>>>> 
>>>>> 
>>>>> We would start deploying Ceph with 4 hosts ( HP Proliant servers ) each
>>>>> running RockyLinux 9.
>>>>> 
>>>>> One of the hosts called ceph-adm will be smaller one and will have
>>>>> following hardware :-
>>>>> 
>>>>> 2x4T SSD  with raid 1 to install OS on.
>>>>> 
>>>>> 8 Core with 3600MHz freq.
>>>>> 
>>>>> 64G  RAM
>>>>> 
>>>>> We are planning to run all Ceph daemons except OSD daemon like monitor
>> ,
>>>>> metadata ,etc on this host.
>>>> 
>>>> 8 core == 16 threads? Are you provisioning this node because you have it
>>>> laying around idle?
>>>> 
>>>> Note that you will want *at least* 3 Monitor (monitors) daemons, which
>>>> must be on different nodes.  5 is better, but at least 3. You’ll also
>> have
>>>> Grafana, Prometheus, MDS (if you’re going to CephFS vs using S3 object
>>>> storage or RBD block)
>>>> 
>>>> 8c is likely on the light side for all of that.  You would also benefit
>>>> from not having that node be a single point of failure.  I would
>> suggest if
>>>> you can raising this node to the spec of the planned 3x OSD nodes so you
>>>> have 4x equivalent nodes, and spread that non-OSD daemons across them.
>>>> 
>>>> Note also that your OSD nodes will also have node_exporter, crash, and
>>>> other boilerplate daemons.
>>>> 
>>>> 
>>>>> We will have 3 hosts to run OSD which will store actual data.
>>>>> 
>>>>> Each OSD host will have following hardware
>>>>> 
>>>>> 2x4T SSD  with raid 1 to install OS on.
>>>>> 
>>>>> 22X8T SSD  to store data ( OSDs ) ( without partition ). We will use
>>>> entire
>>>>> disk without partitions
>>>> 
>>>> SAS, SATA, or NVMe SSDs?  Which specific model?  You really want to
>> avoid
>>>> client (desktop) models for Ceph, but you likely do not need to pay for
>>>> higher endurance mixed-use SKUs.
>>>> 
>>>>> Each OSD host will have 128G RAM  ( No swap space )
>>>> 
>>>> Thank you for skipping swap.  Some people are really stuck in the past
>> in
>>>> that regard.
>>>> 
>>>>> Each OSD host will have 16 cores.
>>>> 
>>>> So 32 threads total?  That is very light for 22 OSDs + other daemons.
>> For
>>>> HDD OSDs a common rule of thumb is at minimum 2x threads per, for
>> SAS/SATA
>>>> SSDs, 4, for NVMe SSDs 6.  Plus margin for the OS and other processes.
>>>> 
>>>>> All 4 hosts will connect to each via 10G nic.
>>>> 
>>>> Two ports with bonding? Redundant switches?
>>>> 
>>>>> The 500T data
>>>> 
>>>> The specs you list above include 528 TB of *raw* space.  Be advised that
>>>> with three OSD nodes, you will necessarily be doing replication.  For
>>>> safety replication with size=3.  Taking into consideration TB vs TiB and
>>>> headroom, you’re looking at 133TiB of usable space.  You could go with
>>>> size=2 to get 300TB of usable space, but at increased risk of data
>>>> unavailability or loss when drives/hosts fail or reboot.
>>>> 
>>>> With at least 4 OSD nodes - even if they aren’t fully populated with
>>>> capacity drives — you could do EC for a more favorable raw:usable
>> ratio, at
>>>> the expense of slower writes and recovery.  With 4 nodes you could in
>>>> theory do 2,2 EC for 200 TiB of usable space, with 5 you could do 3,2
>> for
>>>> 240 TiB usable, etc.
>>>> 
>>>>> will be accessed by the clients. We need to have
>>>>> read performance as fast as possible.
>>>> 
>>>> Hope your SSDs are enterprise NVMe.
>>>> 
>>>>> We can't afford data loss and downtime.
>>>> 
>>>> Then no size=2 for you.
>>>> 
>>>>> So, we want to have a Ceph
>>>>> deployment  which serves our purpose.
>>>>> 
>>>>> So, please advise me if the plan that I have designed will serve our
>>>>> purpose.
>>>>> Or is there a better way , please advise that.
>>>>> 
>>>>> Thanks,
>>>>> Gagan
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> We have a HP storage server with 12 SDD of 5T each and have set-up
>>>> hardware
>>>>> RAID6 on these disks.
>>>>> 
>>>>> HP storage server has 64G RAM and 18 cores.
>>>>> 
>>>>> So, please advise how I should go about setting up Ceph on it to have
>>>> best
>>>>> read performance. We need fastest read performance.
>>>>> 
>>>>> 
>>>>> Thanks,
>>>>> Gagan
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>> 
>>>> 
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> 
>> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph deployment best practice

Reply via email to