[ceph-users] Re: [Urgent suggestion needed] New Prod Cluster Hardware recommendation

Elias Carter Thu, 10 Jul 2025 09:07:12 -0700

> 5x 7.68TB Data Center NVMe Read Intensive AG Drive U2 with carrier
> 10x3.84TB Data Center NVMe Read Intensive AG Drive U2 with Carrier


> Public Network: 2x25G port as Bond0
> Cluster Network: 2x25G Port as Bond1

Did you check the read/write throughput of those NVMEs? It seems like you
may be bottlenecked on the network.

We deploy nodes with 4x NVME and can push around 80-90Gbit/s.

On Wed, Jul 9, 2025 at 11:24 PM Pripriya <pipriya1...@gmail.com> wrote:

> Hello Anthony D'Atri
> <
> https://urldefense.com/v3/__https://lists.ceph.io/hyperkitty/users/8185fca310134bbc9ca3fef8ca01866d/__;!!AAbsiYo4VobK!aVOqAVlNbPaVMiIHU3ct9zFpcdUQCy3Tpac0o4BsBncIQxAUVh697S0kn-h3VDy2cK11bPdEPjwdCwWjljM$
> >
> Thanks for your detailed Reply.
>
> 1 - Yes we have proxmox nodes but like you said they don't use any shared
> cluster right now and are running as single node proxmox.
> Now I have a task to move them into a clustered proxmox and use shared
> storage.
> Deploying external ceph to avoid dependency on single node in
> hyperconverged setup, also we wanted to use ceph for other services apart
> from ceph so we think its better to go with external ceph.
>
> Sorry I think I didn't explain it correctly about my ceph cluster.
>
> *Total 5 Nodes: *where I want to colocate services
> 3 Nodes will have MON,MGR & OSD Services colocated
> 2 Node will be primary for OSD service but if needed can expand it for
> other services as we are planning to go with similar specs hardware for all
> the nodes.
>
> What do you think we should do with core and ram reservation for per osd's
> and per services where we want to populate osd's upto full capacity in
> longer run (24 NVME chassis)?
>
> Sure, I will explore Dell R7615 with 9454 or better 32 cores (AMD EPYC 9334
> 2.70GHz, 32C/64T) because of cost
>
> *RAM per node:*
> We are going with 32GB DIMMs  which will allow more capacity increase in
> future (for now 32*4=128G)
>
> *OSD's per node:*
> 5x 7.68TB Data Center NVMe Read Intensive AG Drive U2 with carrier
>
> OR
> 10x3.84TB Data Center NVMe Read Intensive AG Drive U2 with Carrier
>
> Which one is better ?
>
> *Networking:*
> Public Network: 2x25G port as Bond0
> Cluster Network: 2x25G Port as Bond1
> (Proxmox will also have 2x25G port in bond)
>
> On Wed, Jul 9, 2025 at 8:26 PM Alex Gorbachev <a...@iss-integration.com>
> wrote:
>
> > Completely agreeing with what Anthony wrote, and we see very good results
> > with at least 4 physical OSD nodes, managed and deployed by cephadm - you
> > will have 3 MONs and MGRs "hyperconverged" in cephadm sense, and run 3x
> > replication for OSD with an extra OSD host for n+1 redundancy.
> >
> > Proxmox just needs a network and keyring to talk to this cluster.  You
> can
> > run deployment and automation functions from a VM in Proxmox that runs on
> > local storage.
> >
> > --
> > Alex Gorbachev
> >
> https://urldefense.com/v3/__https://alextelescope.blogspot.com__;!!AAbsiYo4VobK!aVOqAVlNbPaVMiIHU3ct9zFpcdUQCy3Tpac0o4BsBncIQxAUVh697S0kn-h3VDy2cK11bPdEPjwdSNwx0qI$
> >
> >
> >
> > On Wed, Jul 9, 2025 at 10:28 AM Anthony D'Atri <a...@dreamsnake.net>
> wrote:
> >
> >>
> >> >
> >> > I am new to this thread would like to get some suggestions to build
> new
> >> > external ceph  cluster
> >>
> >> Why external?  Many Proxmox deployments are converged.  Is this an
> >> existing Proxmox cluster that currently does not use shared storage?
> >>
> >>
> >> > which will backend for proxmox VM's
> >> >
> >> > I am planning to start with 5 Nodes(3 Mon & 2 OSD)
> >>
> >> This is not the best plan.
> >>
> >> If your data is not disposable you will want to maintain the default 3
> >> copies, which you cannot safely do on 2 OSD nodes.
> >>
> >> When deploying a very small cluster solve first for the number of nodes.
> >> You need at least 3 OSD nodes, 4 has advantages.
> >>
> >> So in your case, go converged: OSDs on all 5 nodes, and add the
> >> mon/mgr/etc ceph orch labels to all 5 so that when a node is down a
> >> replacement may be spun up.
> >>
> >> This would also let you deploy 5 mon instances instead of 3, which is
> >> advantageous in that you can ride out 2 failures without disruption.
> >>
> >> > and I am expecting to start with ~60+ TB usable space.
> >>
> >> That would mean (3 * 60) / .85 =211.765 ~ 212 TB of raw capacity, let’s
> >> see how that matches your numbers below.
> >>
> >> > estimated Storage Specs Calculator:
> >> >
> >> > RAM: 8GB/OSD Daemon, 16GB OS, 4GB for Mon & MGR, 16GB for MDS
> >>
> >> I would allot more than 4GB for mon/mgr.
> >>
> >> > cpu: 2 core/osd, 2 core for os, 2 core per services
> >>
> >> Cores or hyperthreads?  Either way these numbers are low.
> >>
> >> > *Dell R7625 5 Node to start with *
> >>
> >> Dramatic overkill for a mon/mgr/MDS node.
> >>
> >> > - RAM: 128G (Plan to increase later as needed)
> >>
> >> I suggest 32GB DIMMs to maximize potential for future expansion.
> >>
> >> > - CPU: 2x AMD EPYC 9224 2.50GHz, 24C/48T, 64M Cache (200W) DDR5-4800
> >>
> >> 96 threads total per server.
> >>
> >> > - Chassis Configuration 24x2.5 NVME
> >>
> >> You’ll be tempted to fill those slots; each OSD past, say, 12 will
> >> decrease performance due to having to share the vcores/threads.
> >> With the above CPU choice I would go with the R7615 to save rack space,
> >> or bump up the CPU. The 9224 is the default choice on Dell’s
> configurator
> >> but there are lots of others available. The 9454 for example would give
> you
> >> enough cores to more comfortably service an eventual 24 OSDs.
> >>
> >> Alternately consider the R7615 with, say, the 9654P. The P CPUs can’t be
> >> used in a dual-socket motherboard, so they’re usually a bit cheaper for
> the
> >> same specs.
> >>
> >> With EPYC CPUs you can get better performance by disabling IOMMU on the
> >> kernel command line via GRUB defaults.
> >>
> >>
> >> > - 2x1.92TB Data Center NVMe Read Intensive AG Drive U2 Gen4 with
> >> carrier (
> >> > OS Disk, I need extra space)
> >>
> >> Okay so that will limit you to 22 OSDs with the 24-bay chassis.  You
> >> could provision BOSS-N1 for M.2 boot though.
> >>
> >> > - 5x 7.68TB Data Center NVMe Read Intensive AG Drive U2 Gen4 with
> >> Carrier
> >> > 24Gbps 512e 2.5in Hot-Plug 1DWPD , AG Drive
> >>
> >> I think you have a copy/paste error there.  The second line above sounds
> >> like a SAS SSD.
> >>
> >> So from what you wrote about this would intend a total of 10x 7.68TB OSD
> >> drives.  With 3x replication and the default headroom ratios these will
> >> give you about 22 TB of usable space, which is just 20 TiB.
> >>
> >> > - 2x Nvidia ConnectX-6 Lx Dual Port 10/25GbE SFP28, No Crypto, PCIe
> Low
> >> > Profile
> >>
> >> I suggest bonding them and not having an optional replication network.
> >> Some people will use one port for public and the other for replication,
> but
> >> for multiple reasons that wouldn’t be ideal.
> >>
> >> >
> >> > - 1G for IPMI
> >> >
> >> > Please help me finalize these specs.
> >> >
> >> > Thanks
> >> > _______________________________________________
> >> > ceph-users mailing list -- ceph-users@ceph.io
> >> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Urgent suggestion needed] New Prod Cluster Hardware recommendation

Reply via email to