[ceph-users] Re: [Urgent suggestion needed] New Prod Cluster Hardware recommendation

Pripriya Fri, 11 Jul 2025 08:57:41 -0700

On Thu, Jul 10, 2025 at 10:26 PM Anthony D'Atri <a...@dreamsnake.net> wrote:


>
> > Now I have a task to move them into a clustered proxmox and use shared
> > storage.
>
> Gotcha, I’ve seen that before and suspect that’s what was going on.
>
> > Deploying external ceph to avoid dependency on single node in
> > hyperconverged setup, also we wanted to use ceph for other services apart
> > from ceph so we think its better to go with external ceph.
>
> Agreed.
>
> > Sorry I think I didn't explain it correctly about my ceph cluster.
> >
> > *Total 5 Nodes: *where I want to colocate services
> > 3 Nodes will have MON,MGR & OSD Services colocated
> > 2 Node will be primary for OSD service but if needed can expand it for
> > other services as we are planning to go with similar specs hardware for
> all
> > the nodes.
>
> Ah.  That isn’t what was implied by what you wrote originally and I wanted
> to help you avert disaster.
>
> > What do you think we should do with core and ram reservation for per
> osd's
> > and per services where we want to populate osd's upto full capacity in
> > longer run (24 NVME chassis)?
>
> Unless I’m missing something, you won’t have reservations as such with a
> standalone Ceph cluster.
> If you mean what you equip the node with, people have varying rules of
> thumb.
>
> I would suggest 6 vcores/hyperthreads and 8GB per NVMe OSD.
>
>
Thanks for this suggestion on Core and RAM per OSD, Since I am on hardware
Planning phase I would request for change this now. & will try to equip as
much as possible.

> > Sure, I will explore Dell R7615 with 9454 or better 32 cores (AMD EPYC
> 9334
> > 2.70GHz, 32C/64T) because of cost
>
> Just be clear about cores vs threads, it’s super easy to mix them up.
> With Ceph we mostly think in terms of vcores aka hyperthreads, which on
> most CPUs are 2x per physical core.
>
> Yeah right, I am counting on Hyperthreaded Cores.

> >
> > *RAM per node:*
> > We are going with 32GB DIMMs  which will allow more capacity increase in
> > future (for now 32*4=128G)
>
> Nice.  Back in the depths of time I was tasked with ordering a Sun 4/110.
> Minimum orderable RAM was 8GB.  The system had 32 slots, I figured they
> would send 8x 1GB.  Nope.  They sent 32x 256KB, filling all the slots, so
> expansion would have meant pulling some low-density modules that would have
> not been useful elsewhere.  Today’s units are a thousand times larger but
> the potential is still there.
>
> Agree, I would will have enough slots in my case to extend them 512+GB

> *OSD's per node:*
> > 5x 7.68TB Data Center NVMe Read Intensive AG Drive U2 with carrier
> >
> > OR
> > 10x3.84TB Data Center NVMe Read Intensive AG Drive U2 with Carrier
> >
> > Which one is better ?
>
> At small scale there’s a certain advantage to having more OSDs, but the
> 3.84 TB SSDs are prone to the same phenomenon as low-density memory
> modules, assuming future expansion.  With the 7.68 TB SSDs your cluster
> would have 25x OSDs right?  I suspect that would be okay for your use-case,
> so I’d probably prefer the larger so you have more potential for expansion
> without having to buy servers.
>
I  am inclined to 7.68TB NVME which will allow me more expansion space in
future  with
Chassis Configuration 2.5" Chassis with up to 24 NVMe Switched HWRAID
Drives, Dual Controller, Front PERC 12
Hard Drives (PCIe SSD/Flex Bay) 7.68TB Enterprise NVMe Read Intensive AG
Drive U.2 Gen4 with carrier
Right now i am aiming for 5 OSD per nodes so its going to be tatal 25 OSD's

You almost certainly don’t need “mixed use” (MU) models so the above are
> fine.
>
> With spinners there is longstanding conventional wisdom that more spindles
> are better, and even the practice of short-stroking to limit long seeks. To
> a somewhat lesser extent this applies to SAS/SATA SSDs as well.
>
> NVMe SSDs are much less prone to such bottlenecks, especially at PCIe Gen
> 4+, so unless you get into the 60+TB SKU territory my sense is that
> conserving expansion slots is usually the thing to solve for, unless your
> initial footprint is *really* tiny, like only 5 OSDs.
>
> Like above initially it will be 5 OSD per node so total will be 25 OSD's,
with future expansion this can go a bit higher side ~15 OSD per node.

> >
> > *Networking:*
> > Public Network: 2x25G port as Bond0
> > Cluster Network: 2x25G Port as Bond1
> > (Proxmox will also have 2x25G port in bond)
>
> Ah, so you’ll have 4, or 6 physical ports per server?  I personally prefer
> to not have a cluster network, but if you have the Capex to deploy it,
> that’s fine.
>
> Yes I have a 4x 25G port
I would like to have a separate cluster network to avoid traffic on public
networks and clients, some of our applications need network bandwidths. So
we are fine putting additional nic cards
>

> > On Wed, Jul 9, 2025 at 8:26 PM Alex Gorbachev <a...@iss-integration.com>
> > wrote:
> >
> >> Completely agreeing with what Anthony wrote, and we see very good
> results
> >> with at least 4 physical OSD nodes, managed and deployed by cephadm -
> you
> >> will have 3 MONs and MGRs "hyperconverged" in cephadm sense, and run 3x
> >> replication for OSD with an extra OSD host for n+1 redundancy.
> >>
> >> Proxmox just needs a network and keyring to talk to this cluster.  You
> can
> >> run deployment and automation functions from a VM in Proxmox that runs
> on
> >> local storage.
> >>
> >> --
> >> Alex Gorbachev
> >> https://alextelescope.blogspot.com
> >>
> >>
> >>
> >> On Wed, Jul 9, 2025 at 10:28 AM Anthony D'Atri <a...@dreamsnake.net>
> wrote:
> >>
> >>>
> >>>>
> >>>> I am new to this thread would like to get some suggestions to build
> new
> >>>> external ceph  cluster
> >>>
> >>> Why external?  Many Proxmox deployments are converged.  Is this an
> >>> existing Proxmox cluster that currently does not use shared storage?
> >>>
> >>>
> >>>> which will backend for proxmox VM's
> >>>>
> >>>> I am planning to start with 5 Nodes(3 Mon & 2 OSD)
> >>>
> >>> This is not the best plan.
> >>>
> >>> If your data is not disposable you will want to maintain the default 3
> >>> copies, which you cannot safely do on 2 OSD nodes.
> >>>
> >>> When deploying a very small cluster solve first for the number of
> nodes.
> >>> You need at least 3 OSD nodes, 4 has advantages.
> >>>
> >>> So in your case, go converged: OSDs on all 5 nodes, and add the
> >>> mon/mgr/etc ceph orch labels to all 5 so that when a node is down a
> >>> replacement may be spun up.
> >>>
> >>> This would also let you deploy 5 mon instances instead of 3, which is
> >>> advantageous in that you can ride out 2 failures without disruption.
> >>>
> >>>> and I am expecting to start with ~60+ TB usable space.
> >>>
> >>> That would mean (3 * 60) / .85 =211.765 ~ 212 TB of raw capacity, let’s
> >>> see how that matches your numbers below.
> >>>
> >>>> estimated Storage Specs Calculator:
> >>>>
> >>>> RAM: 8GB/OSD Daemon, 16GB OS, 4GB for Mon & MGR, 16GB for MDS
> >>>
> >>> I would allot more than 4GB for mon/mgr.
> >>>
> >>>> cpu: 2 core/osd, 2 core for os, 2 core per services
> >>>
> >>> Cores or hyperthreads?  Either way these numbers are low.
> >>>
> >>>> *Dell R7625 5 Node to start with *
> >>>
> >>> Dramatic overkill for a mon/mgr/MDS node.
> >>>
> >>>> - RAM: 128G (Plan to increase later as needed)
> >>>
> >>> I suggest 32GB DIMMs to maximize potential for future expansion.
> >>>
> >>>> - CPU: 2x AMD EPYC 9224 2.50GHz, 24C/48T, 64M Cache (200W) DDR5-4800
> >>>
> >>> 96 threads total per server.
> >>>
> >>>> - Chassis Configuration 24x2.5 NVME
> >>>
> >>> You’ll be tempted to fill those slots; each OSD past, say, 12 will
> >>> decrease performance due to having to share the vcores/threads.
> >>> With the above CPU choice I would go with the R7615 to save rack space,
> >>> or bump up the CPU. The 9224 is the default choice on Dell’s
> configurator
> >>> but there are lots of others available. The 9454 for example would
> give you
> >>> enough cores to more comfortably service an eventual 24 OSDs.
> >>>
> >>> Alternately consider the R7615 with, say, the 9654P. The P CPUs can’t
> be
> >>> used in a dual-socket motherboard, so they’re usually a bit cheaper
> for the
> >>> same specs.
> >>>
> >>> With EPYC CPUs you can get better performance by disabling IOMMU on the
> >>> kernel command line via GRUB defaults.
> >>>
> >>>
> >>>> - 2x1.92TB Data Center NVMe Read Intensive AG Drive U2 Gen4 with
> >>> carrier (
> >>>> OS Disk, I need extra space)
> >>>
> >>> Okay so that will limit you to 22 OSDs with the 24-bay chassis.  You
> >>> could provision BOSS-N1 for M.2 boot though.
> >>>
> >>>> - 5x 7.68TB Data Center NVMe Read Intensive AG Drive U2 Gen4 with
> >>> Carrier
> >>>> 24Gbps 512e 2.5in Hot-Plug 1DWPD , AG Drive
> >>>
> >>> I think you have a copy/paste error there.  The second line above
> sounds
> >>> like a SAS SSD.
> >>>
> >>> So from what you wrote about this would intend a total of 10x 7.68TB
> OSD
> >>> drives.  With 3x replication and the default headroom ratios these will
> >>> give you about 22 TB of usable space, which is just 20 TiB.
> >>>
> >>>> - 2x Nvidia ConnectX-6 Lx Dual Port 10/25GbE SFP28, No Crypto, PCIe
> Low
> >>>> Profile
> >>>
> >>> I suggest bonding them and not having an optional replication network.
> >>> Some people will use one port for public and the other for
> replication, but
> >>> for multiple reasons that wouldn’t be ideal.
> >>>
> >>>>
> >>>> - 1G for IPMI
> >>>>
> >>>> Please help me finalize these specs.
> >>>>
> >>>> Thanks
> >>>> _______________________________________________
> >>>> ceph-users mailing list -- ceph-users@ceph.io
> >>>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users@ceph.io
> >>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>>
> >>
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: [Urgent suggestion needed] New Prod Cluster Hardware recommendation

Reply via email to