[ceph-users] Re: New Cluster Ceph

Anthony D'Atri Tue, 18 Nov 2025 05:21:07 -0800

These things are always controversial, below are my thoughts.  Your mileage may 
vary.


> We plan to buy hardware for a new cluster ceph and would like some 
> approbation on what we choose.
> 
> 6 nodes DELL 6715 + 6 Powervault MD2412 enclosure

I would not recommend using external drive arrays:

* You're paying money for an HBA
* Proprietary tool for managing the thing, granted this one looks like a JBOD 
vs one with an embedded RAID controller, which I've previously wrangled with
* 24 Gb/s SAS is still SAS.  SAS or SATA purchases today pretty much lock you 
into HDDs in the future.

> 
> For each node =>
> 
> 1 CPU AMD EPYC 9475F 3,65 GHz, 48C/96T, 256M Cache (400 W)

Yikes.  Higher-freq CPUs or those with the highest core counts tend to cost 
more in terms of oomph per euro than middle-of-the-road SKUs. Note the 
significant list price difference compared to say the 9455P.  For many 
purposes, solve for $ / ((cores) * (base)).  Servethehome posts nice value 
analysis charts. Heck, if a single-socket system meets your needs for e.g. RAM 
capacity and PCIe lanes, consider the 9555P.

> RAM 16Gox16 = 256Go

For 12 OSDs you would likely be fine with 128GB, with slots empty for expansion.

> 4x 100GB NVIDIA MELLANOX 100GB

Why 4x?  With 12 HDD OSDs you'd be well-served by 2x 25 GE, skipping a 
replication network.

> HBA465e (externe, 22,5GB/s)
> 
> 2 NVMe mixed 6,4To DBWAL
> 6 NVMe mixed 6,4To

mixed-use SSDs are usually overkill.  They're usually the same hardware as 
read-intensive SSDs with the overprovisioning slider (and the price) bumped.
"read-intensive" and "mixed-use" are marketing terms.  I worked for the product 
/ marketing team of an SSD manufacturer, trust me on this ;)



> 5 NVMe read 15,36To
> 
> for each enclosure =>
> 
> 12 HDD 20To
> 
> HDD will be a replica 3 pool with the dbwal

With modern NVMe SSDs, I might suggest 1x 7.6 T for DB/WAL offload. You can go 
12:1.


> NVME mixed and read will be 2 differents pools (we will test replica 3 and EC 
> to see which performance/storage efficency satisfy us the most)

Why two different, small pools?  You'd be better off getting all read-intensive 
NVMe SSDs, and to keep things simple make all of them the same capacity. You 
would benefit from having more OSDs in a single pool, especially since this is 
a small cluster.


> This cluster will be mostly use to store block (VM proxmox/kubernetes)

On the SSDs I presume?  That workload on HDDs, especially at 20T, is unlikely 
to be adequate.


> and S3.

HDDs just for the bucket pool, right? Everything else on the SSDs?


> The crushmap will be like so : 3 rooms with 2 nodes per room.
> 
> So the point of failure can be for replica 3 => rooms and if we use EC 4+2 => 
> host

With Tentacle fast EC, you *might* find that RBD on EC performance is 
acceptable, worth testing.

> 
> We saw that the thread needs for NVMe OSD are very expensive, does this CPU 
> good enough to carry them ?

Be sure to read those references carefully.  The one that describes gains up to 
14 threads per is in a very specific context.

So let's say you have 8x 15T NVMe RI OSDs + 12x HDD OSDs.  I might budget for 6 
and 4 threads each, respectively, with some left over for mon, mgr, rgw, etc.  
The 96 thread CPU is a bit tight given that math, but I think you'd probably be 
okay.

> 
> MON and MGR will be spread across the cluster and RGW on virtual machine

Oh, RGWs on VMs would alleviate the CPU appetite a bit.  This is not an unknown 
strategy.

I would personally consider an all-NVMe chassis with a mix of TLC and 
QLC/QLC-class SSDs for the bucket pool. You'd use fewer RUs and your recovery 
times would be dramatically improved.  Consider how long it takes to write a 
20T HDD.



> 
> Thanks for your answer !
> 
> Vivien
> 
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: New Cluster Ceph

Reply via email to