[ceph-users] Re: New Cluster Ceph

GLE, Vivien Tue, 18 Nov 2025 08:17:12 -0800


Thanks for your answer !



>These things are always controversial, below are my thoughts.  Your mileage 
>may vary.

>> We plan to buy hardware for a new cluster ceph and would like some 
>> approbation on what we choose.
>>
>> 6 nodes DELL 6715 + 6 Powervault MD2412 enclosure

>I would not recommend using external drive arrays:
>
>* You're paying money for an HBA
>* Proprietary tool for managing the thing, granted this one looks like a JBOD 
>vs one with an embedded RAID controller, which I've previously wrangled with
>* 24 Gb/s SAS is still SAS.  SAS or SATA purchases today pretty much lock you 
>into HDDs in the future.

This looks like a good storage efficency / price ratio for us today

>>
>> For each node =>
>>
>> 1 CPU AMD EPYC 9475F 3,65 GHz, 48C/96T, 256M Cache (400 W)

>Yikes.  Higher-freq CPUs or those with the highest core counts tend to cost 
>more in terms of oomph per euro than middle-of-the-road SKUs. Note the 
>significant list price difference >compared to say the 9455P.  For many 
>purposes, solve for $ / ((cores) * (base)).  Servethehome posts nice value 
>analysis charts. Heck, if a single-socket system meets your needs for e.g. 
>>RAM capacity and PCIe lanes, consider the 9555P.

Thanks we change the CPU based on your advice

>> RAM 16Gox16 = 256Go

>For 12 OSDs you would likely be fine with 128GB, with slots empty for 
>expansion.

Alrgith is it enough for MON/MGR/OSD ?

>> 4x 100GB NVIDIA MELLANOX 100GB

>Why 4x?  With 12 HDD OSDs you'd be well-served by 2x 25 GE, skipping a 
>replication network.

We thought that 2 x 100GB for public and 2 x 100GB for private would be a good 
setup with NVMe

>> HBA465e (externe, 22,5GB/s)
>>
>> 2 NVMe mixed 6,4To DBWAL
>> 6 NVMe mixed 6,4To

>mixed-use SSDs are usually overkill.  They're usually the same hardware as 
>read-intensive SSDs with the overprovisioning slider (and the price) bumped.
>"read-intensive" and "mixed-use" are marketing terms.  I worked for the 
>product / marketing team of an SSD manufacturer, trust me on this ;)

We saw that mixed-use SSDs can achieve twice the IOPS of read-intensive SSDs in 
write operations. This is why we plan to use this type of SSD for high-write 
workloads.

>> 5 NVMe read 15,36To
>>
>> for each enclosure =>
>>
>> 12 HDD 20To
>>
>> HDD will be a replica 3 pool with the dbwal

>With modern NVMe SSDs, I might suggest 1x 7.6 T for DB/WAL offload. You can go 
>12:1.

thanks

>> NVME mixed and read will be 2 differents pools (we will test replica 3 and 
>> EC to see which performance/storage efficency satisfy us the most)

>Why two different, small pools?  You'd be better off getting all 
>read-intensive NVMe SSDs, and to keep things simple make all of them the same 
>capacity. You would benefit from having >more OSDs in a single pool, 
>especially since this is a small cluster.

If we take only read-intensive NVMe SSDs this is the plan

>> This cluster will be mostly use to store block (VM proxmox/kubernetes)

>On the SSDs I presume?  That workload on HDDs, especially at 20T, is unlikely 
>to be adequate.

We thought that low-IOPS workloads would run pretty well on HDDs, as today our 
VMs are on a HDD pool with a cache tier SSD.
Are HDDs/DBWAL unsuitable for this type of workload in Ceph?

Full NVMEs is very expensive, I don't really know if we can do it in our 
cluster if EC is not suitable as we need around 400 To =>

In replica 3 the setup that I shown is around 230 To of NVMEs and 480 To of HDDs

>> and S3.

>HDDs just for the bucket pool, right? Everything else on the SSDs?

At the begining we planned to use HDD for S3 and RBD

>> The crushmap will be like so : 3 rooms with 2 nodes per room.
>>
>> So the point of failure can be for replica 3 => rooms and if we use EC 4+2 
>> => host

>With Tentacle fast EC, you *might* find that RBD on EC performance is 
>acceptable, worth testing.

If EC is acceptable full NVMe could be an option

>>
>> We saw that the thread needs for NVMe OSD are very expensive, does this CPU 
>> good enough to carry them ?

>Be sure to read those references carefully.  The one that describes gains up 
>to 14 threads per is in a very specific context.

>So let's say you have 8x 15T NVMe RI OSDs + 12x HDD OSDs.  I might budget for 
>6 and 4 threads each, respectively, with some left over for mon, mgr, rgw, 
>etc.  The 96 thread CPU is a bit >tight given that math, but I think you'd 
>probably be okay.

>>
>> MON and MGR will be spread across the cluster and RGW on virtual machine

>Oh, RGWs on VMs would alleviate the CPU appetite a bit.  This is not an 
>unknown strategy.

>I would personally consider an all-NVMe chassis with a mix of TLC and 
>QLC/QLC-class SSDs for the bucket pool. You'd use fewer RUs and your recovery 
>times would be dramatically improved.  >Consider how long it takes to write a 
>20T HDD.

We didn't have the difference beetwen TLC and QLC SSDs specs, is this a things 
in Ceph ? Our provider didn't display them

>>
>> Thanks for your answer !
>>
>> Vivien


________________________________
De : Anthony D'Atri <[email protected]>
Envoyé : mardi 18 novembre 2025 14:20:22
À : GLE, Vivien
Cc : [email protected]
Objet : Re: [ceph-users] New Cluster Ceph

These things are always controversial, below are my thoughts.  Your mileage may 
vary.

> We plan to buy hardware for a new cluster ceph and would like some 
> approbation on what we choose.
>
> 6 nodes DELL 6715 + 6 Powervault MD2412 enclosure

I would not recommend using external drive arrays:

* You're paying money for an HBA
* Proprietary tool for managing the thing, granted this one looks like a JBOD 
vs one with an embedded RAID controller, which I've previously wrangled with
* 24 Gb/s SAS is still SAS.  SAS or SATA purchases today pretty much lock you 
into HDDs in the future.

>
> For each node =>
>
> 1 CPU AMD EPYC 9475F 3,65 GHz, 48C/96T, 256M Cache (400 W)

Yikes.  Higher-freq CPUs or those with the highest core counts tend to cost 
more in terms of oomph per euro than middle-of-the-road SKUs. Note the 
significant list price difference compared to say the 9455P.  For many 
purposes, solve for $ / ((cores) * (base)).  Servethehome posts nice value 
analysis charts. Heck, if a single-socket system meets your needs for e.g. RAM 
capacity and PCIe lanes, consider the 9555P.

> RAM 16Gox16 = 256Go

For 12 OSDs you would likely be fine with 128GB, with slots empty for expansion.

> 4x 100GB NVIDIA MELLANOX 100GB

Why 4x?  With 12 HDD OSDs you'd be well-served by 2x 25 GE, skipping a 
replication network.

> HBA465e (externe, 22,5GB/s)
>
> 2 NVMe mixed 6,4To DBWAL
> 6 NVMe mixed 6,4To

mixed-use SSDs are usually overkill.  They're usually the same hardware as 
read-intensive SSDs with the overprovisioning slider (and the price) bumped.
"read-intensive" and "mixed-use" are marketing terms.  I worked for the product 
/ marketing team of an SSD manufacturer, trust me on this ;)



> 5 NVMe read 15,36To
>
> for each enclosure =>
>
> 12 HDD 20To
>
> HDD will be a replica 3 pool with the dbwal

With modern NVMe SSDs, I might suggest 1x 7.6 T for DB/WAL offload. You can go 
12:1.


> NVME mixed and read will be 2 differents pools (we will test replica 3 and EC 
> to see which performance/storage efficency satisfy us the most)

Why two different, small pools?  You'd be better off getting all read-intensive 
NVMe SSDs, and to keep things simple make all of them the same capacity. You 
would benefit from having more OSDs in a single pool, especially since this is 
a small cluster.


> This cluster will be mostly use to store block (VM proxmox/kubernetes)

On the SSDs I presume?  That workload on HDDs, especially at 20T, is unlikely 
to be adequate.


> and S3.

HDDs just for the bucket pool, right? Everything else on the SSDs?


> The crushmap will be like so : 3 rooms with 2 nodes per room.
>
> So the point of failure can be for replica 3 => rooms and if we use EC 4+2 => 
> host

With Tentacle fast EC, you *might* find that RBD on EC performance is 
acceptable, worth testing.

>
> We saw that the thread needs for NVMe OSD are very expensive, does this CPU 
> good enough to carry them ?

Be sure to read those references carefully.  The one that describes gains up to 
14 threads per is in a very specific context.

So let's say you have 8x 15T NVMe RI OSDs + 12x HDD OSDs.  I might budget for 6 
and 4 threads each, respectively, with some left over for mon, mgr, rgw, etc.  
The 96 thread CPU is a bit tight given that math, but I think you'd probably be 
okay.

>
> MON and MGR will be spread across the cluster and RGW on virtual machine

Oh, RGWs on VMs would alleviate the CPU appetite a bit.  This is not an unknown 
strategy.

I would personally consider an all-NVMe chassis with a mix of TLC and 
QLC/QLC-class SSDs for the bucket pool. You'd use fewer RUs and your recovery 
times would be dramatically improved.  Consider how long it takes to write a 
20T HDD.



>
> Thanks for your answer !
>
> Vivien
>
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]


_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: New Cluster Ceph

Reply via email to