[ceph-users] Re: New Cluster Ceph

Anthony D'Atri Tue, 18 Nov 2025 08:41:41 -0800


>> 
>> * You're paying money for an HBA
>> * Proprietary tool for managing the thing, granted this one looks like a 
>> JBOD vs one with an embedded RAID controller, which I've previously wrangled 
>> with
>> * 24 Gb/s SAS is still SAS.  SAS or SATA purchases today pretty much lock 
>> you into HDDs in the future.
> 
> This looks like a good storage efficency / price ratio for us today


... if it does the job and you have lots of DC space.  Figure you're going to 
have to use this gear for at least ten years.  Only once in my career have I 
seen a hardware refresh cycle actually happen.

>>> RAM 16Gox16 = 256Go
> 
>> For 12 OSDs you would likely be fine with 128GB, with slots empty for 
>> expansion.
> 
> Alrgith is it enough for MON/MGR/OSD ?

I think so.  The default osd_memory_target is 4GiB. With the OSD memory 
autotuner enabled, cephadm will divvy up available memory across OSDs taking 
mons/mgrs into account.  I guess you'll actually have > 12 OSDs with your 
architecture, including the NVMe OSDs, so I'd go with 192 GiB.  

> 
>>> 4x 100GB NVIDIA MELLANOX 100GB
> 
>> Why 4x?  With 12 HDD OSDs you'd be well-served by 2x 25 GE, skipping a 
>> replication network.
> 
> We thought that 2 x 100GB for public and 2 x 100GB for private would be a 
> good setup with NVMe

It's not a bad idea, but if you could use the cost for that extra NIC and 
switch to get SSDs instead .... you're only going to have a few NVMe OSDs per 
node.

Also, having the replication network can result in certain difficult to debug 
scenarios.

> 
>>> HBA465e (externe, 22,5GB/s)
>>> 
>>> 2 NVMe mixed 6,4To DBWAL
>>> 6 NVMe mixed 6,4To
> 
>> mixed-use SSDs are usually overkill.  They're usually the same hardware as 
>> read-intensive SSDs with the overprovisioning slider (and the price) bumped.
>> "read-intensive" and "mixed-use" are marketing terms.  I worked for the 
>> product / marketing team of an SSD manufacturer, trust me on this ;)
> 
> We saw that mixed-use SSDs can achieve twice the IOPS of read-intensive SSDs 
> in write operations. This is why we plan to use this type of SSD for 
> high-write workloads.

Interesting, please share with me the spec sheets.  Small random writes -- 
which are what OSDs see -- are usually not much difference. But in any event, a 
mainstream modern NVMe SSD is not going to be your bottleneck for performance, 
OSD code, replication, network latency, PGs, those will be.  You'll get better 
performance by having more NVMe OSDs in a single pool than with only a few in 
each of two pools.

> 
>> Why two different, small pools?  You'd be better off getting all 
>> read-intensive NVMe SSDs, and to keep things simple make all of them the 
>> same capacity. You would benefit from having >more OSDs in a single pool, 
>> especially since this is a small cluster.
> 
> If we take only read-intensive NVMe SSDs this is the plan

Groovy.

> 
>>> This cluster will be mostly use to store block (VM proxmox/kubernetes)
> 
>> On the SSDs I presume?  That workload on HDDs, especially at 20T, is 
>> unlikely to be adequate.
> 
> We thought that low-IOPS workloads would run pretty well on HDDs, as today 
> our VMs are on a HDD pool with a cache tier SSD.

Cache tier.  Ugh.  I've run OpenStack Cinder on LFF HDDs, I wouldn't call the 
results adequate either for boot or for data volumes.  ymmv.


> Are HDDs/DBWAL unsuitable for this type of workload in Ceph?

Depends on your workload, what will the clients be running?  If they're just 
writing archives that's one thing.  If they're trying to run a database, 
another entirely.

> 
> Full NVMEs is very expensive, I don't really know if we can do it in our 
> cluster if EC is not suitable as we need around 400 To =>

You don't necessarily need to buy them from your chassis vendor ;)

> 
> In replica 3 the setup that I shown is around 230 To of NVMEs and 480 To of 
> HDDs
> 
>>> and S3.
> 
>> HDDs just for the bucket pool, right? Everything else on the SSDs?
> 
> At the begining we planned to use HDD for S3 and RBD
> 
>>> The crushmap will be like so : 3 rooms with 2 nodes per room.
>>> 
>>> So the point of failure can be for replica 3 => rooms and if we use EC 4+2 
>>> => host
> 
>> With Tentacle fast EC, you *might* find that RBD on EC performance is 
>> acceptable, worth testing.
> 
> If EC is acceptable full NVMe could be an option

See 
https://ceph.io/assets/pdfs/events/2025/ceph-day-london/04%20Erasure%20Coding%20Enhancements%20for%20Tentacle.pdf
 for a description of the work done for Tentacle.
There are graphs showing the performance improvements in EC.



> 
>>> 
>>> We saw that the thread needs for NVMe OSD are very expensive, does this CPU 
>>> good enough to carry them ?
> 
>> Be sure to read those references carefully.  The one that describes gains up 
>> to 14 threads per is in a very specific context.
> 
>> So let's say you have 8x 15T NVMe RI OSDs + 12x HDD OSDs.  I might budget 
>> for 6 and 4 threads each, respectively, with some left over for mon, mgr, 
>> rgw, etc.  The 96 thread CPU is a bit >tight given that math, but I think 
>> you'd probably be okay.
> 
>>> 
>>> MON and MGR will be spread across the cluster and RGW on virtual machine
> 
>> Oh, RGWs on VMs would alleviate the CPU appetite a bit.  This is not an 
>> unknown strategy.
> 
>> I would personally consider an all-NVMe chassis with a mix of TLC and 
>> QLC/QLC-class SSDs for the bucket pool. You'd use fewer RUs and your 
>> recovery times would be dramatically improved.  >Consider how long it takes 
>> to write a 20T HDD.
> 
> We didn't have the difference beetwen TLC and QLC SSDs specs, is this a 
> things in Ceph ? Our provider didn't display them

Chassis vendors often don't offer them, but you can buy SSDs from anyone.  
Examples:

Solidigm P5430, P5336
Micron 6500, 6550

Others offer them too.  Depending on your mix of operations these might be fine 
for your use-case.  99% of enterprise SSDs never burn more than 15% of their 
rated endurance.

For a mixed block / object workload, the P5430 and 6500 and equivalents from 
other manufacturers are worth considering.  The others are coarse-IU, which 
adds considerations.



> 
>>> 
>>> Thanks for your answer !
>>> 
>>> Vivien
> 
> 
> ________________________________
> De : Anthony D'Atri <[email protected]>
> Envoyé : mardi 18 novembre 2025 14:20:22
> À : GLE, Vivien
> Cc : [email protected]
> Objet : Re: [ceph-users] New Cluster Ceph
> 
> These things are always controversial, below are my thoughts.  Your mileage 
> may vary.
> 
>> We plan to buy hardware for a new cluster ceph and would like some 
>> approbation on what we choose.
>> 
>> 6 nodes DELL 6715 + 6 Powervault MD2412 enclosure
> 
> I would not recommend using external drive arrays:
> 
> * You're paying money for an HBA
> * Proprietary tool for managing the thing, granted this one looks like a JBOD 
> vs one with an embedded RAID controller, which I've previously wrangled with
> * 24 Gb/s SAS is still SAS.  SAS or SATA purchases today pretty much lock you 
> into HDDs in the future.
> 
>> 
>> For each node =>
>> 
>> 1 CPU AMD EPYC 9475F 3,65 GHz, 48C/96T, 256M Cache (400 W)
> 
> Yikes.  Higher-freq CPUs or those with the highest core counts tend to cost 
> more in terms of oomph per euro than middle-of-the-road SKUs. Note the 
> significant list price difference compared to say the 9455P.  For many 
> purposes, solve for $ / ((cores) * (base)).  Servethehome posts nice value 
> analysis charts. Heck, if a single-socket system meets your needs for e.g. 
> RAM capacity and PCIe lanes, consider the 9555P.
> 
>> RAM 16Gox16 = 256Go
> 
> For 12 OSDs you would likely be fine with 128GB, with slots empty for 
> expansion.
> 
>> 4x 100GB NVIDIA MELLANOX 100GB
> 
> Why 4x?  With 12 HDD OSDs you'd be well-served by 2x 25 GE, skipping a 
> replication network.
> 
>> HBA465e (externe, 22,5GB/s)
>> 
>> 2 NVMe mixed 6,4To DBWAL
>> 6 NVMe mixed 6,4To
> 
> mixed-use SSDs are usually overkill.  They're usually the same hardware as 
> read-intensive SSDs with the overprovisioning slider (and the price) bumped.
> "read-intensive" and "mixed-use" are marketing terms.  I worked for the 
> product / marketing team of an SSD manufacturer, trust me on this ;)
> 
> 
> 
>> 5 NVMe read 15,36To
>> 
>> for each enclosure =>
>> 
>> 12 HDD 20To
>> 
>> HDD will be a replica 3 pool with the dbwal
> 
> With modern NVMe SSDs, I might suggest 1x 7.6 T for DB/WAL offload. You can 
> go 12:1.
> 
> 
>> NVME mixed and read will be 2 differents pools (we will test replica 3 and 
>> EC to see which performance/storage efficency satisfy us the most)
> 
> Why two different, small pools?  You'd be better off getting all 
> read-intensive NVMe SSDs, and to keep things simple make all of them the same 
> capacity. You would benefit from having more OSDs in a single pool, 
> especially since this is a small cluster.
> 
> 
>> This cluster will be mostly use to store block (VM proxmox/kubernetes)
> 
> On the SSDs I presume?  That workload on HDDs, especially at 20T, is unlikely 
> to be adequate.
> 
> 
>> and S3.
> 
> HDDs just for the bucket pool, right? Everything else on the SSDs?
> 
> 
>> The crushmap will be like so : 3 rooms with 2 nodes per room.
>> 
>> So the point of failure can be for replica 3 => rooms and if we use EC 4+2 
>> => host
> 
> With Tentacle fast EC, you *might* find that RBD on EC performance is 
> acceptable, worth testing.
> 
>> 
>> We saw that the thread needs for NVMe OSD are very expensive, does this CPU 
>> good enough to carry them ?
> 
> Be sure to read those references carefully.  The one that describes gains up 
> to 14 threads per is in a very specific context.
> 
> So let's say you have 8x 15T NVMe RI OSDs + 12x HDD OSDs.  I might budget for 
> 6 and 4 threads each, respectively, with some left over for mon, mgr, rgw, 
> etc.  The 96 thread CPU is a bit tight given that math, but I think you'd 
> probably be okay.
> 
>> 
>> MON and MGR will be spread across the cluster and RGW on virtual machine
> 
> Oh, RGWs on VMs would alleviate the CPU appetite a bit.  This is not an 
> unknown strategy.
> 
> I would personally consider an all-NVMe chassis with a mix of TLC and 
> QLC/QLC-class SSDs for the bucket pool. You'd use fewer RUs and your recovery 
> times would be dramatically improved.  Consider how long it takes to write a 
> 20T HDD.
> 
> 
> 
>> 
>> Thanks for your answer !
>> 
>> Vivien
>> 
>> _______________________________________________
>> ceph-users mailing list -- [email protected]
>> To unsubscribe send an email to [email protected]
> 
> 
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: New Cluster Ceph

Reply via email to