[ceph-users] Re: Ceph cluster design advice

Anthony D'Atri Fri, 07 Nov 2025 17:32:38 -0800


> On Nov 7, 2025, at 6:14 PM, Gustavo Garcia Rondina <[email protected]> 
> wrote:
> 
> Hi Anthony,
> 
> Thank you for your detailed feedback.


It’s what Community Ambassadors do :D

> 
> 
>> I recommend not mixing the OS and data.  What SKU are these SSDs? Are they 
>> enterprise-class with PLP?  If those are SAS/SATA SSDs, conventional wisdom 
>> is to not offload more than 4-5 HDD OSD WAL+DBs onto each.
> 
> The SSDs are model SSDSC2KB038TZL.

P4520.  Top tier SATA.  

> I haven't been able to find the datasheet to confirm if they have PLP.

They do for sure.  Note that when using these for WAL+DB offload, conventional 
wisdom is to not exceed 5:1 for SATA SSDs.  36:1 as you propose would likely 
yield *worse* performance than leaving WAL+DB colocated with the main OSD data 
HDD. Especially if you have to share with OS and the metadata pool.   
Conventional wisdom for NVMe SSDs has been at most 10:1, though with PCIe gen 
4+ TLC I suspect a higher ratio is ok.  


>> Rightsizing depends in part on your workload.  Will you host a modest number 
>> of big files, a zillion tiny ones, or a mix?
> 
> It's mostly a mix, it will be attached to an HPC cluster with a variety of 
> users.

I might consider constructing your CephFS with the metadata and first data 
pools on 3x replicated pools constrained to the SSD OSDs and a second data pool 
added for the HDDs, replicated or EC as you see fit.  Then use the layout 
mechanism to map top level directories to the HDD data pool.  If you had more 
capacity you might create a directory on the SSD first pool for tiny objects 
but I think you probably can’t afford that. Incent your users to create fewer, 
larger files, even if they’re tarballs.  

> 
> 
>> I suspect with your numbers that the question is moot and you'll have to 
>> forego WAL+DB offload.
> 
> I tend to agree.

Ack.  

> 
> 
>> Skip the RAID.
> 
> The reason why I have RAID in the data servers is because they have only the 
> 2x SSDs and the 18x HDDs, i.e., there is *no* dedicated M.2 or alternative 
> disk for the OS to be installed on -- overlook when the servers were spec'd, 
> and now we have no possibility of changing anything in these servers (long 
> story).

Procurement can be thorny, so I understand.  It’s unfortunate, though.  Part of 
how I reply is to help future Cephers who may come across this thread.  

We mostly haven't seen SATA SSDs larger than 7.6 T in part due to the SATA 
interface bottleneck.  A better strategy might have been a chassis with some 
NVMe slots, or AIC M.2 NVMe adapters, though enterprise grade M.2 NVMe SSDs 
will be difficult to find in the future, and even today are I think limited to 
3.84T.  

Alternately populating some of the SATA slots with SSDs instead of HDDs so that 
there would be enough for OS, offload, and metadata.   


Orrrrr NVMe-only chassis with something like the Micron 6500+6550.  

> 
> The options that I considered:
> 
> 1) OS in one of the 3.5 TB SSDs, use the other for DB or as OSD for metadata 
> pool. (In this case if either fails, we lose 18x OSDs, that's why I was 
> leaning for #2 below.)

Remember that Ceph does replication for you.  This strategy is effectively RAID 
on top of RAID.  It would overload your SATA SSDs, resulting in slow RADOS and 
MDS ops.   Your Prom /Grafana would show false alarms.  My Cephalocon 
presentation from 2024 demonstrates this dynamic, and I’ve seen it on community 
systems.   

If mirrored, each of those SSDs would see the writes from 36 OSDs. Which would 
really be ugly.  


Put another way, what happens when one of your OSD nodes throws an error or 
catches fire? You cluster is designed to survive that.   In the unlikely event 
of an SSD failure * you lose half that number of OSDs.  If your cluster can’t 
handle that, you have a deeper problem.  


* Download SST from Solidigm and ensure the SSDs have up to date firmware.  Or 
if Dell, DSU, etc.  


> 
> 2) RAID1 both SSDs, use 0.5 TB for the OS, then 3 TB for OSD for metadata 
> pool. (This protects against single SSD failure.)

This is sub-optimal either way, but with your constraints, what I would do is 
mirror ~500 GiB for the OS, with minimal partioning and no swap, and use the 
balance on *each* SSD for an unmirrored OSD.  You can do this with MD or HBA 
RAID, though I personally do not favor RAID HBAs.  They are an expensive 
hassle, and the money better spent on faster media.  

> 
> 3) Diskless OS install and use all disks for Ceph. (Our Ceph expertise is not 
> sufficient to attempt this.)

Impressive that you mention that.  Croit does just this, and I think some 
VMware deployments.  You need rock solid HA DHCP etc, and other aspects make it 
nontrivial to deploy.  


> 
> So, we are thinking that #2 is the one that offers better reliability in 
> terms of not having multiple single point of failure in the SSDs as RAID1 
> would protect that.

I think you’d not have a good experience, see my compromise suggestion above.  
Remind me how many chassis total?  As a certain chassis count, the importance 
of mirrored boot (and network bonding) diminishes. 

> 
> As a side note, the RAID is hardware-managed,

Doesn’t have to be though.  I’ve experienced many issues with RAID HBAs, 
including that few deployments monitor them well.  This also greatly confounds 
and complicates fleet management and drive observability.   

> Linux sees the RAID volumes as normal disks, e.g.,
> 
> [root@data-02 ~]# lsblk -o WWN,SIZE,TYPE,VENDOR,MODEL | grep RAID
> 0x600062b101003b803043a322a6ee38d0   550G disk Lenovo   RAID 940-32i 8GB
> 0x600062b101003b803043a3138fd13cab     3T disk Lenovo   RAID 940-32i 8GB

I’m not sure offhand if Lenovo uses LSI or Adaptec HBAs; most chassis vendors 
rebadge one or the other.  It’s likely that you could set them to a passthrough 
aka HBA aka JBOD mode to expose the raw drives to the OS. 


> 
> The controller has power loss protection:
> 
> [root@data-02 ~]# /opt/MegaRAID/storcli/storcli64 /c0 show all | grep BBU

Ok so that’s an LSI HBA.  Aside from Dell’s H740P, most of those have a 
passthrough mode / personality.  

> BBU Status = 0
> BBU = Yes
> BBU = Present
> Cache When BBU Bad = Off

Does it show that the BBU is a supercap vs a battery? I think they don’t use 
batteries any more, but a supercap is likely rated for all of 3 years.  If it 
fails, you won’t know unless you monitor.  The connectors can be finicky, and 
you don’t even want to know about gas gauge firmware.  

/c0 /cv show all

Might give you more info.  


I’ve seen people turn this last setting on, which is a rather bad idea.  

That said, that cache usually is only in effect for VDs, so it isn’t even used 
for passthrough drives. One can define a single drive RAID 0 VD around each 
HDD, but split across 36 that doesn’t give you much per drive, and trust me, 
the operational hassle is substantial.   When I see RAID HBAs I either just set 
passthrough and ignore the RoC, or consider reflashing with IT firmware.  The 
latter is fraught and you’d be on your own.  

> 
> 
>> If you do that, your failure domains will have approximate CRUSH weights of 
>> 18, 21, and 22.8.  With replication == the number of failure domains, only 
>> 18T of the second two will be usable.
> 
> Thanks for pointing this out!
> 
> 
> Gustavo
> 
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Ceph cluster design advice

Reply via email to