[ceph-users] Re: Ceph cluster design advice

Gustavo Garcia Rondina Fri, 07 Nov 2025 15:12:13 -0800

Hi Anthony,

Thank you for your detailed feedback.



>I recommend not mixing the OS and data.  What SKU are these SSDs? Are they 
>enterprise-class with PLP?  If those are SAS/SATA SSDs, conventional wisdom is 
>to not offload more than 4-5 HDD OSD WAL+DBs onto each.

The SSDs are model SSDSC2KB038TZL. I haven't been able to find the datasheet to 
confirm if they have PLP.


>Rightsizing depends in part on your workload.  Will you host a modest number 
>of big files, a zillion tiny ones, or a mix?

It's mostly a mix, it will be attached to an HPC cluster with a variety of 
users.


>I suspect with your numbers that the question is moot and you'll have to 
>forego WAL+DB offload.

I tend to agree.


>Skip the RAID.

The reason why I have RAID in the data servers is because they have only the 2x 
SSDs and the 18x HDDs, i.e., there is *no* dedicated M.2 or alternative disk 
for the OS to be installed on -- overlook when the servers were spec'd, and now 
we have no possibility of changing anything in these servers (long story).

The options that I considered:

1) OS in one of the 3.5 TB SSDs, use the other for DB or as OSD for metadata 
pool. (In this case if either fails, we lose 18x OSDs, that's why I was leaning 
for #2 below.)

2) RAID1 both SSDs, use 0.5 TB for the OS, then 3 TB for OSD for metadata pool. 
(This protects against single SSD failure.)

3) Diskless OS install and use all disks for Ceph. (Our Ceph expertise is not 
sufficient to attempt this.)

So, we are thinking that #2 is the one that offers better reliability in terms 
of not having multiple single point of failure in the SSDs as RAID1 would 
protect that.

As a side note, the RAID is hardware-managed, Linux sees the RAID volumes as 
normal disks, e.g.,

[root@data-02 ~]# lsblk -o WWN,SIZE,TYPE,VENDOR,MODEL | grep RAID
0x600062b101003b803043a322a6ee38d0   550G disk Lenovo   RAID 940-32i 8GB
0x600062b101003b803043a3138fd13cab     3T disk Lenovo   RAID 940-32i 8GB

The controller has power loss protection:

[root@data-02 ~]# /opt/MegaRAID/storcli/storcli64 /c0 show all | grep BBU
BBU Status = 0
BBU = Yes
BBU = Present
Cache When BBU Bad = Off


>If you do that, your failure domains will have approximate CRUSH weights of 
>18, 21, and 22.8.  With replication == the number of failure domains, only 18T 
>of the second two will be usable.

Thanks for pointing this out!


Gustavo

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Ceph cluster design advice

Reply via email to