> On Nov 7, 2025, at 6:14 PM, Gustavo Garcia Rondina <[email protected]> > wrote: > > Hi Anthony, > > Thank you for your detailed feedback.
It’s what Community Ambassadors do :D > > >> I recommend not mixing the OS and data. What SKU are these SSDs? Are they >> enterprise-class with PLP? If those are SAS/SATA SSDs, conventional wisdom >> is to not offload more than 4-5 HDD OSD WAL+DBs onto each. > > The SSDs are model SSDSC2KB038TZL. P4520. Top tier SATA. > I haven't been able to find the datasheet to confirm if they have PLP. They do for sure. Note that when using these for WAL+DB offload, conventional wisdom is to not exceed 5:1 for SATA SSDs. 36:1 as you propose would likely yield *worse* performance than leaving WAL+DB colocated with the main OSD data HDD. Especially if you have to share with OS and the metadata pool. Conventional wisdom for NVMe SSDs has been at most 10:1, though with PCIe gen 4+ TLC I suspect a higher ratio is ok. >> Rightsizing depends in part on your workload. Will you host a modest number >> of big files, a zillion tiny ones, or a mix? > > It's mostly a mix, it will be attached to an HPC cluster with a variety of > users. I might consider constructing your CephFS with the metadata and first data pools on 3x replicated pools constrained to the SSD OSDs and a second data pool added for the HDDs, replicated or EC as you see fit. Then use the layout mechanism to map top level directories to the HDD data pool. If you had more capacity you might create a directory on the SSD first pool for tiny objects but I think you probably can’t afford that. Incent your users to create fewer, larger files, even if they’re tarballs. > > >> I suspect with your numbers that the question is moot and you'll have to >> forego WAL+DB offload. > > I tend to agree. Ack. > > >> Skip the RAID. > > The reason why I have RAID in the data servers is because they have only the > 2x SSDs and the 18x HDDs, i.e., there is *no* dedicated M.2 or alternative > disk for the OS to be installed on -- overlook when the servers were spec'd, > and now we have no possibility of changing anything in these servers (long > story). Procurement can be thorny, so I understand. It’s unfortunate, though. Part of how I reply is to help future Cephers who may come across this thread. We mostly haven't seen SATA SSDs larger than 7.6 T in part due to the SATA interface bottleneck. A better strategy might have been a chassis with some NVMe slots, or AIC M.2 NVMe adapters, though enterprise grade M.2 NVMe SSDs will be difficult to find in the future, and even today are I think limited to 3.84T. Alternately populating some of the SATA slots with SSDs instead of HDDs so that there would be enough for OS, offload, and metadata. Orrrrr NVMe-only chassis with something like the Micron 6500+6550. > > The options that I considered: > > 1) OS in one of the 3.5 TB SSDs, use the other for DB or as OSD for metadata > pool. (In this case if either fails, we lose 18x OSDs, that's why I was > leaning for #2 below.) Remember that Ceph does replication for you. This strategy is effectively RAID on top of RAID. It would overload your SATA SSDs, resulting in slow RADOS and MDS ops. Your Prom /Grafana would show false alarms. My Cephalocon presentation from 2024 demonstrates this dynamic, and I’ve seen it on community systems. If mirrored, each of those SSDs would see the writes from 36 OSDs. Which would really be ugly. Put another way, what happens when one of your OSD nodes throws an error or catches fire? You cluster is designed to survive that. In the unlikely event of an SSD failure * you lose half that number of OSDs. If your cluster can’t handle that, you have a deeper problem. * Download SST from Solidigm and ensure the SSDs have up to date firmware. Or if Dell, DSU, etc. > > 2) RAID1 both SSDs, use 0.5 TB for the OS, then 3 TB for OSD for metadata > pool. (This protects against single SSD failure.) This is sub-optimal either way, but with your constraints, what I would do is mirror ~500 GiB for the OS, with minimal partioning and no swap, and use the balance on *each* SSD for an unmirrored OSD. You can do this with MD or HBA RAID, though I personally do not favor RAID HBAs. They are an expensive hassle, and the money better spent on faster media. > > 3) Diskless OS install and use all disks for Ceph. (Our Ceph expertise is not > sufficient to attempt this.) Impressive that you mention that. Croit does just this, and I think some VMware deployments. You need rock solid HA DHCP etc, and other aspects make it nontrivial to deploy. > > So, we are thinking that #2 is the one that offers better reliability in > terms of not having multiple single point of failure in the SSDs as RAID1 > would protect that. I think you’d not have a good experience, see my compromise suggestion above. Remind me how many chassis total? As a certain chassis count, the importance of mirrored boot (and network bonding) diminishes. > > As a side note, the RAID is hardware-managed, Doesn’t have to be though. I’ve experienced many issues with RAID HBAs, including that few deployments monitor them well. This also greatly confounds and complicates fleet management and drive observability. > Linux sees the RAID volumes as normal disks, e.g., > > [root@data-02 ~]# lsblk -o WWN,SIZE,TYPE,VENDOR,MODEL | grep RAID > 0x600062b101003b803043a322a6ee38d0 550G disk Lenovo RAID 940-32i 8GB > 0x600062b101003b803043a3138fd13cab 3T disk Lenovo RAID 940-32i 8GB I’m not sure offhand if Lenovo uses LSI or Adaptec HBAs; most chassis vendors rebadge one or the other. It’s likely that you could set them to a passthrough aka HBA aka JBOD mode to expose the raw drives to the OS. > > The controller has power loss protection: > > [root@data-02 ~]# /opt/MegaRAID/storcli/storcli64 /c0 show all | grep BBU Ok so that’s an LSI HBA. Aside from Dell’s H740P, most of those have a passthrough mode / personality. > BBU Status = 0 > BBU = Yes > BBU = Present > Cache When BBU Bad = Off Does it show that the BBU is a supercap vs a battery? I think they don’t use batteries any more, but a supercap is likely rated for all of 3 years. If it fails, you won’t know unless you monitor. The connectors can be finicky, and you don’t even want to know about gas gauge firmware. /c0 /cv show all Might give you more info. I’ve seen people turn this last setting on, which is a rather bad idea. That said, that cache usually is only in effect for VDs, so it isn’t even used for passthrough drives. One can define a single drive RAID 0 VD around each HDD, but split across 36 that doesn’t give you much per drive, and trust me, the operational hassle is substantial. When I see RAID HBAs I either just set passthrough and ignore the RoC, or consider reflashing with IT firmware. The latter is fraught and you’d be on your own. > > >> If you do that, your failure domains will have approximate CRUSH weights of >> 18, 21, and 22.8. With replication == the number of failure domains, only >> 18T of the second two will be usable. > > Thanks for pointing this out! > > > Gustavo > _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
