Hi Anthony, Thank you for your detailed feedback.
>I recommend not mixing the OS and data. What SKU are these SSDs? Are they >enterprise-class with PLP? If those are SAS/SATA SSDs, conventional wisdom is >to not offload more than 4-5 HDD OSD WAL+DBs onto each. The SSDs are model SSDSC2KB038TZL. I haven't been able to find the datasheet to confirm if they have PLP. >Rightsizing depends in part on your workload. Will you host a modest number >of big files, a zillion tiny ones, or a mix? It's mostly a mix, it will be attached to an HPC cluster with a variety of users. >I suspect with your numbers that the question is moot and you'll have to >forego WAL+DB offload. I tend to agree. >Skip the RAID. The reason why I have RAID in the data servers is because they have only the 2x SSDs and the 18x HDDs, i.e., there is *no* dedicated M.2 or alternative disk for the OS to be installed on -- overlook when the servers were spec'd, and now we have no possibility of changing anything in these servers (long story). The options that I considered: 1) OS in one of the 3.5 TB SSDs, use the other for DB or as OSD for metadata pool. (In this case if either fails, we lose 18x OSDs, that's why I was leaning for #2 below.) 2) RAID1 both SSDs, use 0.5 TB for the OS, then 3 TB for OSD for metadata pool. (This protects against single SSD failure.) 3) Diskless OS install and use all disks for Ceph. (Our Ceph expertise is not sufficient to attempt this.) So, we are thinking that #2 is the one that offers better reliability in terms of not having multiple single point of failure in the SSDs as RAID1 would protect that. As a side note, the RAID is hardware-managed, Linux sees the RAID volumes as normal disks, e.g., [root@data-02 ~]# lsblk -o WWN,SIZE,TYPE,VENDOR,MODEL | grep RAID 0x600062b101003b803043a322a6ee38d0 550G disk Lenovo RAID 940-32i 8GB 0x600062b101003b803043a3138fd13cab 3T disk Lenovo RAID 940-32i 8GB The controller has power loss protection: [root@data-02 ~]# /opt/MegaRAID/storcli/storcli64 /c0 show all | grep BBU BBU Status = 0 BBU = Yes BBU = Present Cache When BBU Bad = Off >If you do that, your failure domains will have approximate CRUSH weights of >18, 21, and 22.8. With replication == the number of failure domains, only 18T >of the second two will be usable. Thanks for pointing this out! Gustavo _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
