These things are always controversial, below are my thoughts. Your mileage may vary.
> We plan to buy hardware for a new cluster ceph and would like some > approbation on what we choose. > > 6 nodes DELL 6715 + 6 Powervault MD2412 enclosure I would not recommend using external drive arrays: * You're paying money for an HBA * Proprietary tool for managing the thing, granted this one looks like a JBOD vs one with an embedded RAID controller, which I've previously wrangled with * 24 Gb/s SAS is still SAS. SAS or SATA purchases today pretty much lock you into HDDs in the future. > > For each node => > > 1 CPU AMD EPYC 9475F 3,65 GHz, 48C/96T, 256M Cache (400 W) Yikes. Higher-freq CPUs or those with the highest core counts tend to cost more in terms of oomph per euro than middle-of-the-road SKUs. Note the significant list price difference compared to say the 9455P. For many purposes, solve for $ / ((cores) * (base)). Servethehome posts nice value analysis charts. Heck, if a single-socket system meets your needs for e.g. RAM capacity and PCIe lanes, consider the 9555P. > RAM 16Gox16 = 256Go For 12 OSDs you would likely be fine with 128GB, with slots empty for expansion. > 4x 100GB NVIDIA MELLANOX 100GB Why 4x? With 12 HDD OSDs you'd be well-served by 2x 25 GE, skipping a replication network. > HBA465e (externe, 22,5GB/s) > > 2 NVMe mixed 6,4To DBWAL > 6 NVMe mixed 6,4To mixed-use SSDs are usually overkill. They're usually the same hardware as read-intensive SSDs with the overprovisioning slider (and the price) bumped. "read-intensive" and "mixed-use" are marketing terms. I worked for the product / marketing team of an SSD manufacturer, trust me on this ;) > 5 NVMe read 15,36To > > for each enclosure => > > 12 HDD 20To > > HDD will be a replica 3 pool with the dbwal With modern NVMe SSDs, I might suggest 1x 7.6 T for DB/WAL offload. You can go 12:1. > NVME mixed and read will be 2 differents pools (we will test replica 3 and EC > to see which performance/storage efficency satisfy us the most) Why two different, small pools? You'd be better off getting all read-intensive NVMe SSDs, and to keep things simple make all of them the same capacity. You would benefit from having more OSDs in a single pool, especially since this is a small cluster. > This cluster will be mostly use to store block (VM proxmox/kubernetes) On the SSDs I presume? That workload on HDDs, especially at 20T, is unlikely to be adequate. > and S3. HDDs just for the bucket pool, right? Everything else on the SSDs? > The crushmap will be like so : 3 rooms with 2 nodes per room. > > So the point of failure can be for replica 3 => rooms and if we use EC 4+2 => > host With Tentacle fast EC, you *might* find that RBD on EC performance is acceptable, worth testing. > > We saw that the thread needs for NVMe OSD are very expensive, does this CPU > good enough to carry them ? Be sure to read those references carefully. The one that describes gains up to 14 threads per is in a very specific context. So let's say you have 8x 15T NVMe RI OSDs + 12x HDD OSDs. I might budget for 6 and 4 threads each, respectively, with some left over for mon, mgr, rgw, etc. The 96 thread CPU is a bit tight given that math, but I think you'd probably be okay. > > MON and MGR will be spread across the cluster and RGW on virtual machine Oh, RGWs on VMs would alleviate the CPU appetite a bit. This is not an unknown strategy. I would personally consider an all-NVMe chassis with a mix of TLC and QLC/QLC-class SSDs for the bucket pool. You'd use fewer RUs and your recovery times would be dramatically improved. Consider how long it takes to write a 20T HDD. > > Thanks for your answer ! > > Vivien > > _______________________________________________ > ceph-users mailing list -- [email protected] > To unsubscribe send an email to [email protected] _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
