For RBD on HDD, with DB on SSD (typically shared by multiple HDDs), getting better write performance, but SSD will be worn out fast. For S3, index on SSD, data and DB on HDD, it will work fine. For density, it will be fine normally, but in case of failure or maintenance, recovery on dense HDD will be very slow. For EC, 8+3 is fine on HDD without performance requirement. We tried 6+3 on NVMe, CPU usage is very high which affects performance. For networking, 2x50G or 2x100G is too much for HDD. 2x25G is sufficient.
Tony ________________________________________ From: Anthony D'Atri via ceph-users <[email protected]> Sent: May 19, 2026 08:45 AM To: Adam Prycki Cc: [email protected] Subject: [ceph-users] Re: Looking for recommendations on Ceph node specifications >> . > > At least 16 or 32 nodes > in 16 racks. > Erasure coding 8+3 with failure domain on rack level. > Ack. > We initially selected 8+3 over 4+2 because we expect rebuilds to take very > long with nodes this big and we don't want to loose redundancy Fair enough. You get more nines with m=3 for sure, though the wider profile itself will mean slower scrubs and recovery. I suspect you set mon_osd_down_out_subtree_limit? > >>> >>> >>> Splitting JBOD logically into 2 servers isn't an issue for use because we >>> will replicate data on rack level and not host level. >>> >>> >>> Common specifications for all variants >>> >>> 5-6GB of RAM per 1 HDD >> Plus more for mons and other daemons? Especially MDS? > > Other daemons will be on some dedicated non-storage servers. Ack. Had to ask. > We aim for low RAM/HDD on storage nodes. Other daemons won't fit there. > >>> 2% of HDD capacity in NVMe devices for block.db (or none) >>> 2x 50Gb or 2x 100Gb Ethernet per server (active-backup bonded interfaces) >>> (CPU per OSD to be determined) >>> >>> >>> Variant A1 is very unlikely to happen but we are curious what network >>> interface speeds would you suggest for so many HDDs in one node. >> 100GE bonded at the least. Depends on your workload. >>> >>> Variant A2 is the most likely the one we will choose for large deployment. >>> >>> Variant B1/B2 for smaller deployments. >>> >>> Does anyone of you run ceph on similar setups? Did you find any pitfall >>> with it? >>> >>> What are your minimal recommendations for network speed per HDD, cpu per >>> HDD, etc? >>> >>> In our experience most of our servers, even in large clusters, never max >>> out the network interfaces or CPUs. We almost never rebuild or rebalance >>> whole servers. 27 HDD nodes of our biggest CephFS cluster with EC usually >>> have only 2-3Gbps of network traffic. >> Your workload is archival? > > Yes, mostly archival. > We have big demand for S3 and CephFS. > But we may move to pure s3 cluster in the future. > _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected] _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
