In the past we have deployed HDD OSDs (without SSD block.db) and SSD
OSDs for S3 cluster. It worked fine.
From our understanding, on a ceph consisting of HDDs with SSD block.db
most of RGW meta (index shards) will end up on SSD block.db.
So, if we can afford it, HDD+block.db setup for S3 would be better for
general performance and recovery speeds.
Is this assumption correct?
Best regards
Adam Prycki
On 19/05/2026 18:44, Tony Liu wrote:
For RBD on HDD, with DB on SSD (typically shared by multiple HDDs), getting
better write performance,
but SSD will be worn out fast. For S3, index on SSD, data and DB on HDD, it
will work fine. For density,
it will be fine normally, but in case of failure or maintenance, recovery on
dense HDD will be very slow.
For EC, 8+3 is fine on HDD without performance requirement. We tried 6+3 on
NVMe, CPU usage is very
high which affects performance. For networking, 2x50G or 2x100G is too much for
HDD. 2x25G is sufficient.
Tony
________________________________________
From: Anthony D'Atri via ceph-users <[email protected]>
Sent: May 19, 2026 08:45 AM
To: Adam Prycki
Cc: [email protected]
Subject: [ceph-users] Re: Looking for recommendations on Ceph node
specifications
.
At least 16 or 32 nodes
in 16 racks.
Erasure coding 8+3 with failure domain on rack level.
Ack.
We initially selected 8+3 over 4+2 because we expect rebuilds to take very long
with nodes this big and we don't want to loose redundancy
Fair enough. You get more nines with m=3 for sure, though the wider profile
itself will mean slower scrubs and recovery. I suspect you set
mon_osd_down_out_subtree_limit?
Splitting JBOD logically into 2 servers isn't an issue for use because we will
replicate data on rack level and not host level.
Common specifications for all variants
5-6GB of RAM per 1 HDD
Plus more for mons and other daemons? Especially MDS?
Other daemons will be on some dedicated non-storage servers.
Ack. Had to ask.
We aim for low RAM/HDD on storage nodes. Other daemons won't fit there.
2% of HDD capacity in NVMe devices for block.db (or none)
2x 50Gb or 2x 100Gb Ethernet per server (active-backup bonded interfaces)
(CPU per OSD to be determined)
Variant A1 is very unlikely to happen but we are curious what network interface
speeds would you suggest for so many HDDs in one node.
100GE bonded at the least. Depends on your workload.
Variant A2 is the most likely the one we will choose for large deployment.
Variant B1/B2 for smaller deployments.
Does anyone of you run ceph on similar setups? Did you find any pitfall with it?
What are your minimal recommendations for network speed per HDD, cpu per HDD,
etc?
In our experience most of our servers, even in large clusters, never max out
the network interfaces or CPUs. We almost never rebuild or rebalance whole
servers. 27 HDD nodes of our biggest CephFS cluster with EC usually have only
2-3Gbps of network traffic.
Your workload is archival?
Yes, mostly archival.
We have big demand for S3 and CephFS.
But we may move to pure s3 cluster in the future.
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]