[ceph-users] Re: Ceph cluster design advice

Anthony D'Atri Tue, 04 Nov 2025 11:00:19 -0800


> On Nov 4, 2025, at 1:32 PM, Gustavo Garcia Rondina <[email protected]> 
> wrote:
> 
> Hello Ceph community,
> 
> We are provisioning a new Ceph cluster and would appreciate design advice. 
> Below is our current hardware:
> 
> - 13 data servers each with 18 x 24 TB HDD and 2 x 3.5 TB SSD (6 servers in 
> rack A, 7 in rack B)


I'd spread them over more racks if you can, and align the CRUSH topology.  Are 
these SSDs SAS/SATA or NVMe?

> - 3 control servers each with 2 x 960 GB SSD and 4 x 1.9 TB SSD (2 in rack A, 
> 1 in rack B)

5 is better than 3, but you could place two additional mons on the OSD nodes. 
Are these SSDs SAS/SATA or NVMe?


> 
> The planned topology is:
> 
> - A single CephFS backed by a data pool
> - The 3 control servers will run MON/MGR/MDS and other daemons. The 13 data 
> servers will host OSDs and some extra MONs as needed.

Ack.

> 
> Current idea (constraints: we cannot add hardware) is on each data server we 
> will create a RAID1 from the two 3.5 TB SSDs and carve it into a 500 GB 
> virtual disk for the OS and a 3 TB virtual disk for DB/WAL (Bluestore).

I recommend not mixing the OS and data.  What SKU are these SSDs? Are they 
enterprise-class with PLP?  If those are SAS/SATA SSDs, conventional wisdom is 
to not offload more than 4-5 HDD OSD WAL+DBs onto each.

> Each HDD will be an OSD, giving 18 x 24 TB = 432 TB raw HDD per data node.
> 
> 1) Is it acceptable to host WAL/DB on a RAID1 virtual disk made from the two 
> SSDs, or is that a bad idea for performance/reliability?

It's better to map half the OSDs to each SSD.  RADOS handles redundancy.  Some 
people mirror offload SSDs, but IMHO that mostly just burns them twice as fast.

> 
> 2) Is 3 TB for DB/WAL per data node (~167 GB per OSD) likely to be sufficient 
> for 18 x 24 TB OSDs, or should we expect to need more DB space given CephFS 
> metadata/object counts? (We've seen a 2% rule cited, that would be ~8.6 TB 
> per node which is much larger)

Rightsizing depends in part on your workload.  Will you host a modest number of 
big files, a zillion tiny ones, or a mix?  I suspect with your numbers that the 
question is moot and you'll have to forego WAL+DB offload.  This is one reason 
to disfavor dense OSD nodes.  I have a client who bought similar systems and 
has added M.2 adapter PCIe AICs to provision M.2 NVMe SSDs for metadata OSDs 
and WAL+DB offload.  Note that there are AFAICT only two *enterprise* M.2 SSDs 
available:  the Micron 7450 PRO and a Samsung model, as this form factor has 
been more or less superseded by E1.S.

> 3) Alternative: keep WAL/DB on the HDDs (i.e., do not separate),

The value of offloading WAL+DB depends on your use-case.  Ideally I would 
recommend using QLC SSDs for better performance and density, and in many cases 
cost, but this seems moot for your deployment.

> and use the 3 TB RAID1 SSD area plus the control servers' SSDs (4 x 1.9 TB 
> each) as SSD OSDs for the CephFS metadata pool. Would that be a reasonable 
> approach?

You need the CephFS metadata pool, and ideally the first data pool, on SSDs. 
You would create a second data pool on the HDD OSDs and use the usual means to 
define your top level subdirectories to place data there.

Skip the RAID.  Make the 4x SSDs on the control servers into 4x3=12 OSDs.  
Which is fewer than I'd recommend for the metadata pool, but you have what you 
have.  Be sure to bump the autoscaler target so you get more PGs on them.  
Since you have 13x OSD nodes, I'd consider not mirroring the boot SSDs and 
instead provisioning each as 2x OSDs, assuming that they are equivalent to the 
SSDs in the control servers, ie. all are NVMe, or all are SATA.  I might 
hesitantly suggest partitioning 1TB of that boot SSD for the OS and use the 
rest as 2x OSDs as well, *iff* they are NVMe.  If they are SAS/SATA, I would 
not mix OSD and OS.  I have a customer who does that, and it has ... not worked 
out well.


> 
> 4) If we do SSD-backed metadata OSDs with 3x replication for the metadata 
> pool, we were thinking of CRUSH domains: domain1 = 6 data servers in rack A 
> (6 x 3 TB SSD OSDs), domain2 = 7 data servers in rack B (7 x 3 TB), domain3 = 
> 3 control servers (12 x 1.9 TB SSD OSDs). Any pitfalls with that CRUSH layout?

If you do that, your failure domains will have approximate CRUSH weights of 18, 
21, and 22.8.  With replication == the number of failure domains, only 18T of 
the second two will be usable.

> 
> Any suggestions would be highly appreciated.
> 
> Thank you,
> Gustavo
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Ceph cluster design advice

Reply via email to