[ceph-users] Re: New user with some questions...

Patrick Begou Thu, 13 Nov 2025 08:46:44 -0800

Le 13/11/2025 à 15:58, Matthias Riße a écrit :

Hey all!
I am evaluating Ceph at work and have some questions to people withmore experience with it, to avoid making dumb mistakes... I hope thisis the right place to ask.
So, to give some context, we have 5 machines with about 400 TiB of rawstorage total in the form of 8 HDDs per host. The hardware is not allthe same, two nodes have a rather low core count which would bededicated entirely to Ceph, the other three I would like to use in aHCI fashion to run VMs off of Ceph. I'd like to at least be able totake one node at a time out of rotation for maintenance purposes. I'dbe using cephadm for the setup, which in my testing works really well.
As for the questions:
1. 3x replication is a tough sell IMO. For better storage utilizationI am looking at EC. 2+2 would be an obvious choice, and seems to berather performant. But what about MSR with higher values, e.g. 5+3with 2 OSDs per host across 4 hosts? Or taken to more of an extreme,17+7 with 6 OSDs across 4 hosts? This should also allow me to take onehost out for maintenance at a time, but has a better storageutilization. Is this a stupid idea? (16+8 might make more sense, toalso be able to sustain a disk failure while in maintenance.)


Hi Matthias

Just  a partial answer with my small experience.

I've built a small ceph cluster with 5 nodes for capacitive storage (ata research team level 😉) so with *HDDs*. I'm using EC profiles and*CephFS* kernel mount on 3 computational clusters (approx 1000 coresaggregated) for data storage of the simulations. Only 4 OSDs per node inthis first round, it's low. The network is ethernet 25Gb/s on serversand on clients (same switch, same VLAN)

I'm using a 5+3 profile with minsize set to 6 and 2 chunks per node (on2 different OSD) and it work fine. I can loose one node in production.Of course it is a minimal setup and my goal is to add a node and some OSDs.

I've run some comparison with a 2+2 EC profile (*distributing only onechunk per node*) but the I/O performances are significantly lower whenwriting.

Effiiciency with 24 MPI processes writing/reading 4GB each (k=2+m=2 setas the reference)



        *k=5+m=3*       *k=2+m=2*
Write   126%    100 %
Read    95%     100 %

Average performance on 10 runs (10 runs writing with different filesnames and then 10 runs reading) , unit is MB/s


24x4Go  avg
k=5+m=3 write   747,72
k=5+m=3 read    1344,00
k=2+m=2 write   595,46
k=2+m=2 read    1418,00

Effiiciency with 8 MPI processes writing/reading 8GB each (k=2+m=2 setas the reference)



        *k=6+m=2*       *k=5+m=3*       *k=2+m=2*
Write   137%    118%    100 %
Read    98%     117%    100 %

Average performance on 10 runs (10 runs writing with differents filesnames and then 10 runs reading), unit is MB/s


8x8Go   avg
k=6+m=2 write   908,88
k=6+m=2 read    1227,00
k=5+m=3 write   783,10
k=5+m=3 read    1385,00
k=2+m=2 write   665,01
k=2+m=2 read    1252,00

The k=6+m=2 is not secure in production with 5 nodes and 2 chunks per node

So I finaly stay on k=5+m=3 which also allows more available storage.

Patrick

2. I've read that I can update the crush rule of EC pools after thefact, to change both the failure domain as well as the device class ofthe pool. What about changing k, m, or the plugin type? Myunderstanding is that this is not supported, but Ceph didn't stop mefrom doing it and it seems to do /something/ when those values arechanged?
3. Right now we are using libvirt with qcow2 images on local storage.I know that with Ceph the commonly recommended way would be to use RBDinstead, but we have an existing proprietary tape archive for backuppurposes whose official client can to my knowledge only do file-basedbackups, and around which we already have a system to back up livesnapshots of VMs. How bad of an idea would it actually be to use qcow2on top of (kernel- or fuse-mounted?) CephFS? So far it seems toperform on par with RBD in my testing, but both also seem to fullysaturate the single OSDs per host I am testing with anyway.
4. In v20 there seems to be a new ec_optimizations feature for ecpools used in CephFS or RBD. Are those a good idea with this kind of(large) qcow2 image workload on top of CephFS?
5. Speaking of v20, while it is not yet the latest "active" versionextrapolating the releases suggests that it could soon become that.Should I be waiting for it / start a new cluster with v20 now already?
Thanks for any insights you can give!

Kind regards
Matthias Riße


_______________________________________________
ceph-users mailing list [email protected]
To unsubscribe send an email [email protected]


_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: New user with some questions...

Reply via email to