> On Nov 13, 2025, at 9:58 AM, Matthias Riße <[email protected]> wrote: > > Hey all! > > I am evaluating Ceph at work and have some questions to people with more > experience with it, to avoid making dumb mistakes... I hope this is the right > place to ask.
Guten Tag, Herr Riße . > So, to give some context, we have 5 machines with about 400 TiB of raw > storage total in the form of 8 HDDs per host. The hardware is not all the > same, two nodes have a rather low core count which would be dedicated > entirely to Ceph, the other three I would like to use in a HCI fashion to run > VMs off of Ceph. Depending on your virtualization strategy, you could do that with Ceph. Either with a standalone cluster external to the virtualization solution, or via Rook / K8s. Labeling etc. could be used to prevent compute pods from being scheduled on the more modest systems. > I'd like to at least be able to take one node at a time out of rotation for > maintenance purposes. Quite prudent, and that’s one reason to use all five nodes for Ceph. > I'd be using cephadm for the setup, which in my testing works really well. > > As for the questions: > > 1. 3x replication is a tough sell IMO. That depends on your use-case. If you’re using HDDs, then your performance is going to be limited anyway, though. > For better storage utilization I am looking at EC. 2+2 would be an obvious > choice, and seems to be rather performant. Depending on your use case, I would agree. There is significant value in having at least one more node than the replication factor. You *could* do 3+2, but there are a number of drawbacks on 5 hosts and I would not recommend. > But what about MSR with higher values, e.g. 5+3 with 2 OSDs per host across 4 > hosts? Or taken to more of an extreme, 17+7 with 6 OSDs across 4 hosts? This > should also allow me to take one host out for maintenance at a time, but has > a better storage utilization. Is this a stupid idea? (16+8 might make more > sense, to also be able to sustain a disk failure while in maintenance.) It’s not a stupid idea at all, but it’s not the choice I would make. With 2+2, your space amp factor is 2.0: (k+m) / k With 17+7, you have 1.41, but you will find your write speed distinctly impacted, and since HDDs have very limited IOPs, your reads as well. Your scrubs won’t be able to keep up and recovery/backfill will be very slow. 16+8 wouldn’t be much different in many regards. MSR rules are very clever, props to Sam there. There was an analysis a few. years back that concluded that with EC there is a CPU benefit to values of k and m with small prime factors. 17+7 thus would be disadvantageous, though the dynamics may be different with the Fast EC changes coming in Tentacle. Here’s a table showing the space amp factor of various EC profiles. My sense is that the diminishing returns past, say, 8+3 aren’t worth the impact to performance and especially recovery. > > 2. I've read that I can update the crush rule of EC pools after the fact, to > change both the failure domain as well as the device class of the pool. A delicate operation, but yes. > What about changing k, m, or the plugin type? That you can’t do. We’ve tried to make this clear in the docs, please let me know if it isn’t. > My understanding is that this is not supported, but Ceph didn't stop me from > doing it and it seems to do /something/ when those values are changed? Crossing the streams. All bets are off. But note that you can change the EC profile setting on an existing pool, but that doesn’t actually DO anything. > > 3. Right now we are using libvirt with qcow2 images on local storage. I know > that with Ceph the commonly recommended way would be to use RBD instead libvirt can use librbd quite effectively. Zillions of OpenStack instances and certain well-known VPS providers do just this. > , but we have an existing proprietary tape archive for backup purposes whose > official client can to my knowledge only do file-based backups You might use “rbd export” to feed into such a system. Vultr does just that. > and around which we already have a system to back up live snapshots of VMs. > How bad of an idea would it actually be to use qcow2 on top of (kernel- or > fuse-mounted?) CephFS? I don’t know enough to say for sure, but this idea makes me twitch. I suspect that FUSE mounts wouldn’t be as performant. > So far it seems to perform on par with RBD in my testing, but both also seem > to fully saturate the single OSDs per host I am testing with anyway. > > 4. In v20 there seems to be a new ec_optimizations feature for ec pools used > in CephFS or RBD. Are those a good idea with this kind of (large) qcow2 image > workload on top of CephFS? The Fast EC code is beneficial to any EC workload, to extents that vary on the EC profile, the access modality, and the workload, but I am not aware of a situation in which they would be deleterious. For production deployment of course you might let 20.2.0 soak for a bit, consider waiting for 20.2.1, the usual dynamics. > > 5. Speaking of v20, while it is not yet the latest "active" version > extrapolating the releases suggests that it could soon become that. RSN. Really! It’s sooooo close. > Should I be waiting for it / start a new cluster with v20 now already? Depends on your time demands. For RBD the Fast EC space amp improvements may not be as dramatic as for, say, RGW with mixed large and small objects. You can always retrofit after an update to Tentacle. EC *speed* improvements will be signifiant, though, and you can retrofit an existing pool. > I take from that that it might be useful if at some point we need bulk > storage for large sequential data, but definitely not as VM storage. That workload typically does a lot of small-block random IO and does not do well with classic EC (or HDDs at all). Fast EC in Tentacle should measurably improve the performance of EC for such workloads. > I actually just learned that we should have another ten 18 TiB drives laying > around that we could just slot in, so with that my estimation is that we are > more limited on compute than on raw storage with what we have. Which would > mean 3x replication after all… Even with Tentacle, R3 will be faster than 2,2 EC. But note that more OSDs per node means more RAM and CPU consumption. If your existing OSDs are 10TB drives, I strongly suggest taking steps before deploying: mon_target_pg_per_osd = 300 mon_max_pg_per_osd = 600 mgr/balancer/upmap_max_deviation = 1 > Just to make sure though, failure domains and device classes /can/ be changed > after the fact, right? Might become relevant whenever we add SSDs into the > mix. Yes, absolutely. You can use upmap-remapped.py to moderate the thundering heard of data movement that can result. > Ack. But RBD does not yet have a way to export at least crash-consistent > snapshots of a group of images, right? I think I saw that the export > functionality for group snapshots is still in development… In Tentacle I think, or perhaps a Tentacle dot release. > This is really a limitation of the tape backup system we have: I can't push > more than about 8 TiB at a time to it, otherwise the connection drops and the > backup fails AMANDA addressed that years ago with its staging filesystem ;) > > Thanks for any insights you can give! > > Kind regards > Matthias Riße > > -- > --------------------------------------------------------------------------------------------- > --------------------------------------------------------------------------------------------- > Forschungszentrum Jülich GmbH > 52425 Jülich > Sitz der Gesellschaft: Jülich > Eingetragen im Handelsregister des Amtsgerichts Düren Nr. HR B 3498 > Vorsitzender des Aufsichtsrats: MinDir Stefan Müller > Geschäftsführung: Prof. Dr. Astrid Lambrecht (Vorsitzende), > Dr. Stephanie Bauer (stellvertretende Vorsitzende), > Prof. Dr. Ir. Pieter Jansens, Prof. Dr. Laurens Kuipers > --------------------------------------------------------------------------------------------- > --------------------------------------------------------------------------------------------- > > _______________________________________________ > ceph-users mailing list -- [email protected] > To unsubscribe send an email to [email protected] _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
