> I am evaluating Ceph at work and have some questions to people with more > experience with it, to avoid making dumb mistakes... I hope this is the > right place to ask. > > So, to give some context, we have 5 machines with about 400 TiB of raw > storage total in the form of 8 HDDs per host. The hardware is not all > the same, two nodes have a rather low core count which would be > dedicated entirely to Ceph, the other three I would like to use in a HCI > fashion to run VMs off of Ceph. I'd like to at least be able to take one > node at a time out of rotation for maintenance purposes. I'd be using > cephadm for the setup, which in my testing works really well. > > As for the questions: > > 1. 3x replication is a tough sell IMO. For better storage utilization I > am looking at EC. 2+2 would be an obvious choice, and seems to be rather > performant. But what about MSR with higher values, e.g. 5+3 with 2 OSDs > per host across 4 hosts? Or taken to more of an extreme, 17+7 with 6 > OSDs across 4 hosts? This should also allow me to take one host out for > maintenance at a time, but has a better storage utilization. Is this a > stupid idea? (16+8 might make more sense, to also be able to sustain a > disk failure while in maintenance.)
Those crazy K+Ms are going to make any small io involve A LOT of drives. There is a good feeling about having host failure domain (and if repl=3) also knowing all pieces always are on different hosts. > 2. I've read that I can update the crush rule of EC pools after the > fact, to change both the failure domain as well as the device class of > the pool. What about changing k, m, or the plugin type? My understanding > is that this is not supported, but Ceph didn't stop me from doing it and > it seems to do /something/ when those values are changed? Don't know what you say happened, but unless something changed very recently, it should not alter the pool. You can change the k+m policy you once created such a pool from, but it will not change the pool itself. The pool copies the policy at creation time and then never re-reads it. Changing repl=3,4,5 values just create or delete copies while retaining the other replicas, but changing K+M would require the cluster to recalculate all pieces of all data in all PGs. That would be technically be possible but a very long process for a large cluster. > 3. Right now we are using libvirt with qcow2 images on local storage. I > know that with Ceph the commonly recommended way would be to use RBD > instead, but we have an existing proprietary tape archive for backup > purposes whose official client can to my knowledge only do file-based > backups, and around which we already have a system to back up live > snapshots of VMs. How bad of an idea would it actually be to use qcow2 > on top of (kernel- or fuse-mounted?) CephFS? So far it seems to perform > on par with RBD in my testing, but both also seem to fully saturate the > single OSDs per host I am testing with anyway. RBDs are lots simpler, and MDSs eat/require a fair bit of ram to continue performing. Could be that tests while idle show both are good enough, but that may not be true later on. > 4. In v20 there seems to be a new ec_optimizations feature for ec pools > used in CephFS or RBD. Are those a good idea with this kind of (large) > qcow2 image workload on top of CephFS? Normally EC pools are not the best fit for neither CephFS nor RBD, even if it can be made to work. Same as above, on idle clusters, the overhead might not be visible. Then again, you might never grow your usage to where it really starts being noticed. For our use cases, we mostly let the guests do their own caching, so we give the ram to the instances so they have a margin to use for caches and buffers. If you go with CephFS the MDSs are taking over that part in some sense, so you could possibly lower guests ram, but chances are you will be caching same/similar data in multiple places. If budgets are tight, it might be worth considering. > 5. Speaking of v20, while it is not yet the latest "active" version > extrapolating the releases suggests that it could soon become that. > Should I be waiting for it / start a new cluster with v20 now already? I think I would start with v20 if the release (20.2.0) comes out fairly quickly, otherwise go with latest 19 and upgrade when 20 comes out so you get to practice and document that too. -- May the most significant bit of your life be positive. _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
