Thanks for the fast reply!

I am evaluating Ceph at work and have some questions to people with more
experience with it, to avoid making dumb mistakes... I hope this is the
right place to ask.

So, to give some context, we have 5 machines with about 400 TiB of raw
storage total in the form of 8 HDDs per host. The hardware is not all
the same, two nodes have a rather low core count which would be
dedicated entirely to Ceph, the other three I would like to use in a HCI
fashion to run VMs off of Ceph. I'd like to at least be able to take one
node at a time out of rotation for maintenance purposes. I'd be using
cephadm for the setup, which in my testing works really well.

As for the questions:

1. 3x replication is a tough sell IMO. For better storage utilization I
am looking at EC. 2+2 would be an obvious choice, and seems to be rather
performant. But what about MSR with higher values, e.g. 5+3 with 2 OSDs
per host across 4 hosts? Or taken to more of an extreme, 17+7 with 6
OSDs across 4 hosts? This should also allow me to take one host out for
maintenance at a time, but has a better storage utilization. Is this a
stupid idea? (16+8 might make more sense, to also be able to sustain a
disk failure while in maintenance.)

Those crazy K+Ms are going to make any small io involve A LOT of
drives. There is a good feeling about having host failure domain (and
if repl=3) also knowing all pieces always are on different hosts.

I take from that that it might be useful if at some point we need bulk storage for large sequential data, but definitely not as VM storage.

I actually just learned that we should have another ten 18 TiB drives laying around that we could just slot in, so with that my estimation is that we are more limited on compute than on raw storage with what we have. Which would mean 3x replication after all...


2. I've read that I can update the crush rule of EC pools after the
fact, to change both the failure domain as well as the device class of
the pool. What about changing k, m, or the plugin type? My understanding
is that this is not supported, but Ceph didn't stop me from doing it and
it seems to do /something/ when those values are changed?

Don't know what you say happened, but unless something changed very
recently, it should not alter the pool.
You can change the k+m policy you once created such a pool from, but
it will not change the pool itself. The pool copies the policy at
creation time and then never re-reads it.
Changing repl=3,4,5 values just create or delete copies while
retaining the other replicas, but changing K+M would require the
cluster to recalculate all pieces of all data in all PGs. That would
be technically be possible but a very long process for a large
cluster.

OK, then I understood that correctly. Recalculating would indeed be very expensive. I guess the planned "pool migration" feature would be the only way to do this then, at the cost of explicitly doing this recalculation.

Just to make sure though, failure domains and device classes /can/ be changed after the fact, right? Might become relevant whenever we add SSDs into the mix.


3. Right now we are using libvirt with qcow2 images on local storage. I
know that with Ceph the commonly recommended way would be to use RBD
instead, but we have an existing proprietary tape archive for backup
purposes whose official client can to my knowledge only do file-based
backups, and around which we already have a system to back up live
snapshots of VMs. How bad of an idea would it actually be to use qcow2
on top of (kernel- or fuse-mounted?) CephFS? So far it seems to perform
on par with RBD in my testing, but both also seem to fully saturate the
single OSDs per host I am testing with anyway.

RBDs are lots simpler, and MDSs eat/require a fair bit of ram to
continue performing. Could be that tests while idle show both are good
enough, but that may not be true later on.

Ack. But RBD does not yet have a way to export at least crash-consistent snapshots of a group of images, right? I think I saw that the export functionality for group snapshots is still in development...

This is really a limitation of the tape backup system we have: I can't push more than about 8 TiB at a time to it, otherwise the connection drops and the backup fails. And this only works for RBD exports with an unofficial and unsupported client, otherwise it must be files on a filesystem. So what we have right now is multiple 5 TiB qcow2 images for the VMs that need a lot of storage, and we LVM them together in the VM. I could do the same with RBD, but would need a way to make a consistent backup of them, which I don't see yet. But maybe I am just missing something?

Our requirements to performance should overall be relatively modest, so I am inclined to just try it with CephFS and see how far it goes. We can still migrate to RBD in the future, if it makes sense.


5. Speaking of v20, while it is not yet the latest "active" version
extrapolating the releases suggests that it could soon become that.
Should I be waiting for it / start a new cluster with v20 now already?

I think I would start with v20 if the release (20.2.0) comes out
fairly quickly, otherwise go with latest 19 and upgrade when 20 comes
out so you get to practice and document that too.

Makes sense. I'll be on vacation for a week, so I guess I'll see what the world looks like then :)

--
---------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------
Forschungszentrum Jülich GmbH
52425 Jülich
Sitz der Gesellschaft: Jülich
Eingetragen im Handelsregister des Amtsgerichts Düren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Stefan Müller
Geschäftsführung: Prof. Dr. Astrid Lambrecht (Vorsitzende),
Dr. Stephanie Bauer (stellvertretende Vorsitzende),
Prof. Dr. Ir. Pieter Jansens, Prof. Dr. Laurens Kuipers
---------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to