[ceph-users] Re: ceph-ansible installation error

Janne Johansson Sat, 31 Aug 2024 09:03:56 -0700

Den lör 31 aug. 2024 kl 15:42 skrev Tim Holloway <t...@mousetech.com>:
>
> I would greatly like to know what the rationale is for avoiding
> containers.
>
> Especially in large shops. From what I can tell, you need to use the
> containerized Ceph if you want to run multiple Ceph filesystems on a
> single host. The legacy installations only support dumping everything
> directly under /var/lib/ceph, so you'd have to invest a lot of effort
> into installing, maintaining and operating a second fsid under the
> legacy architecture.

Using two fsids on one machine is far outside our scope for the
10-or-so clusters we run. Not saying no one does it, but it was
frowned upon to have multiple clusternames on the same host, so I
guess most people took that to also include multiple fsids running in
parallel on the same host, even if the cluster name was the same.

> The only definite argument I've ever heard in my insular world against
> containers was based on security. Yet the primary security issues
> seemed to be more because people were pulling insecure containers from
> Docker repositories. I'd expect Ceph to have safeguards. Plus Ceph
> under RHEL 9 (and 8?) will run entirely and preferably under Podman,
> which allegedly is more secure, and can in fact, run containers under
> user accounts to allow additional security. I do that myself, although
> I think the mechanisms could stand some extra polishing.

From what I see on irc and the maillists, the container setup seems to
sometimes end up recreating new containers with new/unique fsids at
times as if it forgot the old cluster and rather decided to invent a
new one. This seems to combine well (sarcastically) with those ceph
admins inability to easily enter the containers and/or read out logs
from the old/missing containers to figure out what happened and why
this new mon container wants to reinvent the cluster instead of
joining the existing one. I know bugs are bugs, but wrapping it all
into an extra layer is not helping new ceph admins when it breaks. We
have a decent page on PG repairs of various kinds, but perhaps not as
much on what to do when the orchestrator isn't orchestrating?

Containers help them set the initial cluster up tons faster, but it
seems as if it leads them into situations where the container's
ephemeral state is actively working against their ability to figure
out when things go wrong and what the actual cause for that was.

Perhaps it is clusters that were adopted into the new style, perhaps
they run the containers in the wrong way, but there are a certain
amount of posts about "I pressed the button for totally automated
(re)deploy of X,Y and Z and it doesn't work". I would not like to end
up in this situation while at the same time handling real customers
who wonder why our storage is not serving IO at this moment.

Doing installs 'manually' is far from optimal, but at least I know the
logs end up under /var/log/ceph/<clustername>-<daemon><instance>.log
and they stay there even if the OSD disk is totally dead and gone.

-- 
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: ceph-ansible installation error

Reply via email to