[ceph-users] Re: Why you might want packages not containers for Ceph deployments

Anthony D'Atri Sat, 19 Jun 2021 11:17:59 -0700

Thanks, Sage.  This is a terrific distilation of the challenges and benefits.


FWIW here are a few of my own perspectives, as someone experienced with Ceph 
but with limited container experience.  To be very clear, these are 
*perceptions* not *assertions*; my goal is discussion not argument.  For 
context, I have not used a release newer than Nautilus in production, in large 
part due to containers and cephadm.


>> Containers are more complicated than packages, making debugging harder.
> 
> I think that part of this comes down to a learning curve and some
> semi-arbitrary changes to get used to (e.g., systemd unit name has
> changed; logs now in /var/log/ceph/$fsid instead of /var/log/ceph).

Indeed, if there are logs at all. It seems that (by default?) one has to (know 
to) use journalctl to extract daemon or cluster logs, which is rather awkward 
compared to having straight files.  And those go away when the daemon restarts 
or is redeployed, losing data continuity?  Is logrotate used as usual, such 
that it can be adjusted?

If running multiple clusters on a set of hardware is deprecated, why include 
the fsid in the pathname?  This complicates scripting and monitoring / metrics 
collection.  Or have we retconned multiple clusters?

The admin sockets are under a similar path in /var/run.  I have yet to discover 
an incantation of `ceph daemon mon.foo` eg. that works, indeed specifying the 
whole path to the asok yields an error about the path being too long, so I’ve 
had to make a symlink to it.  This isn’t great usability, unless of course I’m 
missing something.

>> Security (50 containers -> 50 versions of openssl to patch)
> 
> This feels like the most tangible critique.  It's a tradeoff.  We have
> had so many bugs over the years due to varying versions of our
> dependencies that containers feel like a huge win: we can finally test
> and distribute something that we know won't break due to some random
> library on some random distro.  But it means the Ceph team is on the
> hook for rebuilding our containers when the libraries inside the
> container need to be patched.

This seems REALLY congruent with the tradeoffs that accompanied shared/dynamic 
linking years ago.  Shared linking saves on binary size and facilitates sharing 
of address space among processes; dynamic shared linking lets one update 
dependencies (notably openssl for sure since it has had lots of exploits over 
time, but others too).  But that also means that changes to those libraries can 
break applications.  So we’ve long seen commercial / pre-built binaries 
statically linked to avoid regression and breakage.  Kind of a rock and a hard 
place situation.  Some assert that Ceph daemon systems should mostly or 
entirely inaccessible from the Internet, and usually don’t have a large set of 
users — or any customers — logging into them.  Thus it can be argued that they 
are less exposed to attacks which would favor somewhat containerization.

One might say that containerization and orchestration make updates for security 
fixes trivial, but remember that in most cases such an upgrade is not against 
the immediately prior Ceph dot release, which means exposure to regressions and 
other unanticipated changes in behavior.  Which is one reason why enterprises 
especially may stick with a given specific dot release that works until 
compelled.  Updating upstream containers for security fixes is right back into 
the dependency hell situation too.

> 
> On the flip side, cephadm's use of containers offer some huge wins:
> 
> - Package installation hell is gone.

As a user I never experienced much of this, but then I was mostly installing 
packages outside of ceph-deploy et al.  With at least 3 different container 
technologies in play, though, are we substiuting one complexity for another?

> - Upgrades/downgrades can be carefully orchestrated. With packages,
> the version change is by host, with a limbo period (and occasional
> SIGBUS) before daemons were restarted.  Now we can run new or patched
> code on individual daemons and avoid an accidental upgrade when a
> daemon restarts.

Fair enough - that limbo period was never a problem for me, but re careful 
orchestration, we see people on this list all the time experiencing 
orchestration failures.  Is the list a nonrepresentative sample of people’s 
experience?  The opacity of said orchestration also complicates troubleshooting.

> - Ceph installations are carefully sandboxed.  Removing/scrubbing ceph
> from a host is trivial as only a handful of directories or
> configuration files are touched.

Plus of course any ancillary tools.  This seems like it would be advantageous 
in labs.  In production it’s not uncommon to reimage the entire box anyway.

>  And we can safely run multiple
> clusters on the same machine without worry about bad interactions

Wasn’t it observed a few years ago that almost nobody actually did that, hence 
the deprecation of custom cluster names?

> - Cephadm deploys a bunch of non-ceph software as well to provide a
> complete storage system, including haproxy and keepalived for HA
> ingress for RGW and NFS, ganesha for NFS service, grafana, prometheus,
> node-exporter, and (soon) samba for SMB.  All neatly containerized to
> avoid bumping into other software on the host; testing and supporting
> the huge matrix of packages versions available via various distros
> would be a huge time sink.

One size fits all?  None? Many? Some?  Does that get in the way of sites that, 
eg. choose nginux for LB/HA, to run their own Prometheus / Grafana infra for 
various reasons?  Is this more of the 

> We've been beat up for years about how complicated and hard Ceph is.

True.  I was told in an interview once that one needs a PhD in Ceph.  Over the 
years operators have had to rework tooling with every release, so the 
substantial retoolings that come with containers and cephadm / ceph orch can be 
daunting.  Midstream changes and changes made for no apparent reason contribute 
to the perception.  JSON output is supposed to be invariant, or at least 
backward compatible, yet we saw mon clock skew move for no apparent reason, and 
there have been other breaking changes.  cf. the ceph_exporter source for more 
examples.

> Rook and cephadm represent two of the most successful efforts to
> address usability (and not just because they enable deployment
> management via the dashboard!),

The goals here are totally worthy, to make things more turnkey.  I get that, I 
really do.  There are some wrinkles though:

* Are they successful, though?  I’m not saying they aren’t, I’m asking.  The 
frequency of cephadm / ceph orch SNAFUs posted to this list is daunting.  It 
seemed at one point that Rook would become the party line, but now it’s 
heterodox?

* Removing other complexity by introducing new complexity (containers).  There 
seems to have been assumption here that operators already grok containers?  In 
any of the three+ flavors in play?  It’s easy to just dismiss this as a 
learning curve, but it’s a rather significant one, and assuming that the 
operator will do that in their Copious Free Time isn’t IMHO reasonable.

* Dashboard operation by pushing buttons can make it dead simple to deploy a 
single dead simple configuration, but revision-controlled management of dozens 
of clusters is a different story.  Centralized config is one example (assuming 
the subtree limit bug has been fixed).  Absolutely mangaging ceph.conf across 
system types and multiple clusters is a pain — brittle ERB or J2 templates, 
inscrutable Ansible errors.  But how does one link CLI-based centralized config 
with revision control and peer review of changes?  One thing about turnkey 
solutions is that in generaly they usually are unreasonably simplistic or rigid 
in ways that are a bad fit for manageable enterprise deployment, and if we’re 
going to do everything for the user *and* make it diffcult for them to dig deep 
or customize, then the bar is *very* high for success.

* Some might add ceph-ansible to that list.





_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Why you might want packages not containers for Ceph deployments

Reply via email to