Hi folks,
having a LTS release cycle could be a great topic for upcoming "Ceph
User + Dev Monthly meeting".
The first one is scheduled on November 18, 2021, 14:00-15:00 UTC
https://pad.ceph.com/p/ceph-user-dev-monthly-minutes
Any volunteers to extend the agenda and advocate the idea?
Thanks,
Igor
On 11/8/2021 3:21 PM, Frank Schilder wrote:
Hi all,
I followed this thread with great interest and would like to add my
opinion/experience/wishes as well.
I believe the question packages versus containers needs a bit more context to
be really meaningful. This was already mentioned several times with regards to
documentation. I see the following three topics tightly connected (my
opinion/answers included):
1. Distribution: Packages are compulsory, containers are optional.
2. Deployment: Ceph adm (yet another deployment framework) and ceph (the actual
storage system) should be strictly different projects.
3. Release cycles: The release cadence is way too fast, I very much miss a ceph
LTS branch with at least 10 years back-port support.
These are my short answers/wishes/expectations in this context. I will add
below some more reasoning as optional reading (warning: wall of text ahead).
1. Distribution
---------
I don't think the question is about packages versus containers, because even if
a distribution should decide not to package ceph any more, other distributors
certainly will and the user community just moves away from distributions
without ceph packages. In addition, unless Rad Hat plans to move to a
source-only container where I run the good old configure - make - make install,
it will be package based any ways, so packages are there to stay.
Therefore, the way I understand this question is about ceph-adm versus other
deployment methods. Here, I think the push to a container-based ceph-adm only
deployment is unlikely to become the no. 1 choice for everyone for good reasons
already mentioned in earlier messages. In addition, I also believe that
development of a general deployment tool is currently not sustainable as was
mentioned by another user. My reasons for this are given in the next section.
2. Deployment
---------
In my opinion, it is really important to distinguish three components of any
open-source project: development (release cycles), distribution and deployment.
Following the good old philosophy that every tool does exactly one job and does
it well, each of these components are separate projects, because they
correspond to different tools.
This implies immediately that ceph documentation should not contain
documentation about packaging and deployment tools. Each of these ought to be
strictly separate. If I have a low-level problem with ceph and go to the ceph
documentation, I do not want to see ceph-adm commands. Ceph documentation
should be about ceph (the storage system) only. Such a mix-up is leading to
problems and there were already ceph-user cases where people could not use the
documentation for trouble shooting, because it showed ceph-adm commands but
their cluster was not ceph-adm deployed.
In this context, I would prefer if there was a separate ceph-adm-users list so
that ceph-users can focus on actual ceph problems again.
Now to the point that ceph-adm might be an un-sustainable project. Although at
a first glance the idea of a generic deployment tool that solves all problems
with a single command might look appealing, it is likely doomed to fail for a
simple reason that was already indicated in an earlier message: ceph deployment
is subject to a complexity paradox. Ceph has a very large configuration space
and implementing and using a generic tool that covers and understands this
configuration space is more complex than deploying any specific ceph cluster,
each of which uses only a tiny subset of the entire configuration space.
In other words: deploying a specific ceph cluster is actually not that
difficult.
Designing a - and dimensioning all components of a ceph cluster is difficult
and none of the current deployment tools help here. There is not even a check
for suitable hardware. In addition, technology is moving fast and adapting a
generic tool to new developments in time seems a hopeless task. For example,
when will ceph-adm natively support collocated lvm OSDs with dm_cache devices?
Is it even worth trying to incorporate this?
My wish would be to keep the ceph project clean of any deployment tasks. In my
opinion, the basic ceph tooling is already doing tasks that are the
responsibility of a configuration management- and not a storage system (e.g.
deploy unit files by default instead of as an option disabled by default).
3. Release cycles
---------
Ceph is a complex system and the code is getting more complex every day. It is
very difficult to beat the curse of complexity that development and maintenance
effort grows non-linearly (exponentially?) with the number of lines of code. As
a consequence, (A) if one wants to maintain quality while adding substantial
new features, the release intervals become longer and longer. (B) If one wants
to maintain constant release intervals while adding substantial new features,
the quality will have to go down. The last option is that (C) new releases with
constant release intervals contain ever smaller increments in functionality to
maintain quality. I ignore the option of throwing more and more qualified
developers at the project as this seems unlikely and also comes with its own
complexity cost.
I'm afraid we are in scenario B. Ceph is loosing its nimbus of being a rock
solid system.
Just recently, there were some ceph-user emails about how dangerous or not is
it to upgrade to the latest stable octopus version. The upgrade itself
apparently goes well, but what happens then? I personally have too many reports
that the latest ceph versions are quite touchy and collapse in situations that
have never been a problem up to mimic (most prominently, that a simple
rebalance operation after adding disks gets OSDs to flap and can take a whole
cluster down - plenty of cases since nautilus). Stability at scale seems to
become a real issue with increasing version numbers. I'm myself very hesitant
to upgrade, in particular, because there is no way back and the cycles of
potential doom are so short.
Therefore, I would very much appreciate the foundation of a ceph-LTS branch
with at least 10 years back-port support, if not longer. In addition, upgrade
procedures between LTS versions should allow a downgrade by one version as well
(move legacy data along until explicitly allowed to cut all bridges). For any
large storage system, robustness, predictability and low maintenance effort are
invaluable. For example, our cluster is very demanding compared with our other
storage systems, the OSDs have a nasty memory leak, operations get stuck in
MONs and MDSes at least once or twice a week due to race conditions and so on.
It is currently not possible to let the cluster run unattended for months or
even years, something that is possible if not the rule with other (also
open-source) storage systems.
Fixing bugs that show up rarely and are very difficult to catch is really important for a
storage system with theoretically infinite uptime. Rolling versions over all the time and
then throwing "xyz is not supported, try with a newer version" at users when
they discover a rare a problem after running for a few years is not helping to get ceph
to a level of stability that will be convincing enough in the long run.
I understand that implementing new features is more fun than bug fixing.
However, bug fixing is what makes users trust a platform. I see too many people
around me loosing faith in ceph at the moment and starting to treat it as a
second- or third-class storage system. This is largely due to the short support
interval given the actual complexity of the software. Establishing an LTS
branch could win back sceptical admins who started looking for alternatives.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
--
Igor Fedotov
Ceph Lead Developer
Looking for help with your Ceph cluster? Contact us at https://croit.io
croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io