Jordan, replies inline.

To take a snippet from your email "A little empathy for our users goes a
> long way."  While I agree clarity is important, forcing our users to
> upgrade multiple times is not in their best interest.
>


Yes – we would be moving in that direction by now saying we aim for online
compatibility across all versions.   But how feasible that turns out to be
depends on our future actions and new versions.

The separation between "the code maintains compatibility across all
versions" versus "we only actively test these upgrade paths so that's our
limited recommendation"  is here what lets us reduce the "forcing our users
to upgrade multiple times".  That's the "other paths may work but you're on
your own – do your homework" aspect.   This is a position that allows us to
progress into something better.

For now, and using the current status quo of major/minor usage as the
implemented example: this would progress us to no longer needing major
versions (we would just test all upgrade paths for all current maintained
versions, CI resources permitting).

The community can change over time as well, it's worth thinking about an
approach that is adjustable to changing resources.  (This includes efforts
required in documenting past, present, future, especially as changes are
made.)

I emphasise, first I think we need to be focusing on maintaining
compatibility in the code (and how and when we are willing/needing to break
it).



> At the same time, doesn't less testing resources primarily translate to
> longer test runs?
>


Too much also saturates the testing cluster to a point where tests become
flaky and fail.  ci-cassandra.a.o is already better at exposing flaky tests
than other systems.  This is a practicality, and it's constantly being
improved, but only under volunteer time.  Donating test hardware is
the simpler ask.



> Upgrade tests don't need to be run on every commit. When I worked on Riak
> we had very comprehensive upgrade testing (pretty much the full matrix of
> versions) and we had a schedule we ran these tests on ahead of release.
>


We are already struggling to stay on top of failures and flakies with
~per-commit builds and butler.c.a.o
I'm not against the idea of schedule test runs, but it needs more input and
effort from people in that space for it to accommodate it.

I am not fond of the idea of "tests ahead of release" – release managers
already do enough and are a scarce resource.  Asking them to also be the
build butler and chase down bugs and people to fix them is not appropriate
IMO.   I also think it's unwise without guarantee that the
contributor/committer that created the bug is available at release time.
Having just one post-commit pipeline has nice benefits in simplicity, as
long as it's feasible then slow is ok (as you say above).



> Could you share some more details on the resource issues and their impacts?
>

Python Upgrade DTests and JVM Upgrade DTests.

Python Upgrade DTests today requires 192x large (7 cpu, 14GB ram) servers,
each taking up to one hour.
Currently we have too many upgrade paths (4.0, 4.1, 5.0, to trunk), and are
seeing builds abort because of timeouts (>1hr).  Collected timing numbers
suggest we should double this number to 384, or simply remove upgrade paths
we test.

https://github.com/apache/cassandra/blob/trunk/.jenkins/Jenkinsfile#L185-L188

https://github.com/apache/cassandra/blob/trunk/.jenkins/Jenkinsfile#L37

We have far fewer (and more effective?) JVM Upgrade DTests.
There we only need 8x medium (3 cpu, 5GB ram) servers.
https://github.com/apache/cassandra/blob/trunk/.jenkins/Jenkinsfile#L177

Reply via email to