Jordan, replies inline.
To take a snippet from your email "A little empathy for our users goes a > long way." While I agree clarity is important, forcing our users to > upgrade multiple times is not in their best interest. > Yes – we would be moving in that direction by now saying we aim for online compatibility across all versions. But how feasible that turns out to be depends on our future actions and new versions. The separation between "the code maintains compatibility across all versions" versus "we only actively test these upgrade paths so that's our limited recommendation" is here what lets us reduce the "forcing our users to upgrade multiple times". That's the "other paths may work but you're on your own – do your homework" aspect. This is a position that allows us to progress into something better. For now, and using the current status quo of major/minor usage as the implemented example: this would progress us to no longer needing major versions (we would just test all upgrade paths for all current maintained versions, CI resources permitting). The community can change over time as well, it's worth thinking about an approach that is adjustable to changing resources. (This includes efforts required in documenting past, present, future, especially as changes are made.) I emphasise, first I think we need to be focusing on maintaining compatibility in the code (and how and when we are willing/needing to break it). > At the same time, doesn't less testing resources primarily translate to > longer test runs? > Too much also saturates the testing cluster to a point where tests become flaky and fail. ci-cassandra.a.o is already better at exposing flaky tests than other systems. This is a practicality, and it's constantly being improved, but only under volunteer time. Donating test hardware is the simpler ask. > Upgrade tests don't need to be run on every commit. When I worked on Riak > we had very comprehensive upgrade testing (pretty much the full matrix of > versions) and we had a schedule we ran these tests on ahead of release. > We are already struggling to stay on top of failures and flakies with ~per-commit builds and butler.c.a.o I'm not against the idea of schedule test runs, but it needs more input and effort from people in that space for it to accommodate it. I am not fond of the idea of "tests ahead of release" – release managers already do enough and are a scarce resource. Asking them to also be the build butler and chase down bugs and people to fix them is not appropriate IMO. I also think it's unwise without guarantee that the contributor/committer that created the bug is available at release time. Having just one post-commit pipeline has nice benefits in simplicity, as long as it's feasible then slow is ok (as you say above). > Could you share some more details on the resource issues and their impacts? > Python Upgrade DTests and JVM Upgrade DTests. Python Upgrade DTests today requires 192x large (7 cpu, 14GB ram) servers, each taking up to one hour. Currently we have too many upgrade paths (4.0, 4.1, 5.0, to trunk), and are seeing builds abort because of timeouts (>1hr). Collected timing numbers suggest we should double this number to 384, or simply remove upgrade paths we test. https://github.com/apache/cassandra/blob/trunk/.jenkins/Jenkinsfile#L185-L188 https://github.com/apache/cassandra/blob/trunk/.jenkins/Jenkinsfile#L37 We have far fewer (and more effective?) JVM Upgrade DTests. There we only need 8x medium (3 cpu, 5GB ram) servers. https://github.com/apache/cassandra/blob/trunk/.jenkins/Jenkinsfile#L177