Re: [DISCUSS] 5.1 should be 6.0

Josh McKenzie Tue, 28 Jan 2025 07:52:43 -0800

> Python Upgrade DTests today requires 192x large (7 cpu, 14GB ram) servers


> We have far fewer (and more effective?) JVM Upgrade DTests.
> There we only need 8x medium (3 cpu, 5GB ram) servers

Does anyone have a strong understanding of the coverage and value offered by 
the python upgrade dtests vs. the in-jvm dtests? I don't, but I intuitively 
have a hard time believing the value difference matches the hardware 
requirement difference there.

> Lots and lots of words about releases from mick (<3)
Those of you who know me know my "spidey-senses" get triggered by enough 
complexity regardless of how well justified. I feel like our release process 
has passed this threshold for me. Been talking a lot with Mick about this topic 
for a couple weeks and I'm curious if the community sees a major flaw with a 
proposal like the following:
 • We formally support 3 releases at a time
 • We only release MAJOR (i.e. semver major). No more "5.0, 5.1, 5.2", would 
now be "5.0, 6.0, 7.0"
 • We test and support online upgrades between supported releases
 • Any removal or API breakage follows a "deprecate-then-release" cycle
 • We cut a release every 12 months
*Implications for operators:*
 • Upgrade paths for online upgrades are simple and clear. T-2.
 • "Forced" update cadence to stay on supported versions is 3 years
   • If you adopt v1.0 it will be supported until v4.0 comes out 36 months later
   • This gives users the flexibility to prioritize functionality vs. stability 
and to balance release validation costs
 • Deprecation cycles are clear as are compatibility paths.
 • Release timelines and feature availability are predictable and clear
*Implications for developers on the project:**
*
 • Support requirements for online upgrades are clear
 • Opportunity cost of feature slippage relative to release date is balanced 
(worst-case == 11.99 month delay on availability in GA supported release)
 • Path to keep code-base maintainable is clear (deprecate-then-remove)
 • CI requirements are constrained and predictable
Moving to a "online upgrades supported for everything" is something I support 
in principle, but would advocate we consider after getting a handle on our 
release process.

So - what do we lose if we consider the above approach?

On Tue, Jan 28, 2025, at 8:23 AM, Mick Semb Wever wrote:
> Jordan, replies inline. 
> 
> 
>> To take a snippet from your email "A little empathy for our users goes a 
>> long way."  While I agree clarity is important, forcing our users to upgrade 
>> multiple times is not in their best interest. 
> 
> 
> Yes – we would be moving in that direction by now saying we aim for online 
> compatibility across all versions.   But how feasible that turns out to be 
> depends on our future actions and new versions.  
> 
> The separation between "the code maintains compatibility across all versions" 
> versus "we only actively test these upgrade paths so that's our limited 
> recommendation"  is here what lets us reduce the "forcing our users to 
> upgrade multiple times".  That's the "other paths may work but you're on your 
> own – do your homework" aspect.   This is a position that allows us to 
> progress into something better.
> 
> For now, and using the current status quo of major/minor usage as the 
> implemented example: this would progress us to no longer needing major 
> versions (we would just test all upgrade paths for all current maintained 
> versions, CI resources permitting).
> The community can change over time as well, it's worth thinking about an 
> approach that is adjustable to changing resources.  (This includes efforts 
> required in documenting past, present, future, especially as changes are 
> made.)
> 
> I emphasise, first I think we need to be focusing on maintaining 
> compatibility in the code (and how and when we are willing/needing to break 
> it).
> 
>  
>> At the same time, doesn't less testing resources primarily translate to 
>> longer test runs?
> 
> 
> Too much also saturates the testing cluster to a point where tests become 
> flaky and fail.  ci-cassandra.a.o is already better at exposing flaky tests 
> than other systems.  This is a practicality, and it's constantly being 
> improved, but only under volunteer time.  Donating test hardware is the 
> simpler ask.
>  
>> Upgrade tests don't need to be run on every commit. When I worked on Riak we 
>> had very comprehensive upgrade testing (pretty much the full matrix of 
>> versions) and we had a schedule we ran these tests on ahead of release.
> 
> 
> We are already struggling to stay on top of failures and flakies with 
> ~per-commit builds and butler.c.a.o
> I'm not against the idea of schedule test runs, but it needs more input and 
> effort from people in that space for it to accommodate it.
> 
> I am not fond of the idea of "tests ahead of release" – release managers 
> already do enough and are a scarce resource.  Asking them to also be the 
> build butler and chase down bugs and people to fix them is not appropriate 
> IMO.   I also think it's unwise without guarantee that the 
> contributor/committer that created the bug is available at release time.  
> Having just one post-commit pipeline has nice benefits in simplicity, as long 
> as it's feasible then slow is ok (as you say above).
> 
>  
>> Could you share some more details on the resource issues and their impacts?
> 
> Python Upgrade DTests and JVM Upgrade DTests.
> 
> Python Upgrade DTests today requires 192x large (7 cpu, 14GB ram) servers, 
> each taking up to one hour.
> Currently we have too many upgrade paths (4.0, 4.1, 5.0, to trunk), and are 
> seeing builds abort because of timeouts (>1hr).  Collected timing numbers 
> suggest we should double this number to 384, or simply remove upgrade paths 
> we test.
> 
> https://github.com/apache/cassandra/blob/trunk/.jenkins/Jenkinsfile#L185-L188 
> https://github.com/apache/cassandra/blob/trunk/.jenkins/Jenkinsfile#L37
> 
> We have far fewer (and more effective?) JVM Upgrade DTests.
> There we only need 8x medium (3 cpu, 5GB ram) servers.
> https://github.com/apache/cassandra/blob/trunk/.jenkins/Jenkinsfile#L177
> 
>

Re: [DISCUSS] 5.1 should be 6.0

Reply via email to