We revisit this basically every year and so I’m sort of inclined to keep the 
status quo which really amounts to basically doing whatever we end up deciding 
arbitrarily before we actually cut a release. 

Before discussing at length a new policy we’ll only immediately break, if the 
motivation is avoiding extra release steps, I would prefer we just avoid extra 
release steps by eg running nightly upgrade tests rather than pre commit, or 
making the tests faster, or waiting until the test matrix actually causes 
anything to break rather than assuming it will.

> On 28 Jan 2025, at 15:45, Josh McKenzie <jmcken...@apache.org> wrote:
> 
> 
>> 
>> Python Upgrade DTests today requires 192x large (7 cpu, 14GB ram) servers
> 
>> We have far fewer (and more effective?) JVM Upgrade DTests.
>> There we only need 8x medium (3 cpu, 5GB ram) servers
> 
> Does anyone have a strong understanding of the coverage and value offered by 
> the python upgrade dtests vs. the in-jvm dtests? I don't, but I intuitively 
> have a hard time believing the value difference matches the hardware 
> requirement difference there.
> 
>> Lots and lots of words about releases from mick (<3)
> Those of you who know me know my "spidey-senses" get triggered by enough 
> complexity regardless of how well justified. I feel like our release process 
> has passed this threshold for me. Been talking a lot with Mick about this 
> topic for a couple weeks and I'm curious if the community sees a major flaw 
> with a proposal like the following:
> We formally support 3 releases at a time
> We only release MAJOR (i.e. semver major). No more "5.0, 5.1, 5.2", would now 
> be "5.0, 6.0, 7.0"
> We test and support online upgrades between supported releases
> Any removal or API breakage follows a "deprecate-then-release" cycle
> We cut a release every 12 months
> Implications for operators:
> Upgrade paths for online upgrades are simple and clear. T-2.
> "Forced" update cadence to stay on supported versions is 3 years
> If you adopt v1.0 it will be supported until v4.0 comes out 36 months later
> This gives users the flexibility to prioritize functionality vs. stability 
> and to balance release validation costs
> Deprecation cycles are clear as are compatibility paths.
> Release timelines and feature availability are predictable and clear
> Implications for developers on the project:
> Support requirements for online upgrades are clear
> Opportunity cost of feature slippage relative to release date is balanced 
> (worst-case == 11.99 month delay on availability in GA supported release)
> Path to keep code-base maintainable is clear (deprecate-then-remove)
> CI requirements are constrained and predictable
> Moving to a "online upgrades supported for everything" is something I support 
> in principle, but would advocate we consider after getting a handle on our 
> release process.
> 
> So - what do we lose if we consider the above approach?
> 
>> On Tue, Jan 28, 2025, at 8:23 AM, Mick Semb Wever wrote:
>> Jordan, replies inline. 
>> 
>> 
>> To take a snippet from your email "A little empathy for our users goes a 
>> long way."  While I agree clarity is important, forcing our users to upgrade 
>> multiple times is not in their best interest. 
>> 
>> 
>> Yes – we would be moving in that direction by now saying we aim for online 
>> compatibility across all versions.   But how feasible that turns out to be 
>> depends on our future actions and new versions.  
>> 
>> The separation between "the code maintains compatibility across all 
>> versions" versus "we only actively test these upgrade paths so that's our 
>> limited recommendation"  is here what lets us reduce the "forcing our users 
>> to upgrade multiple times".  That's the "other paths may work but you're on 
>> your own – do your homework" aspect.   This is a position that allows us to 
>> progress into something better.
>> 
>> For now, and using the current status quo of major/minor usage as the 
>> implemented example: this would progress us to no longer needing major 
>> versions (we would just test all upgrade paths for all current maintained 
>> versions, CI resources permitting).
>> The community can change over time as well, it's worth thinking about an 
>> approach that is adjustable to changing resources.  (This includes efforts 
>> required in documenting past, present, future, especially as changes are 
>> made.)
>> 
>> I emphasise, first I think we need to be focusing on maintaining 
>> compatibility in the code (and how and when we are willing/needing to break 
>> it).
>> 
>>  
>> At the same time, doesn't less testing resources primarily translate to 
>> longer test runs?
>> 
>> 
>> Too much also saturates the testing cluster to a point where tests become 
>> flaky and fail.  ci-cassandra.a.o is already better at exposing flaky tests 
>> than other systems.  This is a practicality, and it's constantly being 
>> improved, but only under volunteer time.  Donating test hardware is the 
>> simpler ask.
>>  
>> Upgrade tests don't need to be run on every commit. When I worked on Riak we 
>> had very comprehensive upgrade testing (pretty much the full matrix of 
>> versions) and we had a schedule we ran these tests on ahead of release.
>> 
>> 
>> We are already struggling to stay on top of failures and flakies with 
>> ~per-commit builds and butler.c.a.o
>> I'm not against the idea of schedule test runs, but it needs more input and 
>> effort from people in that space for it to accommodate it.
>> 
>> I am not fond of the idea of "tests ahead of release" – release managers 
>> already do enough and are a scarce resource.  Asking them to also be the 
>> build butler and chase down bugs and people to fix them is not appropriate 
>> IMO.   I also think it's unwise without guarantee that the 
>> contributor/committer that created the bug is available at release time.  
>> Having just one post-commit pipeline has nice benefits in simplicity, as 
>> long as it's feasible then slow is ok (as you say above).
>> 
>>  
>> Could you share some more details on the resource issues and their impacts?
>> 
>> Python Upgrade DTests and JVM Upgrade DTests.
>> 
>> Python Upgrade DTests today requires 192x large (7 cpu, 14GB ram) servers, 
>> each taking up to one hour.
>> Currently we have too many upgrade paths (4.0, 4.1, 5.0, to trunk), and are 
>> seeing builds abort because of timeouts (>1hr).  Collected timing numbers 
>> suggest we should double this number to 384, or simply remove upgrade paths 
>> we test.
>> 
>> https://github.com/apache/cassandra/blob/trunk/.jenkins/Jenkinsfile#L185-L188
>>  
>> https://github.com/apache/cassandra/blob/trunk/.jenkins/Jenkinsfile#L37
>> 
>> We have far fewer (and more effective?) JVM Upgrade DTests.
>> There we only need 8x medium (3 cpu, 5GB ram) servers.
>> https://github.com/apache/cassandra/blob/trunk/.jenkins/Jenkinsfile#L177
>> 
>> 
> 

Reply via email to