My opinion is that it would be valuable to take this discussion as a forcing 
function to determine how we plan to handle releases broadly to answer the "5.1 
should be 6.0" question. Assuming we move away from ad hoc per-release debate. 
If there's broad strong dissent (i.e. let's have 6.0 be the next major and talk 
about this topic separately) I'm happy to open another thread, but I didn't see 
clear consensus on this thread yet and was trying to help drive to that.

> Depending on what “T-2” means for the online upgrade. 
For 6.0, that would mean last 2 majors (5.0, 4.1). I think we'd need to make an 
exception for 4.0 during the change much like we made exceptions for 3.0 and 
3.x, meaning T-3 to respect the current paradigm of "any adjacent major.minor 
to next major" during this transition.

For 7.0, online upgrade would be supported for 6.0 and 5.0.

> If you mean only 4.1 and 5.0 would be online upgrade targets, I would suggest 
> we change that to T-3 so you encompass all “currently supported” releases at 
> the time the new branch is GAed.
I think that's better actually, yeah. I was originally thinking T-2 from the 
"what calendar time frame is reasonable" perspective, but saying "if you're on 
a currently supported branch you can upgrade to a release that comes out" makes 
clean intuitive sense. That'd mean:

6.0: 5.0, 4.1, 4.0 online upgrades supported. Drop support for 4.0. API 
compatible guaranteed w/5.0.
7.0: 6.0, 5.0, 4.1 online upgrades supported. Drop support for 4.1. API 
compatible guaranteed w/6.0.
8.0: 7.0, 6.0, 5.0 online upgrades supported. Drop support for 5.0. API 
compatible guaranteed w/7.0.

On Wed, Jan 29, 2025, at 12:15 PM, Jeremiah Jordan wrote:
> This got way off topic from 5.1 should be 6.0, so maybe there should be a new 
> DISCUSS thread with the correct title to have a discussion around codifying 
> our upgrade paths?
> 
> FWIW this mostly agrees with my thoughts around upgrade support.
> 
>>>> T-2 online upgrade supported, T-1 API compatible, deprecate-then-remove is 
>>>> a combination of 3 simple things that I think will improve this situation 
>>>> greatly and hopefully put a nail in the coffin of the topic, improve 
>>>> things, and let us move on to more interesting topics that we can then 
>>>> re-litigate endlessly. ;)
> 
> Depending on what “T-2” means for the online upgrade.  If you mean 4.0, 4.1, 
> and 5.0 are all online upgrade supported versions for trunk, then I agree.  
> If you mean only 4.1 and 5.0 would be online upgrade targets, I would suggest 
> we change that to T-3 so you encompass all “currently supported” releases at 
> the time the new branch is GAed.
> 
> -Jeremiah
> 
> On Jan 29, 2025 at 10:49:17 AM, Josh McKenzie <jmcken...@apache.org> wrote:
>> 
>> To clarify, when I say unspoken it includes "not consciously considered but 
>> shapes engagement patterns". I don't think there's people sitting around 
>> deeply against either the status quo or my proposal who are holding back for 
>> nefarious purposes or anything.
>> 
>> And yeah - my goal is to try and put a little more energy into this to see 
>> if we can surface pushback as I don't think it'd be appropriate to move to a 
>> VOTE thread on a proposal with essentially nil engagement. My intuition is 
>> that the properties of the status quo isn't actually what the polity wants, 
>> whether or not what I'm proposing is an improvement on that status quo.
>> 
>> On Wed, Jan 29, 2025, at 11:15 AM, Benedict wrote:
>>> 
>>> I think you’re making the mistake of assuming a representative sample of 
>>> the community participates in these debates. Sensibly, a majority of the 
>>> community sits these out, and I think on this topic that’s actually the 
>>> rational response.
>>> 
>>> That doesn’t stop folk voting for something else when the decision actually 
>>> matters, as it shouldn’t - the polity can’t bind itself after all.
>>> 
>>> Which is only to say, I applaud your optimism but it’s probably wrong to 
>>> assume there’ll be pushback that reifies the community’s revealed 
>>> preferences. There’s no reason to assume there will be, and history shows 
>>> there usually isn’t.
>>> 
>>> To be clear, I don’t think these are our “unspoken incentives” but our 
>>> collective preferences that simply can’t functionally be codified due to 
>>> the fact nobody is willing to actually argue this is a good thing. 
>>> Sometimes no individual likes what happens, but it’s what the polity 
>>> actually wants, collectively. That’s fine, let’s be at peace with it.
>>> 
>>>> On 29 Jan 2025, at 16:00, Josh McKenzie <jmcken...@apache.org> wrote:
>>>> 
>>>> I've let this topic sit in my head overnight and kind of chewed on it. 
>>>> While I agree w/the "we're doing what matches our unspoken incentives" 
>>>> angle Benedict, I think we can do better than that both for ourselves and 
>>>> our users if we apply energy here and codify something. If people come out 
>>>> with energy to push *against* that codification, that'll at least bring 
>>>> the unspoken incentives to light to work through.
>>>> 
>>>> I think it's important we release on a predictable cadence for our users. 
>>>> We've fallen short (in some cases exceptionally) on this in the past, and 
>>>> it also adds value for operators to plan out verification and adoption 
>>>> cycles. It also helps users considering different databases to see a 
>>>> predictable cadence and a healthy project. My current position is that 12 
>>>> months is a happy medium min-value, especially with a T-2 supported cycle 
>>>> since that gives users between 12 months for high appetite fast adoption 
>>>> up to 36 months for slow verification. I don't want to further pry open 
>>>> Pandora's box, but I'd love to see us cut alphas from trunk quarterly as 
>>>> well.
>>>> 
>>>> I also think it's important that our release versioning is clear and 
>>>> simple. Right now,  *to my mind*, it is not. The current matrix of:
>>>>  • Any .MINOR to next MAJOR is supported
>>>>  • Any .MAJOR to next MAJOR is supported
>>>>  • A release will be supported for some variable amount of time based on 
>>>> when we get around to new releases
>>>>  • API breaks in MAJOR changes, except when we get excited about a feature 
>>>> and want to .MAJOR to signal that in which case it may be completely 
>>>> low-risk and easy adoption, or we change JDK's and need to signal that, or 
>>>> any of another slew of caveats that require digging into NEWS.txt to see 
>>>> what the hell we're up to
>>>>  • And all of our CI pain that ensues from the above
>>>> In my opinion the above is a mess. This isn't a particularly interesting 
>>>> topic to me, and us re-litigating this on every release (even if you 
>>>> discount me agitating about it; this isn't just me making noise I think), 
>>>> is a giant waste of time and energy for a low value outcome.
>>>> 
>>>> T-2 online upgrade supported, T-1 API compatible, deprecate-then-remove is 
>>>> a combination of 3 simple things that I think will improve this situation 
>>>> greatly and hopefully put a nail in the coffin of the topic, improve 
>>>> things, and let us move on to more interesting topics that we can then 
>>>> re-litigate endlessly. ;)
>>>> 
>>>> So - is anyone actively *against* the above proposal?
>>>> 
>>>> On Tue, Jan 28, 2025, at 11:34 AM, David Capwell wrote:
>>>>> I have not checked Jenkins, but we see this in another environment…
>>>>> 
>>>>> For python upgrades have we actually audited the runtime to see that the 
>>>>> time spent is doing real work?  Josh and I have spent a ton of time 
>>>>> trying to fix (and failing) an issue where the python driver blocks the 
>>>>> test and we wait 2 hours for that to timeout… this pattern is always 
>>>>> after all tests are run… what I see is python upgrades take around 30m of 
>>>>> real work, then 2h of idle blocking taking all resources…
>>>>> 
>>>>> 
>>>>> Sent from my iPhone
>>>>> 
>>>>>> On Jan 28, 2025, at 8:16 AM, Benedict <bened...@apache.org> wrote:
>>>>>> 
>>>>>> 
>>>>>> My opinion? Our revealed preferences don’t match whatever ideal is being 
>>>>>> chased whenever we discuss a policy.
>>>>>> .
>>>>>> Ignoring the tick-tick debacle the community has done basically the same 
>>>>>> thing every release, only with a drift towards stricter QA and 
>>>>>> compatibility expectations with maturity.
>>>>>> 
>>>>>> That is, we have always numbered using some combination of semver and 
>>>>>> how exciting the release is, and backed all other decisions out of 
>>>>>> whatever was reasonable once that decision was made.
>>>>>> 
>>>>>> Which basically means a new major every 1 or 2 releases depending on how 
>>>>>> big the new features are. Which is actually pretty intuitive really, but 
>>>>>> isn’t a policy anyone dogmatic wants to argue for.
>>>>>> 
>>>>>>> On 28 Jan 2025, at 16:07, Josh McKenzie <jmcken...@apache.org> wrote:
>>>>>>> 
>>>>>>>> We revisit this basically every year and so I’m sort of inclined to 
>>>>>>>> keep the status quo which really amounts to basically doing whatever 
>>>>>>>> we end up deciding arbitrarily before we actually cut a release. 
>>>>>>>> 
>>>>>>>> Before discussing at length a new policy we’ll only immediately break
>>>>>>> It's painful how accurate this feels. =/
>>>>>>> 
>>>>>>> Is it the complexity of these topics that's keeping us stuck or a lack 
>>>>>>> of consensus... or both?
>>>>>>> 
>>>>>>>> if the motivation is
>>>>>>> My personal motivation is that our ad hoc re-litigating of this 
>>>>>>> reactively at the last possible moment over and over is uninteresting 
>>>>>>> and feels like a giant waste of time and energy for all of us. But to 
>>>>>>> your point, if trying to formalize it doesn't yield results, that's 
>>>>>>> just objectively worse since it's adding more churn on top of a 
>>>>>>> churn-heavy process. /sigh
>>>>>>> 
>>>>>>> On Tue, Jan 28, 2025, at 11:01 AM, Benedict wrote:
>>>>>>>> 
>>>>>>>> We revisit this basically every year and so I’m sort of inclined to 
>>>>>>>> keep the status quo which really amounts to basically doing whatever 
>>>>>>>> we end up deciding arbitrarily before we actually cut a release. 
>>>>>>>> 
>>>>>>>> Before discussing at length a new policy we’ll only immediately break, 
>>>>>>>> if the motivation is avoiding extra release steps, I would prefer we 
>>>>>>>> just avoid extra release steps by eg running nightly upgrade tests 
>>>>>>>> rather than pre commit, or making the tests faster, or waiting until 
>>>>>>>> the test matrix actually causes anything to break rather than assuming 
>>>>>>>> it will.
>>>>>>>> 
>>>>>>>>> On 28 Jan 2025, at 15:45, Josh McKenzie <jmcken...@apache.org> wrote:
>>>>>>>>> 
>>>>>>>>>> Python Upgrade DTests today requires 192x large (7 cpu, 14GB ram) 
>>>>>>>>>> servers
>>>>>>>>> 
>>>>>>>>>> We have far fewer (and more effective?) JVM Upgrade DTests.
>>>>>>>>>> There we only need 8x medium (3 cpu, 5GB ram) servers
>>>>>>>>> 
>>>>>>>>> Does anyone have a strong understanding of the coverage and value 
>>>>>>>>> offered by the python upgrade dtests vs. the in-jvm dtests? I don't, 
>>>>>>>>> but I intuitively have a hard time believing the value difference 
>>>>>>>>> matches the hardware requirement difference there.
>>>>>>>>> 
>>>>>>>>>> Lots and lots of words about releases from mick (<3)
>>>>>>>>> Those of you who know me know my "spidey-senses" get triggered by 
>>>>>>>>> enough complexity regardless of how well justified. I feel like our 
>>>>>>>>> release process has passed this threshold for me. Been talking a lot 
>>>>>>>>> with Mick about this topic for a couple weeks and I'm curious if the 
>>>>>>>>> community sees a major flaw with a proposal like the following:
>>>>>>>>>  • We formally support 3 releases at a time
>>>>>>>>>  • We only release MAJOR (i.e. semver major). No more "5.0, 5.1, 
>>>>>>>>> 5.2", would now be "5.0, 6.0, 7.0"
>>>>>>>>>  • We test and support online upgrades between supported releases
>>>>>>>>>  • Any removal or API breakage follows a "deprecate-then-release" 
>>>>>>>>> cycle
>>>>>>>>>  • We cut a release every 12 months
>>>>>>>>> *Implications for operators:*
>>>>>>>>>  • Upgrade paths for online upgrades are simple and clear. T-2.
>>>>>>>>>  • "Forced" update cadence to stay on supported versions is 3 years
>>>>>>>>>    • If you adopt v1.0 it will be supported until v4.0 comes out 36 
>>>>>>>>> months later
>>>>>>>>>    • This gives users the flexibility to prioritize functionality vs. 
>>>>>>>>> stability and to balance release validation costs
>>>>>>>>>  • Deprecation cycles are clear as are compatibility paths.
>>>>>>>>>  • Release timelines and feature availability are predictable and 
>>>>>>>>> clear
>>>>>>>>> *Implications for developers on the project:***
>>>>>>>>>  • Support requirements for online upgrades are clear
>>>>>>>>>  • Opportunity cost of feature slippage relative to release date is 
>>>>>>>>> balanced (worst-case == 11.99 month delay on availability in GA 
>>>>>>>>> supported release)
>>>>>>>>>  • Path to keep code-base maintainable is clear 
>>>>>>>>> (deprecate-then-remove)
>>>>>>>>>  • CI requirements are constrained and predictable
>>>>>>>>> Moving to a "online upgrades supported for everything" is something I 
>>>>>>>>> support in principle, but would advocate we consider after getting a 
>>>>>>>>> handle on our release process.
>>>>>>>>> 
>>>>>>>>> So - what do we lose if we consider the above approach?
>>>>>>>>> 
>>>>>>>>> On Tue, Jan 28, 2025, at 8:23 AM, Mick Semb Wever wrote:
>>>>>>>>>> Jordan, replies inline. 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> To take a snippet from your email "A little empathy for our users 
>>>>>>>>>>> goes a long way."  While I agree clarity is important, forcing our 
>>>>>>>>>>> users to upgrade multiple times is not in their best interest. 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Yes – we would be moving in that direction by now saying we aim for 
>>>>>>>>>> online compatibility across all versions.   But how feasible that 
>>>>>>>>>> turns out to be depends on our future actions and new versions.  
>>>>>>>>>> 
>>>>>>>>>> The separation between "the code maintains compatibility across all 
>>>>>>>>>> versions" versus "we only actively test these upgrade paths so 
>>>>>>>>>> that's our limited recommendation"  is here what lets us reduce the 
>>>>>>>>>> "forcing our users to upgrade multiple times".  That's the "other 
>>>>>>>>>> paths may work but you're on your own – do your homework" aspect.   
>>>>>>>>>> This is a position that allows us to progress into something better.
>>>>>>>>>> 
>>>>>>>>>> For now, and using the current status quo of major/minor usage as 
>>>>>>>>>> the implemented example: this would progress us to no longer needing 
>>>>>>>>>> major versions (we would just test all upgrade paths for all current 
>>>>>>>>>> maintained versions, CI resources permitting).
>>>>>>>>>> The community can change over time as well, it's worth thinking 
>>>>>>>>>> about an approach that is adjustable to changing resources.  (This 
>>>>>>>>>> includes efforts required in documenting past, present, future, 
>>>>>>>>>> especially as changes are made.)
>>>>>>>>>> 
>>>>>>>>>> I emphasise, first I think we need to be focusing on maintaining 
>>>>>>>>>> compatibility in the code (and how and when we are willing/needing 
>>>>>>>>>> to break it).
>>>>>>>>>> 
>>>>>>>>>>  
>>>>>>>>>>> At the same time, doesn't less testing resources primarily 
>>>>>>>>>>> translate to longer test runs?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Too much also saturates the testing cluster to a point where tests 
>>>>>>>>>> become flaky and fail.  ci-cassandra.a.o is already better at 
>>>>>>>>>> exposing flaky tests than other systems.  This is a practicality, 
>>>>>>>>>> and it's constantly being improved, but only under volunteer time.  
>>>>>>>>>> Donating test hardware is the simpler ask.
>>>>>>>>>>  
>>>>>>>>>>> Upgrade tests don't need to be run on every commit. When I worked 
>>>>>>>>>>> on Riak we had very comprehensive upgrade testing (pretty much the 
>>>>>>>>>>> full matrix of versions) and we had a schedule we ran these tests 
>>>>>>>>>>> on ahead of release.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> We are already struggling to stay on top of failures and flakies 
>>>>>>>>>> with ~per-commit builds and butler.c.a.o
>>>>>>>>>> I'm not against the idea of schedule test runs, but it needs more 
>>>>>>>>>> input and effort from people in that space for it to accommodate it.
>>>>>>>>>> 
>>>>>>>>>> I am not fond of the idea of "tests ahead of release" – release 
>>>>>>>>>> managers already do enough and are a scarce resource.  Asking them 
>>>>>>>>>> to also be the build butler and chase down bugs and people to fix 
>>>>>>>>>> them is not appropriate IMO.   I also think it's unwise without 
>>>>>>>>>> guarantee that the contributor/committer that created the bug is 
>>>>>>>>>> available at release time.  Having just one post-commit pipeline has 
>>>>>>>>>> nice benefits in simplicity, as long as it's feasible then slow is 
>>>>>>>>>> ok (as you say above).
>>>>>>>>>> 
>>>>>>>>>>  
>>>>>>>>>>> Could you share some more details on the resource issues and their 
>>>>>>>>>>> impacts?
>>>>>>>>>> 
>>>>>>>>>> Python Upgrade DTests and JVM Upgrade DTests.
>>>>>>>>>> 
>>>>>>>>>> Python Upgrade DTests today requires 192x large (7 cpu, 14GB ram) 
>>>>>>>>>> servers, each taking up to one hour.
>>>>>>>>>> Currently we have too many upgrade paths (4.0, 4.1, 5.0, to trunk), 
>>>>>>>>>> and are seeing builds abort because of timeouts (>1hr).  Collected 
>>>>>>>>>> timing numbers suggest we should double this number to 384, or 
>>>>>>>>>> simply remove upgrade paths we test.
>>>>>>>>>> 
>>>>>>>>>> https://github.com/apache/cassandra/blob/trunk/.jenkins/Jenkinsfile#L185-L188
>>>>>>>>>>  
>>>>>>>>>> https://github.com/apache/cassandra/blob/trunk/.jenkins/Jenkinsfile#L37
>>>>>>>>>> 
>>>>>>>>>> We have far fewer (and more effective?) JVM Upgrade DTests.
>>>>>>>>>> There we only need 8x medium (3 cpu, 5GB ram) servers.
>>>>>>>>>> https://github.com/apache/cassandra/blob/trunk/.jenkins/Jenkinsfile#L177
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>> 
>> 

Reply via email to