Re: [DISCUSS] 5.1 should be 6.0

Benedict Wed, 29 Jan 2025 08:15:49 -0800

I think you’re making the mistake of assuming a representative sample of the 
community participates in these debates. Sensibly, a majority of the community 
sits these out, and I think on this topic that’s actually the rational response.


That doesn’t stop folk voting for something else when the decision actually 
matters, as it shouldn’t - the polity can’t bind itself after all.

Which is only to say, I applaud your optimism but it’s probably wrong to assume 
there’ll be pushback that reifies the community’s revealed preferences. There’s 
no reason to assume there will be, and history shows there usually isn’t.

To be clear, I don’t think these are our “unspoken incentives” but our 
collective preferences that simply can’t functionally be codified due to the 
fact nobody is willing to actually argue this is a good thing. Sometimes no 
individual likes what happens, but it’s what the polity actually wants, 
collectively. That’s fine, let’s be at peace with it.

> On 29 Jan 2025, at 16:00, Josh McKenzie <[email protected]> wrote:
> 
> 
> I've let this topic sit in my head overnight and kind of chewed on it. While 
> I agree w/the "we're doing what matches our unspoken incentives" angle 
> Benedict, I think we can do better than that both for ourselves and our users 
> if we apply energy here and codify something. If people come out with energy 
> to push against that codification, that'll at least bring the unspoken 
> incentives to light to work through.
> 
> I think it's important we release on a predictable cadence for our users. 
> We've fallen short (in some cases exceptionally) on this in the past, and it 
> also adds value for operators to plan out verification and adoption cycles. 
> It also helps users considering different databases to see a predictable 
> cadence and a healthy project. My current position is that 12 months is a 
> happy medium min-value, especially with a T-2 supported cycle since that 
> gives users between 12 months for high appetite fast adoption up to 36 months 
> for slow verification. I don't want to further pry open Pandora's box, but 
> I'd love to see us cut alphas from trunk quarterly as well.
> 
> I also think it's important that our release versioning is clear and simple. 
> Right now,  to my mind, it is not. The current matrix of:
> Any .MINOR to next MAJOR is supported
> Any .MAJOR to next MAJOR is supported
> A release will be supported for some variable amount of time based on when we 
> get around to new releases
> API breaks in MAJOR changes, except when we get excited about a feature and 
> want to .MAJOR to signal that in which case it may be completely low-risk and 
> easy adoption, or we change JDK's and need to signal that, or any of another 
> slew of caveats that require digging into NEWS.txt to see what the hell we're 
> up to
> And all of our CI pain that ensues from the above
> In my opinion the above is a mess. This isn't a particularly interesting 
> topic to me, and us re-litigating this on every release (even if you discount 
> me agitating about it; this isn't just me making noise I think), is a giant 
> waste of time and energy for a low value outcome.
> 
> T-2 online upgrade supported, T-1 API compatible, deprecate-then-remove is a 
> combination of 3 simple things that I think will improve this situation 
> greatly and hopefully put a nail in the coffin of the topic, improve things, 
> and let us move on to more interesting topics that we can then re-litigate 
> endlessly. ;)
> 
> So - is anyone actively against the above proposal?
> 
> On Tue, Jan 28, 2025, at 11:34 AM, David Capwell wrote:
>> I have not checked Jenkins, but we see this in another environment…
>> 
>> For python upgrades have we actually audited the runtime to see that the 
>> time spent is doing real work?  Josh and I have spent a ton of time trying 
>> to fix (and failing) an issue where the python driver blocks the test and we 
>> wait 2 hours for that to timeout… this pattern is always after all tests are 
>> run… what I see is python upgrades take around 30m of real work, then 2h of 
>> idle blocking taking all resources…
>> 
>> 
>> Sent from my iPhone
>> 
>>> On Jan 28, 2025, at 8:16 AM, Benedict <[email protected]> wrote:
>>> 
>>> 
>>> My opinion? Our revealed preferences don’t match whatever ideal is being 
>>> chased whenever we discuss a policy.
>>> .
>>> Ignoring the tick-tick debacle the community has done basically the same 
>>> thing every release, only with a drift towards stricter QA and 
>>> compatibility expectations with maturity.
>>> 
>>> That is, we have always numbered using some combination of semver and how 
>>> exciting the release is, and backed all other decisions out of whatever was 
>>> reasonable once that decision was made.
>>> 
>>> Which basically means a new major every 1 or 2 releases depending on how 
>>> big the new features are. Which is actually pretty intuitive really, but 
>>> isn’t a policy anyone dogmatic wants to argue for.
>>> 
>>>> On 28 Jan 2025, at 16:07, Josh McKenzie <[email protected]> wrote:
>>>> 
>>>>> We revisit this basically every year and so I’m sort of inclined to keep 
>>>>> the status quo which really amounts to basically doing whatever we end up 
>>>>> deciding arbitrarily before we actually cut a release. 
>>>>> 
>>>>> Before discussing at length a new policy we’ll only immediately break
>>>> It's painful how accurate this feels. =/
>>>> 
>>>> Is it the complexity of these topics that's keeping us stuck or a lack of 
>>>> consensus... or both?
>>>> 
>>>>> if the motivation is
>>>> My personal motivation is that our ad hoc re-litigating of this reactively 
>>>> at the last possible moment over and over is uninteresting and feels like 
>>>> a giant waste of time and energy for all of us. But to your point, if 
>>>> trying to formalize it doesn't yield results, that's just objectively 
>>>> worse since it's adding more churn on top of a churn-heavy process. /sigh
>>>> 
>>>> On Tue, Jan 28, 2025, at 11:01 AM, Benedict wrote:
>>>>> 
>>>>> We revisit this basically every year and so I’m sort of inclined to keep 
>>>>> the status quo which really amounts to basically doing whatever we end up 
>>>>> deciding arbitrarily before we actually cut a release. 
>>>>> 
>>>>> Before discussing at length a new policy we’ll only immediately break, if 
>>>>> the motivation is avoiding extra release steps, I would prefer we just 
>>>>> avoid extra release steps by eg running nightly upgrade tests rather than 
>>>>> pre commit, or making the tests faster, or waiting until the test matrix 
>>>>> actually causes anything to break rather than assuming it will.
>>>>> 
>>>>>>> On 28 Jan 2025, at 15:45, Josh McKenzie <[email protected]> wrote:
>>>>>> 
>>>>>>> Python Upgrade DTests today requires 192x large (7 cpu, 14GB ram) 
>>>>>>> servers
>>>>>> 
>>>>>>> We have far fewer (and more effective?) JVM Upgrade DTests.
>>>>>>> There we only need 8x medium (3 cpu, 5GB ram) servers
>>>>>> 
>>>>>> Does anyone have a strong understanding of the coverage and value 
>>>>>> offered by the python upgrade dtests vs. the in-jvm dtests? I don't, but 
>>>>>> I intuitively have a hard time believing the value difference matches 
>>>>>> the hardware requirement difference there.
>>>>>> 
>>>>>>> Lots and lots of words about releases from mick (<3)
>>>>>> Those of you who know me know my "spidey-senses" get triggered by enough 
>>>>>> complexity regardless of how well justified. I feel like our release 
>>>>>> process has passed this threshold for me. Been talking a lot with Mick 
>>>>>> about this topic for a couple weeks and I'm curious if the community 
>>>>>> sees a major flaw with a proposal like the following:
>>>>>> We formally support 3 releases at a time
>>>>>> We only release MAJOR (i.e. semver major). No more "5.0, 5.1, 5.2", 
>>>>>> would now be "5.0, 6.0, 7.0"
>>>>>> We test and support online upgrades between supported releases
>>>>>> Any removal or API breakage follows a "deprecate-then-release" cycle
>>>>>> We cut a release every 12 months
>>>>>> Implications for operators:
>>>>>> Upgrade paths for online upgrades are simple and clear. T-2.
>>>>>> "Forced" update cadence to stay on supported versions is 3 years
>>>>>> If you adopt v1.0 it will be supported until v4.0 comes out 36 months 
>>>>>> later
>>>>>> This gives users the flexibility to prioritize functionality vs. 
>>>>>> stability and to balance release validation costs
>>>>>> Deprecation cycles are clear as are compatibility paths.
>>>>>> Release timelines and feature availability are predictable and clear
>>>>>> Implications for developers on the project:
>>>>>> Support requirements for online upgrades are clear
>>>>>> Opportunity cost of feature slippage relative to release date is 
>>>>>> balanced (worst-case == 11.99 month delay on availability in GA 
>>>>>> supported release)
>>>>>> Path to keep code-base maintainable is clear (deprecate-then-remove)
>>>>>> CI requirements are constrained and predictable
>>>>>> Moving to a "online upgrades supported for everything" is something I 
>>>>>> support in principle, but would advocate we consider after getting a 
>>>>>> handle on our release process.
>>>>>> 
>>>>>> So - what do we lose if we consider the above approach?
>>>>>> 
>>>>>>> On Tue, Jan 28, 2025, at 8:23 AM, Mick Semb Wever wrote:
>>>>>>> Jordan, replies inline. 
>>>>>>> 
>>>>>>> 
>>>>>>> To take a snippet from your email "A little empathy for our users goes 
>>>>>>> a long way."  While I agree clarity is important, forcing our users to 
>>>>>>> upgrade multiple times is not in their best interest. 
>>>>>>> 
>>>>>>> 
>>>>>>> Yes – we would be moving in that direction by now saying we aim for 
>>>>>>> online compatibility across all versions.   But how feasible that turns 
>>>>>>> out to be depends on our future actions and new versions.  
>>>>>>> 
>>>>>>> The separation between "the code maintains compatibility across all 
>>>>>>> versions" versus "we only actively test these upgrade paths so that's 
>>>>>>> our limited recommendation"  is here what lets us reduce the "forcing 
>>>>>>> our users to upgrade multiple times".  That's the "other paths may work 
>>>>>>> but you're on your own – do your homework" aspect.   This is a position 
>>>>>>> that allows us to progress into something better.
>>>>>>> 
>>>>>>> For now, and using the current status quo of major/minor usage as the 
>>>>>>> implemented example: this would progress us to no longer needing major 
>>>>>>> versions (we would just test all upgrade paths for all current 
>>>>>>> maintained versions, CI resources permitting).
>>>>>>> The community can change over time as well, it's worth thinking about 
>>>>>>> an approach that is adjustable to changing resources.  (This includes 
>>>>>>> efforts required in documenting past, present, future, especially as 
>>>>>>> changes are made.)
>>>>>>> 
>>>>>>> I emphasise, first I think we need to be focusing on maintaining 
>>>>>>> compatibility in the code (and how and when we are willing/needing to 
>>>>>>> break it).
>>>>>>> 
>>>>>>>  
>>>>>>> At the same time, doesn't less testing resources primarily translate to 
>>>>>>> longer test runs?
>>>>>>> 
>>>>>>> 
>>>>>>> Too much also saturates the testing cluster to a point where tests 
>>>>>>> become flaky and fail.  ci-cassandra.a.o is already better at exposing 
>>>>>>> flaky tests than other systems.  This is a practicality, and it's 
>>>>>>> constantly being improved, but only under volunteer time.  Donating 
>>>>>>> test hardware is the simpler ask.
>>>>>>>  
>>>>>>> Upgrade tests don't need to be run on every commit. When I worked on 
>>>>>>> Riak we had very comprehensive upgrade testing (pretty much the full 
>>>>>>> matrix of versions) and we had a schedule we ran these tests on ahead 
>>>>>>> of release.
>>>>>>> 
>>>>>>> 
>>>>>>> We are already struggling to stay on top of failures and flakies with 
>>>>>>> ~per-commit builds and butler.c.a.o
>>>>>>> I'm not against the idea of schedule test runs, but it needs more input 
>>>>>>> and effort from people in that space for it to accommodate it.
>>>>>>> 
>>>>>>> I am not fond of the idea of "tests ahead of release" – release 
>>>>>>> managers already do enough and are a scarce resource.  Asking them to 
>>>>>>> also be the build butler and chase down bugs and people to fix them is 
>>>>>>> not appropriate IMO.   I also think it's unwise without guarantee that 
>>>>>>> the contributor/committer that created the bug is available at release 
>>>>>>> time.  Having just one post-commit pipeline has nice benefits in 
>>>>>>> simplicity, as long as it's feasible then slow is ok (as you say above).
>>>>>>> 
>>>>>>>  
>>>>>>> Could you share some more details on the resource issues and their 
>>>>>>> impacts?
>>>>>>> 
>>>>>>> Python Upgrade DTests and JVM Upgrade DTests.
>>>>>>> 
>>>>>>> Python Upgrade DTests today requires 192x large (7 cpu, 14GB ram) 
>>>>>>> servers, each taking up to one hour.
>>>>>>> Currently we have too many upgrade paths (4.0, 4.1, 5.0, to trunk), and 
>>>>>>> are seeing builds abort because of timeouts (>1hr).  Collected timing 
>>>>>>> numbers suggest we should double this number to 384, or simply remove 
>>>>>>> upgrade paths we test.
>>>>>>> 
>>>>>>> https://github.com/apache/cassandra/blob/trunk/.jenkins/Jenkinsfile#L185-L188
>>>>>>>  
>>>>>>> https://github.com/apache/cassandra/blob/trunk/.jenkins/Jenkinsfile#L37
>>>>>>> 
>>>>>>> We have far fewer (and more effective?) JVM Upgrade DTests.
>>>>>>> There we only need 8x medium (3 cpu, 5GB ram) servers.
>>>>>>> https://github.com/apache/cassandra/blob/trunk/.jenkins/Jenkinsfile#L177
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>

Re: [DISCUSS] 5.1 should be 6.0

Reply via email to