My opinion is that it would be valuable to take this discussion as a forcing function to determine how we plan to handle releases broadly to answer the "5.1 should be 6.0" question. Assuming we move away from ad hoc per-release debate. If there's broad strong dissent (i.e. let's have 6.0 be the next major and talk about this topic separately) I'm happy to open another thread, but I didn't see clear consensus on this thread yet and was trying to help drive to that.
> Depending on what “T-2” means for the online upgrade. For 6.0, that would mean last 2 majors (5.0, 4.1). I think we'd need to make an exception for 4.0 during the change much like we made exceptions for 3.0 and 3.x, meaning T-3 to respect the current paradigm of "any adjacent major.minor to next major" during this transition. For 7.0, online upgrade would be supported for 6.0 and 5.0. > If you mean only 4.1 and 5.0 would be online upgrade targets, I would suggest > we change that to T-3 so you encompass all “currently supported” releases at > the time the new branch is GAed. I think that's better actually, yeah. I was originally thinking T-2 from the "what calendar time frame is reasonable" perspective, but saying "if you're on a currently supported branch you can upgrade to a release that comes out" makes clean intuitive sense. That'd mean: 6.0: 5.0, 4.1, 4.0 online upgrades supported. Drop support for 4.0. API compatible guaranteed w/5.0. 7.0: 6.0, 5.0, 4.1 online upgrades supported. Drop support for 4.1. API compatible guaranteed w/6.0. 8.0: 7.0, 6.0, 5.0 online upgrades supported. Drop support for 5.0. API compatible guaranteed w/7.0. On Wed, Jan 29, 2025, at 12:15 PM, Jeremiah Jordan wrote: > This got way off topic from 5.1 should be 6.0, so maybe there should be a new > DISCUSS thread with the correct title to have a discussion around codifying > our upgrade paths? > > FWIW this mostly agrees with my thoughts around upgrade support. > >>>> T-2 online upgrade supported, T-1 API compatible, deprecate-then-remove is >>>> a combination of 3 simple things that I think will improve this situation >>>> greatly and hopefully put a nail in the coffin of the topic, improve >>>> things, and let us move on to more interesting topics that we can then >>>> re-litigate endlessly. ;) > > Depending on what “T-2” means for the online upgrade. If you mean 4.0, 4.1, > and 5.0 are all online upgrade supported versions for trunk, then I agree. > If you mean only 4.1 and 5.0 would be online upgrade targets, I would suggest > we change that to T-3 so you encompass all “currently supported” releases at > the time the new branch is GAed. > > -Jeremiah > > On Jan 29, 2025 at 10:49:17 AM, Josh McKenzie <jmcken...@apache.org> wrote: >> >> To clarify, when I say unspoken it includes "not consciously considered but >> shapes engagement patterns". I don't think there's people sitting around >> deeply against either the status quo or my proposal who are holding back for >> nefarious purposes or anything. >> >> And yeah - my goal is to try and put a little more energy into this to see >> if we can surface pushback as I don't think it'd be appropriate to move to a >> VOTE thread on a proposal with essentially nil engagement. My intuition is >> that the properties of the status quo isn't actually what the polity wants, >> whether or not what I'm proposing is an improvement on that status quo. >> >> On Wed, Jan 29, 2025, at 11:15 AM, Benedict wrote: >>> >>> I think you’re making the mistake of assuming a representative sample of >>> the community participates in these debates. Sensibly, a majority of the >>> community sits these out, and I think on this topic that’s actually the >>> rational response. >>> >>> That doesn’t stop folk voting for something else when the decision actually >>> matters, as it shouldn’t - the polity can’t bind itself after all. >>> >>> Which is only to say, I applaud your optimism but it’s probably wrong to >>> assume there’ll be pushback that reifies the community’s revealed >>> preferences. There’s no reason to assume there will be, and history shows >>> there usually isn’t. >>> >>> To be clear, I don’t think these are our “unspoken incentives” but our >>> collective preferences that simply can’t functionally be codified due to >>> the fact nobody is willing to actually argue this is a good thing. >>> Sometimes no individual likes what happens, but it’s what the polity >>> actually wants, collectively. That’s fine, let’s be at peace with it. >>> >>>> On 29 Jan 2025, at 16:00, Josh McKenzie <jmcken...@apache.org> wrote: >>>> >>>> I've let this topic sit in my head overnight and kind of chewed on it. >>>> While I agree w/the "we're doing what matches our unspoken incentives" >>>> angle Benedict, I think we can do better than that both for ourselves and >>>> our users if we apply energy here and codify something. If people come out >>>> with energy to push *against* that codification, that'll at least bring >>>> the unspoken incentives to light to work through. >>>> >>>> I think it's important we release on a predictable cadence for our users. >>>> We've fallen short (in some cases exceptionally) on this in the past, and >>>> it also adds value for operators to plan out verification and adoption >>>> cycles. It also helps users considering different databases to see a >>>> predictable cadence and a healthy project. My current position is that 12 >>>> months is a happy medium min-value, especially with a T-2 supported cycle >>>> since that gives users between 12 months for high appetite fast adoption >>>> up to 36 months for slow verification. I don't want to further pry open >>>> Pandora's box, but I'd love to see us cut alphas from trunk quarterly as >>>> well. >>>> >>>> I also think it's important that our release versioning is clear and >>>> simple. Right now, *to my mind*, it is not. The current matrix of: >>>> • Any .MINOR to next MAJOR is supported >>>> • Any .MAJOR to next MAJOR is supported >>>> • A release will be supported for some variable amount of time based on >>>> when we get around to new releases >>>> • API breaks in MAJOR changes, except when we get excited about a feature >>>> and want to .MAJOR to signal that in which case it may be completely >>>> low-risk and easy adoption, or we change JDK's and need to signal that, or >>>> any of another slew of caveats that require digging into NEWS.txt to see >>>> what the hell we're up to >>>> • And all of our CI pain that ensues from the above >>>> In my opinion the above is a mess. This isn't a particularly interesting >>>> topic to me, and us re-litigating this on every release (even if you >>>> discount me agitating about it; this isn't just me making noise I think), >>>> is a giant waste of time and energy for a low value outcome. >>>> >>>> T-2 online upgrade supported, T-1 API compatible, deprecate-then-remove is >>>> a combination of 3 simple things that I think will improve this situation >>>> greatly and hopefully put a nail in the coffin of the topic, improve >>>> things, and let us move on to more interesting topics that we can then >>>> re-litigate endlessly. ;) >>>> >>>> So - is anyone actively *against* the above proposal? >>>> >>>> On Tue, Jan 28, 2025, at 11:34 AM, David Capwell wrote: >>>>> I have not checked Jenkins, but we see this in another environment… >>>>> >>>>> For python upgrades have we actually audited the runtime to see that the >>>>> time spent is doing real work? Josh and I have spent a ton of time >>>>> trying to fix (and failing) an issue where the python driver blocks the >>>>> test and we wait 2 hours for that to timeout… this pattern is always >>>>> after all tests are run… what I see is python upgrades take around 30m of >>>>> real work, then 2h of idle blocking taking all resources… >>>>> >>>>> >>>>> Sent from my iPhone >>>>> >>>>>> On Jan 28, 2025, at 8:16 AM, Benedict <bened...@apache.org> wrote: >>>>>> >>>>>> >>>>>> My opinion? Our revealed preferences don’t match whatever ideal is being >>>>>> chased whenever we discuss a policy. >>>>>> . >>>>>> Ignoring the tick-tick debacle the community has done basically the same >>>>>> thing every release, only with a drift towards stricter QA and >>>>>> compatibility expectations with maturity. >>>>>> >>>>>> That is, we have always numbered using some combination of semver and >>>>>> how exciting the release is, and backed all other decisions out of >>>>>> whatever was reasonable once that decision was made. >>>>>> >>>>>> Which basically means a new major every 1 or 2 releases depending on how >>>>>> big the new features are. Which is actually pretty intuitive really, but >>>>>> isn’t a policy anyone dogmatic wants to argue for. >>>>>> >>>>>>> On 28 Jan 2025, at 16:07, Josh McKenzie <jmcken...@apache.org> wrote: >>>>>>> >>>>>>>> We revisit this basically every year and so I’m sort of inclined to >>>>>>>> keep the status quo which really amounts to basically doing whatever >>>>>>>> we end up deciding arbitrarily before we actually cut a release. >>>>>>>> >>>>>>>> Before discussing at length a new policy we’ll only immediately break >>>>>>> It's painful how accurate this feels. =/ >>>>>>> >>>>>>> Is it the complexity of these topics that's keeping us stuck or a lack >>>>>>> of consensus... or both? >>>>>>> >>>>>>>> if the motivation is >>>>>>> My personal motivation is that our ad hoc re-litigating of this >>>>>>> reactively at the last possible moment over and over is uninteresting >>>>>>> and feels like a giant waste of time and energy for all of us. But to >>>>>>> your point, if trying to formalize it doesn't yield results, that's >>>>>>> just objectively worse since it's adding more churn on top of a >>>>>>> churn-heavy process. /sigh >>>>>>> >>>>>>> On Tue, Jan 28, 2025, at 11:01 AM, Benedict wrote: >>>>>>>> >>>>>>>> We revisit this basically every year and so I’m sort of inclined to >>>>>>>> keep the status quo which really amounts to basically doing whatever >>>>>>>> we end up deciding arbitrarily before we actually cut a release. >>>>>>>> >>>>>>>> Before discussing at length a new policy we’ll only immediately break, >>>>>>>> if the motivation is avoiding extra release steps, I would prefer we >>>>>>>> just avoid extra release steps by eg running nightly upgrade tests >>>>>>>> rather than pre commit, or making the tests faster, or waiting until >>>>>>>> the test matrix actually causes anything to break rather than assuming >>>>>>>> it will. >>>>>>>> >>>>>>>>> On 28 Jan 2025, at 15:45, Josh McKenzie <jmcken...@apache.org> wrote: >>>>>>>>> >>>>>>>>>> Python Upgrade DTests today requires 192x large (7 cpu, 14GB ram) >>>>>>>>>> servers >>>>>>>>> >>>>>>>>>> We have far fewer (and more effective?) JVM Upgrade DTests. >>>>>>>>>> There we only need 8x medium (3 cpu, 5GB ram) servers >>>>>>>>> >>>>>>>>> Does anyone have a strong understanding of the coverage and value >>>>>>>>> offered by the python upgrade dtests vs. the in-jvm dtests? I don't, >>>>>>>>> but I intuitively have a hard time believing the value difference >>>>>>>>> matches the hardware requirement difference there. >>>>>>>>> >>>>>>>>>> Lots and lots of words about releases from mick (<3) >>>>>>>>> Those of you who know me know my "spidey-senses" get triggered by >>>>>>>>> enough complexity regardless of how well justified. I feel like our >>>>>>>>> release process has passed this threshold for me. Been talking a lot >>>>>>>>> with Mick about this topic for a couple weeks and I'm curious if the >>>>>>>>> community sees a major flaw with a proposal like the following: >>>>>>>>> • We formally support 3 releases at a time >>>>>>>>> • We only release MAJOR (i.e. semver major). No more "5.0, 5.1, >>>>>>>>> 5.2", would now be "5.0, 6.0, 7.0" >>>>>>>>> • We test and support online upgrades between supported releases >>>>>>>>> • Any removal or API breakage follows a "deprecate-then-release" >>>>>>>>> cycle >>>>>>>>> • We cut a release every 12 months >>>>>>>>> *Implications for operators:* >>>>>>>>> • Upgrade paths for online upgrades are simple and clear. T-2. >>>>>>>>> • "Forced" update cadence to stay on supported versions is 3 years >>>>>>>>> • If you adopt v1.0 it will be supported until v4.0 comes out 36 >>>>>>>>> months later >>>>>>>>> • This gives users the flexibility to prioritize functionality vs. >>>>>>>>> stability and to balance release validation costs >>>>>>>>> • Deprecation cycles are clear as are compatibility paths. >>>>>>>>> • Release timelines and feature availability are predictable and >>>>>>>>> clear >>>>>>>>> *Implications for developers on the project:*** >>>>>>>>> • Support requirements for online upgrades are clear >>>>>>>>> • Opportunity cost of feature slippage relative to release date is >>>>>>>>> balanced (worst-case == 11.99 month delay on availability in GA >>>>>>>>> supported release) >>>>>>>>> • Path to keep code-base maintainable is clear >>>>>>>>> (deprecate-then-remove) >>>>>>>>> • CI requirements are constrained and predictable >>>>>>>>> Moving to a "online upgrades supported for everything" is something I >>>>>>>>> support in principle, but would advocate we consider after getting a >>>>>>>>> handle on our release process. >>>>>>>>> >>>>>>>>> So - what do we lose if we consider the above approach? >>>>>>>>> >>>>>>>>> On Tue, Jan 28, 2025, at 8:23 AM, Mick Semb Wever wrote: >>>>>>>>>> Jordan, replies inline. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> To take a snippet from your email "A little empathy for our users >>>>>>>>>>> goes a long way." While I agree clarity is important, forcing our >>>>>>>>>>> users to upgrade multiple times is not in their best interest. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Yes – we would be moving in that direction by now saying we aim for >>>>>>>>>> online compatibility across all versions. But how feasible that >>>>>>>>>> turns out to be depends on our future actions and new versions. >>>>>>>>>> >>>>>>>>>> The separation between "the code maintains compatibility across all >>>>>>>>>> versions" versus "we only actively test these upgrade paths so >>>>>>>>>> that's our limited recommendation" is here what lets us reduce the >>>>>>>>>> "forcing our users to upgrade multiple times". That's the "other >>>>>>>>>> paths may work but you're on your own – do your homework" aspect. >>>>>>>>>> This is a position that allows us to progress into something better. >>>>>>>>>> >>>>>>>>>> For now, and using the current status quo of major/minor usage as >>>>>>>>>> the implemented example: this would progress us to no longer needing >>>>>>>>>> major versions (we would just test all upgrade paths for all current >>>>>>>>>> maintained versions, CI resources permitting). >>>>>>>>>> The community can change over time as well, it's worth thinking >>>>>>>>>> about an approach that is adjustable to changing resources. (This >>>>>>>>>> includes efforts required in documenting past, present, future, >>>>>>>>>> especially as changes are made.) >>>>>>>>>> >>>>>>>>>> I emphasise, first I think we need to be focusing on maintaining >>>>>>>>>> compatibility in the code (and how and when we are willing/needing >>>>>>>>>> to break it). >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> At the same time, doesn't less testing resources primarily >>>>>>>>>>> translate to longer test runs? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Too much also saturates the testing cluster to a point where tests >>>>>>>>>> become flaky and fail. ci-cassandra.a.o is already better at >>>>>>>>>> exposing flaky tests than other systems. This is a practicality, >>>>>>>>>> and it's constantly being improved, but only under volunteer time. >>>>>>>>>> Donating test hardware is the simpler ask. >>>>>>>>>> >>>>>>>>>>> Upgrade tests don't need to be run on every commit. When I worked >>>>>>>>>>> on Riak we had very comprehensive upgrade testing (pretty much the >>>>>>>>>>> full matrix of versions) and we had a schedule we ran these tests >>>>>>>>>>> on ahead of release. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> We are already struggling to stay on top of failures and flakies >>>>>>>>>> with ~per-commit builds and butler.c.a.o >>>>>>>>>> I'm not against the idea of schedule test runs, but it needs more >>>>>>>>>> input and effort from people in that space for it to accommodate it. >>>>>>>>>> >>>>>>>>>> I am not fond of the idea of "tests ahead of release" – release >>>>>>>>>> managers already do enough and are a scarce resource. Asking them >>>>>>>>>> to also be the build butler and chase down bugs and people to fix >>>>>>>>>> them is not appropriate IMO. I also think it's unwise without >>>>>>>>>> guarantee that the contributor/committer that created the bug is >>>>>>>>>> available at release time. Having just one post-commit pipeline has >>>>>>>>>> nice benefits in simplicity, as long as it's feasible then slow is >>>>>>>>>> ok (as you say above). >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> Could you share some more details on the resource issues and their >>>>>>>>>>> impacts? >>>>>>>>>> >>>>>>>>>> Python Upgrade DTests and JVM Upgrade DTests. >>>>>>>>>> >>>>>>>>>> Python Upgrade DTests today requires 192x large (7 cpu, 14GB ram) >>>>>>>>>> servers, each taking up to one hour. >>>>>>>>>> Currently we have too many upgrade paths (4.0, 4.1, 5.0, to trunk), >>>>>>>>>> and are seeing builds abort because of timeouts (>1hr). Collected >>>>>>>>>> timing numbers suggest we should double this number to 384, or >>>>>>>>>> simply remove upgrade paths we test. >>>>>>>>>> >>>>>>>>>> https://github.com/apache/cassandra/blob/trunk/.jenkins/Jenkinsfile#L185-L188 >>>>>>>>>> >>>>>>>>>> https://github.com/apache/cassandra/blob/trunk/.jenkins/Jenkinsfile#L37 >>>>>>>>>> >>>>>>>>>> We have far fewer (and more effective?) JVM Upgrade DTests. >>>>>>>>>> There we only need 8x medium (3 cpu, 5GB ram) servers. >>>>>>>>>> https://github.com/apache/cassandra/blob/trunk/.jenkins/Jenkinsfile#L177 >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>> >>>> >>