Re: [DISCUSS] 5.1 should be 6.0

Jeremiah Jordan Wed, 29 Jan 2025 09:15:32 -0800

 This got way off topic from 5.1 should be 6.0, so maybe there should be a
new DISCUSS thread with the correct title to have a discussion around
codifying our upgrade paths?


FWIW this mostly agrees with my thoughts around upgrade support.

T-2 online upgrade supported, T-1 API compatible, deprecate-then-remove is
> a combination of 3 simple things that I think will improve this situation
> greatly and hopefully put a nail in the coffin of the topic, improve
> things, and let us move on to more interesting topics that we can then
> re-litigate endlessly. ;)
>
>
Depending on what “T-2” means for the online upgrade.  If you mean 4.0,
4.1, and 5.0 are all online upgrade supported versions for trunk, then I
agree.  If you mean only 4.1 and 5.0 would be online upgrade targets, I
would suggest we change that to T-3 so you encompass all “currently
supported” releases at the time the new branch is GAed.

-Jeremiah

On Jan 29, 2025 at 10:49:17 AM, Josh McKenzie <[email protected]> wrote:

> To clarify, when I say unspoken it includes "not consciously considered
> but shapes engagement patterns". I don't think there's people sitting
> around deeply against either the status quo or my proposal who are holding
> back for nefarious purposes or anything.
>
> And yeah - my goal is to try and put a little more energy into this to see
> if we can surface pushback as I don't think it'd be appropriate to move to
> a VOTE thread on a proposal with essentially nil engagement. My intuition
> is that the properties of the status quo isn't actually what the polity
> wants, whether or not what I'm proposing is an improvement on that status
> quo.
>
> On Wed, Jan 29, 2025, at 11:15 AM, Benedict wrote:
>
>
> I think you’re making the mistake of assuming a representative sample of
> the community participates in these debates. Sensibly, a majority of the
> community sits these out, and I think on this topic that’s actually the
> rational response.
>
> That doesn’t stop folk voting for something else when the decision
> actually matters, as it shouldn’t - the polity can’t bind itself after all.
>
> Which is only to say, I applaud your optimism but it’s probably wrong to
> assume there’ll be pushback that reifies the community’s revealed
> preferences. There’s no reason to assume there will be, and history shows
> there usually isn’t.
>
> To be clear, I don’t think these are our “unspoken incentives” but our
> collective preferences that simply can’t functionally be codified due to
> the fact nobody is willing to actually argue this is a good thing.
> Sometimes no individual likes what happens, but it’s what the polity
> actually wants, collectively. That’s fine, let’s be at peace with it.
>
> On 29 Jan 2025, at 16:00, Josh McKenzie <[email protected]> wrote:
>
> 
> I've let this topic sit in my head overnight and kind of chewed on it.
> While I agree w/the "we're doing what matches our unspoken incentives"
> angle Benedict, I think we can do better than that both for ourselves and
> our users if we apply energy here and codify something. If people come out
> with energy to push *against* that codification, that'll at least bring
> the unspoken incentives to light to work through.
>
> I think it's important we release on a predictable cadence for our users.
> We've fallen short (in some cases exceptionally) on this in the past, and
> it also adds value for operators to plan out verification and adoption
> cycles. It also helps users considering different databases to see a
> predictable cadence and a healthy project. My current position is that 12
> months is a happy medium min-value, especially with a T-2 supported cycle
> since that gives users between 12 months for high appetite fast adoption up
> to 36 months for slow verification. I don't want to further pry open
> Pandora's box, but I'd love to see us cut alphas from trunk quarterly as
> well.
>
> I also think it's important that our release versioning is clear and
> simple. Right now,  *to my mind*, it is not. The current matrix of:
>
>    - Any .MINOR to next MAJOR is supported
>    - Any .MAJOR to next MAJOR is supported
>    - A release will be supported for some variable amount of time based
>    on when we get around to new releases
>    - API breaks in MAJOR changes, except when we get excited about a
>    feature and want to .MAJOR to signal that in which case it may be
>    completely low-risk and easy adoption, or we change JDK's and need to
>    signal that, or any of another slew of caveats that require digging into
>    NEWS.txt to see what the hell we're up to
>    - And all of our CI pain that ensues from the above
>
> In my opinion the above is a mess. This isn't a particularly interesting
> topic to me, and us re-litigating this on every release (even if you
> discount me agitating about it; this isn't just me making noise I think),
> is a giant waste of time and energy for a low value outcome.
>
> T-2 online upgrade supported, T-1 API compatible, deprecate-then-remove is
> a combination of 3 simple things that I think will improve this situation
> greatly and hopefully put a nail in the coffin of the topic, improve
> things, and let us move on to more interesting topics that we can then
> re-litigate endlessly. ;)
>
> So - is anyone actively *against* the above proposal?
>
> On Tue, Jan 28, 2025, at 11:34 AM, David Capwell wrote:
>
> I have not checked Jenkins, but we see this in another environment…
>
> For python upgrades have we actually audited the runtime to see that the
> time spent is doing real work?  Josh and I have spent a ton of time trying
> to fix (and failing) an issue where the python driver blocks the test and
> we wait 2 hours for that to timeout… this pattern is always after all tests
> are run… what I see is python upgrades take around 30m of real work, then
> 2h of idle blocking taking all resources…
>
>
> Sent from my iPhone
>
> On Jan 28, 2025, at 8:16 AM, Benedict <[email protected]> wrote:
>
> 
>
> My opinion? Our revealed preferences don’t match whatever ideal is being
> chased whenever we discuss a policy.
> .
> Ignoring the tick-tick debacle the community has done basically the same
> thing every release, only with a drift towards stricter QA and
> compatibility expectations with maturity.
>
> That is, we have always numbered using some combination of semver and how
> exciting the release is, and backed all other decisions out of whatever was
> reasonable once that decision was made.
>
> Which basically means a new major every 1 or 2 releases depending on how
> big the new features are. Which is actually pretty intuitive really, but
> isn’t a policy anyone dogmatic wants to argue for.
>
> On 28 Jan 2025, at 16:07, Josh McKenzie <[email protected]> wrote:
>
> 
>
> We revisit this basically every year and so I’m sort of inclined to keep
> the status quo which really amounts to basically doing whatever we end up
> deciding arbitrarily before we actually cut a release.
>
> Before discussing at length a new policy we’ll only immediately break
>
> It's painful how accurate this feels. =/
>
> Is it the complexity of these topics that's keeping us stuck or a lack of
> consensus... or both?
>
> if the motivation is
>
> My personal motivation is that our ad hoc re-litigating of this reactively
> at the last possible moment over and over is uninteresting and feels like a
> giant waste of time and energy for all of us. But to your point, if trying
> to formalize it doesn't yield results, that's just objectively worse since
> it's adding more churn on top of a churn-heavy process. /sigh
>
> On Tue, Jan 28, 2025, at 11:01 AM, Benedict wrote:
>
>
> We revisit this basically every year and so I’m sort of inclined to keep
> the status quo which really amounts to basically doing whatever we end up
> deciding arbitrarily before we actually cut a release.
>
> Before discussing at length a new policy we’ll only immediately break, if
> the motivation is avoiding extra release steps, I would prefer we just
> avoid extra release steps by eg running nightly upgrade tests rather than
> pre commit, or making the tests faster, or waiting until the test matrix
> actually causes anything to break rather than assuming it will.
>
> On 28 Jan 2025, at 15:45, Josh McKenzie <[email protected]> wrote:
>
> 
>
> Python Upgrade DTests today requires 192x large (7 cpu, 14GB ram) servers
>
>
> We have far fewer (and more effective?) JVM Upgrade DTests.
> There we only need 8x medium (3 cpu, 5GB ram) servers
>
>
> Does anyone have a strong understanding of the coverage and value offered
> by the python upgrade dtests vs. the in-jvm dtests? I don't, but I
> intuitively have a hard time believing the value difference matches the
> hardware requirement difference there.
>
> Lots and lots of words about releases from mick (<3)
>
> Those of you who know me know my "spidey-senses" get triggered by enough
> complexity regardless of how well justified. I feel like our release
> process has passed this threshold for me. Been talking a lot with Mick
> about this topic for a couple weeks and I'm curious if the community sees a
> major flaw with a proposal like the following:
>
>    - We formally support 3 releases at a time
>    - We only release MAJOR (i.e. semver major). No more "5.0, 5.1, 5.2",
>    would now be "5.0, 6.0, 7.0"
>    - We test and support online upgrades between supported releases
>    - Any removal or API breakage follows a "deprecate-then-release" cycle
>    - We cut a release every 12 months
>
> *Implications for operators:*
>
>    - Upgrade paths for online upgrades are simple and clear. T-2.
>    - "Forced" update cadence to stay on supported versions is 3 years
>    - If you adopt v1.0 it will be supported until v4.0 comes out 36
>       months later
>       - This gives users the flexibility to prioritize functionality vs.
>       stability and to balance release validation costs
>       - Deprecation cycles are clear as are compatibility paths.
>    - Release timelines and feature availability are predictable and clear
>
> *Implications for developers on the project:*
>
>    - Support requirements for online upgrades are clear
>    - Opportunity cost of feature slippage relative to release date is
>    balanced (worst-case == 11.99 month delay on availability in GA supported
>    release)
>    - Path to keep code-base maintainable is clear (deprecate-then-remove)
>    - CI requirements are constrained and predictable
>
> Moving to a "online upgrades supported for everything" is something I
> support in principle, but would advocate we consider after getting a handle
> on our release process.
>
> So - what do we lose if we consider the above approach?
>
> On Tue, Jan 28, 2025, at 8:23 AM, Mick Semb Wever wrote:
>
> Jordan, replies inline.
>
>
> To take a snippet from your email "A little empathy for our users goes a
> long way."  While I agree clarity is important, forcing our users to
> upgrade multiple times is not in their best interest.
>
>
>
> Yes – we would be moving in that direction by now saying we aim for online
> compatibility across all versions.   But how feasible that turns out to be
> depends on our future actions and new versions.
>
> The separation between "the code maintains compatibility across all
> versions" versus "we only actively test these upgrade paths so that's our
> limited recommendation"  is here what lets us reduce the "forcing our users
> to upgrade multiple times".  That's the "other paths may work but you're on
> your own – do your homework" aspect.   This is a position that allows us to
> progress into something better.
>
> For now, and using the current status quo of major/minor usage as the
> implemented example: this would progress us to no longer needing major
> versions (we would just test all upgrade paths for all current maintained
> versions, CI resources permitting).
> The community can change over time as well, it's worth thinking about an
> approach that is adjustable to changing resources.  (This includes efforts
> required in documenting past, present, future, especially as changes are
> made.)
>
> I emphasise, first I think we need to be focusing on maintaining
> compatibility in the code (and how and when we are willing/needing to break
> it).
>
>
>
> At the same time, doesn't less testing resources primarily translate to
> longer test runs?
>
>
>
> Too much also saturates the testing cluster to a point where tests become
> flaky and fail.  ci-cassandra.a.o is already better at exposing flaky tests
> than other systems.  This is a practicality, and it's constantly being
> improved, but only under volunteer time.  Donating test hardware is
> the simpler ask.
>
>
> Upgrade tests don't need to be run on every commit. When I worked on Riak
> we had very comprehensive upgrade testing (pretty much the full matrix of
> versions) and we had a schedule we ran these tests on ahead of release.
>
>
>
> We are already struggling to stay on top of failures and flakies with
> ~per-commit builds and butler.c.a.o
> I'm not against the idea of schedule test runs, but it needs more input
> and effort from people in that space for it to accommodate it.
>
> I am not fond of the idea of "tests ahead of release" – release managers
> already do enough and are a scarce resource.  Asking them to also be the
> build butler and chase down bugs and people to fix them is not appropriate
> IMO.   I also think it's unwise without guarantee that the
> contributor/committer that created the bug is available at release time.
> Having just one post-commit pipeline has nice benefits in simplicity, as
> long as it's feasible then slow is ok (as you say above).
>
>
>
> Could you share some more details on the resource issues and their impacts?
>
>
> Python Upgrade DTests and JVM Upgrade DTests.
>
> Python Upgrade DTests today requires 192x large (7 cpu, 14GB ram) servers,
> each taking up to one hour.
> Currently we have too many upgrade paths (4.0, 4.1, 5.0, to trunk), and
> are seeing builds abort because of timeouts (>1hr).  Collected timing
> numbers suggest we should double this number to 384, or simply remove
> upgrade paths we test.
>
>
> https://github.com/apache/cassandra/blob/trunk/.jenkins/Jenkinsfile#L185-L188
>
> https://github.com/apache/cassandra/blob/trunk/.jenkins/Jenkinsfile#L37
>
> We have far fewer (and more effective?) JVM Upgrade DTests.
> There we only need 8x medium (3 cpu, 5GB ram) servers.
> https://github.com/apache/cassandra/blob/trunk/.jenkins/Jenkinsfile#L177
>
>
>
>
>
>
>

Re: [DISCUSS] 5.1 should be 6.0

Reply via email to