It's clear from discussion on this list that the current "storage_compatibility_mode" implementation and upgrade path for 5.0 is a source of real and legitimate user pain, and is
likely to result in many organizations slowing their adoption of the release. Would love to discuss on dev@ how we can improve this (e.g., enabling transitions via nodetool/JMX without
process restarts); or ideally obviating the need to manually advance between them in 5.1/6.0 via TCM -- which might produce an even smoother upgrade path for 4.x users that requires minimal
use action. – Scott On Dec 18, 2024, at 9:43 AM, Jon Haddad <j...@rustyrazorblade.com> wrote: > We (Wikimedia) have had more (major) upgrades go wrong in some way, than right. Any
significant upgrade is going to be weeks —if not months— in the making, with careful testing, a phased rollout, and a workable plan for rollback. We'd never entertain doing more than one at a
time, it's just way too many moving parts. The question wasn't about why upgrades are hard, it was about why a rolling restart of the cluster is hard. They're different things. * Yes,
upgrades should go through a rigorous qualification process. * No, rolling restarts shouldn't be a major endeavor. If an organization has thousands of Cassandra nodes, it should also have
tooling to perform rolling restarts of a cluster, either one node, multiple nodes in a rack, or an entire rack at a tie. I consider this fundamental to operating Cassandra at scale. I've
worked with organizations that have had this dialed in well, and ones that have done it by hand. The ones that did 1K nodes by hand really hated rolling restarts. The ones that did it well
didn't care at all because it was behind automation, and we'd do it whenever we needed to, not just during off hours. Jon On Wed, Dec 18, 2024 at 9:27 AM Eric Evans <
john.eric.ev...@gmail.com > wrote: On Tue, Dec 17, 2024 at 2:37 PM Paul Chandler < p...@redshots.com > wrote: It is a mixture of things really, firstly it is a legacy issue where
there have been performance problems in the past during upgrades, these have now been fixed, but it is not easy to regain the trust in the process. Secondly there are some very large clusters
involved, 1300+ nodes across multiple physical datacenters, in this case any upgrades are only done out of hours and only one datacenter per day. So a normal upgrade cycle will take multiple
weeks, and this one will take 3 times as long. This is a very large organisation with some very fixed rules and processes, so the Cassandra team does need to fit within these constraints and
we have limited ability to influence any changes. I can second all of this. We (Wikimedia) have had more (major) upgrades go wrong in some way, than right. Any significant upgrade is going to
be weeks —if not months— in the making, with careful testing, a phased rollout, and a workable plan for rollback. We'd never entertain doing more than one at a time, it's just way too many
moving parts. But even forgetting these constraints, in a previous organisation ( 100+ clusters ) which had very good automation for this sort of thing, I can still see this process taking 3
times as long to complete as a normal upgrade, and this does take up operators time. I can see the advantages of 3 stage process, and all things being equal I would recommend that process as
being safer, however I am getting a lot of push back whenever we discuss the upgrade process. Thanks Paul > On 17 Dec 2024, at 19:24, Jon Haddad < rustyrazorbl...@apache.org > wrote:
> > Just curious, why is a rolling restart difficult? Is it a tooling issue, stability, just overall fear of messing with things? > > You *should* be able to do a rolling restart
without it being an issue. I look at this as a fundamental workflow that every C* operator should have available, and you should be able to do them without there being any concern. > >
Jon > > > On 2024/12/17 16:01:06 Paul Chandler wrote: >> All, >> >> We are getting a lot of push back on the 3 stage process of going through the three
compatibility modes to upgrade to Cassandra 5. This basically means 3 rolling restarts of a cluster, which will be difficult for some of our large multi DC clusters. >> >> Having
researched this, it looks like, if you are not going to create large TTL’s, it would be possible to go straight from C*4 to C*5 with SCM NONE. This seems to be the same as it would have been
going from 4.0 -> 4.1 >> >> Is there any reason why this should not be done? Has anyone had experience of upgrading in this way? -- Eric Evans john.eric.ev...@gmail.com