Yeah, the issue with the yaml being out of sync is consistent with any other JMX change, such as compaction throughput / threads, etc. You'd have to deploy the config and apply the change via JMX otherwise you'd risk restarting the node and running into an issue.
I think there's probably room for improvement here in the future. I can't think of any reason why we need to do this at the config level and not at the cluster level, but I haven't thought about it too deeply. Jon On Wed, Dec 18, 2024 at 3:51 AM Paul Chandler <p...@redshots.com> wrote: > The ability to move through the SCM via the nodetool would definitely help > in this situation. I can see there being an issue is the cassandra.yaml is > not changed, as the node could revert back to an older mode if the node is > restarted. > > Would there be any other potential problems with exposing it like this? > > On 17 Dec 2024, at 22:12, Jon Haddad <j...@rustyrazorblade.com> wrote: > > > Secondly there are some very large clusters involved, 1300+ nodes across > multiple physical datacenters, in this case any upgrades are only done out > of hours and only one datacenter per day. So a normal upgrade cycle will > take multiple weeks, and this one will take 3 times as long. > > If you only restart one machine at a time, then yes, this will take a > while. It's better with these environments to restart an entire rack at a > once. This should significantly cut down on the time it takes to restart a > cluster. This is how all large orgs I've worked in roll out big changes. > > Regardless, it might be possible to make the compatibility mode something > that can be changed without a restart, through JMX. While it would solve > your immediate problem by avoiding it, I'd strive to solve the underlying > problem that your org is running Cassandra with unnecessary limitations > practices that make your life harder. > > Jon > > > On Tue, Dec 17, 2024 at 12:37 PM Paul Chandler <p...@redshots.com> wrote: > >> Hi Jon, >> >> It is a mixture of things really, firstly it is a legacy issue where >> there have been performance problems in the past during upgrades, these >> have now been fixed, but it is not easy to regain the trust in the process. >> >> Secondly there are some very large clusters involved, 1300+ nodes across >> multiple physical datacenters, in this case any upgrades are only done out >> of hours and only one datacenter per day. So a normal upgrade cycle will >> take multiple weeks, and this one will take 3 times as long. >> >> This is a very large organisation with some very fixed rules and >> processes, so the Cassandra team does need to fit within these constraints >> and we have limited ability to influence any changes. >> >> But even forgetting these constraints, in a previous organisation ( 100+ >> clusters ) which had very good automation for this sort of thing, I can >> still see this process taking 3 times as long to complete as a normal >> upgrade, and this does take up operators time. >> >> I can see the advantages of 3 stage process, and all things being equal I >> would recommend that process as being safer, however I am getting a lot of >> push back whenever we discuss the upgrade process. >> >> Thanks >> >> Paul >> >> > On 17 Dec 2024, at 19:24, Jon Haddad <rustyrazorbl...@apache.org> >> wrote: >> > >> > Just curious, why is a rolling restart difficult? Is it a tooling >> issue, stability, just overall fear of messing with things? >> > >> > You *should* be able to do a rolling restart without it being an >> issue. I look at this as a fundamental workflow that every C* operator >> should have available, and you should be able to do them without there >> being any concern. >> > >> > Jon >> > >> > >> > On 2024/12/17 16:01:06 Paul Chandler wrote: >> >> All, >> >> >> >> We are getting a lot of push back on the 3 stage process of going >> through the three compatibility modes to upgrade to Cassandra 5. This >> basically means 3 rolling restarts of a cluster, which will be difficult >> for some of our large multi DC clusters. >> >> >> >> Having researched this, it looks like, if you are not going to create >> large TTL’s, it would be possible to go straight from C*4 to C*5 with SCM >> NONE. This seems to be the same as it would have been going from 4.0 -> 4.1 >> >> >> >> Is there any reason why this should not be done? Has anyone had >> experience of upgrading in this way? >> >> >> >> Thanks >> >> >> >> Paul Chandler >> >> >> >> >> >> >