Yeah, the issue with the yaml being out of sync is consistent with any
other JMX change, such as compaction throughput / threads, etc.  You'd have
to deploy the config and apply the change via JMX otherwise you'd risk
restarting the node and running into an issue.

I think there's probably room for improvement here in the future.  I can't
think of any reason why we need to do this at the config level and not at
the cluster level, but I haven't thought about it too deeply.

Jon



On Wed, Dec 18, 2024 at 3:51 AM Paul Chandler <p...@redshots.com> wrote:

> The ability to move through the SCM via the nodetool would definitely help
> in this situation. I can see there being an issue is the cassandra.yaml is
> not changed, as the node could revert back to an older mode if the node is
> restarted.
>
> Would there be any other potential problems with exposing it like this?
>
> On 17 Dec 2024, at 22:12, Jon Haddad <j...@rustyrazorblade.com> wrote:
>
> > Secondly there are some very large clusters involved, 1300+ nodes across
> multiple physical datacenters, in this case any upgrades are only done out
> of hours and only one datacenter per day. So a normal upgrade cycle will
> take multiple weeks, and this one will take 3 times as long.
>
> If you only restart one machine at a time, then yes, this will take a
> while.  It's better with these environments to restart an entire rack at a
> once.  This should significantly cut down on the time it takes to restart a
> cluster.  This is how all large orgs I've worked in roll out big changes.
>
> Regardless, it might be possible to make the compatibility mode something
> that can be changed without a restart, through JMX.  While it would solve
> your immediate problem by avoiding it, I'd strive to solve the underlying
> problem that your org is running Cassandra with unnecessary limitations
> practices that make your life harder.
>
> Jon
>
>
> On Tue, Dec 17, 2024 at 12:37 PM Paul Chandler <p...@redshots.com> wrote:
>
>> Hi Jon,
>>
>> It is a mixture of things really, firstly it is a legacy issue where
>> there have been performance problems in the past during upgrades, these
>> have now been fixed, but it is not easy to regain the trust in the process.
>>
>> Secondly there are some very large clusters involved, 1300+ nodes across
>> multiple physical datacenters, in this case any upgrades are only done out
>> of hours and only one datacenter per day. So a normal upgrade cycle will
>> take multiple weeks, and this one will take 3 times as long.
>>
>> This is a very large organisation with some very fixed rules and
>> processes, so the Cassandra team does need to fit within these constraints
>> and we have limited ability to influence any changes.
>>
>> But even forgetting these constraints, in a previous organisation ( 100+
>> clusters ) which had very good automation for this sort of thing, I can
>> still see this process taking 3 times as long to complete as a normal
>> upgrade, and this does take up operators time.
>>
>> I can see the advantages of 3 stage process, and all things being equal I
>> would recommend that process as being safer, however I am getting a lot of
>> push back whenever we discuss the upgrade process.
>>
>> Thanks
>>
>> Paul
>>
>> > On 17 Dec 2024, at 19:24, Jon Haddad <rustyrazorbl...@apache.org>
>> wrote:
>> >
>> > Just curious, why is a rolling restart difficult?  Is it a tooling
>> issue, stability, just overall fear of messing with things?
>> >
>> > You *should* be able to do a rolling restart without it being an
>> issue.  I look at this as a fundamental workflow that every C* operator
>> should have available, and you should be able to do them without there
>> being any concern.
>> >
>> > Jon
>> >
>> >
>> > On 2024/12/17 16:01:06 Paul Chandler wrote:
>> >> All,
>> >>
>> >> We are getting a lot of push back on the 3 stage process of going
>> through the three compatibility modes to upgrade to Cassandra 5. This
>> basically means 3 rolling restarts of a cluster, which will be difficult
>> for some of our large multi DC clusters.
>> >>
>> >> Having researched this, it looks like, if you are not going to create
>> large TTL’s, it would be possible to go straight from C*4 to C*5 with SCM
>> NONE. This seems to be the same as it would have been going from 4.0 -> 4.1
>> >>
>> >> Is there any reason why this should not be done? Has anyone had
>> experience of upgrading in this way?
>> >>
>> >> Thanks
>> >>
>> >> Paul Chandler
>> >>
>> >>
>>
>>
>

Reply via email to