Re: Cassandra 5 Upgrade - Storage Compatibility Modes

2024-12-17 Thread Paul Chandler
Hi Jon,

It is a mixture of things really, firstly it is a legacy issue where there have 
been performance problems in the past during upgrades, these have now been 
fixed, but it is not easy to regain the trust in the process.

Secondly there are some very large clusters involved, 1300+ nodes across 
multiple physical datacenters, in this case any upgrades are only done out of 
hours and only one datacenter per day. So a normal upgrade cycle will take 
multiple weeks, and this one will take 3 times as long.

This is a very large organisation with some very fixed rules and processes, so 
the Cassandra team does need to fit within these constraints and we have 
limited ability to influence any changes. 

But even forgetting these constraints, in a previous organisation ( 100+ 
clusters ) which had very good automation for this sort of thing, I can still 
see this process taking 3 times as long to complete as a normal upgrade, and 
this does take up operators time. 

I can see the advantages of 3 stage process, and all things being equal I would 
recommend that process as being safer, however I am getting a lot of push back 
whenever we discuss the upgrade process.

Thanks 

Paul

> On 17 Dec 2024, at 19:24, Jon Haddad  wrote:
> 
> Just curious, why is a rolling restart difficult?  Is it a tooling issue, 
> stability, just overall fear of messing with things?
> 
> You *should* be able to do a rolling restart without it being an issue.  I 
> look at this as a fundamental workflow that every C* operator should have 
> available, and you should be able to do them without there being any concern. 
> 
> Jon
> 
> 
> On 2024/12/17 16:01:06 Paul Chandler wrote:
>> All,
>> 
>> We are getting a lot of push back on the 3 stage process of going through 
>> the three compatibility modes to upgrade to Cassandra 5. This basically 
>> means 3 rolling restarts of a cluster, which will be difficult for some of 
>> our large multi DC clusters.
>> 
>> Having researched this, it looks like, if you are not going to create large 
>> TTL’s, it would be possible to go straight from C*4 to C*5 with SCM NONE. 
>> This seems to be the same as it would have been going from 4.0 -> 4.1
>> 
>> Is there any reason why this should not be done? Has anyone had experience 
>> of upgrading in this way?
>> 
>> Thanks 
>> 
>> Paul Chandler
>> 
>> 



Re: Cassandra 5 Upgrade - Storage Compatibility Modes

2024-12-17 Thread C. Scott Andreas

Hi Jeff, Repair is not a prerequisite for upgrading from 3.x to 4.x (but it's always recommended to run as a continuous process). Repair is not supported between nodes 
running different major versions, so it should be disabled during the upgrade. There are quite a few fixes for hung repair sessions and performance improvements in 4.x 
as well. You may find that repair runs more smoothly for you after upgrading. – Scott On Dec 17, 2024, at 4:30 PM, Jeff Masud  wrote: We have 
similar issues with 3.x repairs, and run manually as well as with Reaper. Can someone tell me, if I cannot get a table repaired because it is locking up a node, is it 
still possible to upgrade to 4.0? Jeff From: Jon Haddad  Reply-To:  Date: Tuesday, December 17, 2024 
at 2:20 PM To:  Subject: Re: Cassandra 5 Upgrade - Storage Compatibility Modes I strongly suggest moving to 4.0 and to set up Reaper. 
Managing repairs yourself is a waste of time, and you're almost certainly not doing it optimally. Jon On Tue, Dec 17, 2024 at 12:40 PM Miguel Santos-Lopez 
 wrote: We haven’t had the chance to upgrade to 4, let alone 5. Has there been a big chance wrt to repairs since the old days of 3.11? :-) In my 
experience the problems have been on one hand a performance & latency hit, but also a lack of flexibility in the tooling: often I had repairs failing and the only 
option I know of using plain nodetool is to restart again the repair. I ended up wrapping the call to nodetool in a bash script allowing only selected keyspaces and 
tables to be repaired. In this way I get a clear picture of what failed and can then do a reliable “resume” with very extra effort. I would also add the time it takes. 
Afaik you don’t want to run more than two repairs at the same time. Depending on the load and number of nodes it easily becomes a tedious task. My view might well be 
biased by running that old version on a less than optimal cluster -improved only a couple of weeks ago, so I still have to see how it translates to repairs. Miguel A. 
Santos Senior Platform Engineer e mlo...@ims.tech w ims.tech t +1 226 339 8357 Error! Filename not specified. Error! Filename not specified. Trak (Global Solutions) 
Limited, trading as IMS, is a company registered in England and Wales with company registration number 06944694 and registered address at Global House, Westmere Drive, 
Crewe Business Park, Crewe, Cheshire, CW1 6ZD. This email and any attachments to it may be confidential, may be legally privileged and are intended solely for the use 
of the individual to whom it is addressed. Any views or opinions expressed are solely those of the author and do not necessarily represent those of the Trak Global 
Group. If you are not the intended recipient of this email, you must not take any action based upon its contents, nor copy or show it to anyone. Please contact the 
sender if you believe you have received this email in error. From: Josh McKenzie < jmcken...@apache.org > Sent: Tuesday, December 17, 2024 3:11:06 PM To: 
user@cassandra.apache.org < user@cassandra.apache.org > Subject: Re: Cassandra 5 Upgrade - Storage Compatibility Modes It's kind of a shame we don't have rolling 
restart functionality built in to the database / sidecar. I know we've discussed that in the past. +1 to Jon's question - clients (i.e. java driver, etc) should be 
able to handle disconnects gracefully and route to other coordinators leaving the application-facing symptom being a blip on latency. Are you seeing something else 
more painful, or is it more just not having the built-in tooling / instrumentation to make it a clean reproducible operation? On Tue, Dec 17, 2024, at 2:24 PM, Jon 
Haddad wrote: Just curious, why is a rolling restart difficult? Is it a tooling issue, stability, just overall fear of messing with things? You *should* be able to do 
a rolling restart without it being an issue. I look at this as a fundamental workflow that every C* operator should have available, and you should be able to do them 
without there being any concern. Jon On 2024/12/17 16:01:06 Paul Chandler wrote: > All, > > We are getting a lot of push back on the 3 stage process of going 
through the three compatibility modes to upgrade to Cassandra 5. This basically means 3 rolling restarts of a cluster, which will be difficult for some of our large 
multi DC clusters. > > Having researched this, it looks like, if you are not going to create large TTL’s, it would be possible to go straight from C*4 to C*5 
with SCM NONE. This seems to be the same as it would have been going from 4.0 -> 4.1 > > Is there any reason why this should not be done? Has anyone had 
experience of upgrading in this way? > > Thanks > > Paul Chandler > >

Re: Cassandra 5 Upgrade - Storage Compatibility Modes

2024-12-17 Thread Jeff Masud
We have similar issues with 3.x repairs, and run manually as well as with 
Reaper.  Can someone tell me, if I cannot get a table repaired because it is 
locking up a node, is it still possible to upgrade to 4.0?  
 
Jeff
 
 
From: Jon Haddad 
Reply-To: 
Date: Tuesday, December 17, 2024 at 2:20 PM
To: 
Subject: Re: Cassandra 5 Upgrade - Storage Compatibility Modes
 
I strongly suggest moving to 4.0 and to set up Reaper.  Managing repairs 
yourself is a waste of time, and you're almost certainly not doing it 
optimally.  
 
Jon
 
On Tue, Dec 17, 2024 at 12:40 PM Miguel Santos-Lopez  wrote:
We haven’t had the chance to upgrade to 4, let alone 5. Has there been a big 
chance wrt to repairs since the old days of 3.11? :-)
 
In my experience the problems have been on one hand a performance & latency 
hit, but also a lack of flexibility in the tooling: often I had repairs failing 
and the only option I know of using plain nodetool is to restart again the 
repair. I ended up wrapping the call to nodetool in a bash script allowing only 
selected keyspaces and tables to be repaired. In this way I get a clear picture 
of what failed and can then do a reliable “resume” with very extra effort.
 
I would also add the time it takes. Afaik you don’t want to run more than two 
repairs at the same time. Depending on the  load and number of nodes it easily 
becomes a tedious task.
 
My view might well be biased by running that old version on a less than optimal 
cluster -improved only a couple of weeks ago, so I still have to see how it 
translates to repairs. 
 
 
 
Miguel A. Santos
Senior Platform Engineer
 
e mlo...@ims.tech
w ims.tech
t +1 226 339 8357
 
 
 


 
Error! Filename not specified.   Error! Filename not specified.  
Trak (Global Solutions) Limited, trading as IMS, is a company registered in 
England and Wales with company registration number 06944694 and registered 
address at Global House, Westmere Drive, Crewe Business Park, Crewe, Cheshire, 
CW1 6ZD.

This email and any attachments to it may be confidential, may be legally 
privileged and are intended solely for the use of the individual to whom it is 
addressed.  Any views or opinions expressed are solely those of the author and 
do not necessarily represent those of the Trak Global Group.  If you are not 
the intended recipient of this email, you must not take any action based upon 
its contents, nor copy or show it to anyone.  Please contact the sender if you 
believe you have received this email in error.
 


From: Josh McKenzie 
Sent: Tuesday, December 17, 2024 3:11:06 PM
To: user@cassandra.apache.org 
Subject: Re: Cassandra 5 Upgrade - Storage Compatibility Modes 
 
It's kind of a shame we don't have rolling restart functionality built in to 
the database / sidecar. I know we've discussed that in the past.
 
+1 to Jon's question - clients (i.e. java driver, etc) should be able to handle 
disconnects gracefully and route to other coordinators leaving the 
application-facing symptom being a blip on latency. Are you seeing something 
else more painful, or is it more just not having the built-in tooling / 
instrumentation to make it a clean reproducible operation?
 
On Tue, Dec 17, 2024, at 2:24 PM, Jon Haddad wrote:
Just curious, why is a rolling restart difficult?  Is it a tooling issue, 
stability, just overall fear of messing with things?
 
You *should* be able to do a rolling restart without it being an issue.  I look 
at this as a fundamental workflow that every C* operator should have available, 
and you should be able to do them without there being any concern. 
 
Jon
 
 
On 2024/12/17 16:01:06 Paul Chandler wrote:
> All,
> 
> We are getting a lot of push back on the 3 stage process of going through the 
> three compatibility modes to upgrade to Cassandra 5. This basically means 3 
> rolling restarts of a cluster, which will be difficult for some of our large 
> multi DC clusters.
> 
> Having researched this, it looks like, if you are not going to create large 
> TTL’s, it would be possible to go straight from C*4 to C*5 with SCM NONE. 
> This seems to be the same as it would have been going from 4.0 -> 4.1
> 
> Is there any reason why this should not be done? Has anyone had experience of 
> upgrading in this way?
> 
> Thanks 
> 
> Paul Chandler
> 
>

Re: Cassandra 5 Upgrade - Storage Compatibility Modes

2024-12-17 Thread Josh McKenzie
It's kind of a shame we don't have rolling restart functionality built in to 
the database / sidecar. I know we've discussed that in the past.

+1 to Jon's question - clients (i.e. java driver, etc) should be able to handle 
disconnects gracefully and route to other coordinators leaving the 
application-facing symptom being a blip on latency. Are you seeing something 
else more painful, or is it more just not having the built-in tooling / 
instrumentation to make it a clean reproducible operation?

On Tue, Dec 17, 2024, at 2:24 PM, Jon Haddad wrote:
> Just curious, why is a rolling restart difficult?  Is it a tooling issue, 
> stability, just overall fear of messing with things?
> 
> You *should* be able to do a rolling restart without it being an issue.  I 
> look at this as a fundamental workflow that every C* operator should have 
> available, and you should be able to do them without there being any concern. 
> 
> Jon
> 
> 
> On 2024/12/17 16:01:06 Paul Chandler wrote:
> > All,
> > 
> > We are getting a lot of push back on the 3 stage process of going through 
> > the three compatibility modes to upgrade to Cassandra 5. This basically 
> > means 3 rolling restarts of a cluster, which will be difficult for some of 
> > our large multi DC clusters.
> > 
> > Having researched this, it looks like, if you are not going to create large 
> > TTL’s, it would be possible to go straight from C*4 to C*5 with SCM NONE. 
> > This seems to be the same as it would have been going from 4.0 -> 4.1
> > 
> > Is there any reason why this should not be done? Has anyone had experience 
> > of upgrading in this way?
> > 
> > Thanks 
> > 
> > Paul Chandler
> > 
> >  
> 


Cassandra 5 Upgrade - Storage Compatibility Modes

2024-12-17 Thread Paul Chandler
All,

We are getting a lot of push back on the 3 stage process of going through the 
three compatibility modes to upgrade to Cassandra 5. This basically means 3 
rolling restarts of a cluster, which will be difficult for some of our large 
multi DC clusters.

Having researched this, it looks like, if you are not going to create large 
TTL’s, it would be possible to go straight from C*4 to C*5 with SCM NONE. This 
seems to be the same as it would have been going from 4.0 -> 4.1

Is there any reason why this should not be done? Has anyone had experience of 
upgrading in this way?

Thanks 

Paul Chandler

 

Re: Cassandra 5 Upgrade - Storage Compatibility Modes

2024-12-17 Thread Jon Haddad
Just curious, why is a rolling restart difficult?  Is it a tooling issue, 
stability, just overall fear of messing with things?

You *should* be able to do a rolling restart without it being an issue.  I look 
at this as a fundamental workflow that every C* operator should have available, 
and you should be able to do them without there being any concern. 

Jon


On 2024/12/17 16:01:06 Paul Chandler wrote:
> All,
> 
> We are getting a lot of push back on the 3 stage process of going through the 
> three compatibility modes to upgrade to Cassandra 5. This basically means 3 
> rolling restarts of a cluster, which will be difficult for some of our large 
> multi DC clusters.
> 
> Having researched this, it looks like, if you are not going to create large 
> TTL’s, it would be possible to go straight from C*4 to C*5 with SCM NONE. 
> This seems to be the same as it would have been going from 4.0 -> 4.1
> 
> Is there any reason why this should not be done? Has anyone had experience of 
> upgrading in this way?
> 
> Thanks 
> 
> Paul Chandler
> 
>  


Re: Cassandra 5 Upgrade - Storage Compatibility Modes

2024-12-17 Thread Jon Haddad
 > Secondly there are some very large clusters involved, 1300+ nodes across
multiple physical datacenters, in this case any upgrades are only done out
of hours and only one datacenter per day. So a normal upgrade cycle will
take multiple weeks, and this one will take 3 times as long.

If you only restart one machine at a time, then yes, this will take a
while.  It's better with these environments to restart an entire rack at a
once.  This should significantly cut down on the time it takes to restart a
cluster.  This is how all large orgs I've worked in roll out big changes.

Regardless, it might be possible to make the compatibility mode something
that can be changed without a restart, through JMX.  While it would solve
your immediate problem by avoiding it, I'd strive to solve the underlying
problem that your org is running Cassandra with unnecessary limitations
practices that make your life harder.

Jon


On Tue, Dec 17, 2024 at 12:37 PM Paul Chandler  wrote:

> Hi Jon,
>
> It is a mixture of things really, firstly it is a legacy issue where there
> have been performance problems in the past during upgrades, these have now
> been fixed, but it is not easy to regain the trust in the process.
>
> Secondly there are some very large clusters involved, 1300+ nodes across
> multiple physical datacenters, in this case any upgrades are only done out
> of hours and only one datacenter per day. So a normal upgrade cycle will
> take multiple weeks, and this one will take 3 times as long.
>
> This is a very large organisation with some very fixed rules and
> processes, so the Cassandra team does need to fit within these constraints
> and we have limited ability to influence any changes.
>
> But even forgetting these constraints, in a previous organisation ( 100+
> clusters ) which had very good automation for this sort of thing, I can
> still see this process taking 3 times as long to complete as a normal
> upgrade, and this does take up operators time.
>
> I can see the advantages of 3 stage process, and all things being equal I
> would recommend that process as being safer, however I am getting a lot of
> push back whenever we discuss the upgrade process.
>
> Thanks
>
> Paul
>
> > On 17 Dec 2024, at 19:24, Jon Haddad  wrote:
> >
> > Just curious, why is a rolling restart difficult?  Is it a tooling
> issue, stability, just overall fear of messing with things?
> >
> > You *should* be able to do a rolling restart without it being an issue.
> I look at this as a fundamental workflow that every C* operator should have
> available, and you should be able to do them without there being any
> concern.
> >
> > Jon
> >
> >
> > On 2024/12/17 16:01:06 Paul Chandler wrote:
> >> All,
> >>
> >> We are getting a lot of push back on the 3 stage process of going
> through the three compatibility modes to upgrade to Cassandra 5. This
> basically means 3 rolling restarts of a cluster, which will be difficult
> for some of our large multi DC clusters.
> >>
> >> Having researched this, it looks like, if you are not going to create
> large TTL’s, it would be possible to go straight from C*4 to C*5 with SCM
> NONE. This seems to be the same as it would have been going from 4.0 -> 4.1
> >>
> >> Is there any reason why this should not be done? Has anyone had
> experience of upgrading in this way?
> >>
> >> Thanks
> >>
> >> Paul Chandler
> >>
> >>
>
>


Re: Cassandra 5 Upgrade - Storage Compatibility Modes

2024-12-17 Thread Jon Haddad
I strongly suggest moving to 4.0 and to set up Reaper.  Managing repairs
yourself is a waste of time, and you're almost certainly not doing it
optimally.

Jon

On Tue, Dec 17, 2024 at 12:40 PM Miguel Santos-Lopez 
wrote:

> We haven’t had the chance to upgrade to 4, let alone 5. Has there been a
> big chance wrt to repairs since the old days of 3.11? :-)
>
> In my experience the problems have been on one hand a performance &
> latency hit, but also a lack of flexibility in the tooling: often I had
> repairs failing and the only option I know of using plain nodetool is to
> restart again the repair. I ended up wrapping the call to nodetool in a
> bash script allowing only selected keyspaces and tables to be repaired.
> In this way I get a clear picture of what failed and can then do a
> reliable “resume” with very extra effort.
>
> I would also add the time it takes. Afaik you don’t want to run more than
> two repairs at the same time. Depending on the  load and number of nodes
> it easily becomes a tedious task.
>
> My view might well be biased by running that old version on a less than
> optimal cluster -improved only a couple of weeks ago, so I still have to
> see how it translates to repairs.
>
>
>
> *Miguel A. Santos*
>
> *Senior Platform Engineer*
>
>
>
> *e* mlo...@ims.tech 
> *w* ims.tech 
>
> *t *+1 226 339 8357 
>
>
> [image: signatureImage]
>
>
> --
>
>
>
> [image: Image]    [image: Image]
> 
>
> Trak (Global Solutions) Limited, trading as IMS, is a company registered
> in England and Wales with company registration number 06944694 and
> registered address at Global House, Westmere Drive, Crewe Business Park,
> Crewe, Cheshire, CW1 6ZD.
>
> This email and any attachments to it may be confidential, may be legally
> privileged and are intended solely for the use of the individual to whom it
> is addressed.  Any views or opinions expressed are solely those of the
> author and do not necessarily represent those of the Trak Global Group.  If
> you are not the intended recipient of this email, you must not take any
> action based upon its contents, nor copy or show it to anyone.  Please
> contact the sender if you believe you have received this email in error.
>
> --
> *From:* Josh McKenzie 
> *Sent:* Tuesday, December 17, 2024 3:11:06 PM
> *To:* user@cassandra.apache.org 
> *Subject:* Re: Cassandra 5 Upgrade - Storage Compatibility Modes
>
> It's kind of a shame we don't have rolling restart functionality built in
> to the database / sidecar. I know we've discussed that in the past.
>
> +1 to Jon's question - clients (i.e. java driver, etc) should be able to
> handle disconnects gracefully and route to other coordinators leaving the
> application-facing symptom being a blip on latency. Are you seeing
> something else more painful, or is it more just not having the built-in
> tooling / instrumentation to make it a clean reproducible operation?
>
> On Tue, Dec 17, 2024, at 2:24 PM, Jon Haddad wrote:
>
> Just curious, why is a rolling restart difficult?  Is it a tooling issue,
> stability, just overall fear of messing with things?
>
> You *should* be able to do a rolling restart without it being an issue.  I
> look at this as a fundamental workflow that every C* operator should have
> available, and you should be able to do them without there being any
> concern.
>
> Jon
>
>
> On 2024/12/17 16:01:06 Paul Chandler wrote:
> > All,
> >
> > We are getting a lot of push back on the 3 stage process of going
> through the three compatibility modes to upgrade to Cassandra 5. This
> basically means 3 rolling restarts of a cluster, which will be difficult
> for some of our large multi DC clusters.
> >
> > Having researched this, it looks like, if you are not going to create
> large TTL’s, it would be possible to go straight from C*4 to C*5 with SCM
> NONE. This seems to be the same as it would have been going from 4.0 -> 4.1
> >
> > Is there any reason why this should not be done? Has anyone had
> experience of upgrading in this way?
> >
> > Thanks
> >
> > Paul Chandler
> >
> >
>
>
>