OK, it seems like I didn’t explain it too well, but yes it is the rolling restart 3 times as part of the upgrade that is causing the push back, my message was a bit vague on the use cases because there are confidentiality agreements in place so I can’t share too much.
We have had problems in the past with rolling restarts, this was on the very large cluster, and from what I remember, when a node restarts, it was at under huge load for a while, this was due to the large number of gossip messages accumulated from all the other nodes, and there were a large number of clients trying to connect and the bcrypt ( struggling to remember if this the correct name) hashing was taking a lot of processing, this meant that the first clients to connect where then having very high latencies while the rest of the connections where processed. This is all old history and has been fixed, so is not really what the question was about, however these old problems have a bad legacy in the memory of the people that matter. Hence the push back we have now. I would like to thank Jeff for pointing out that my pain could be legitimate, but I would also like to thank everyone else for answering too. I have tried Jon’s suggestion of accessing the SCM through JMX/nodetool, I have managed to get the setting changed while the node is up, so removing the need for a reboot. However the sstable format is only configured on startup, so it continuers to write nb* sstables. This is not really a subject for this list, so I will follow Scott’s suggestion and create an email to discuss this on the dev list. I will do that tomorrow. I think the answer to my original question is that no, nobody has gone straight from C*4 to C*5 ( NONE ) and it is not recommended. Thanks everyone Paul > On 18 Dec 2024, at 18:45, Eric Evans <john.eric.ev...@gmail.com> wrote: > > > > On Wed, Dec 18, 2024 at 12:12 PM Jon Haddad <j...@rustyrazorblade.com > <mailto:j...@rustyrazorblade.com>> wrote: > I think we're talking about different things. > > > Yes, and Paul clarified that it wasn't (just) an issue of having to do > > rolling restarts, but the work involved in doing an upgrade. Were it only > > the case that the hardest part of doing an upgrade was the rolling > > restart... > > From several messages ago: > > > This basically means 3 rolling restarts of a cluster, which will be > > difficult for some of our large multi DC clusters. > > The discussion was specifically about rolling restarts and how storage > compatibility mode requires them, which in this environment was described as > difficult. The difficultly of rest of the process is irrelevant here, > because it's the same regardless of how you approach storage compatibility > mode. My point is that rolling restarts should not be difficult if you have > the right automation, which you seem to agree with. > > Want to discuss the difficulty of upgrading in general? I'm all for > improving it. It's just not what this thread is about. > > You're right, I'm at least partly conflating other (recent) dev threads about > upgrade trajectories, sorry about that. It still reads to me though as an > issue of change management (vis-a-vis what's happening that has us > restarting) versus the mechanics of rolling restarts, and that was what I was > alluding to. If it is strictly about rolling restart logistics, I am a) > surprised (I didn't know this was a problem for anyone), and b) will sit > quietly now and try to understand why that is. :) > > On Wed, Dec 18, 2024 at 10:01 AM Eric Evans <john.eric.ev...@gmail.com > <mailto:john.eric.ev...@gmail.com>> wrote: > > > On Wed, Dec 18, 2024 at 11:43 AM Jon Haddad <j...@rustyrazorblade.com > <mailto:j...@rustyrazorblade.com>> wrote: > > We (Wikimedia) have had more (major) upgrades go wrong in some way, than > > right. Any significant upgrade is going to be weeks —if not months— in the > > making, with careful testing, a phased rollout, and a workable plan for > > rollback. We'd never entertain doing more than one at a time, it's just > > way too many moving parts. > > The question wasn't about why upgrades are hard, it was about why a rolling > restart of the cluster is hard. They're different things. > > Yes, and Paul clarified that it wasn't (just) an issue of having to do > rolling restarts, but the work involved in doing an upgrade. Were it only > the case that the hardest part of doing an upgrade was the rolling restart... > > -- > Eric Evans > john.eric.ev...@gmail.com <mailto:john.eric.ev...@gmail.com> > > > -- > Eric Evans > john.eric.ev...@gmail.com <mailto:john.eric.ev...@gmail.com>