On Wed, Dec 18, 2024 at 12:26 PM Jeff Jirsa <jji...@gmail.com> wrote:
> I think this is one of those cases where if someone tells us they’re > feeling pain, instead of telling them it shouldn’t be painful, we try to > learn a bit more about the pain. > > For example, both you and Scott expressed surprise at the concern of > rolling restarts (you repeatedly, Scott mentioned that repair isn’t > required for upgrade - or restart, but I could see how teams consider it > required, especially if they're doing low consistency writes with lots of > DCs and relying on repair-after-bounce for visibility (because hints may > not keep up before they time out, for example), so it kinda makes sense > that people are seeing this as a huge endeavor. > > So, to borrow from Ted Lasso, can we be curious not judgmental ( > https://www.youtube.com/watch?v=5x0PzUoJS-U ) ? > > WikiMedia folks - has the rate of major-upgrades-going-wrong been steady > the whole time? Anecdotally, I thought they got much better around 4.0? > Our 3.11 -> 4.1 upgrade encountered CASSANDRA-18559 <https://issues.apache.org/jira/browse/CASSANDRA-18559> (still open). The test cluster we used to vet the release is only a single data-center, so this wasn't something we saw until we moved to production. It wasn't disruptive (didn't bring down a cluster, or cause data loss), and the work-around was ultimately simple (we set internode_encryption: all), but it cost us a few days. Strictly speaking it probably *was* the easiest major upgrade we've had, but anxiety still runs high. > For teams that are hesitant to do rolling restarts, is there a specific > part of the restart or upgrade you consider high risk or high effort? > Beyond corporate / enterprise change policy / change windows (which I get), > is there something that has caused pain in the past that you’re optimizing > around now? > > > > On Dec 18, 2024, at 10:12 AM, Jon Haddad <j...@rustyrazorblade.com> wrote: > > I think we're talking about different things. > > > Yes, and Paul clarified that it wasn't (just) an issue of having to do > rolling restarts, but the work involved in doing an upgrade. Were it only > the case that the hardest part of doing an upgrade was the rolling > restart... > > From several messages ago: > > > This basically means 3 rolling restarts of a cluster, which will be > difficult for some of our large multi DC clusters. > > The discussion was specifically about rolling restarts and how storage > compatibility mode requires them, which in this environment was described > as difficult. The difficultly of rest of the process is irrelevant here, > because it's the same regardless of how you approach storage compatibility > mode. My point is that rolling restarts should not be difficult if you > have the right automation, which you seem to agree with. > > Want to discuss the difficulty of upgrading in general? I'm all for > improving it. It's just not what this thread is about. > > Jon > > > > On Wed, Dec 18, 2024 at 10:01 AM Eric Evans <john.eric.ev...@gmail.com> > wrote: > >> >> >> On Wed, Dec 18, 2024 at 11:43 AM Jon Haddad <j...@rustyrazorblade.com> >> wrote: >> >>> > We (Wikimedia) have had more (major) upgrades go wrong in some way, >>> than right. Any significant upgrade is going to be weeks —if not months— >>> in the making, with careful testing, a phased rollout, and a workable plan >>> for rollback. We'd never entertain doing more than one at a time, it's >>> just way too many moving parts. >>> >>> The question wasn't about why upgrades are hard, it was about why a >>> rolling restart of the cluster is hard. They're different things. >>> >> >> Yes, and Paul clarified that it wasn't (just) an issue of having to do >> rolling restarts, but the work involved in doing an upgrade. Were it only >> the case that the hardest part of doing an upgrade was the rolling >> restart... >> >> -- >> Eric Evans >> john.eric.ev...@gmail.com >> > > -- Eric Evans john.eric.ev...@gmail.com