I think this is one of those cases where if someone tells us they’re feeling pain, instead of telling them it shouldn’t be painful, we try to learn a bit more about the pain.
For example, both you and Scott expressed surprise at the concern of rolling restarts (you repeatedly, Scott mentioned that repair isn’t required for upgrade - or restart, but I could see how teams consider it required, especially if they're doing low consistency writes with lots of DCs and relying on repair-after-bounce for visibility (because hints may not keep up before they time out, for example), so it kinda makes sense that people are seeing this as a huge endeavor. So, to borrow from Ted Lasso, can we be curious not judgmental ( https://www.youtube.com/watch?v=5x0PzUoJS-U ) ? WikiMedia folks - has the rate of major-upgrades-going-wrong been steady the whole time? Anecdotally, I thought they got much better around 4.0? For teams that are hesitant to do rolling restarts, is there a specific part of the restart or upgrade you consider high risk or high effort? Beyond corporate / enterprise change policy / change windows (which I get), is there something that has caused pain in the past that you’re optimizing around now? > On Dec 18, 2024, at 10:12 AM, Jon Haddad <j...@rustyrazorblade.com> wrote: > > I think we're talking about different things. > > > Yes, and Paul clarified that it wasn't (just) an issue of having to do > > rolling restarts, but the work involved in doing an upgrade. Were it only > > the case that the hardest part of doing an upgrade was the rolling > > restart... > > From several messages ago: > > > This basically means 3 rolling restarts of a cluster, which will be > > difficult for some of our large multi DC clusters. > > The discussion was specifically about rolling restarts and how storage > compatibility mode requires them, which in this environment was described as > difficult. The difficultly of rest of the process is irrelevant here, > because it's the same regardless of how you approach storage compatibility > mode. My point is that rolling restarts should not be difficult if you have > the right automation, which you seem to agree with. > > Want to discuss the difficulty of upgrading in general? I'm all for > improving it. It's just not what this thread is about. > > Jon > > > > On Wed, Dec 18, 2024 at 10:01 AM Eric Evans <john.eric.ev...@gmail.com > <mailto:john.eric.ev...@gmail.com>> wrote: >> >> >> On Wed, Dec 18, 2024 at 11:43 AM Jon Haddad <j...@rustyrazorblade.com >> <mailto:j...@rustyrazorblade.com>> wrote: >>> > We (Wikimedia) have had more (major) upgrades go wrong in some way, than >>> > right. Any significant upgrade is going to be weeks —if not months— in >>> > the making, with careful testing, a phased rollout, and a workable plan >>> > for rollback. We'd never entertain doing more than one at a time, it's >>> > just way too many moving parts. >>> >>> The question wasn't about why upgrades are hard, it was about why a rolling >>> restart of the cluster is hard. They're different things. >> >> Yes, and Paul clarified that it wasn't (just) an issue of having to do >> rolling restarts, but the work involved in doing an upgrade. Were it only >> the case that the hardest part of doing an upgrade was the rolling restart... >> >> -- >> Eric Evans >> john.eric.ev...@gmail.com <mailto:john.eric.ev...@gmail.com>