I think this is one of those cases where if someone tells us they’re feeling 
pain, instead of telling them it shouldn’t be painful, we try to learn a bit 
more about the pain.

For example, both you and Scott expressed surprise at the concern of rolling 
restarts (you repeatedly, Scott mentioned that repair isn’t required for 
upgrade - or restart, but I could see how teams consider it required, 
especially if they're doing low consistency writes with lots of DCs and relying 
on repair-after-bounce for visibility (because hints may not keep up before 
they time out, for example), so it kinda makes sense that people are seeing 
this as a huge endeavor.

So, to borrow from Ted Lasso, can we be curious not judgmental ( 
https://www.youtube.com/watch?v=5x0PzUoJS-U ) ? 

WikiMedia folks - has the rate of major-upgrades-going-wrong been steady the 
whole time? Anecdotally, I thought they got much better around 4.0?

For teams that are hesitant to do rolling restarts, is there a specific part of 
the restart or upgrade you consider high risk or high effort? Beyond corporate 
/ enterprise change policy / change windows (which I get), is there something 
that has caused pain in the past that you’re optimizing around now? 



> On Dec 18, 2024, at 10:12 AM, Jon Haddad <j...@rustyrazorblade.com> wrote:
> 
> I think we're talking about different things. 
> 
> >  Yes, and Paul clarified that it wasn't (just) an issue of having to do 
> > rolling restarts, but the work involved in doing an upgrade.  Were it only 
> > the case that the hardest part of doing an upgrade was the rolling 
> > restart...
> 
> From several messages ago:
> 
> > This basically means 3 rolling restarts of a cluster, which will be 
> > difficult for some of our large multi DC clusters.
> 
> The discussion was specifically about rolling restarts and how storage 
> compatibility mode requires them, which in this environment was described as 
> difficult.  The difficultly of rest of the process is irrelevant here, 
> because it's the same regardless of how you approach storage compatibility 
> mode.  My point is that rolling restarts should not be difficult if you have 
> the right automation, which you seem to agree with.
> 
> Want to discuss the difficulty of upgrading in general?  I'm all for 
> improving it.  It's just not what this thread is about.
> 
> Jon
> 
> 
> 
> On Wed, Dec 18, 2024 at 10:01 AM Eric Evans <john.eric.ev...@gmail.com 
> <mailto:john.eric.ev...@gmail.com>> wrote:
>> 
>> 
>> On Wed, Dec 18, 2024 at 11:43 AM Jon Haddad <j...@rustyrazorblade.com 
>> <mailto:j...@rustyrazorblade.com>> wrote:
>>> > We (Wikimedia) have had more (major) upgrades go wrong in some way, than 
>>> > right.  Any significant upgrade is going to be weeks —if not months— in 
>>> > the making, with careful testing, a phased rollout, and a workable plan 
>>> > for rollback.  We'd never entertain doing more than one at a time, it's 
>>> > just way too many moving parts.
>>> 
>>> The question wasn't about why upgrades are hard, it was about why a rolling 
>>> restart of the cluster is hard.  They're different things.
>> 
>> Yes, and Paul clarified that it wasn't (just) an issue of having to do 
>> rolling restarts, but the work involved in doing an upgrade.  Were it only 
>> the case that the hardest part of doing an upgrade was the rolling restart...
>> 
>> --
>> Eric Evans
>> john.eric.ev...@gmail.com <mailto:john.eric.ev...@gmail.com>

Reply via email to