On Wed, Dec 18, 2024 at 12:12 PM Jon Haddad <j...@rustyrazorblade.com> wrote:
> I think we're talking about different things. > > > Yes, and Paul clarified that it wasn't (just) an issue of having to do > rolling restarts, but the work involved in doing an upgrade. Were it only > the case that the hardest part of doing an upgrade was the rolling > restart... > > From several messages ago: > > > This basically means 3 rolling restarts of a cluster, which will be > difficult for some of our large multi DC clusters. > > The discussion was specifically about rolling restarts and how storage > compatibility mode requires them, which in this environment was described > as difficult. The difficultly of rest of the process is irrelevant here, > because it's the same regardless of how you approach storage compatibility > mode. My point is that rolling restarts should not be difficult if you > have the right automation, which you seem to agree with. > > Want to discuss the difficulty of upgrading in general? I'm all for > improving it. It's just not what this thread is about. > You're right, I'm at least partly conflating other (recent) dev threads about upgrade trajectories, sorry about that. It still reads to me though as an issue of change management (vis-a-vis what's happening that has us restarting) versus the mechanics of rolling restarts, and that was what I was alluding to. If it *is* strictly about rolling restart logistics, I am a) surprised (I didn't know this was a problem for anyone), and b) will sit quietly now and try to understand why that is. :) > On Wed, Dec 18, 2024 at 10:01 AM Eric Evans <john.eric.ev...@gmail.com> > wrote: > >> >> >> On Wed, Dec 18, 2024 at 11:43 AM Jon Haddad <j...@rustyrazorblade.com> >> wrote: >> >>> > We (Wikimedia) have had more (major) upgrades go wrong in some way, >>> than right. Any significant upgrade is going to be weeks —if not months— >>> in the making, with careful testing, a phased rollout, and a workable plan >>> for rollback. We'd never entertain doing more than one at a time, it's >>> just way too many moving parts. >>> >>> The question wasn't about why upgrades are hard, it was about why a >>> rolling restart of the cluster is hard. They're different things. >>> >> >> Yes, and Paul clarified that it wasn't (just) an issue of having to do >> rolling restarts, but the work involved in doing an upgrade. Were it only >> the case that the hardest part of doing an upgrade was the rolling >> restart... >> >> -- >> Eric Evans >> john.eric.ev...@gmail.com >> > -- Eric Evans john.eric.ev...@gmail.com