On Wed, Dec 18, 2024 at 12:26 PM Jeff Jirsa <jji...@gmail.com> wrote:

> I think this is one of those cases where if someone tells us they’re
> feeling pain, instead of telling them it shouldn’t be painful, we try to
> learn a bit more about the pain.
>
> For example, both you and Scott expressed surprise at the concern of
> rolling restarts (you repeatedly, Scott mentioned that repair isn’t
> required for upgrade - or restart, but I could see how teams consider it
> required, especially if they're doing low consistency writes with lots of
> DCs and relying on repair-after-bounce for visibility (because hints may
> not keep up before they time out, for example), so it kinda makes sense
> that people are seeing this as a huge endeavor.
>
> So, to borrow from Ted Lasso, can we be curious not judgmental (
> https://www.youtube.com/watch?v=5x0PzUoJS-U ) ?
>
> WikiMedia folks - has the rate of major-upgrades-going-wrong been steady
> the whole time? Anecdotally, I thought they got much better around 4.0?
>

Our 3.11 -> 4.1 upgrade encountered CASSANDRA-18559
<https://issues.apache.org/jira/browse/CASSANDRA-18559> (still open).   The
test cluster we used to vet the release is only a single data-center, so
this wasn't something we saw until we moved to production. It wasn't
disruptive (didn't bring down a cluster, or cause data loss), and the
work-around was ultimately simple (we set internode_encryption: all), but
it cost us a few days.  Strictly speaking it probably *was* the easiest
major upgrade we've had, but anxiety still runs high.


> For teams that are hesitant to do rolling restarts, is there a specific
> part of the restart or upgrade you consider high risk or high effort?
> Beyond corporate / enterprise change policy / change windows (which I get),
> is there something that has caused pain in the past that you’re optimizing
> around now?
>
>
>
> On Dec 18, 2024, at 10:12 AM, Jon Haddad <j...@rustyrazorblade.com> wrote:
>
> I think we're talking about different things.
>
> >  Yes, and Paul clarified that it wasn't (just) an issue of having to do
> rolling restarts, but the work involved in doing an upgrade.  Were it only
> the case that the hardest part of doing an upgrade was the rolling
> restart...
>
> From several messages ago:
>
> > This basically means 3 rolling restarts of a cluster, which will be
> difficult for some of our large multi DC clusters.
>
> The discussion was specifically about rolling restarts and how storage
> compatibility mode requires them, which in this environment was described
> as difficult.  The difficultly of rest of the process is irrelevant here,
> because it's the same regardless of how you approach storage compatibility
> mode.  My point is that rolling restarts should not be difficult if you
> have the right automation, which you seem to agree with.
>
> Want to discuss the difficulty of upgrading in general?  I'm all for
> improving it.  It's just not what this thread is about.
>
> Jon
>
>
>
> On Wed, Dec 18, 2024 at 10:01 AM Eric Evans <john.eric.ev...@gmail.com>
> wrote:
>
>>
>>
>> On Wed, Dec 18, 2024 at 11:43 AM Jon Haddad <j...@rustyrazorblade.com>
>> wrote:
>>
>>> > We (Wikimedia) have had more (major) upgrades go wrong in some way,
>>> than right.  Any significant upgrade is going to be weeks —if not months—
>>> in the making, with careful testing, a phased rollout, and a workable plan
>>> for rollback.  We'd never entertain doing more than one at a time, it's
>>> just way too many moving parts.
>>>
>>> The question wasn't about why upgrades are hard, it was about why a
>>> rolling restart of the cluster is hard.  They're different things.
>>>
>>
>> Yes, and Paul clarified that it wasn't (just) an issue of having to do
>> rolling restarts, but the work involved in doing an upgrade.  Were it only
>> the case that the hardest part of doing an upgrade was the rolling
>> restart...
>>
>> --
>> Eric Evans
>> john.eric.ev...@gmail.com
>>
>
>

-- 
Eric Evans
john.eric.ev...@gmail.com

Reply via email to