Re: brain split causes both nodes to be LIVE in a replicated-failback-static setup

Simon Valter Thu, 07 Mar 2024 00:03:14 -0800

Thank you for clarifying

--
Simon


On Wed, Mar 6, 2024 at 7:23 PM Justin Bertram <[email protected]> wrote:

> > I did however assume that since backup was able to vote the quorum with 1
> vote and decide to become active, it would also once they did re-establish
> connection later, which they did, would just pass back the active to
> primary when i have allow-failback.
>
> If that assumption were correct then split brain wouldn't really be a
> problem.
>
> > ...would you use the  to determine network-check-ping-command if there
> was a network split
>
> It looks like something was left out of your question here. That said, I
> wouldn't personally recommend using pings, but I know some users have
> employed them with success.
>
> > ...or the vote-on-replication-failure on the primary node to shut it down
> and manually recover them later ?
>
> That's a valid option.
>
> Personally I'd recommend what I mentioned in my previous email - using
> ZooKeeper to coordinate your primary and backup.
>
>
> Justin
>
> On Wed, Mar 6, 2024 at 11:44 AM Simon Valter <[email protected]> wrote:
>
> > At the moment only two nodes are available and it was acceptable with
> some
> > administrative intervention.
> >
> > I did however assume that since backup was able to vote the quorum with 1
> > vote and decide to become active, it would also
> > once they did re-establish connection later, which they did, would just
> > pass back the active to primary when i have allow-failback.
> >
> > but i guess that would be like you say a problem about who has the up to
> > date data, as we did see clients fall over to backup.
> >
> > If I'm not able to add a 3rd node, would you use the  to determine
> > network-check-ping-command if there was a network split, or the
> > vote-on-replication-failure
> > on the primary node to shut it down and manually recover them later ?
> >
> > On Wed, Mar 6, 2024 at 5:49 PM Justin Bertram <[email protected]>
> wrote:
> >
> > > Do you have any mitigation in place for split brain? Typically you'd
> use
> > > ZooKeeper with a single primary/backup pair of brokers. Otherwise you'd
> > > need 3 primary/backup pairs to establish a proper quorum.
> > >
> > > To be clear, once split brain occurs administrative intervention is
> > > required to resolve the situation. The brokers by themselves can't
> > > determine which broker has more up-to-date data so they can't
> > automatically
> > > decide which broker should take over.
> > >
> > >
> > > Justin
> > >
> > > On Wed, Mar 6, 2024 at 8:11 AM Simon Valter <[email protected]> wrote:
> > >
> > > > like to hear your thoughts on this.
> > > >
> > > > My setup is as follows:
> > > >
> > > > I have a setup similar to the replicated-failback-static example
> > > >
> > > > I run the following version: apache-artemis-2.30.0
> > > >
> > > > JDK is java 17
> > > >
> > > > It's on 2 nodes running windows 2022 (i have 3 environments, it
> > > > happened across them all at different times. currently i have kept 1
> > > > environment in this state, sadly it's not in DEBUG)
> > > >
> > > > ssl transport is in use
> > > >
> > > > nodes are placed in the same subnet on vmware infrastructure
> > > >
> > > > ntp/time is in sync on the nodes
> > > >
> > > > activemq service has not been restarted for 84 days, after 2 days
> > uptime
> > > > this happened:
> > > >
> > > > After a split brain replication stopped and both are LIVE and can see
> > > each
> > > > other and are connected again but failback did not happen.
> > > >
> > > > I have tested and seen failback happen previously but this exact
> > scenario
> > > > seems to have caused some bad state?
> > > >
> > > > logs and screenshots showcasing the issue has been attached.
> > > >
> > >
> >
>

Re: brain split causes both nodes to be LIVE in a replicated-failback-static setup

Reply via email to