> I did however assume that since backup was able to vote the quorum with 1
vote and decide to become active, it would also once they did re-establish
connection later, which they did, would just pass back the active to
primary when i have allow-failback.

If that assumption were correct then split brain wouldn't really be a
problem.

> ...would you use the  to determine network-check-ping-command if there
was a network split

It looks like something was left out of your question here. That said, I
wouldn't personally recommend using pings, but I know some users have
employed them with success.

> ...or the vote-on-replication-failure on the primary node to shut it down
and manually recover them later ?

That's a valid option.

Personally I'd recommend what I mentioned in my previous email - using
ZooKeeper to coordinate your primary and backup.


Justin

On Wed, Mar 6, 2024 at 11:44 AM Simon Valter <si...@valter.info> wrote:

> At the moment only two nodes are available and it was acceptable with some
> administrative intervention.
>
> I did however assume that since backup was able to vote the quorum with 1
> vote and decide to become active, it would also
> once they did re-establish connection later, which they did, would just
> pass back the active to primary when i have allow-failback.
>
> but i guess that would be like you say a problem about who has the up to
> date data, as we did see clients fall over to backup.
>
> If I'm not able to add a 3rd node, would you use the  to determine
> network-check-ping-command if there was a network split, or the
> vote-on-replication-failure
> on the primary node to shut it down and manually recover them later ?
>
> On Wed, Mar 6, 2024 at 5:49 PM Justin Bertram <jbert...@apache.org> wrote:
>
> > Do you have any mitigation in place for split brain? Typically you'd use
> > ZooKeeper with a single primary/backup pair of brokers. Otherwise you'd
> > need 3 primary/backup pairs to establish a proper quorum.
> >
> > To be clear, once split brain occurs administrative intervention is
> > required to resolve the situation. The brokers by themselves can't
> > determine which broker has more up-to-date data so they can't
> automatically
> > decide which broker should take over.
> >
> >
> > Justin
> >
> > On Wed, Mar 6, 2024 at 8:11 AM Simon Valter <si...@valter.info> wrote:
> >
> > > like to hear your thoughts on this.
> > >
> > > My setup is as follows:
> > >
> > > I have a setup similar to the replicated-failback-static example
> > >
> > > I run the following version: apache-artemis-2.30.0
> > >
> > > JDK is java 17
> > >
> > > It's on 2 nodes running windows 2022 (i have 3 environments, it
> > > happened across them all at different times. currently i have kept 1
> > > environment in this state, sadly it's not in DEBUG)
> > >
> > > ssl transport is in use
> > >
> > > nodes are placed in the same subnet on vmware infrastructure
> > >
> > > ntp/time is in sync on the nodes
> > >
> > > activemq service has not been restarted for 84 days, after 2 days
> uptime
> > > this happened:
> > >
> > > After a split brain replication stopped and both are LIVE and can see
> > each
> > > other and are connected again but failback did not happen.
> > >
> > > I have tested and seen failback happen previously but this exact
> scenario
> > > seems to have caused some bad state?
> > >
> > > logs and screenshots showcasing the issue has been attached.
> > >
> >
>

Reply via email to