At the moment only two nodes are available and it was acceptable with some
administrative intervention.

I did however assume that since backup was able to vote the quorum with 1
vote and decide to become active, it would also
once they did re-establish connection later, which they did, would just
pass back the active to primary when i have allow-failback.

but i guess that would be like you say a problem about who has the up to
date data, as we did see clients fall over to backup.

If I'm not able to add a 3rd node, would you use the  to determine
network-check-ping-command if there was a network split, or the
vote-on-replication-failure
on the primary node to shut it down and manually recover them later ?

On Wed, Mar 6, 2024 at 5:49 PM Justin Bertram <jbert...@apache.org> wrote:

> Do you have any mitigation in place for split brain? Typically you'd use
> ZooKeeper with a single primary/backup pair of brokers. Otherwise you'd
> need 3 primary/backup pairs to establish a proper quorum.
>
> To be clear, once split brain occurs administrative intervention is
> required to resolve the situation. The brokers by themselves can't
> determine which broker has more up-to-date data so they can't automatically
> decide which broker should take over.
>
>
> Justin
>
> On Wed, Mar 6, 2024 at 8:11 AM Simon Valter <si...@valter.info> wrote:
>
> > like to hear your thoughts on this.
> >
> > My setup is as follows:
> >
> > I have a setup similar to the replicated-failback-static example
> >
> > I run the following version: apache-artemis-2.30.0
> >
> > JDK is java 17
> >
> > It's on 2 nodes running windows 2022 (i have 3 environments, it
> > happened across them all at different times. currently i have kept 1
> > environment in this state, sadly it's not in DEBUG)
> >
> > ssl transport is in use
> >
> > nodes are placed in the same subnet on vmware infrastructure
> >
> > ntp/time is in sync on the nodes
> >
> > activemq service has not been restarted for 84 days, after 2 days uptime
> > this happened:
> >
> > After a split brain replication stopped and both are LIVE and can see
> each
> > other and are connected again but failback did not happen.
> >
> > I have tested and seen failback happen previously but this exact scenario
> > seems to have caused some bad state?
> >
> > logs and screenshots showcasing the issue has been attached.
> >
>

Reply via email to