At the moment only two nodes are available and it was acceptable with some administrative intervention.
I did however assume that since backup was able to vote the quorum with 1 vote and decide to become active, it would also once they did re-establish connection later, which they did, would just pass back the active to primary when i have allow-failback. but i guess that would be like you say a problem about who has the up to date data, as we did see clients fall over to backup. If I'm not able to add a 3rd node, would you use the to determine network-check-ping-command if there was a network split, or the vote-on-replication-failure on the primary node to shut it down and manually recover them later ? On Wed, Mar 6, 2024 at 5:49 PM Justin Bertram <jbert...@apache.org> wrote: > Do you have any mitigation in place for split brain? Typically you'd use > ZooKeeper with a single primary/backup pair of brokers. Otherwise you'd > need 3 primary/backup pairs to establish a proper quorum. > > To be clear, once split brain occurs administrative intervention is > required to resolve the situation. The brokers by themselves can't > determine which broker has more up-to-date data so they can't automatically > decide which broker should take over. > > > Justin > > On Wed, Mar 6, 2024 at 8:11 AM Simon Valter <si...@valter.info> wrote: > > > like to hear your thoughts on this. > > > > My setup is as follows: > > > > I have a setup similar to the replicated-failback-static example > > > > I run the following version: apache-artemis-2.30.0 > > > > JDK is java 17 > > > > It's on 2 nodes running windows 2022 (i have 3 environments, it > > happened across them all at different times. currently i have kept 1 > > environment in this state, sadly it's not in DEBUG) > > > > ssl transport is in use > > > > nodes are placed in the same subnet on vmware infrastructure > > > > ntp/time is in sync on the nodes > > > > activemq service has not been restarted for 84 days, after 2 days uptime > > this happened: > > > > After a split brain replication stopped and both are LIVE and can see > each > > other and are connected again but failback did not happen. > > > > I have tested and seen failback happen previously but this exact scenario > > seems to have caused some bad state? > > > > logs and screenshots showcasing the issue has been attached. > > >