> I did however assume that since backup was able to vote the quorum with 1 vote and decide to become active, it would also once they did re-establish connection later, which they did, would just pass back the active to primary when i have allow-failback.
If that assumption were correct then split brain wouldn't really be a problem. > ...would you use the to determine network-check-ping-command if there was a network split It looks like something was left out of your question here. That said, I wouldn't personally recommend using pings, but I know some users have employed them with success. > ...or the vote-on-replication-failure on the primary node to shut it down and manually recover them later ? That's a valid option. Personally I'd recommend what I mentioned in my previous email - using ZooKeeper to coordinate your primary and backup. Justin On Wed, Mar 6, 2024 at 11:44 AM Simon Valter <si...@valter.info> wrote: > At the moment only two nodes are available and it was acceptable with some > administrative intervention. > > I did however assume that since backup was able to vote the quorum with 1 > vote and decide to become active, it would also > once they did re-establish connection later, which they did, would just > pass back the active to primary when i have allow-failback. > > but i guess that would be like you say a problem about who has the up to > date data, as we did see clients fall over to backup. > > If I'm not able to add a 3rd node, would you use the to determine > network-check-ping-command if there was a network split, or the > vote-on-replication-failure > on the primary node to shut it down and manually recover them later ? > > On Wed, Mar 6, 2024 at 5:49 PM Justin Bertram <jbert...@apache.org> wrote: > > > Do you have any mitigation in place for split brain? Typically you'd use > > ZooKeeper with a single primary/backup pair of brokers. Otherwise you'd > > need 3 primary/backup pairs to establish a proper quorum. > > > > To be clear, once split brain occurs administrative intervention is > > required to resolve the situation. The brokers by themselves can't > > determine which broker has more up-to-date data so they can't > automatically > > decide which broker should take over. > > > > > > Justin > > > > On Wed, Mar 6, 2024 at 8:11 AM Simon Valter <si...@valter.info> wrote: > > > > > like to hear your thoughts on this. > > > > > > My setup is as follows: > > > > > > I have a setup similar to the replicated-failback-static example > > > > > > I run the following version: apache-artemis-2.30.0 > > > > > > JDK is java 17 > > > > > > It's on 2 nodes running windows 2022 (i have 3 environments, it > > > happened across them all at different times. currently i have kept 1 > > > environment in this state, sadly it's not in DEBUG) > > > > > > ssl transport is in use > > > > > > nodes are placed in the same subnet on vmware infrastructure > > > > > > ntp/time is in sync on the nodes > > > > > > activemq service has not been restarted for 84 days, after 2 days > uptime > > > this happened: > > > > > > After a split brain replication stopped and both are LIVE and can see > > each > > > other and are connected again but failback did not happen. > > > > > > I have tested and seen failback happen previously but this exact > scenario > > > seems to have caused some bad state? > > > > > > logs and screenshots showcasing the issue has been attached. > > > > > >