Thank you for clarifying -- Simon
On Wed, Mar 6, 2024 at 7:23 PM Justin Bertram <jbert...@apache.org> wrote: > > I did however assume that since backup was able to vote the quorum with 1 > vote and decide to become active, it would also once they did re-establish > connection later, which they did, would just pass back the active to > primary when i have allow-failback. > > If that assumption were correct then split brain wouldn't really be a > problem. > > > ...would you use the to determine network-check-ping-command if there > was a network split > > It looks like something was left out of your question here. That said, I > wouldn't personally recommend using pings, but I know some users have > employed them with success. > > > ...or the vote-on-replication-failure on the primary node to shut it down > and manually recover them later ? > > That's a valid option. > > Personally I'd recommend what I mentioned in my previous email - using > ZooKeeper to coordinate your primary and backup. > > > Justin > > On Wed, Mar 6, 2024 at 11:44 AM Simon Valter <si...@valter.info> wrote: > > > At the moment only two nodes are available and it was acceptable with > some > > administrative intervention. > > > > I did however assume that since backup was able to vote the quorum with 1 > > vote and decide to become active, it would also > > once they did re-establish connection later, which they did, would just > > pass back the active to primary when i have allow-failback. > > > > but i guess that would be like you say a problem about who has the up to > > date data, as we did see clients fall over to backup. > > > > If I'm not able to add a 3rd node, would you use the to determine > > network-check-ping-command if there was a network split, or the > > vote-on-replication-failure > > on the primary node to shut it down and manually recover them later ? > > > > On Wed, Mar 6, 2024 at 5:49 PM Justin Bertram <jbert...@apache.org> > wrote: > > > > > Do you have any mitigation in place for split brain? Typically you'd > use > > > ZooKeeper with a single primary/backup pair of brokers. Otherwise you'd > > > need 3 primary/backup pairs to establish a proper quorum. > > > > > > To be clear, once split brain occurs administrative intervention is > > > required to resolve the situation. The brokers by themselves can't > > > determine which broker has more up-to-date data so they can't > > automatically > > > decide which broker should take over. > > > > > > > > > Justin > > > > > > On Wed, Mar 6, 2024 at 8:11 AM Simon Valter <si...@valter.info> wrote: > > > > > > > like to hear your thoughts on this. > > > > > > > > My setup is as follows: > > > > > > > > I have a setup similar to the replicated-failback-static example > > > > > > > > I run the following version: apache-artemis-2.30.0 > > > > > > > > JDK is java 17 > > > > > > > > It's on 2 nodes running windows 2022 (i have 3 environments, it > > > > happened across them all at different times. currently i have kept 1 > > > > environment in this state, sadly it's not in DEBUG) > > > > > > > > ssl transport is in use > > > > > > > > nodes are placed in the same subnet on vmware infrastructure > > > > > > > > ntp/time is in sync on the nodes > > > > > > > > activemq service has not been restarted for 84 days, after 2 days > > uptime > > > > this happened: > > > > > > > > After a split brain replication stopped and both are LIVE and can see > > > each > > > > other and are connected again but failback did not happen. > > > > > > > > I have tested and seen failback happen previously but this exact > > scenario > > > > seems to have caused some bad state? > > > > > > > > logs and screenshots showcasing the issue has been attached. > > > > > > > > > >