We only need 1 queue to be created so we can't have 3 master/slave pairs listening to the same queue. The network ping configuration (put several UNIX servers to be ping) detects the network was unhealthy so Artemis went to sleep. Configuration allow-failback in slave was configured to be false to avoid flip flop problem. The problem if the sequences below occurred: 1. Primary/master starts and active 2. Slave starts as backup. 3. Master was isolated from network. 4. Slave became active. 5. Master recovered from network isolation. 6. Master woke up but detects active server, announced as backup. 7. Slave was isolated from network. 8. Master became active. 9. Slave recovered from network isolation. 10. Slave woke up but because allow-failback = false and there is no configuration check-for-live-server, slave became active while master was also active.
-----Original Message----- From: Justin Bertram <jbert...@apache.org> Sent: Thursday, April 7, 2022 10:42 PM To: users@activemq.apache.org Subject: [EXTERNAL] Re: Configuration check-for-live-server recommendation for backup server The check-for-live-server controls what happens when a master broker is *started*. If it's false then it will activate even if there is already another broker on the network with its ID, but if it's true then it will check first and if it finds another broker on the network with its ID then it will become a backup to that broker. On the other hand, a replication slave *always* starts as a backup no matter what. If you want to mitigate split brain in this case then you need a proper quorum. In order to get this you can either configure 3 master/slave pairs or you can integrate with ZooKeeper via the pluggable quorum vote replication configuration [1]. A single master/slave pair simply cannot avoid split brain in every possible situation. Justin [1] https://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Factivemq.apache.org%2Fcomponents%2Fartemis%2Fdocumentation%2Flatest%2Fha.html%23pluggable-quorum-vote-replication-configurations&data=04%7C01%7Crahman.gunawan%40nasa.gov%7C3c40f2b290a24597b5fd08da19097938%7C7005d45845be48ae8140d43da96dd17b%7C0%7C0%7C637849825728388437%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=P%2Ba74ZA6TYwrZy383kpu0Z8f%2BJqjcmjperyL5efsVdI%3D&reserved=0 On Wed, Apr 6, 2022 at 10:06 AM Gunawan, Rahman (GSFC-703.H)[Halvik Corp] <rahman.guna...@nasa.gov.invalid> wrote: > Hi, > I would like to recommend to add configuration "<check-for-live-server>" > to backup server. I tested artemis replication mode with the > following > configuration: > Primary: > <ha-policy> > <replication> > <master> > > <vote-on-replication-failure>true</vote-on-replication-failure> > <check-for-live-server>true</check-for-live-server> > </master> > </replication> > </ha-policy> > > Backup server: > <ha-policy> > <replication> > <slave> > <allow-failback>false</allow-failback> > </slave> > </replication> > </ha-policy> > > We also enable ping on both primary and backup server. > > 1. When the network card in primary was disabled, after around 2 > minutes, the backup server went live while the primary server was > still isolated from network. > > 2. After network card in primary server was enabled, artemis in > primary woke up but it detected a live server was already active so it > announced as backup. > > 3. Then, network card in the backup server was disabled, after around > 2 minutes, the primary server went live while the backup server was > still isolated from network. > > 4. After network card in the backup server was enabled, the backup > server woke but because there was no configuration to check for live > server, it went live while the primary server also live (split brain issue). > > Any reason why the backup server doesn't have configuration > "<check-for-live-server>"? > Thanks > > Regards, > Rahman >