Hello,
I’m working on an Artemis cluster with ha/replication and I have a question about the quorum default implementation. The cluster I’m using is configured on 3 servers (s1, s2, s3) on 3 different physical locations (latency < 5ms), each server hosting a primary and a backup that I will respectively call p1,p2,p3 and b1,b2,b3. So we have: s1 hosting p1,b1 s2 hosting p2,b2 s3 hosting p3,b3 The primary and backup are paired into replication groups in a way to prevent data loss in case of a server failure: p1 paired with b2 in group p1b2 p2 paired with b3 in group p2b3 p3 paired with b1 in group p3b1 check-for-active-server is at true on the primary brokers and allow-failback is at true on the backup brokers. To prevent any split brain the vote-on-replication-failure is at true on every broker with a quorum-size at -1. When there is a failure between a primary and its backup, a vote is triggered on both sides. For the backup, if it obtains 2 successful answers it will go live. If not it will stay as a backup. For the primary, if it obtains 2 successful answers it will stay live. If not it will stop itself as in that case it’s expected that its backup is now being live with the majority of the remaining live brokers. The problem I have now is that if I lose a complete server (server/site stopped / crashed or connection loss), let’s say s2. The vote on the affected backup b3 will let it go live as p1 and p3 will positively answer. BUT, the vote on the primary p1 will fail, unable to get 2 successful answers as p2 is down. Ending with p1 shutting down itself. That case is not what I would expect from a quorum vote as we still have 2 live brokers on the 3 usually active, so the majority of them. p1 would only need an answer from p3 during the vote to confirm that there is no split brain but as the vote is only counting the “other live brokers” it ends as a fail. Now, what I would like to understand is why during a vote triggered by a live broker, only the “other live brokers” are counted in the result? In our cluster, as long as the majority of the live brokers are able to communicate, a vote should not result with a live stopping itself. Did I missed something to prevent that case within the configuration or that’s the expected behavior from the primary quorum voting? If so, is the solution to switch to another quorum implementation such as Zookeeper? Regards, Gaëtan