> That case is not what I would expect from a quorum vote as we still have 2 live brokers on the 3 usually active, so the majority of them.
The problem is that one of the two active brokers is the one initiating the vote so it can't participate leaving just 1 active broker to vote which can't pass. > ...why during a vote triggered by a live broker, only the “other live brokers” are counted in the result? Only active brokers participate in a vote. I can't remember exactly why that's the case off the top of my head. > Did I missed something to prevent that case within the configuration or that’s the expected behavior from the primary quorum voting? Given your configuration and the fact that you're losing a primary and a backup simultaneously I think you're seeing the expected behavior. > If so, is the solution to switch to another quorum implementation such as Zookeeper? If you did not set vote-on-replication-failure on the primary brokers to true I don't think you'd have this problem. Also, if you didn't colocate a primary and a backup on the same server I don't think you'd have this problem. That said, generally speaking ZooKeeper is the recommended option at this point since it eliminates the need for 3 pairs (which most folks don't need anyway). Hope that helps! Justin On Thu, Jan 23, 2025 at 4:50 AM Gaëtan Caumartin <caumartin.gae...@gmail.com> wrote: > Hello, > > > > I’m working on an Artemis cluster with ha/replication and I have a question > about the quorum default implementation. > > > > The cluster I’m using is configured on 3 servers (s1, s2, s3) on 3 > different physical locations (latency < 5ms), each server hosting a primary > and a backup that I will respectively call p1,p2,p3 and b1,b2,b3. > > > > So we have: > > s1 hosting p1,b1 > > s2 hosting p2,b2 > > s3 hosting p3,b3 > > > > The primary and backup are paired into replication groups in a way to > prevent data loss in case of a server failure: > > p1 paired with b2 in group p1b2 > > p2 paired with b3 in group p2b3 > > p3 paired with b1 in group p3b1 > > > > check-for-active-server is at true on the primary brokers and > allow-failback is at true on the backup brokers. > > > > To prevent any split brain the vote-on-replication-failure is at true on > every broker with a quorum-size at -1. > > > > When there is a failure between a primary and its backup, a vote is > triggered on both sides. > > > > For the backup, if it obtains 2 successful answers it will go live. > > If not it will stay as a backup. > > For the primary, if it obtains 2 successful answers it will stay live. > > If not it will stop itself as in that case it’s expected that its backup is > now being live with the majority of the remaining live brokers. > > > > The problem I have now is that if I lose a complete server (server/site > stopped / crashed or connection loss), let’s say s2. > > The vote on the affected backup b3 will let it go live as p1 and p3 will > positively answer. > > BUT, the vote on the primary p1 will fail, unable to get 2 successful > answers as p2 is down. Ending with p1 shutting down itself. > > > > That case is not what I would expect from a quorum vote as we still have 2 > live brokers on the 3 usually active, so the majority of them. > > p1 would only need an answer from p3 during the vote to confirm that there > is no split brain but as the vote is only counting the “other live brokers” > it ends as a fail. > > > > Now, what I would like to understand is why during a vote triggered by a > live broker, only the “other live brokers” are counted in the result? In > our cluster, as long as the majority of the live brokers are able to > communicate, a vote should not result with a live stopping itself. > > > > Did I missed something to prevent that case within the configuration or > that’s the expected behavior from the primary quorum voting? If so, is the > solution to switch to another quorum implementation such as Zookeeper? > > > > Regards, > > > > Gaëtan >