Hello,


I’m working on an Artemis cluster with ha/replication and I have a question
about the quorum default implementation.



The cluster I’m using is configured on 3 servers (s1, s2, s3) on 3
different physical locations (latency < 5ms), each server hosting a primary
and a backup that I will respectively call p1,p2,p3 and b1,b2,b3.



So we have:

s1 hosting p1,b1

s2 hosting p2,b2

s3 hosting p3,b3



The primary and backup are paired into replication groups in a way to
prevent data loss in case of a server failure:

p1 paired with b2 in group p1b2

p2 paired with b3 in group p2b3

p3 paired with b1 in group p3b1



check-for-active-server is at true on the primary brokers and
allow-failback is at true on the backup brokers.



To prevent any split brain the vote-on-replication-failure is at true on
every broker with a quorum-size at -1.



When there is a failure between a primary and its backup, a vote is
triggered on both sides.



For the backup, if it obtains 2 successful answers it will go live.

If not it will stay as a backup.

For the primary, if it obtains 2 successful answers it will stay live.

If not it will stop itself as in that case it’s expected that its backup is
now being live with the majority of the remaining live brokers.



The problem I have now is that if I lose a complete server (server/site
stopped / crashed or connection loss), let’s say s2.

The vote on the affected backup b3 will let it go live as p1 and p3 will
positively answer.

BUT, the vote on the primary p1 will fail, unable to get 2 successful
answers as p2 is down. Ending with p1 shutting down itself.



That case is not what I would expect from a quorum vote as we still have 2
live brokers on the 3 usually active, so the majority of them.

p1 would only need an answer from p3 during the vote to confirm that there
is no split brain but as the vote is only counting the “other live brokers”
it ends as a fail.



Now, what I would like to understand is why during a vote triggered by a
live broker, only the “other live brokers” are counted in the result? In
our cluster, as long as the majority of the live brokers are able to
communicate, a vote should not result with a live stopping itself.



Did I missed something to prevent that case within the configuration or
that’s the expected behavior from the primary quorum voting? If so, is the
solution to switch to another quorum implementation such as Zookeeper?



Regards,



Gaëtan

Reply via email to