> That case is not what I would expect from a quorum vote as we still have
2 live brokers on the 3 usually active, so the majority of them.

The problem is that one of the two active brokers is the one initiating the
vote so it can't participate leaving just 1 active broker to vote which
can't pass.

> ...why during a vote triggered by a live broker, only the “other live
brokers” are counted in the result?

Only active brokers participate in a vote. I can't remember exactly why
that's the case off the top of my head.

> Did I missed something to prevent that case within the configuration or
that’s the expected behavior from the primary quorum voting?

Given your configuration and the fact that you're losing a primary and a
backup simultaneously I think you're seeing the expected behavior.

> If so, is the solution to switch to another quorum implementation such as
Zookeeper?

If you did not set vote-on-replication-failure on the primary brokers to
true I don't think you'd have this problem. Also, if you didn't colocate a
primary and a backup on the same server I don't think you'd have this
problem.

That said, generally speaking ZooKeeper is the recommended option at this
point since it eliminates the need for 3 pairs (which most folks don't need
anyway).


Hope that helps!


Justin

On Thu, Jan 23, 2025 at 4:50 AM Gaëtan Caumartin <caumartin.gae...@gmail.com>
wrote:

> Hello,
>
>
>
> I’m working on an Artemis cluster with ha/replication and I have a question
> about the quorum default implementation.
>
>
>
> The cluster I’m using is configured on 3 servers (s1, s2, s3) on 3
> different physical locations (latency < 5ms), each server hosting a primary
> and a backup that I will respectively call p1,p2,p3 and b1,b2,b3.
>
>
>
> So we have:
>
> s1 hosting p1,b1
>
> s2 hosting p2,b2
>
> s3 hosting p3,b3
>
>
>
> The primary and backup are paired into replication groups in a way to
> prevent data loss in case of a server failure:
>
> p1 paired with b2 in group p1b2
>
> p2 paired with b3 in group p2b3
>
> p3 paired with b1 in group p3b1
>
>
>
> check-for-active-server is at true on the primary brokers and
> allow-failback is at true on the backup brokers.
>
>
>
> To prevent any split brain the vote-on-replication-failure is at true on
> every broker with a quorum-size at -1.
>
>
>
> When there is a failure between a primary and its backup, a vote is
> triggered on both sides.
>
>
>
> For the backup, if it obtains 2 successful answers it will go live.
>
> If not it will stay as a backup.
>
> For the primary, if it obtains 2 successful answers it will stay live.
>
> If not it will stop itself as in that case it’s expected that its backup is
> now being live with the majority of the remaining live brokers.
>
>
>
> The problem I have now is that if I lose a complete server (server/site
> stopped / crashed or connection loss), let’s say s2.
>
> The vote on the affected backup b3 will let it go live as p1 and p3 will
> positively answer.
>
> BUT, the vote on the primary p1 will fail, unable to get 2 successful
> answers as p2 is down. Ending with p1 shutting down itself.
>
>
>
> That case is not what I would expect from a quorum vote as we still have 2
> live brokers on the 3 usually active, so the majority of them.
>
> p1 would only need an answer from p3 during the vote to confirm that there
> is no split brain but as the vote is only counting the “other live brokers”
> it ends as a fail.
>
>
>
> Now, what I would like to understand is why during a vote triggered by a
> live broker, only the “other live brokers” are counted in the result? In
> our cluster, as long as the majority of the live brokers are able to
> communicate, a vote should not result with a live stopping itself.
>
>
>
> Did I missed something to prevent that case within the configuration or
> that’s the expected behavior from the primary quorum voting? If so, is the
> solution to switch to another quorum implementation such as Zookeeper?
>
>
>
> Regards,
>
>
>
> Gaëtan
>

Reply via email to