Hi Justin,

After some testing, I have come up with some more questions.  One of the 
failure use cases that we are trying to protect against is the loss of a node 
in our Kubernetes cluster that is hosting both the Artemis broker pod and one 
of our server application pods.  We have clients that load balance connect to 
our server application pods being hosted on different nodes in the Kubernetes 
cluster. Our client applications use client-side failover when the application 
server pod is marked down in the MQTT broker, the client connects to a another 
application server pod.  We are using a single “active” MQTT broker so every 
one of our application clients and servers has a complete view of our entire 
system.  In my testing of the above use case I see that when the standby 
instance becomes “active” and clients connect to the standby broker instance, 
they receive a retained message that has the state of being “up” which is 
inconsistent with the actual state of the server application pod.

The first question is, can Artemis protect against this use case and what 
broker configuration would you recommend to do so.

We have tried to use a single broker without HA and rely on the Kubernetes 
cluster to restart the broker pod when it detects it is down.  But the startup 
times are not consistent enough for our application.  Most of the time issue is 
in the inconsistent time required to creating the pod in our Kubernetes 
cluster.  With a HA pair of broker pods, the failover consistently happens in 
less than 1 min and that is within our application tolerance.

Our application can handle building up system state as our clients connect to 
the MQTT broker as in when the system and broker are first brought up.  But it 
does not handle inconsistent state very well.

The second question is, how would we configure Artemis MQTT broker to have 
failover but without replicating the retained and last will messages to the 
standby broker instance?  In other words, we would like the system to behave as 
it does on startup after a failover occurs that way our application can derive 
a consistent state of the system as it does on startup.

Thanks,
Paul.

From: Justin Bertram <jbert...@apache.org>
Date: Monday, January 22, 2024 at 9:26 AM
To: users@activemq.apache.org <users@activemq.apache.org>
Subject: Re: Trouble with Replication HA Master/Slave config performing failback
Looking at the code everything seems to be in order. Can you work up a
test-case to reproduce the issue you're seeing? Slap it on GitHub, and I'll
take a look.


Justin

Reply via email to