Hi Justin, After some testing, I have come up with some more questions. One of the failure use cases that we are trying to protect against is the loss of a node in our Kubernetes cluster that is hosting both the Artemis broker pod and one of our server application pods. We have clients that load balance connect to our server application pods being hosted on different nodes in the Kubernetes cluster. Our client applications use client-side failover when the application server pod is marked down in the MQTT broker, the client connects to a another application server pod. We are using a single “active” MQTT broker so every one of our application clients and servers has a complete view of our entire system. In my testing of the above use case I see that when the standby instance becomes “active” and clients connect to the standby broker instance, they receive a retained message that has the state of being “up” which is inconsistent with the actual state of the server application pod.
The first question is, can Artemis protect against this use case and what broker configuration would you recommend to do so. We have tried to use a single broker without HA and rely on the Kubernetes cluster to restart the broker pod when it detects it is down. But the startup times are not consistent enough for our application. Most of the time issue is in the inconsistent time required to creating the pod in our Kubernetes cluster. With a HA pair of broker pods, the failover consistently happens in less than 1 min and that is within our application tolerance. Our application can handle building up system state as our clients connect to the MQTT broker as in when the system and broker are first brought up. But it does not handle inconsistent state very well. The second question is, how would we configure Artemis MQTT broker to have failover but without replicating the retained and last will messages to the standby broker instance? In other words, we would like the system to behave as it does on startup after a failover occurs that way our application can derive a consistent state of the system as it does on startup. Thanks, Paul. From: Justin Bertram <jbert...@apache.org> Date: Monday, January 22, 2024 at 9:26 AM To: users@activemq.apache.org <users@activemq.apache.org> Subject: Re: Trouble with Replication HA Master/Slave config performing failback Looking at the code everything seems to be in order. Can you work up a test-case to reproduce the issue you're seeing? Slap it on GitHub, and I'll take a look. Justin