I'm trying to wrap my head around your deployment, and I have a few
questions...

  1) Are your client applications connecting to your "application server
pod" or to an ActiveMQ Artemis broker pod or both?
  2) It seems like the failure case you're describing is predicated on both
kinds of pods being on the same node. Is that true? Are both kinds of pods
_always_ deployed on the same node?
  3) What is "client-side failover" in the context of the MQTT client
implementation you're using? Based on your description it sounds like it's
just reconnecting which is semantically different from "failover" in my
experience.
  4) If the broker pod goes down independently of the application server
pod do you still want to ignore retained messages?

Ideally your application should manage its state without any special
configuration of the broker. Have you considered using last-will messages
for MQTT client sessions from the application server pod such that if those
sessions die then new, retained messages are sent which reflect the current
state? It sounds like something like this would be better than trying to
solve the problem with the broker itself.

Retained messages are stored in queues with the prefix "$sys.mqtt.retain."
but those queues are hard-coded to be durable which means their messages
will always be available on the backup unless persistence for the broker is
completely disabled. I don't believe yet-to-be-sent last-will messages will
ever be on the backup.


Justin

On Thu, Feb 1, 2024 at 5:11 PM Shields, Paul Michael <paul.shie...@hpe.com>
wrote:

> Hi Justin,
>
> After some testing, I have come up with some more questions.  One of the
> failure use cases that we are trying to protect against is the loss of a
> node in our Kubernetes cluster that is hosting both the Artemis broker pod
> and one of our server application pods.  We have clients that load balance
> connect to our server application pods being hosted on different nodes in
> the Kubernetes cluster. Our client applications use client-side failover
> when the application server pod is marked down in the MQTT broker, the
> client connects to a another application server pod.  We are using a single
> “active” MQTT broker so every one of our application clients and servers
> has a complete view of our entire system.  In my testing of the above use
> case I see that when the standby instance becomes “active” and clients
> connect to the standby broker instance, they receive a retained message
> that has the state of being “up” which is inconsistent with the actual
> state of the server application pod.
>
> The first question is, can Artemis protect against this use case and what
> broker configuration would you recommend to do so.
>
> We have tried to use a single broker without HA and rely on the Kubernetes
> cluster to restart the broker pod when it detects it is down.  But the
> startup times are not consistent enough for our application.  Most of the
> time issue is in the inconsistent time required to creating the pod in our
> Kubernetes cluster.  With a HA pair of broker pods, the failover
> consistently happens in less than 1 min and that is within our application
> tolerance.
>
> Our application can handle building up system state as our clients connect
> to the MQTT broker as in when the system and broker are first brought up.
> But it does not handle inconsistent state very well.
>
> The second question is, how would we configure Artemis MQTT broker to have
> failover but without replicating the retained and last will messages to the
> standby broker instance?  In other words, we would like the system to
> behave as it does on startup after a failover occurs that way our
> application can derive a consistent state of the system as it does on
> startup.
>
> Thanks,
> Paul.
>
> From: Justin Bertram <jbert...@apache.org>
> Date: Monday, January 22, 2024 at 9:26 AM
> To: users@activemq.apache.org <users@activemq.apache.org>
> Subject: Re: Trouble with Replication HA Master/Slave config performing
> failback
> Looking at the code everything seems to be in order. Can you work up a
> test-case to reproduce the issue you're seeing? Slap it on GitHub, and I'll
> take a look.
>
>
> Justin
>

Reply via email to