Artemis 2.30 cluster split brain due to sudden credit consumption

Stefano Mazzocchi Thu, 24 Aug 2023 17:49:13 -0700

Hi there,

at $day_job we are running in production an Artemis 2.30 cluster with 3
nodes using jgroups over TCP for broadcast and discovery. We are using it
over MQTT and things are working well.


Every couple of days, messages stop flowing across nodes (causing negative
issues with the rest of our cluster which directly impact our customers).

The smoking gun seems to be this log message:

[org.apache.activemq.artemis.core.client.impl.AsynchronousProducerCreditsImpl]
releaseOutstanding credits, balance=0, callback=class
org.apache.activemq.artemis.core.server.cluster.impl.ClusterConnectionBridge

Every time this message appears, messages stop being routed across Artemis
instances and end up piling up in internal queues instead of being
delivered.

We have tried configuring "producer-window-size" to be -1 in the cluster
connector but that has caused even more problems so we had to revert it.
Our production environment is therefore operating with the default value
which we believe to be 1Mb.

We have also created a grafana dashboard to look at the value of the
"credits" for each cluster connector over time and they oscillate
consistently between the "1mb" and 600kb range. The ONLY time it dips below
600kb is when it goes straight to zero and then it bounces right back, but
the messages continue to be stuck in a queue.

There is no indication of reconnection or anything else in the logs.

Unfortunately we have been unable to reproduce this with artificial load
tests. It seems to be something very specific to how our production cluster
is operating (in AWS).

Has anyone experienced anything like this before? Do you have any
suggestions on what we could try to prevent this from happening?

Thank you very much in advance for any suggestion you could give us.

-- 
Stefano.

Artemis 2.30 cluster split brain due to sudden credit consumption

Reply via email to