Hi there, at $day_job we are running in production an Artemis 2.30 cluster with 3 nodes using jgroups over TCP for broadcast and discovery. We are using it over MQTT and things are working well.
Every couple of days, messages stop flowing across nodes (causing negative issues with the rest of our cluster which directly impact our customers). The smoking gun seems to be this log message: [org.apache.activemq.artemis.core.client.impl.AsynchronousProducerCreditsImpl] releaseOutstanding credits, balance=0, callback=class org.apache.activemq.artemis.core.server.cluster.impl.ClusterConnectionBridge Every time this message appears, messages stop being routed across Artemis instances and end up piling up in internal queues instead of being delivered. We have tried configuring "producer-window-size" to be -1 in the cluster connector but that has caused even more problems so we had to revert it. Our production environment is therefore operating with the default value which we believe to be 1Mb. We have also created a grafana dashboard to look at the value of the "credits" for each cluster connector over time and they oscillate consistently between the "1mb" and 600kb range. The ONLY time it dips below 600kb is when it goes straight to zero and then it bounces right back, but the messages continue to be stuck in a queue. There is no indication of reconnection or anything else in the logs. Unfortunately we have been unable to reproduce this with artificial load tests. It seems to be something very specific to how our production cluster is operating (in AWS). Has anyone experienced anything like this before? Do you have any suggestions on what we could try to prevent this from happening? Thank you very much in advance for any suggestion you could give us. -- Stefano.