Couple of questions: - What high availability configuration are you using and at what point does split brain occur? - Is JGroups w/TCP really viable in AWS? I assumed it would be onerous to configure in a cloud environment since it requires a static list of IP addresses (i.e. no dynamic discovery). - What metric exactly are you looking at for the cluster-connection's credits? - Have you considered using the connection router functionality [1] to pin relevant producers and consumers to the same node to avoid moving messages around the cluster? Moving messages might be neutralizing the benefits of clustering [2].
Justin [1] https://activemq.apache.org/components/artemis/documentation/latest/connection-routers.html [2] https://activemq.apache.org/components/artemis/documentation/latest/clusters.html#performance-considerations On Thu, Aug 24, 2023 at 7:49 PM Stefano Mazzocchi <stef...@apache.org> wrote: > Hi there, > > at $day_job we are running in production an Artemis 2.30 cluster with 3 > nodes using jgroups over TCP for broadcast and discovery. We are using it > over MQTT and things are working well. > > Every couple of days, messages stop flowing across nodes (causing negative > issues with the rest of our cluster which directly impact our customers). > > The smoking gun seems to be this log message: > > > [org.apache.activemq.artemis.core.client.impl.AsynchronousProducerCreditsImpl] > releaseOutstanding credits, balance=0, callback=class > > org.apache.activemq.artemis.core.server.cluster.impl.ClusterConnectionBridge > > Every time this message appears, messages stop being routed across Artemis > instances and end up piling up in internal queues instead of being > delivered. > > We have tried configuring "producer-window-size" to be -1 in the cluster > connector but that has caused even more problems so we had to revert it. > Our production environment is therefore operating with the default value > which we believe to be 1Mb. > > We have also created a grafana dashboard to look at the value of the > "credits" for each cluster connector over time and they oscillate > consistently between the "1mb" and 600kb range. The ONLY time it dips below > 600kb is when it goes straight to zero and then it bounces right back, but > the messages continue to be stuck in a queue. > > There is no indication of reconnection or anything else in the logs. > > Unfortunately we have been unable to reproduce this with artificial load > tests. It seems to be something very specific to how our production cluster > is operating (in AWS). > > Has anyone experienced anything like this before? Do you have any > suggestions on what we could try to prevent this from happening? > > Thank you very much in advance for any suggestion you could give us. > > -- > Stefano. >