Couple of questions:

 - What high availability configuration are you using and at what point
does split brain occur?
 - Is JGroups w/TCP really viable in AWS? I assumed it would be onerous to
configure in a cloud environment since it requires a static list of IP
addresses (i.e. no dynamic discovery).
 - What metric exactly are you looking at for the cluster-connection's
credits?
 - Have you considered using the connection router functionality [1] to pin
relevant producers and consumers to the same node to avoid moving messages
around the cluster? Moving messages might be neutralizing the benefits of
clustering [2].


Justin

[1]
https://activemq.apache.org/components/artemis/documentation/latest/connection-routers.html
[2]
https://activemq.apache.org/components/artemis/documentation/latest/clusters.html#performance-considerations

On Thu, Aug 24, 2023 at 7:49 PM Stefano Mazzocchi <stef...@apache.org>
wrote:

> Hi there,
>
> at $day_job we are running in production an Artemis 2.30 cluster with 3
> nodes using jgroups over TCP for broadcast and discovery. We are using it
> over MQTT and things are working well.
>
> Every couple of days, messages stop flowing across nodes (causing negative
> issues with the rest of our cluster which directly impact our customers).
>
> The smoking gun seems to be this log message:
>
>
> [org.apache.activemq.artemis.core.client.impl.AsynchronousProducerCreditsImpl]
> releaseOutstanding credits, balance=0, callback=class
>
> org.apache.activemq.artemis.core.server.cluster.impl.ClusterConnectionBridge
>
> Every time this message appears, messages stop being routed across Artemis
> instances and end up piling up in internal queues instead of being
> delivered.
>
> We have tried configuring "producer-window-size" to be -1 in the cluster
> connector but that has caused even more problems so we had to revert it.
> Our production environment is therefore operating with the default value
> which we believe to be 1Mb.
>
> We have also created a grafana dashboard to look at the value of the
> "credits" for each cluster connector over time and they oscillate
> consistently between the "1mb" and 600kb range. The ONLY time it dips below
> 600kb is when it goes straight to zero and then it bounces right back, but
> the messages continue to be stuck in a queue.
>
> There is no indication of reconnection or anything else in the logs.
>
> Unfortunately we have been unable to reproduce this with artificial load
> tests. It seems to be something very specific to how our production cluster
> is operating (in AWS).
>
> Has anyone experienced anything like this before? Do you have any
> suggestions on what we could try to prevent this from happening?
>
> Thank you very much in advance for any suggestion you could give us.
>
> --
> Stefano.
>

Reply via email to