Re: Artemis 2.30 cluster split brain due to sudden credit consumption

Stefano Mazzocchi Fri, 25 Aug 2023 12:22:25 -0700

Hi Justin, thx for your response!

Find my answers inline below.

On Thu, Aug 24, 2023 at 8:43 PM Justin Bertram <jbert...@apache.org> wrote:

> Couple of questions:
>
>  - What high availability configuration are you using and at what point
> does split brain occur?
>

We don't have HA enabled. Artemis is used as an asynchronous ephemeral
control plane sending messages between software modules. If it does go down
for a little while, or some messages are lost, it's ok for our needs.

The split brain occurs when that log event is emitted. We have not been
able to identify what is causing that to happen.

>  - Is JGroups w/TCP really viable in AWS? I assumed it would be onerous to
> configure in a cloud environment since it requires a static list of IP
> addresses (i.e. no dynamic discovery).
>

Our cluster uses kubernetes to manage 3 different artemis "pods" living in
3 different availability zones. We configured it using JGroups with TCP
because it's not possible to do IP multicast across AZs in AWS.

>  - What metric exactly are you looking at for the cluster-connection's
> credits?
>

We are scraping the balance="" value out of DEBUG logs.

>  - Have you considered using the connection router functionality [1] to pin
> relevant producers and consumers to the same node to avoid moving messages
> around the cluster? Moving messages might be neutralizing the benefits of
> clustering [2].
>

We are using Artemis to create an asynchronous and ephemeral control plane
between a few thousands of software modules and we designed the system to
be resilient to latency and temporary failures and we didn't expect our
load (600 msg/sec) to be enough to justify investing in this kind of broker
affiliation. What we did NOT expect is this kind of "wedged" behavior in
which Artemis finds itself and is not able to recover until we physically
kill the instance that is accumulating messages. Our modules are designed
to wait and reconnect if communication to the broker goes down, but they
have no way of telling the difference between a valid connection that is
not receiving messages because there aren't any to be received or a valid
connection that is not receiving messages because they are stuck in transit
between brokers.

We could limp along indefinitely like this (automating the termination of
any artemis pod that shows any accumulation of messages) or we could just
abandon the entire concept of multi-pod Artemis configuration and just have
one and tolerate it going down once in a while (the rest of our system is
designed to withstand that) but before giving up we wanted to understand
why this is happening and if there was something we can do to prevent it.
(or if it's a bug in Artemis)

>
> Justin
>
> [1]
>
> https://activemq.apache.org/components/artemis/documentation/latest/connection-routers.html
> [2]
>
> https://activemq.apache.org/components/artemis/documentation/latest/clusters.html#performance-considerations
>
> On Thu, Aug 24, 2023 at 7:49 PM Stefano Mazzocchi <stef...@apache.org>
> wrote:
>
> > Hi there,
> >
> > at $day_job we are running in production an Artemis 2.30 cluster with 3
> > nodes using jgroups over TCP for broadcast and discovery. We are using it
> > over MQTT and things are working well.
> >
> > Every couple of days, messages stop flowing across nodes (causing
> negative
> > issues with the rest of our cluster which directly impact our customers).
> >
> > The smoking gun seems to be this log message:
> >
> >
> >
> [org.apache.activemq.artemis.core.client.impl.AsynchronousProducerCreditsImpl]
> > releaseOutstanding credits, balance=0, callback=class
> >
> >
> org.apache.activemq.artemis.core.server.cluster.impl.ClusterConnectionBridge
> >
> > Every time this message appears, messages stop being routed across
> Artemis
> > instances and end up piling up in internal queues instead of being
> > delivered.
> >
> > We have tried configuring "producer-window-size" to be -1 in the cluster
> > connector but that has caused even more problems so we had to revert it.
> > Our production environment is therefore operating with the default value
> > which we believe to be 1Mb.
> >
> > We have also created a grafana dashboard to look at the value of the
> > "credits" for each cluster connector over time and they oscillate
> > consistently between the "1mb" and 600kb range. The ONLY time it dips
> below
> > 600kb is when it goes straight to zero and then it bounces right back,
> but
> > the messages continue to be stuck in a queue.
> >
> > There is no indication of reconnection or anything else in the logs.
> >
> > Unfortunately we have been unable to reproduce this with artificial load
> > tests. It seems to be something very specific to how our production
> cluster
> > is operating (in AWS).
> >
> > Has anyone experienced anything like this before? Do you have any
> > suggestions on what we could try to prevent this from happening?
> >
> > Thank you very much in advance for any suggestion you could give us.
> >
> > --
> > Stefano.
> >
>

-- 
Stefano.

Re: Artemis 2.30 cluster split brain due to sudden credit consumption

Reply via email to