Re: Artemis 2.30 cluster split brain due to sudden credit consumption

Justin Bertram Fri, 25 Aug 2023 21:08:37 -0700

> We don't have HA enabled.

In ActiveMQ Artemis the idea of "split-brain" [1] is predicated on an HA
configuration. In short, it's what we call what happens when both a primary
and backup server are active at the same time serving the same messages to
clients. Given that you're not using HA then "split-brain" doesn't seem to
apply here.


What specifically do you mean when you use the term "split-brain"? Are you
talking about the situation where 2 active nodes in the cluster are not
communicating properly?

> We configured it using JGroups with TCP because it's not possible to do
IP multicast across AZs in AWS.

Why not just use the "static" connector configuration offered via the
normal configuration? Typically folks who configure JGroups use something
more exotic for cloud-based use-cases like S3_PING [2] or KUBE_PING [3].

> ...we didn't expect our load (600 msg/sec) to be enough to justify
investing in this kind of broker affiliation.

Fair enough. I brought up the connection router because many folks seem to
be under the impression that clustering is just a silver bullet for more
performance without understanding the underlying implications of clustering.

> What we did NOT expect is this kind of "wedged" behavior in which Artemis
finds itself and is not able to recover until we physically kill the
instance that is accumulating messages.

That's certainly problematic and not something I would expect either.
Occasional administrative restarts seems like a viable short-term
work-around, but the goal would be to identify the root cause of the
problem so it can either be addressed via configuration or code (i.e. a bug
fix). At this point I can't say what the root cause is.

You said that you tried using -1 as the producer-window-size on the
cluster-connection and that it caused even more problems. What were those
problems? Did you try any other values greater than the default (i.e.
1048576 - 1MiB)? If not, could you?

How long has this deployment been running before you saw this issue? Has
anything changed recently? Version 2.30.0 was only recently released. Did
you use another version previously? If so, did you see this problem in the
previous version?

How large are the messages that you are sending?

Instead of restarting the entire broker have you tried stopping and
starting the cluster connection via the management API? If so, what
happened? If not, could you?

When you attempt to reproduce the issue do you still see the
"releaseOutstanding" log message at some point? In your reproducer
environment have you tried lowering the producer-window-size as a way to
potentially make the error more likely?

> ...we could just abandon the entire concept of multi-pod Artemis
configuration and just have one and tolerate it going down once in a
while...

Generally speaking I think this is a viable strategy and one I recommend to
folks often (which goes back to the fact that lots of folks deploy a
cluster without any real need). You could potentially configure HA to
mitigate the risk of the broker going down, although that has caveats of
its own.


Justin

[1]
https://activemq.apache.org/components/artemis/documentation/latest/network-isolation.html
[2] http://www.jgroups.org/javadoc/org/jgroups/protocols/S3_PING.html
[3] http://www.jgroups.org/manual5/index.html#_kube_ping

On Fri, Aug 25, 2023 at 2:22 PM Stefano Mazzocchi <
stefano.mazzoc...@gmail.com> wrote:

> Hi Justin, thx for your response!
>
> Find my answers inline below.
>
> On Thu, Aug 24, 2023 at 8:43 PM Justin Bertram <jbert...@apache.org>
> wrote:
>
> > Couple of questions:
> >
> >  - What high availability configuration are you using and at what point
> > does split brain occur?
> >
>
> We don't have HA enabled. Artemis is used as an asynchronous ephemeral
> control plane sending messages between software modules. If it does go down
> for a little while, or some messages are lost, it's ok for our needs.
>
> The split brain occurs when that log event is emitted. We have not been
> able to identify what is causing that to happen.
>
>
> >  - Is JGroups w/TCP really viable in AWS? I assumed it would be onerous
> to
> > configure in a cloud environment since it requires a static list of IP
> > addresses (i.e. no dynamic discovery).
> >
>
> Our cluster uses kubernetes to manage 3 different artemis "pods" living in
> 3 different availability zones. We configured it using JGroups with TCP
> because it's not possible to do IP multicast across AZs in AWS.
>
>
> >  - What metric exactly are you looking at for the cluster-connection's
> > credits?
> >
>
> We are scraping the balance="" value out of DEBUG logs.
>
>
> >  - Have you considered using the connection router functionality [1] to
> pin
> > relevant producers and consumers to the same node to avoid moving
> messages
> > around the cluster? Moving messages might be neutralizing the benefits of
> > clustering [2].
> >
>
> We are using Artemis to create an asynchronous and ephemeral control plane
> between a few thousands of software modules and we designed the system to
> be resilient to latency and temporary failures and we didn't expect our
> load (600 msg/sec) to be enough to justify investing in this kind of broker
> affiliation. What we did NOT expect is this kind of "wedged" behavior in
> which Artemis finds itself and is not able to recover until we physically
> kill the instance that is accumulating messages. Our modules are designed
> to wait and reconnect if communication to the broker goes down, but they
> have no way of telling the difference between a valid connection that is
> not receiving messages because there aren't any to be received or a valid
> connection that is not receiving messages because they are stuck in transit
> between brokers.
>
> We could limp along indefinitely like this (automating the termination of
> any artemis pod that shows any accumulation of messages) or we could just
> abandon the entire concept of multi-pod Artemis configuration and just have
> one and tolerate it going down once in a while (the rest of our system is
> designed to withstand that) but before giving up we wanted to understand
> why this is happening and if there was something we can do to prevent it.
> (or if it's a bug in Artemis)
>
>
> >
> > Justin
> >
> > [1]
> >
> >
> https://activemq.apache.org/components/artemis/documentation/latest/connection-routers.html
> > [2]
> >
> >
> https://activemq.apache.org/components/artemis/documentation/latest/clusters.html#performance-considerations
> >
> > On Thu, Aug 24, 2023 at 7:49 PM Stefano Mazzocchi <stef...@apache.org>
> > wrote:
> >
> > > Hi there,
> > >
> > > at $day_job we are running in production an Artemis 2.30 cluster with 3
> > > nodes using jgroups over TCP for broadcast and discovery. We are using
> it
> > > over MQTT and things are working well.
> > >
> > > Every couple of days, messages stop flowing across nodes (causing
> > negative
> > > issues with the rest of our cluster which directly impact our
> customers).
> > >
> > > The smoking gun seems to be this log message:
> > >
> > >
> > >
> >
> [org.apache.activemq.artemis.core.client.impl.AsynchronousProducerCreditsImpl]
> > > releaseOutstanding credits, balance=0, callback=class
> > >
> > >
> >
> org.apache.activemq.artemis.core.server.cluster.impl.ClusterConnectionBridge
> > >
> > > Every time this message appears, messages stop being routed across
> > Artemis
> > > instances and end up piling up in internal queues instead of being
> > > delivered.
> > >
> > > We have tried configuring "producer-window-size" to be -1 in the
> cluster
> > > connector but that has caused even more problems so we had to revert
> it.
> > > Our production environment is therefore operating with the default
> value
> > > which we believe to be 1Mb.
> > >
> > > We have also created a grafana dashboard to look at the value of the
> > > "credits" for each cluster connector over time and they oscillate
> > > consistently between the "1mb" and 600kb range. The ONLY time it dips
> > below
> > > 600kb is when it goes straight to zero and then it bounces right back,
> > but
> > > the messages continue to be stuck in a queue.
> > >
> > > There is no indication of reconnection or anything else in the logs.
> > >
> > > Unfortunately we have been unable to reproduce this with artificial
> load
> > > tests. It seems to be something very specific to how our production
> > cluster
> > > is operating (in AWS).
> > >
> > > Has anyone experienced anything like this before? Do you have any
> > > suggestions on what we could try to prevent this from happening?
> > >
> > > Thank you very much in advance for any suggestion you could give us.
> > >
> > > --
> > > Stefano.
> > >
> >
>
>
> --
> Stefano.
>

Re: Artemis 2.30 cluster split brain due to sudden credit consumption

Reply via email to