Re: Artemis 2.30 cluster split brain due to sudden credit consumption

Justin Bertram Fri, 06 Oct 2023 06:54:14 -0700

A different user reported this issue and provided a test-case which I've
used to reproduce the issue, and I see what's happening. I hope to have a
fix soon. See ARTEMIS-4453 [1] for more details.



Justin

[1] https://issues.apache.org/jira/browse/ARTEMIS-4453

On Mon, Aug 28, 2023 at 5:16 PM Stefano Mazzocchi <
stefano.mazzoc...@gmail.com> wrote:

> On Fri, Aug 25, 2023 at 9:08 PM Justin Bertram <jbert...@apache.org>
> wrote:
>
> > > We don't have HA enabled.
> >
> > In ActiveMQ Artemis the idea of "split-brain" [1] is predicated on an HA
> > configuration. In short, it's what we call what happens when both a
> primary
> > and backup server are active at the same time serving the same messages
> to
> > clients. Given that you're not using HA then "split-brain" doesn't seem
> to
> > apply here.
> >
> > What specifically do you mean when you use the term "split-brain"? Are
> you
> > talking about the situation where 2 active nodes in the cluster are not
> > communicating properly?
> >
>
> I'm sorry I used the term improperly.
>
> Yes, I'm referring to a situation in which a cluster of 3 brokers gets into
> a state in which brokers can no longer talk to each other and the messages
> don't flow between them.
>
>
> > > We configured it using JGroups with TCP because it's not possible to do
> > IP multicast across AZs in AWS.
> >
> > Why not just use the "static" connector configuration offered via the
> > normal configuration? Typically folks who configure JGroups use something
> > more exotic for cloud-based use-cases like S3_PING [2] or KUBE_PING [3].
> >
>
> Yeah, we might want to resort to that although we originally planned on
> using KUBE_PING but ended up stopping when DNS_PING worked for us.
>
>
> > > ...we didn't expect our load (600 msg/sec) to be enough to justify
> > investing in this kind of broker affiliation.
> >
> > Fair enough. I brought up the connection router because many folks seem
> to
> > be under the impression that clustering is just a silver bullet for more
> > performance without understanding the underlying implications of
> > clustering.
> >
>
> Yeah, we understand that.
>
>
> >
> > > What we did NOT expect is this kind of "wedged" behavior in which
> Artemis
> > finds itself and is not able to recover until we physically kill the
> > instance that is accumulating messages.
> >
> > That's certainly problematic and not something I would expect either.
> > Occasional administrative restarts seems like a viable short-term
> > work-around, but the goal would be to identify the root cause of the
> > problem so it can either be addressed via configuration or code (i.e. a
> bug
> > fix). At this point I can't say what the root cause is.
> >
>
> Yes, it's very puzzling.
>
> We are 99% sure the problem happens when this method gets invoked:
>
>
> https://github.com/apache/activemq-artemis/blob/main/artemis-core-client/src/main/java/org/apache/activemq/artemis/core/client/impl/ClientProducerCreditManagerImpl.java#L161
>
> There are only two other methods calling this one:
>
> ClientProducerCreditManagerImpl.getCredits()
>
> https://github.com/apache/activemq-artemis/blob/main/artemis-core-client/src/main/java/org/apache/activemq/artemis/core/client/impl/ClientProducerCreditManagerImpl.java#L55
>
> and
>
> ClientProducerCreditManagerImpl.returnCredits()
>
> https://github.com/apache/activemq-artemis/blob/main/artemis-core-client/src/main/java/org/apache/activemq/artemis/core/client/impl/ClientProducerCreditManagerImpl.java#L105
>
> It seems that the internal address between brokers is treated just the same
> as any other address in terms of flow control and when the entry is
> removed, it ends up being "blocked" but there isn't anything else that ever
> unblocks it. It feels like a bug, honestly. Or it could be that whatever
> causes the unblocking is not invoked because of some misconfiguration on
> our part.
>
> You said that you tried using -1 as the producer-window-size on the
> > cluster-connection and that it caused even more problems. What were those
> > problems?
>
>
> Our entire cluster went bad because messages weren't going thru so we had
> to quickly revert the configuration. A lot of messages failed to be
> delivered but we don't know if that was because of a load slam or something
> else.
>
> Our biggest problem is that the only way to reproduce this problem is under
> load in our production environments, which impacts our customers, so it's a
> very slow and risky process to experiment with this.
>
>
> > Did you try any other values greater than the default (i.e.
> > 1048576 - 1MiB)? If not, could you?
> >
>
> Yes, we could, but see above.
>
>
> > How long has this deployment been running before you saw this issue?
>
>
> Well, we just launched our service a few weeks ago.
>
>
> > Has
> > anything changed recently? Version 2.30.0 was only recently released. Did
> > you use another version previously? If so, did you see this problem in
> the
> > previous version?
> >
>
> We launched with 2.28.0 and had the same problem. We upgraded to 2.30.0
> hoping it would go away but it didn't.
>
>
> > How large are the messages that you are sending?
> >
>
> Pretty small, few kb tops.
>
>
> > Instead of restarting the entire broker have you tried stopping and
> > starting the cluster connection via the management API? If so, what
> > happened? If not, could you?
> >
>
> We did not. How would you do this?
>
>
> > When you attempt to reproduce the issue do you still see the
> > "releaseOutstanding" log message at some point? In your reproducer
> > environment have you tried lowering the producer-window-size as a way to
> > potentially make the error more likely?
> >
>
> Ah, that's a good suggestion. We did not but we could try to see if that
> helps us discover it in dev.
>
>
> >
> > > ...we could just abandon the entire concept of multi-pod Artemis
> > configuration and just have one and tolerate it going down once in a
> > while...
> >
> > Generally speaking I think this is a viable strategy and one I recommend
> to
> > folks often (which goes back to the fact that lots of folks deploy a
> > cluster without any real need). You could potentially configure HA to
> > mitigate the risk of the broker going down, although that has caveats of
> > its own.
> >
>
> We just tested today a single Artemis instance and managed to get enough
> load to satisfy our needs, so we will probably go with that for now.
>
> Still, I can't shake the feeling a intra-broker queue getting wedged like
> that is not a good thing and I would like to understand why it's happening
> because we might need to cluster in the future.
>
> Thx for all your help and suggestions.
>
>
> >
> >
> > Justin
> >
> > [1]
> >
> >
> https://activemq.apache.org/components/artemis/documentation/latest/network-isolation.html
> > [2] http://www.jgroups.org/javadoc/org/jgroups/protocols/S3_PING.html
> > [3] http://www.jgroups.org/manual5/index.html#_kube_ping
> >
> > On Fri, Aug 25, 2023 at 2:22 PM Stefano Mazzocchi <
> > stefano.mazzoc...@gmail.com> wrote:
> >
> > > Hi Justin, thx for your response!
> > >
> > > Find my answers inline below.
> > >
> > > On Thu, Aug 24, 2023 at 8:43 PM Justin Bertram <jbert...@apache.org>
> > > wrote:
> > >
> > > > Couple of questions:
> > > >
> > > >  - What high availability configuration are you using and at what
> point
> > > > does split brain occur?
> > > >
> > >
> > > We don't have HA enabled. Artemis is used as an asynchronous ephemeral
> > > control plane sending messages between software modules. If it does go
> > down
> > > for a little while, or some messages are lost, it's ok for our needs.
> > >
> > > The split brain occurs when that log event is emitted. We have not been
> > > able to identify what is causing that to happen.
> > >
> > >
> > > >  - Is JGroups w/TCP really viable in AWS? I assumed it would be
> onerous
> > > to
> > > > configure in a cloud environment since it requires a static list of
> IP
> > > > addresses (i.e. no dynamic discovery).
> > > >
> > >
> > > Our cluster uses kubernetes to manage 3 different artemis "pods" living
> > in
> > > 3 different availability zones. We configured it using JGroups with TCP
> > > because it's not possible to do IP multicast across AZs in AWS.
> > >
> > >
> > > >  - What metric exactly are you looking at for the
> cluster-connection's
> > > > credits?
> > > >
> > >
> > > We are scraping the balance="" value out of DEBUG logs.
> > >
> > >
> > > >  - Have you considered using the connection router functionality [1]
> to
> > > pin
> > > > relevant producers and consumers to the same node to avoid moving
> > > messages
> > > > around the cluster? Moving messages might be neutralizing the
> benefits
> > of
> > > > clustering [2].
> > > >
> > >
> > > We are using Artemis to create an asynchronous and ephemeral control
> > plane
> > > between a few thousands of software modules and we designed the system
> to
> > > be resilient to latency and temporary failures and we didn't expect our
> > > load (600 msg/sec) to be enough to justify investing in this kind of
> > broker
> > > affiliation. What we did NOT expect is this kind of "wedged" behavior
> in
> > > which Artemis finds itself and is not able to recover until we
> physically
> > > kill the instance that is accumulating messages. Our modules are
> designed
> > > to wait and reconnect if communication to the broker goes down, but
> they
> > > have no way of telling the difference between a valid connection that
> is
> > > not receiving messages because there aren't any to be received or a
> valid
> > > connection that is not receiving messages because they are stuck in
> > transit
> > > between brokers.
> > >
> > > We could limp along indefinitely like this (automating the termination
> of
> > > any artemis pod that shows any accumulation of messages) or we could
> just
> > > abandon the entire concept of multi-pod Artemis configuration and just
> > have
> > > one and tolerate it going down once in a while (the rest of our system
> is
> > > designed to withstand that) but before giving up we wanted to
> understand
> > > why this is happening and if there was something we can do to prevent
> it.
> > > (or if it's a bug in Artemis)
> > >
> > >
> > > >
> > > > Justin
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://activemq.apache.org/components/artemis/documentation/latest/connection-routers.html
> > > > [2]
> > > >
> > > >
> > >
> >
> https://activemq.apache.org/components/artemis/documentation/latest/clusters.html#performance-considerations
> > > >
> > > > On Thu, Aug 24, 2023 at 7:49 PM Stefano Mazzocchi <
> stef...@apache.org>
> > > > wrote:
> > > >
> > > > > Hi there,
> > > > >
> > > > > at $day_job we are running in production an Artemis 2.30 cluster
> > with 3
> > > > > nodes using jgroups over TCP for broadcast and discovery. We are
> > using
> > > it
> > > > > over MQTT and things are working well.
> > > > >
> > > > > Every couple of days, messages stop flowing across nodes (causing
> > > > negative
> > > > > issues with the rest of our cluster which directly impact our
> > > customers).
> > > > >
> > > > > The smoking gun seems to be this log message:
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> [org.apache.activemq.artemis.core.client.impl.AsynchronousProducerCreditsImpl]
> > > > > releaseOutstanding credits, balance=0, callback=class
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.activemq.artemis.core.server.cluster.impl.ClusterConnectionBridge
> > > > >
> > > > > Every time this message appears, messages stop being routed across
> > > > Artemis
> > > > > instances and end up piling up in internal queues instead of being
> > > > > delivered.
> > > > >
> > > > > We have tried configuring "producer-window-size" to be -1 in the
> > > cluster
> > > > > connector but that has caused even more problems so we had to
> revert
> > > it.
> > > > > Our production environment is therefore operating with the default
> > > value
> > > > > which we believe to be 1Mb.
> > > > >
> > > > > We have also created a grafana dashboard to look at the value of
> the
> > > > > "credits" for each cluster connector over time and they oscillate
> > > > > consistently between the "1mb" and 600kb range. The ONLY time it
> dips
> > > > below
> > > > > 600kb is when it goes straight to zero and then it bounces right
> > back,
> > > > but
> > > > > the messages continue to be stuck in a queue.
> > > > >
> > > > > There is no indication of reconnection or anything else in the
> logs.
> > > > >
> > > > > Unfortunately we have been unable to reproduce this with artificial
> > > load
> > > > > tests. It seems to be something very specific to how our production
> > > > cluster
> > > > > is operating (in AWS).
> > > > >
> > > > > Has anyone experienced anything like this before? Do you have any
> > > > > suggestions on what we could try to prevent this from happening?
> > > > >
> > > > > Thank you very much in advance for any suggestion you could give
> us.
> > > > >
> > > > > --
> > > > > Stefano.
> > > > >
> > > >
> > >
> > >
> > > --
> > > Stefano.
> > >
> >
>
>
> --
> Stefano.
>

Re: Artemis 2.30 cluster split brain due to sudden credit consumption

Reply via email to