A different user reported this issue and provided a test-case which I've used to reproduce the issue, and I see what's happening. I hope to have a fix soon. See ARTEMIS-4453 [1] for more details.
Justin [1] https://issues.apache.org/jira/browse/ARTEMIS-4453 On Mon, Aug 28, 2023 at 5:16 PM Stefano Mazzocchi < stefano.mazzoc...@gmail.com> wrote: > On Fri, Aug 25, 2023 at 9:08 PM Justin Bertram <jbert...@apache.org> > wrote: > > > > We don't have HA enabled. > > > > In ActiveMQ Artemis the idea of "split-brain" [1] is predicated on an HA > > configuration. In short, it's what we call what happens when both a > primary > > and backup server are active at the same time serving the same messages > to > > clients. Given that you're not using HA then "split-brain" doesn't seem > to > > apply here. > > > > What specifically do you mean when you use the term "split-brain"? Are > you > > talking about the situation where 2 active nodes in the cluster are not > > communicating properly? > > > > I'm sorry I used the term improperly. > > Yes, I'm referring to a situation in which a cluster of 3 brokers gets into > a state in which brokers can no longer talk to each other and the messages > don't flow between them. > > > > > We configured it using JGroups with TCP because it's not possible to do > > IP multicast across AZs in AWS. > > > > Why not just use the "static" connector configuration offered via the > > normal configuration? Typically folks who configure JGroups use something > > more exotic for cloud-based use-cases like S3_PING [2] or KUBE_PING [3]. > > > > Yeah, we might want to resort to that although we originally planned on > using KUBE_PING but ended up stopping when DNS_PING worked for us. > > > > > ...we didn't expect our load (600 msg/sec) to be enough to justify > > investing in this kind of broker affiliation. > > > > Fair enough. I brought up the connection router because many folks seem > to > > be under the impression that clustering is just a silver bullet for more > > performance without understanding the underlying implications of > > clustering. > > > > Yeah, we understand that. > > > > > > > What we did NOT expect is this kind of "wedged" behavior in which > Artemis > > finds itself and is not able to recover until we physically kill the > > instance that is accumulating messages. > > > > That's certainly problematic and not something I would expect either. > > Occasional administrative restarts seems like a viable short-term > > work-around, but the goal would be to identify the root cause of the > > problem so it can either be addressed via configuration or code (i.e. a > bug > > fix). At this point I can't say what the root cause is. > > > > Yes, it's very puzzling. > > We are 99% sure the problem happens when this method gets invoked: > > > https://github.com/apache/activemq-artemis/blob/main/artemis-core-client/src/main/java/org/apache/activemq/artemis/core/client/impl/ClientProducerCreditManagerImpl.java#L161 > > There are only two other methods calling this one: > > ClientProducerCreditManagerImpl.getCredits() > > https://github.com/apache/activemq-artemis/blob/main/artemis-core-client/src/main/java/org/apache/activemq/artemis/core/client/impl/ClientProducerCreditManagerImpl.java#L55 > > and > > ClientProducerCreditManagerImpl.returnCredits() > > https://github.com/apache/activemq-artemis/blob/main/artemis-core-client/src/main/java/org/apache/activemq/artemis/core/client/impl/ClientProducerCreditManagerImpl.java#L105 > > It seems that the internal address between brokers is treated just the same > as any other address in terms of flow control and when the entry is > removed, it ends up being "blocked" but there isn't anything else that ever > unblocks it. It feels like a bug, honestly. Or it could be that whatever > causes the unblocking is not invoked because of some misconfiguration on > our part. > > You said that you tried using -1 as the producer-window-size on the > > cluster-connection and that it caused even more problems. What were those > > problems? > > > Our entire cluster went bad because messages weren't going thru so we had > to quickly revert the configuration. A lot of messages failed to be > delivered but we don't know if that was because of a load slam or something > else. > > Our biggest problem is that the only way to reproduce this problem is under > load in our production environments, which impacts our customers, so it's a > very slow and risky process to experiment with this. > > > > Did you try any other values greater than the default (i.e. > > 1048576 - 1MiB)? If not, could you? > > > > Yes, we could, but see above. > > > > How long has this deployment been running before you saw this issue? > > > Well, we just launched our service a few weeks ago. > > > > Has > > anything changed recently? Version 2.30.0 was only recently released. Did > > you use another version previously? If so, did you see this problem in > the > > previous version? > > > > We launched with 2.28.0 and had the same problem. We upgraded to 2.30.0 > hoping it would go away but it didn't. > > > > How large are the messages that you are sending? > > > > Pretty small, few kb tops. > > > > Instead of restarting the entire broker have you tried stopping and > > starting the cluster connection via the management API? If so, what > > happened? If not, could you? > > > > We did not. How would you do this? > > > > When you attempt to reproduce the issue do you still see the > > "releaseOutstanding" log message at some point? In your reproducer > > environment have you tried lowering the producer-window-size as a way to > > potentially make the error more likely? > > > > Ah, that's a good suggestion. We did not but we could try to see if that > helps us discover it in dev. > > > > > > > ...we could just abandon the entire concept of multi-pod Artemis > > configuration and just have one and tolerate it going down once in a > > while... > > > > Generally speaking I think this is a viable strategy and one I recommend > to > > folks often (which goes back to the fact that lots of folks deploy a > > cluster without any real need). You could potentially configure HA to > > mitigate the risk of the broker going down, although that has caveats of > > its own. > > > > We just tested today a single Artemis instance and managed to get enough > load to satisfy our needs, so we will probably go with that for now. > > Still, I can't shake the feeling a intra-broker queue getting wedged like > that is not a good thing and I would like to understand why it's happening > because we might need to cluster in the future. > > Thx for all your help and suggestions. > > > > > > > > Justin > > > > [1] > > > > > https://activemq.apache.org/components/artemis/documentation/latest/network-isolation.html > > [2] http://www.jgroups.org/javadoc/org/jgroups/protocols/S3_PING.html > > [3] http://www.jgroups.org/manual5/index.html#_kube_ping > > > > On Fri, Aug 25, 2023 at 2:22 PM Stefano Mazzocchi < > > stefano.mazzoc...@gmail.com> wrote: > > > > > Hi Justin, thx for your response! > > > > > > Find my answers inline below. > > > > > > On Thu, Aug 24, 2023 at 8:43 PM Justin Bertram <jbert...@apache.org> > > > wrote: > > > > > > > Couple of questions: > > > > > > > > - What high availability configuration are you using and at what > point > > > > does split brain occur? > > > > > > > > > > We don't have HA enabled. Artemis is used as an asynchronous ephemeral > > > control plane sending messages between software modules. If it does go > > down > > > for a little while, or some messages are lost, it's ok for our needs. > > > > > > The split brain occurs when that log event is emitted. We have not been > > > able to identify what is causing that to happen. > > > > > > > > > > - Is JGroups w/TCP really viable in AWS? I assumed it would be > onerous > > > to > > > > configure in a cloud environment since it requires a static list of > IP > > > > addresses (i.e. no dynamic discovery). > > > > > > > > > > Our cluster uses kubernetes to manage 3 different artemis "pods" living > > in > > > 3 different availability zones. We configured it using JGroups with TCP > > > because it's not possible to do IP multicast across AZs in AWS. > > > > > > > > > > - What metric exactly are you looking at for the > cluster-connection's > > > > credits? > > > > > > > > > > We are scraping the balance="" value out of DEBUG logs. > > > > > > > > > > - Have you considered using the connection router functionality [1] > to > > > pin > > > > relevant producers and consumers to the same node to avoid moving > > > messages > > > > around the cluster? Moving messages might be neutralizing the > benefits > > of > > > > clustering [2]. > > > > > > > > > > We are using Artemis to create an asynchronous and ephemeral control > > plane > > > between a few thousands of software modules and we designed the system > to > > > be resilient to latency and temporary failures and we didn't expect our > > > load (600 msg/sec) to be enough to justify investing in this kind of > > broker > > > affiliation. What we did NOT expect is this kind of "wedged" behavior > in > > > which Artemis finds itself and is not able to recover until we > physically > > > kill the instance that is accumulating messages. Our modules are > designed > > > to wait and reconnect if communication to the broker goes down, but > they > > > have no way of telling the difference between a valid connection that > is > > > not receiving messages because there aren't any to be received or a > valid > > > connection that is not receiving messages because they are stuck in > > transit > > > between brokers. > > > > > > We could limp along indefinitely like this (automating the termination > of > > > any artemis pod that shows any accumulation of messages) or we could > just > > > abandon the entire concept of multi-pod Artemis configuration and just > > have > > > one and tolerate it going down once in a while (the rest of our system > is > > > designed to withstand that) but before giving up we wanted to > understand > > > why this is happening and if there was something we can do to prevent > it. > > > (or if it's a bug in Artemis) > > > > > > > > > > > > > > Justin > > > > > > > > [1] > > > > > > > > > > > > > > https://activemq.apache.org/components/artemis/documentation/latest/connection-routers.html > > > > [2] > > > > > > > > > > > > > > https://activemq.apache.org/components/artemis/documentation/latest/clusters.html#performance-considerations > > > > > > > > On Thu, Aug 24, 2023 at 7:49 PM Stefano Mazzocchi < > stef...@apache.org> > > > > wrote: > > > > > > > > > Hi there, > > > > > > > > > > at $day_job we are running in production an Artemis 2.30 cluster > > with 3 > > > > > nodes using jgroups over TCP for broadcast and discovery. We are > > using > > > it > > > > > over MQTT and things are working well. > > > > > > > > > > Every couple of days, messages stop flowing across nodes (causing > > > > negative > > > > > issues with the rest of our cluster which directly impact our > > > customers). > > > > > > > > > > The smoking gun seems to be this log message: > > > > > > > > > > > > > > > > > > > > > > > > > [org.apache.activemq.artemis.core.client.impl.AsynchronousProducerCreditsImpl] > > > > > releaseOutstanding credits, balance=0, callback=class > > > > > > > > > > > > > > > > > > > > org.apache.activemq.artemis.core.server.cluster.impl.ClusterConnectionBridge > > > > > > > > > > Every time this message appears, messages stop being routed across > > > > Artemis > > > > > instances and end up piling up in internal queues instead of being > > > > > delivered. > > > > > > > > > > We have tried configuring "producer-window-size" to be -1 in the > > > cluster > > > > > connector but that has caused even more problems so we had to > revert > > > it. > > > > > Our production environment is therefore operating with the default > > > value > > > > > which we believe to be 1Mb. > > > > > > > > > > We have also created a grafana dashboard to look at the value of > the > > > > > "credits" for each cluster connector over time and they oscillate > > > > > consistently between the "1mb" and 600kb range. The ONLY time it > dips > > > > below > > > > > 600kb is when it goes straight to zero and then it bounces right > > back, > > > > but > > > > > the messages continue to be stuck in a queue. > > > > > > > > > > There is no indication of reconnection or anything else in the > logs. > > > > > > > > > > Unfortunately we have been unable to reproduce this with artificial > > > load > > > > > tests. It seems to be something very specific to how our production > > > > cluster > > > > > is operating (in AWS). > > > > > > > > > > Has anyone experienced anything like this before? Do you have any > > > > > suggestions on what we could try to prevent this from happening? > > > > > > > > > > Thank you very much in advance for any suggestion you could give > us. > > > > > > > > > > -- > > > > > Stefano. > > > > > > > > > > > > > > > > > > -- > > > Stefano. > > > > > > > > -- > Stefano. >