-- sorry, my mail client led me to a direct reply instead of to the
group, resend to the group
Justin,
Thank you for your reply.
Your questions:
> Are the other working consumers on the same queue as the stalled
consumer?
No, all other consumers appear to be receiving messages from this same
queue with no problem. This queue is a multicast queue and all 13 nodes
have a producer and a consumer. You can think of it as each node
broadcasting status to all other nodes. There are other queues in use by
the other nodes but this JVM that embeds the broker doesn't really use
them. So the broker JVM is able to send messages to that queue, just not
receive them. And everyone else can both send and receive.
> Can you get metrics from the queue with the stalled consumer? If
so,what's the value of the MessageCount, ConsumerCount, DeliveringCount,
& Paused attributes?
This is a good idea. I'll see if I can work a way to get these.
> Have you acquired any thread dumps from the stalled consumer? If so,
what did they show?
This is also a good idea. If I can be on the phone with someone while
this is happening, I'll see if I can coach them through getting this
through the kill -3 mechnism.
> What kinds of clients are you using? What version are they?
All sockets. We are using the Core API client with sockets and all
software on all nodes stays at the exact same version, 2.30 in this case.
One of the challenges with this situation is that it only happens one
place that I know of and I don't have direct access to the machine, more
forensic analysis. Thank you for your help and thinking on this.
David.
On 11/7/23 8:58 PM, Justin Bertram wrote:
I've got a few questions:
- Are the other working consumers on the same queue as the stalled
consumer?
- Can you get metrics from the queue with the stalled consumer? If so,
what's the value of the MessageCount, ConsumerCount, DeliveringCount, &
Paused attributes?
- Have you acquired any thread dumps from the stalled consumer? If so,
what did they show?
- What kinds of clients are you using? What version are they?
Are there other normal reasons that message delivery to a consumer could
stop?
Typically what I see in this kind of situation is that the consumer is hung
for some reason while attempting to handle a message (e.g. a blocking call
without a timeout to a remote resource like a REST API or something).
What log messages or logging can help me prove one way or another what is
happening?
It's impossible to say at this point without more knowledge of what
protocol(s) your clients are using.
Justin
On Tue, Nov 7, 2023 at 7:52 PM David Bennion <david.benn...@gmx.com.invalid>
wrote:
Hey all,
I could use a push in the right direction to troubleshoot an issue!
TL;DR
After running really well for a seemingly indeterminate period of time
(from hours to days), message delivery stops to connected consumers that
are located within the same JVM as the Artemis server. Producers in
that same JVM continue uninterrupted. (version Artemis 2.30, will
upgrade to 2.31.2 soon)
Details:
4 JVMs on each of 3 large Linux VMs. Node 1 has an additional JVM that
contains an embedded Artemis broker. All 13 of these JVMs has an open
producer and consumer session in the broker and persistence is off.
I don't have direct access to the machines where this problem is
occurring to debug, but I can get logs and ultimately apply updates.
Log analysis of application behavior points to cessation of message
delivery to the consumer inside the broker JVM. All other consumers and
producers continue to pass messages through broker without issue; the
broker is running great.
I setup a similar 3 node setup that I could debug into to attempt to
replicate. I put a breakpoint in my message handler and then following
the call stack into ClientConsumerImpl, I manually called
setMessageHandler(null) to disable the handler on the consumer as the
application was running. The resulting application behavior and logging
on this setup then matched exactly the behavior on the problem machines,
including some pretty distinctive behaviors that the application does.
This really leads me to believe that the message delivery stopped.
So I have no idea WHY the consumer stopped receiving messages. I have
requested the logs for org.apache.activemq be set to INFO to capture
more information from this environment. We normally run them at WARN
level because of the volume of logs. I didn't really see anything
interesting in the logs I did get from the broker (at WARN level). If
there were some kind of network issue, I don't understand how it could
not affect the producers as well -- let alone all the other 12 connected
JVMs?
Are there other normal reasons that message delivery to a consumer could
stop? What log messages or logging can help me prove one way or another
what is happening? The only thing unusual about these machines is that
they have 2 NICs.
Regards,
David.