Re: Not abortable slow consumers / stopped processing of messages in a queue

Tim Bain Fri, 31 Oct 2014 17:45:06 -0700

Marek,

Thanks for the explanation; that helped me to separate the different things
you're seeing.

On Fri, Oct 31, 2014 at 5:21 AM, Marek Dominiak <isdomi...@gmail.com> wrote:

> tbain98 wrote
> > I'm not clear on what behavior you're seeing, because the descriptions
> you
> > give (as I understand them) seem contradictory.  You say that the
> consumer
> > won't abort, but that you've got a 30-minute client-side abort timeout.
> > You say that after the intended abort, you know it didn't work because
> the
> > consumer didn't resume processing messages, but then you say that there
> > weren't any messages to process.  Maybe you're describing multiple
> > independent scenarios with different behavior and I'm just not catching
> > the
> > difference between them, but I'm not at all clear on what you're seeing.
> > Can you give us a from-the-top summary?  No need to give the overview or
> > any config files or log files, just tell us at each step what you expect
> > to
> > happen and what's actually happening (and how you know).
> >
> > Also, your first message was all about aborting slow consumers, while
> your
> > reply sounds like it's concerned entirely with aborting idle consumers.
> > Which one's the problem here?  Also, how do you know that a particular
> > idle
> > consumer isn't being aborted?  The logs tell you the abort is happening;
> > what's telling you it's not?
>
> Hi Tim,
>
> I am sorry for my unclear description. I have mixed two (or more) issues in
> one post.
> *
> First issue* was visible with this config (30 minutes timeout on
> MessageListener due to tx timeout, prefetch size: default 1000, none of
> “abortSlow*ConsumerStrategy” defined, redelivery policy defined on the jms
> connection <not on jms factory> - one redelivery). It looks like the JMS
> processing has stopped at one point: consumer got a HEAVY message, it
> failed
> at least once to consume it, message was still visible via ActiveMQ Web App
> (or JMX), consumer had 1 message to be acked (I saw this via JMX). After
> one
> day this very consumer hasn’t done anything at all - it was idle, it didn’t
> get any more messages dispatched to it while the other consumer got a lot
> of
> them. I was forced to move the message to DLQ and to restart the
> application
> node.
>

It sounds like something's going wrong with rolling back your transaction
here, because a client-aborted transaction isn't supposed to prevent the
client from processing the next message.  Since you say you've extended
DMLC, I'd start looking there for something that's not handled properly,
and if that doesn't turn up anything I'd attach a debugger and step through
the code to see if you can see what's going wrong.  I've never used
transactions so I can't offer any specific advice here, but since you've
got a reproducible test case it should be fairly straightforward to step
through the code and see if something's going wrong.

> Similar behaviour I have got with prefetch.size set to 0. That is when I
> started to look at abort strategies, AbortSlowAckedConsumerStrategy looked
> like the one to use. And after I have configured it I got the other issue
> with a new configuration.
>

I thought the minimum prefetch size was 1, based on what's in the "Using
the Prefetch Policy to configure the limit" section of
http://activemq.apache.org/slow-consumer-handling.html...

> *Second issue:* conf (30 minutes timeout on MessageListener due to tx
> timeout, AbortSlowAckConsumerStrategy set to abort every consumer after
> every 6 minutes abortConnection=false), I had 1 HEAVY message in the queue,
> and it failed due to tx timeout. Now as I understand the
> AbortSlowAckConsumerStrategy should abort the consumer and create a new one
> which should try to consume the message after some time (100 seconds due to
> redelivery policy), but what happened that day was that consumer stayed
> alive with 1 message to be acked back. And the consumer was being kept
> alive
> indefinitely (the same consumer id + logs I have posted). I was forced to
> move this message to the DLQ, but the consumer was still slow, idle with
> no.
> of messages to be acked = 1. After some time I was forced to restart
> application node to get new consumer in place.
>
> As I understand aborting correctly, after consumer has been marked as slow
> and it did finish it’s job (even if it was unsuccessful: rollback) it
> should
> be removed and replaced by a different one. *(This behavior is observable
> for “smaller” messages).*
>

Your description of how an abort happens is not quite right.  When a
consumer is determined to be slow and need aborting, the broker sends a
message to the consumer requesting that it abort itself.  Assuming the
consumer processes this message, this will cause the TCP transport to
disconnect, which will be detected by the failover transport, which will
establish a new TCP connection.  But the reconnection won't happen until
after the client has finished processing the current message, in your case
by rolling back the transaction.  The broker doesn't create connections,
and it can't create a new connection for the consumer; the consumer has to
do that, and it will only do it if it processes the disconnect request from
the broker in the first place.  Your client's logs should show a failover
reconnect when this happens, so if you're not seeing that log line in the
client's logs, your client is probably not processing that disconnect
request; I suspect that's what's going on based on your description of the
client staying connected and having the same connection ID.

In 5.9.0, Tim Bish added the ability to abort a connection from the broker
side by closing the socket by using the abortConnection=true URI option;
his blog post (
http://timbish.blogspot.com/2013/07/coming-in-activemq-59-new-way-to-abort.html)
is the best documentation I'm aware of.  If you found that the voluntary
abort wasn't happening, abortConnection=true *might* let your client detect
that the connection is dead and abort sooner, though I haven't played with
it enough to know for sure.  If you're going to use it, you should probably
upgrade to 5.10.0 to get Gary's fix for
https://issues.apache.org/jira/browse/AMQ-5114.

If that doesn't work, you might be able to manually disconnect and
reconnect from within your consumer code once you've finished rolling back
the transaction, but I've never looked into how you'd actually do that so I
don't have suggestions for how to do it or if it would solve the problem...

> So in the logs I should see a different consumer id after some time, but in
> this case it didn’t happen (without restarting server).
>
> 2014-10-25 00:00:11,455 [host] Scheduler] INFO  AbortSlowConsumerStrategy
> - aborting slow consumer:
> ID:min-p-app02.osl.basefarm.net-36433-1414153506788-1:1:17:7 for
> destination:queue://generateReportQueue
> … more logs every 6 minutes here
> 2014-10-25 01:12:11,641 [host] Scheduler] INFO  AbortSlowConsumerStrategy
> - aborting slow consumer:
> ID:min-p-app02.osl.basefarm.net-36433-1414153506788-1:1:17:7 for
> destination:queue://generateReportQueue
>
> *Both issues looks kind of the same for me now: consumer hasn’t acked back
> to the server after it failed to consume this HEAVY message (one or more
> times) and it stopped to be responsive (no acked sent back, abort strategy
> couldn’t force it to stop, no new messages dispatched to the consumer). *
>

I agree that both look very similar.  One possible explanation for that is
that ActiveMQ doesn't consider your rollback to be a close-out of the
message, so it doesn't think it can go on to the next message (first case)
or re-establish the connection (second message).  That would be the first
place I'd focus on.  Another possibility is that there's a bug in the
handling of some of these when transactions are manually rolled back, so
that's the next place I'd look.

>
> tbain98 wrote
> > 1.  If I've understood correctly, you say your business logic will abort
> > after 30 minutes, independently of any ActiveMQ-initiated abort request.
> > Is that actually happening?  The logs you've posted don't give any
> > indication either way (and you say "the same idle consumer can’t be
> > aborted
> > in a span of 18 hours"), and the behavior you're describing would be more
> > consistent with your clients not aborting than with them aborting but not
> > pulling the next message, though of course both are possible.  So make
> > sure
> > your client's really doing what you think it is.
>
> Yes, that is really happening, in the application logs I can see that the
> transaction is stopped, and then code which listens to the exceptions in
> all
> MessageListeners sends information about the exception to the devs (custom
> error handler in DefaultMessageListenerContainer). Usually just after
> getting this exception consumer starts to consume the next message or waits
> until redelivery policy kicks in to try to consume the message again.
>
> *As you mentioned it, I will try to verify what is actually happening under
> the hood in the DMLC (we have done some small extensions to it in order to
> define redelivery policy per connection, and custom error handling). Maybe
> when processing is too heavy the rollback isn’t called?*
>

Those sound like the right places to look.  But I'd suggest you dig (with a
debugger, ideally) into what happens in the ActiveMQ client code when you
roll back your transaction.  Having logs from your own code saying you're
doing the rollback isn't proof that it's actually being done, so I'd
confirm that it's really happening, all the way through sending a rollback
notification back to the broker.

> tbain98 wrote
> > 3.  Can you confirm (via JConsole in the MBeans tab or some other JMX
> > viewer) that your consumer is still connected to the broker after the
> > abort?  Also, when your client aborts, how is ActiveMQ being told about
> > the
> > failure?  (And what ack mode are you using?)
>
> Via JMX I saw that consumer was still connected to the server
> We use session transacted (without transaction manager but with some Spring
> JMS magic to handle commit/rollback correctly).
>

> “Also, when your client aborts, how is ActiveMQ being told about the
> failure?”
> I am not sure if I understood your question.
>

I think you just answered it: "with some Spring JMS magic to handle
commit/rollback correctly".  I assume you've confirmed that you're invoking
this magic correctly, but if not it should become clear as you step through
the code with a debugger.

> tbain98 wrote
> > 5 & 6.  For you to use the approach I suggested, you'd either have to be
> > OK
> > losing messages when failures occur or you'd have to persist the message
> > to
> > a datastore to retry in the case of a failure.  It sounds like neither of
> > those is appealing, so this may not be an option.
>
> We use persistent store in ActiveMQ, and I hoped this should be enough. I
> am
> trying to track down the bug in the configuration, to avoid unnecessary
> over
> complicated store code/configuration.
>

Message persistence and how you handle message failures are orthogonal
concerns; the former relates to how you ensure the message gets to your
consumers for successful processing, but doesn't make any guarantees about
what you do once it fails.  Your approach to the latter sounds reasonable;
it just sounds like it's not working properly, but it's well thought-out
and should be a valid approach once you figure out what's failing along the
way.  My proposal was that you store the message on the consumer and have
the consumer invoke retries on failure instead of having the broker do it;
either approach will work, but it sounds like you prefer the former, which
is fine as long as you can track down what's not working right about it.

> Thanks again for your help Tim.
>
>
> I will try to see what’s under the hood in the Spring DMLC when heavy
> message has failed due to transaction timeout and if upgrading to 5.10.0
> solves our issues. I will write when I am done with these two.
>
>
> Regards
> Marek Dominiak
>
>
>
> --
> View this message in context:
> http://activemq.2283324.n4.nabble.com/Not-abortable-slow-consumers-stopped-processing-of-messages-in-a-queue-tp4686721p4686838.html
> Sent from the ActiveMQ - User mailing list archive at Nabble.com.
>

Re: Not abortable slow consumers / stopped processing of messages in a queue

Reply via email to