Re: consumer prefetch considered harmful ..evict from the prefetch buffer?

Tim Bain Mon, 08 Jun 2015 07:33:55 -0700

On Sun, Jun 7, 2015 at 9:17 PM, Kevin Burton <bur...@spinn3r.com> wrote:


> I think I’m in a weird edge situation caused by a potential bug / design
> flaw.
>
> I have a java daemon that needs to process tasks as much as possible.  It’s
> a thread per task model with each box having a thread per session and
> consumer.
>
> This is required per activemq/jms:
>
> http://activemq.apache.org/multiple-consumers-on-a-queue.html
>
> The problem comes with prefetch sizing.
>
> Since each consumer has a prefetch window I have to be careful.
>
> Too high and I get slow consumers that prefetch to many messages and just
> sit on them.. preventing anything using it.
>
> So if I had a 50k prefetch (silly but just for an example) and a queue with
> 50k messages, a consumer will just prefetch them thereby preventing
> anything else from using it.
>

I can see two potential problems that your description didn't draw a line
between:

   1. With a large prefetch buffer, it's possible to have one thread have a
   large number of prefetched tasks and another have none, even if all tasks
   take an average amount of time to complete.  No thread is slow per se, but
   because the messages were prefetched lopsidedly, one thread sits idle while
   the other churns through what's on its plate.
   2. With *any* prefetch buffer size, it's possible to have one message
   that takes forever to complete.  Any messages caught behind that one slow
   message are stuck until it finishes.

 Which scenario are you worried about here?

If the latter, the AbortSlowAckConsumerStrategy (
http://timbish.blogspot.com/2013/07/coming-in-activemq-59-new-way-to-abort.html;
sadly the wiki doesn't detail this strategy and Tim's personal blog post is
the best documentation available) is intended to address exactly this: if a
consumer takes an unacceptable amount of time to complete a task, it is
aborted, putting all of its prefetched tasks back into the queue on the
broker to be dispatched to another consumer.  The current consumer will
continue processing the message it's stuck on until it completes (I'm not
sure whether an attempt is made to interrupt the thread or whether it's
simply allowed to run to completion), but I believe that message will also
be dispatched to another consumer, so 1) that message will be processed
more than once and you have to be able to handle that, and 2) you could end
up with N threads (where N is the max redelivery count) all processing the
message and tying up cores before the message gets moved to the DLQ and
processing can continue as normal.  So it's not a perfect solution, but it
might address most of what you're concerned about, if indeed this is what
you're concerned about.

If the former, you're basically looking to enable work-stealing between
consumers, and I'm not aware of any existing capability to do that.  If you
wanted to implement it, you'd probably want to implement it as a sibling
class to AbortSlowAckConsumerStrategy where SlowAck is the trigger but
StealWork is the action rather than Abort.  To implement the StealWork
action, you'll have to extend the protocol(s) to allow the recall of
certain messages from the prefetch buffer of the slow consumer, and you'll
have to make sure you do it in a way that doesn't deliver messages twice,
that doesn't cause performance issues (pulling messages from the middle of
the prefetch buffer may be expensive with the current data structures), and
that doesn't violate any guarantees the JMS spec might make about the order
messages are processed within the consumers that will receive the
repatriated messages (I'm not familiar enough with the spec to say whether
any such guarantees exist but would be violated by doing this).

Too low and I *think* what’s happening is that the activemq connection
> can’t service queues fast enough to keep them prefectched (warm).  This
> means I get this cyclical performance wave which looks like a sine wave.
> On startup, we prefetch everything, but then CPU spikes processing my
> tasks.  the ActiveMQ connection then gets choked.  Then the tasks finish
> up, but have nothing prefetched so they have to wait for ActiveMQ.  But now
> ActiveMQ has enough CPU locally to get around to prefetching again.
>

I'm a little skeptical that your worker threads could so thoroughly smother
the CPU that the thread doing the prefetching gets starved out
(particularly since I'd expect it to be primarily I/O-bound, so it's CPU
usage should be minimal), though I guess if you had as many worker threads
as cores you might be able to burn through all the prefetched messages
before the ActiveMQ thread gets rescheduled.  But I assume that your
workers are doing non-trivial amounts of work and are probably getting
context switched repeatedly during their processing, which I'd think would
give the ActiveMQ thread plenty of time to do what it needs to.  Unless 1)
you've set thread priorities to prioritize your workers over ActiveMQ, in
which case don't do that, 2) your worker threads are somehow holding onto a
lock that the ActiveMQ thread needs, which is possible but seems unlikely,
or 3) you've set up so many consumers (far more than you have cores) that
the 1/(N+1)th that the ActiveMQ thread gets is too little or too infrequent
to maintain responsiveness, in which case you need to scale back your
worker thread pool size (which I think means using fewer consumers per
process, based on what you've described).


> … so one strategy could be (I think) to use prefetch eviction
>
> which I think this is about:
>
> http://activemq.apache.org/slow-consumer-handling.html
>
> am I right? It’s hard to tell though.
>

Slow consumer handling as it's currently implemented (with Abort as the
only available action) is geared towards consumers that go slow due to
external factors (unexpected contention for resources, deadlock,
non-homogeneous nodes where one is vastly underpowered compared to the
others); it assumes that the message was perfectly reasonable and that if
the consumer is slow, it's because something happened in the consumer so
giving the message to a different consumer should allow the message to be
processed normally.  For topics, it also provides a nice ability to wipe
the slate for a particular consumer and let it start over, avoiding the
risk of messages building up to the point where Producer Flow Control kicks
in; for queues, aborting a consumer doesn't result in discarding any
messages, so it doesn't help for that problem with topics.  As it's
currently implemented, it's not intended to handle situations where the
message itself is the problem (i.e. processing a particular message takes
way longer than the average time to process messages) and the consumer is
perfectly healthy but overwhelmed by the scope of the work to be done (and
all other consumers would be similarly overwhelmed so shuttling the message
over to another consumer wouldn't process the message any faster), though
the enhancement I described above would allow that.


> This would be resolved if JMS could handle threading better.  I wouldn’t
> need so many consumers.  Another idea is that I could use ONE thread for
> all the ActiveMQ message handling and then just dispatch local messages,
> and then ack them in the same thread.  This seems like a pain though.
>
> And this is yet another ActiveMQ complexity I have to swallow in my app…
>
> Hopefully 6.0 can address some of these issues with JMS 2.0 goodness… but I
> haven’t looked into it that much to know for sure.
>
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
>

Re: consumer prefetch considered harmful ..evict from the prefetch buffer?

Reply via email to