OK.  I think I have a handle regarding what’s happening during queue purges
that cause GC lockups.

Wanted to get your feedback.

I can create a bug for this if you guys think my assessment is accurate as
I think the fix is someone reasonable / easy.

I have a unit test which duplicates this now but I need to do more cleanup
so I can put it into a public github repo for you guys to look at.

## Problem overview.

ActiveMQ supports a feature where it can GC a queue that is inactive. IE
now messages and no consumers.

However, there’s a bug where

purgeInactiveDestinations

in

org.apache.activemq.broker.region.RegionBroker

creates a read/write lock (inactiveDestinationsPurgeLock) which is held
during the entire queue GC.

each individual queue GC takes about 100ms with a disk backed queue and
10ms with a memory backed (non-persistent) queue. If you have thousands of
them to GC at once the inactiveDestinationsPurgeLock lock is held the
entire time which can last from 60 seconds to 5 minutes (and essentially
unbounded).

A read lock is also held for this in addConsumer addProducer so that when a
new consumer or produce tries to connect, they’re blocked until queue GC
completes.

Existing producers/consumers work JUST fine.

The lock MUST be held on each queue because if it isn’t there’s a race
where a queue is flagged to be GCd , then a producer comes in and writes a
new message, then the background thread deletes the queue which it marked
as GCable but it had the newly produced message.  This would result in data
loss.

## Confirmed

I have a unit tests now that confirms this.   I create 7500 queues, produce
1 message in each, then consume it. I keep all consumers open.  then I
release all 7500 queues at once.

I then have an consumer/producer pair I hold open and produce and consume
messages on it.  this works fine.

However, I have another which creates a new producer each time.  This will
block for 60,000ms multiple time while queue GC is happening in the
background.

## Proposed solution.

Rework the read/write locks to be one lock per queue.

So instead of using one global lock per broker, we use one lock per queue
name.  This way the locks are FAR more granular and new producers/consumers
won’t block during this time period.

If a queue named ‘foo’ is being GCd and a new producer is created on a
‘bar’ queue the bar producer will work fine and won’t block on the foo
queue.

This can be accomplished by:

creating a concurrent hash map with the name of the queue as the key (or an
ActiveMQDestination as the key) which stores read/write locks as the
values. Then we use this as the lock backing and the purge thread and
add/remove producers will all reference the more granular lock.

….

Now initially, I was thinking I would just fix this myself, however, I
might have a workaround for our queue design which uses less queues, and I
think this will drop our queue requirement from a few thousand to a few
dozen.  So at that point this won’t be as much of a priority.

However, this is a significant scalability issue with ActiveMQ… one that
doesn’t need to exist.  In our situation I think our performance would be
fine even with 7500 queues once this bug is fixed.

Perhaps it should just exist as an open JIRA that could be fixed at some
time in the future?

I can also get time to clean up a project with a test which demonstrates
this problem.

Kevin

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>

Reply via email to