OK. I think I have a handle regarding what’s happening during queue purges that cause GC lockups.
Wanted to get your feedback. I can create a bug for this if you guys think my assessment is accurate as I think the fix is someone reasonable / easy. I have a unit test which duplicates this now but I need to do more cleanup so I can put it into a public github repo for you guys to look at. ## Problem overview. ActiveMQ supports a feature where it can GC a queue that is inactive. IE now messages and no consumers. However, there’s a bug where purgeInactiveDestinations in org.apache.activemq.broker.region.RegionBroker creates a read/write lock (inactiveDestinationsPurgeLock) which is held during the entire queue GC. each individual queue GC takes about 100ms with a disk backed queue and 10ms with a memory backed (non-persistent) queue. If you have thousands of them to GC at once the inactiveDestinationsPurgeLock lock is held the entire time which can last from 60 seconds to 5 minutes (and essentially unbounded). A read lock is also held for this in addConsumer addProducer so that when a new consumer or produce tries to connect, they’re blocked until queue GC completes. Existing producers/consumers work JUST fine. The lock MUST be held on each queue because if it isn’t there’s a race where a queue is flagged to be GCd , then a producer comes in and writes a new message, then the background thread deletes the queue which it marked as GCable but it had the newly produced message. This would result in data loss. ## Confirmed I have a unit tests now that confirms this. I create 7500 queues, produce 1 message in each, then consume it. I keep all consumers open. then I release all 7500 queues at once. I then have an consumer/producer pair I hold open and produce and consume messages on it. this works fine. However, I have another which creates a new producer each time. This will block for 60,000ms multiple time while queue GC is happening in the background. ## Proposed solution. Rework the read/write locks to be one lock per queue. So instead of using one global lock per broker, we use one lock per queue name. This way the locks are FAR more granular and new producers/consumers won’t block during this time period. If a queue named ‘foo’ is being GCd and a new producer is created on a ‘bar’ queue the bar producer will work fine and won’t block on the foo queue. This can be accomplished by: creating a concurrent hash map with the name of the queue as the key (or an ActiveMQDestination as the key) which stores read/write locks as the values. Then we use this as the lock backing and the purge thread and add/remove producers will all reference the more granular lock. …. Now initially, I was thinking I would just fix this myself, however, I might have a workaround for our queue design which uses less queues, and I think this will drop our queue requirement from a few thousand to a few dozen. So at that point this won’t be as much of a priority. However, this is a significant scalability issue with ActiveMQ… one that doesn’t need to exist. In our situation I think our performance would be fine even with 7500 queues once this bug is fixed. Perhaps it should just exist as an open JIRA that could be fixed at some time in the future? I can also get time to clean up a project with a test which demonstrates this problem. Kevin -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog: http://burtonator.wordpress.com … or check out my Google+ profile <https://plus.google.com/102718274791889610666/posts> <http://spinn3r.com>
