Re: A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.

Kevin Burton Sun, 22 Feb 2015 23:11:42 -0800

So I have some working/theoretical code that should resolve this.

It acquires a lock *per* ActiveMQDestination, this way there is no lock
contention.


But here’s where I’m stuck.

    @Override
>     public void removeSubscription(ConnectionContext context,
> RemoveSubscriptionInfo info) throws Exception {
>         inactiveDestinationsPurgeLock.readLock().lock();
>         try {
>             topicRegion.removeSubscription(context, info);
>         } finally {
>             inactiveDestinationsPurgeLock.readLock().unlock();
>         }
>     }


.. this is in RegionBroker.

There is no ActiveMQDestination involved here so I’m not sure the best way
to resolve this.

Any advice?


On Sun, Feb 22, 2015 at 8:11 PM, Kevin Burton <bur...@spinn3r.com> wrote:

> Yes.  That was my thinking too.. that just replacing the CopyOnWriteArraySet
> with something more performance would solve the issue.
>
> This would also improve queue creation time as well as queue deletion
> time.
>
> What I think I”m going to do in the mean time is:
>
> - implement a granular lock based on queue name… I am going to use an
> interface so we can replace the implementation later.
>
> - implement timing for the purge thread so it tracks how long it takes to
> remove a queue but also how long the entire loop takes.
>
> I’ll do this on a branch so it should be easy to merge.
>
> On Sun, Feb 22, 2015 at 7:40 PM, Tim Bain <tb...@alumni.duke.edu> wrote:
>
>> A decent amount of the time is being spent calling remove() on various
>> array-backed collections.  Those data structures might be inappropriate
>> for
>> the number of destinations you're running, since array-backed collections
>> tend to have add/remove operations that are O(N); some improvement might
>> come from something as simple as moving to a ConcurrentHashSet instead of
>> a
>> CopyOnWriteArraySet, for example.  (Or it might make performance worse
>> because of other aspects of how those collections are used; people other
>> than me would be in a better position to evaluate the full range of
>> performance requirements for those collections.)
>>
>> Scheduler.cancel() also takes an alarming amount of time for what looks
>> like a really simple method (
>>
>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.activemq/activemq-all/5.10.0/org/apache/activemq/thread/Scheduler.java#Scheduler.cancel%28java.lang.Runnable%29
>> ).
>>
>> On Sun, Feb 22, 2015 at 7:56 PM, Kevin Burton <bur...@spinn3r.com> wrote:
>>
>> > Here’s a jprofiler view with the advisory support enabled if you’re
>> > curious.
>> >
>> > http://i.imgur.com/I1jesZz.jpg
>> >
>> > I’m not familiar with the internals of ActiveMQ enough to have any
>> obvious
>> > optimization ideas.
>> >
>> > One other idea I had (which would require a ton of refactoring I think)
>> > would be to potentially bulk delete all the queues at once.
>> >
>> >
>> > On Sun, Feb 22, 2015 at 6:42 PM, Kevin Burton <bur...@spinn3r.com>
>> wrote:
>> >
>> > > And spending some more time in jprofiler, looks like 20% of this is
>> do to
>> > > schedulerSupport and the other 80% of this is do to advisorySupport.
>> > >
>> > > if I set both to false the total runtime of my tests drops in half…
>> and
>> > > the latencies fall from
>> > >
>> > > max create producer latency: 10,176 ms
>> > >> max create message on existing producer and consumer: 2 ms
>> > >
>> > >
>> > > … to
>> > >
>> > > max create producer latency: 1 ms
>> > >> max create message on existing producer and consumer: 1 ms
>> > >
>> > >
>> > > and this isn’t without fixing the purge background lock.
>> > >
>> > > So the question now is what the heck is the advisory support doing
>> that
>> > > can result in such massive performance overhead.
>> > >
>> > > … and I think advisorySupport is enabled by default so that’s
>> problematic
>> > > as well.
>> > >
>> > >
>> > > On Sun, Feb 22, 2015 at 4:45 PM, Kevin Burton <bur...@spinn3r.com>
>> > wrote:
>> > >
>> > >> OK.  Loaded up JProfiler and confirmed that it’s not LevelDB.
>> > >>
>> > >> This is a non-persistent broker I’m testing on.
>> > >>
>> > >> Looks like it’s spending all it’s time in CopyOnWriteArrayList.remove
>> > and
>> > >> Timer.purge…
>> > >>
>> > >> Which is hopeful because this is ALL due to ActiveMQ internals and in
>> > >> theory LevelDB should perform well if we improve the performance of
>> > >> ActiveMQ internals and fix this lock bug.
>> > >>
>> > >> Which would rock!
>> > >>
>> > >> It should ALSO make queue creation faster.
>> > >>
>> > >>
>> > >> On Sun, Feb 22, 2015 at 4:10 PM, Kevin Burton <bur...@spinn3r.com>
>> > wrote:
>> > >>
>> > >>>
>> > >>>
>> > >>> On Sun, Feb 22, 2015 at 3:30 PM, Tim Bain <tb...@alumni.duke.edu>
>> > wrote:
>> > >>>
>> > >>>> So if LevelDB cleanup during removeDestination() is the presumed
>> > >>>> culprit,
>> > >>>> can we spin off the LevelDB cleanup work into a separate thread
>> > >>>> (better: a
>> > >>>> task object to be serviced by a ThreadPool so you can avoid a fork
>> > bomb
>> > >>>> if
>> > >>>> we remove many destinations at once) so the call to
>> > removeDestination()
>> > >>>> can
>> > >>>> return quickly and LevelDB can do its record-keeping in the
>> background
>> > >>>> without blocking message-processing?
>> > >>>>
>> > >>>
>> > >>> Would that be possible?  If the delete is pending on ActiveMQ there
>> is
>> > a
>> > >>> race where a producer could re-create it unless the lock is held.
>> > >>>
>> > >>> Though I guess if you dispatched to the GC thread WITH the lock
>> still
>> > >>> held you would be ok but I think if we use the existing purge thread
>> > then
>> > >>> we’re fine.
>> > >>>
>> > >>> OK. I think I’m wrong about LevelDB being the issue.  To be fair I
>> > >>> wasn’t 100% certain before but I should have specified.
>> > >>>
>> > >>> Our current production broker is running with persistent=false..
>> .and I
>> > >>> just re-ran the tests without disk persistence and it has the same
>> > problem.
>> > >>>
>> > >>> So the main issue how is why the heck is ActiveMQ taking SO LONG to
>> GC
>> > a
>> > >>> queue.  It’s taking about 100ms which is an insane amount of time
>> > >>> considering this is done all in memory.
>> > >>>
>> > >>> Kevin
>> > >>>
>> > >>> --
>> > >>>
>> > >>> Founder/CEO Spinn3r.com
>> > >>> Location: *San Francisco, CA*
>> > >>> blog: http://burtonator.wordpress.com
>> > >>> … or check out my Google+ profile
>> > >>> <https://plus.google.com/102718274791889610666/posts>
>> > >>> <http://spinn3r.com>
>> > >>>
>> > >>>
>> > >>
>> > >>
>> > >> --
>> > >>
>> > >> Founder/CEO Spinn3r.com
>> > >> Location: *San Francisco, CA*
>> > >> blog: http://burtonator.wordpress.com
>> > >> … or check out my Google+ profile
>> > >> <https://plus.google.com/102718274791889610666/posts>
>> > >> <http://spinn3r.com>
>> > >>
>> > >>
>> > >
>> > >
>> > > --
>> > >
>> > > Founder/CEO Spinn3r.com
>> > > Location: *San Francisco, CA*
>> > > blog: http://burtonator.wordpress.com
>> > > … or check out my Google+ profile
>> > > <https://plus.google.com/102718274791889610666/posts>
>> > > <http://spinn3r.com>
>> > >
>> > >
>> >
>> >
>> > --
>> >
>> > Founder/CEO Spinn3r.com
>> > Location: *San Francisco, CA*
>> > blog: http://burtonator.wordpress.com
>> > … or check out my Google+ profile
>> > <https://plus.google.com/102718274791889610666/posts>
>> > <http://spinn3r.com>
>> >
>>
>
>
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
>
>


-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>

Re: A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.

Reply via email to