Re: A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.

Kevin Burton Tue, 24 Feb 2015 10:03:41 -0800

How long do the tests usually take? I’m looking at 45 minutes right now
before I gave up… I think part of it was that the box I was testing on was
virtualized and didn’t have enough resources.


I tried to parallelize the tests (-T 8 with maven) but I got other errors
so I assume the ports are singletons.

On Tue, Feb 24, 2015 at 8:03 AM, Gary Tully <gary.tu...@gmail.com> wrote:

> if there are any test failures - try to run them individually
> -Dtest=a,b etc. There may be an issue with a full test run, but all of
> the tests that are enabled should work. I know there are some issues
> with jdbc tests that hang or fail due to previous runs no cleaning up,
> but that should be the most of it. I got a bunch of full test runs
> before the 5.11 release if that is any help.
>
> On 23 February 2015 at 20:38, Kevin Burton <bur...@spinn3r.com> wrote:
> > OK.  This is ready to go and I have a patch branch:
> >
> > https://issues.apache.org/jira/browse/AMQ-5609
> >
> > I’m stuck at the moment though because tests don’t pass.  But it was
> > failing tests before so I don’t think it has anything to do with my
> > changes.
> >
> >
> >
> > On Sun, Feb 22, 2015 at 11:11 PM, Kevin Burton <bur...@spinn3r.com>
> wrote:
> >
> >> Actually, is the lock even needed here?  Why would it be?  if we’re
> >> *removing* a subscription, why does it care if we possibly ALSO remove a
> >> separate / isolated queue before/after the subscription is removed.
> >>
> >> I think this is redundant and can be removed.  Maybe I’m wrong though.
> >>
> >> I looked at all the callers and none were associated with queues.
> >>
> >> On Sun, Feb 22, 2015 at 11:07 PM, Kevin Burton <bur...@spinn3r.com>
> wrote:
> >>
> >>> So I have some working/theoretical code that should resolve this.
> >>>
> >>> It acquires a lock *per* ActiveMQDestination, this way there is no lock
> >>> contention.
> >>>
> >>> But here’s where I’m stuck.
> >>>
> >>>     @Override
> >>>>     public void removeSubscription(ConnectionContext context,
> >>>> RemoveSubscriptionInfo info) throws Exception {
> >>>>         inactiveDestinationsPurgeLock.readLock().lock();
> >>>>         try {
> >>>>             topicRegion.removeSubscription(context, info);
> >>>>         } finally {
> >>>>             inactiveDestinationsPurgeLock.readLock().unlock();
> >>>>         }
> >>>>     }
> >>>
> >>>
> >>> .. this is in RegionBroker.
> >>>
> >>> There is no ActiveMQDestination involved here so I’m not sure the best
> >>> way to resolve this.
> >>>
> >>> Any advice?
> >>>
> >>>
> >>> On Sun, Feb 22, 2015 at 8:11 PM, Kevin Burton <bur...@spinn3r.com>
> wrote:
> >>>
> >>>> Yes.  That was my thinking too.. that just replacing the
> CopyOnWriteArraySet
> >>>> with something more performance would solve the issue.
> >>>>
> >>>> This would also improve queue creation time as well as queue deletion
> >>>> time.
> >>>>
> >>>> What I think I”m going to do in the mean time is:
> >>>>
> >>>> - implement a granular lock based on queue name… I am going to use an
> >>>> interface so we can replace the implementation later.
> >>>>
> >>>> - implement timing for the purge thread so it tracks how long it takes
> >>>> to remove a queue but also how long the entire loop takes.
> >>>>
> >>>> I’ll do this on a branch so it should be easy to merge.
> >>>>
> >>>> On Sun, Feb 22, 2015 at 7:40 PM, Tim Bain <tb...@alumni.duke.edu>
> wrote:
> >>>>
> >>>>> A decent amount of the time is being spent calling remove() on
> various
> >>>>> array-backed collections.  Those data structures might be
> inappropriate
> >>>>> for
> >>>>> the number of destinations you're running, since array-backed
> >>>>> collections
> >>>>> tend to have add/remove operations that are O(N); some improvement
> might
> >>>>> come from something as simple as moving to a ConcurrentHashSet
> instead
> >>>>> of a
> >>>>> CopyOnWriteArraySet, for example.  (Or it might make performance
> worse
> >>>>> because of other aspects of how those collections are used; people
> other
> >>>>> than me would be in a better position to evaluate the full range of
> >>>>> performance requirements for those collections.)
> >>>>>
> >>>>> Scheduler.cancel() also takes an alarming amount of time for what
> looks
> >>>>> like a really simple method (
> >>>>>
> >>>>>
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.activemq/activemq-all/5.10.0/org/apache/activemq/thread/Scheduler.java#Scheduler.cancel%28java.lang.Runnable%29
> >>>>> ).
> >>>>>
> >>>>> On Sun, Feb 22, 2015 at 7:56 PM, Kevin Burton <bur...@spinn3r.com>
> >>>>> wrote:
> >>>>>
> >>>>> > Here’s a jprofiler view with the advisory support enabled if you’re
> >>>>> > curious.
> >>>>> >
> >>>>> > http://i.imgur.com/I1jesZz.jpg
> >>>>> >
> >>>>> > I’m not familiar with the internals of ActiveMQ enough to have any
> >>>>> obvious
> >>>>> > optimization ideas.
> >>>>> >
> >>>>> > One other idea I had (which would require a ton of refactoring I
> >>>>> think)
> >>>>> > would be to potentially bulk delete all the queues at once.
> >>>>> >
> >>>>> >
> >>>>> > On Sun, Feb 22, 2015 at 6:42 PM, Kevin Burton <bur...@spinn3r.com>
> >>>>> wrote:
> >>>>> >
> >>>>> > > And spending some more time in jprofiler, looks like 20% of this
> is
> >>>>> do to
> >>>>> > > schedulerSupport and the other 80% of this is do to
> advisorySupport.
> >>>>> > >
> >>>>> > > if I set both to false the total runtime of my tests drops in
> half…
> >>>>> and
> >>>>> > > the latencies fall from
> >>>>> > >
> >>>>> > > max create producer latency: 10,176 ms
> >>>>> > >> max create message on existing producer and consumer: 2 ms
> >>>>> > >
> >>>>> > >
> >>>>> > > … to
> >>>>> > >
> >>>>> > > max create producer latency: 1 ms
> >>>>> > >> max create message on existing producer and consumer: 1 ms
> >>>>> > >
> >>>>> > >
> >>>>> > > and this isn’t without fixing the purge background lock.
> >>>>> > >
> >>>>> > > So the question now is what the heck is the advisory support
> doing
> >>>>> that
> >>>>> > > can result in such massive performance overhead.
> >>>>> > >
> >>>>> > > … and I think advisorySupport is enabled by default so that’s
> >>>>> problematic
> >>>>> > > as well.
> >>>>> > >
> >>>>> > >
> >>>>> > > On Sun, Feb 22, 2015 at 4:45 PM, Kevin Burton <
> bur...@spinn3r.com>
> >>>>> > wrote:
> >>>>> > >
> >>>>> > >> OK.  Loaded up JProfiler and confirmed that it’s not LevelDB.
> >>>>> > >>
> >>>>> > >> This is a non-persistent broker I’m testing on.
> >>>>> > >>
> >>>>> > >> Looks like it’s spending all it’s time in
> >>>>> CopyOnWriteArrayList.remove
> >>>>> > and
> >>>>> > >> Timer.purge…
> >>>>> > >>
> >>>>> > >> Which is hopeful because this is ALL due to ActiveMQ internals
> and
> >>>>> in
> >>>>> > >> theory LevelDB should perform well if we improve the
> performance of
> >>>>> > >> ActiveMQ internals and fix this lock bug.
> >>>>> > >>
> >>>>> > >> Which would rock!
> >>>>> > >>
> >>>>> > >> It should ALSO make queue creation faster.
> >>>>> > >>
> >>>>> > >>
> >>>>> > >> On Sun, Feb 22, 2015 at 4:10 PM, Kevin Burton <
> bur...@spinn3r.com>
> >>>>> > wrote:
> >>>>> > >>
> >>>>> > >>>
> >>>>> > >>>
> >>>>> > >>> On Sun, Feb 22, 2015 at 3:30 PM, Tim Bain <
> tb...@alumni.duke.edu>
> >>>>> > wrote:
> >>>>> > >>>
> >>>>> > >>>> So if LevelDB cleanup during removeDestination() is the
> presumed
> >>>>> > >>>> culprit,
> >>>>> > >>>> can we spin off the LevelDB cleanup work into a separate
> thread
> >>>>> > >>>> (better: a
> >>>>> > >>>> task object to be serviced by a ThreadPool so you can avoid a
> >>>>> fork
> >>>>> > bomb
> >>>>> > >>>> if
> >>>>> > >>>> we remove many destinations at once) so the call to
> >>>>> > removeDestination()
> >>>>> > >>>> can
> >>>>> > >>>> return quickly and LevelDB can do its record-keeping in the
> >>>>> background
> >>>>> > >>>> without blocking message-processing?
> >>>>> > >>>>
> >>>>> > >>>
> >>>>> > >>> Would that be possible?  If the delete is pending on ActiveMQ
> >>>>> there is
> >>>>> > a
> >>>>> > >>> race where a producer could re-create it unless the lock is
> held.
> >>>>> > >>>
> >>>>> > >>> Though I guess if you dispatched to the GC thread WITH the lock
> >>>>> still
> >>>>> > >>> held you would be ok but I think if we use the existing purge
> >>>>> thread
> >>>>> > then
> >>>>> > >>> we’re fine.
> >>>>> > >>>
> >>>>> > >>> OK. I think I’m wrong about LevelDB being the issue.  To be
> fair I
> >>>>> > >>> wasn’t 100% certain before but I should have specified.
> >>>>> > >>>
> >>>>> > >>> Our current production broker is running with
> persistent=false..
> >>>>> .and I
> >>>>> > >>> just re-ran the tests without disk persistence and it has the
> same
> >>>>> > problem.
> >>>>> > >>>
> >>>>> > >>> So the main issue how is why the heck is ActiveMQ taking SO
> LONG
> >>>>> to GC
> >>>>> > a
> >>>>> > >>> queue.  It’s taking about 100ms which is an insane amount of
> time
> >>>>> > >>> considering this is done all in memory.
> >>>>> > >>>
> >>>>> > >>> Kevin
> >>>>> > >>>
> >>>>> > >>> --
> >>>>> > >>>
> >>>>> > >>> Founder/CEO Spinn3r.com
> >>>>> > >>> Location: *San Francisco, CA*
> >>>>> > >>> blog: http://burtonator.wordpress.com
> >>>>> > >>> … or check out my Google+ profile
> >>>>> > >>> <https://plus.google.com/102718274791889610666/posts>
> >>>>> > >>> <http://spinn3r.com>
> >>>>> > >>>
> >>>>> > >>>
> >>>>> > >>
> >>>>> > >>
> >>>>> > >> --
> >>>>> > >>
> >>>>> > >> Founder/CEO Spinn3r.com
> >>>>> > >> Location: *San Francisco, CA*
> >>>>> > >> blog: http://burtonator.wordpress.com
> >>>>> > >> … or check out my Google+ profile
> >>>>> > >> <https://plus.google.com/102718274791889610666/posts>
> >>>>> > >> <http://spinn3r.com>
> >>>>> > >>
> >>>>> > >>
> >>>>> > >
> >>>>> > >
> >>>>> > > --
> >>>>> > >
> >>>>> > > Founder/CEO Spinn3r.com
> >>>>> > > Location: *San Francisco, CA*
> >>>>> > > blog: http://burtonator.wordpress.com
> >>>>> > > … or check out my Google+ profile
> >>>>> > > <https://plus.google.com/102718274791889610666/posts>
> >>>>> > > <http://spinn3r.com>
> >>>>> > >
> >>>>> > >
> >>>>> >
> >>>>> >
> >>>>> > --
> >>>>> >
> >>>>> > Founder/CEO Spinn3r.com
> >>>>> > Location: *San Francisco, CA*
> >>>>> > blog: http://burtonator.wordpress.com
> >>>>> > … or check out my Google+ profile
> >>>>> > <https://plus.google.com/102718274791889610666/posts>
> >>>>> > <http://spinn3r.com>
> >>>>> >
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>>
> >>>> Founder/CEO Spinn3r.com
> >>>> Location: *San Francisco, CA*
> >>>> blog: http://burtonator.wordpress.com
> >>>> … or check out my Google+ profile
> >>>> <https://plus.google.com/102718274791889610666/posts>
> >>>> <http://spinn3r.com>
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>>
> >>> Founder/CEO Spinn3r.com
> >>> Location: *San Francisco, CA*
> >>> blog: http://burtonator.wordpress.com
> >>> … or check out my Google+ profile
> >>> <https://plus.google.com/102718274791889610666/posts>
> >>> <http://spinn3r.com>
> >>>
> >>>
> >>
> >>
> >> --
> >>
> >> Founder/CEO Spinn3r.com
> >> Location: *San Francisco, CA*
> >> blog: http://burtonator.wordpress.com
> >> … or check out my Google+ profile
> >> <https://plus.google.com/102718274791889610666/posts>
> >> <http://spinn3r.com>
> >>
> >>
> >
> >
> > --
> >
> > Founder/CEO Spinn3r.com
> > Location: *San Francisco, CA*
> > blog: http://burtonator.wordpress.com
> > … or check out my Google+ profile
> > <https://plus.google.com/102718274791889610666/posts>
> > <http://spinn3r.com>
>



-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>

Re: A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.

Reply via email to