Re: A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.

Hadrian Zbarcea Tue, 24 Feb 2015 15:33:07 -0800

Kevin,

Go for it.

Yes, out tests need a lot of love. Actually most of the tests are notreally unit tests, but more like integration tests.Helping with the spring (as in season) cleanup would be immenselyappreciated.


Hadrian


On 02/24/2015 02:18 PM, Kevin Burton wrote:

Ah.  OK… I assume continuous integration isn’t a goal.  I’m upset if my
testing takes more than 20 minutes :-P

I bet that this would be taken down to an hour if they were parallelized.
But of course someone needs to take the time to get that done.

I might be allocating some of our internal dev resources to getting our
build to parallelize across containers.  If this generalizable to other
projects I’ll OSS it.. Would be great to get better testing feedback.

Right now I think my patches are ready to be merged but I’d feel more
comfortable suggesting it if I knew the tests worked.

Should I propose it now and if there are any bugs we can fix them later?

Kevin

On Tue, Feb 24, 2015 at 10:54 AM, Timothy Bish <tabish...@gmail.com> wrote:

On 02/24/2015 01:22 PM, Kevin Burton wrote:

oh  geez… seriously though.  can you give me an estimate?

Assuming 30 seconds each I’m assuming about 11 hours.

That's probably not to far off, usually just let them run over night for
the all profile.

http://activemq.apache.org/junit-reports.html
https://hudson.apache.org/hudson/job/ActiveMQ/
Hudson is down for ActiveMQ.

Also, is there a disk space requirement for the tests? I have 10GB free or
so and I’m getting TONS of these:

I don't think anyone has looked into the min required free for the tests
to run.

java.lang.RuntimeException: Failed to start provided job scheduler store:
JobSchedulerStore:activemq-data

at java.io.RandomAccessFile.seek(Native Method)

at
org.apache.activemq.util.RecoverableRandomAccessFile.seek(
RecoverableRandomAccessFile.java:384)

at
org.apache.activemq.store.kahadb.disk.page.PageFile.
readPage(PageFile.java:877)

at
org.apache.activemq.store.kahadb.disk.page.Transaction.
load(Transaction.java:427)

at
org.apache.activemq.store.kahadb.disk.page.Transaction.
load(Transaction.java:377)

at
org.apache.activemq.store.kahadb.disk.index.BTreeIndex.
load(BTreeIndex.java:159)

at
org.apache.activemq.store.kahadb.scheduler.JobSchedulerStoreImpl$
MetaData.load(JobSchedulerStoreImpl.java:90)

at
org.apache.activemq.store.kahadb.scheduler.JobSchedulerStoreImpl$3.
execute(JobSchedulerStoreImpl.java:277)

at
org.apache.activemq.store.kahadb.disk.page.Transaction.
execute(Transaction.java:779)

at
org.apache.activemq.store.kahadb.scheduler.JobSchedulerStoreImpl.doStart(
JobSchedulerStoreImpl.java:261)

at org.apache.activemq.util.ServiceSupport.start(ServiceSupport.java:55)

at
org.apache.activemq.broker.BrokerService.setJobSchedulerStore(
BrokerService.java:1891)

at
org.apache.activemq.transport.stomp.StompTestSupport.
createBroker(StompTestSupport.java:174)

at
org.apache.activemq.transport.stomp.StompTestSupport.
startBroker(StompTestSupport.java:112)

at
org.apache.activemq.transport.stomp.StompTestSupport.setUp(
StompTestSupport.java:94)

at
org.apache.activemq.transport.stomp.Stomp12Test.setUp(
Stomp12Test.java:41)

at
org.apache.activemq.transport.stomp.Stomp12SslAuthTest.
setUp(Stomp12SslAuthTest.java:38)

On Tue, Feb 24, 2015 at 10:09 AM, Timothy Bish <tabish...@gmail.com>
wrote:

  On 02/24/2015 01:01 PM, Kevin Burton wrote:

  How long do the tests usually take?

  An eternity + 1hr

   I’m looking at 45 minutes right now

before I gave up… I think part of it was that the box I was testing on
was
virtualized and didn’t have enough resources.

I tried to parallelize the tests (-T 8 with maven) but I got other
errors
so I assume the ports are singletons.

  They won't be happy if you run them parallel.



  On Tue, Feb 24, 2015 at 8:03 AM, Gary Tully <gary.tu...@gmail.com>

wrote:

   if there are any test failures - try to run them individually

-Dtest=a,b etc. There may be an issue with a full test run, but all of
the tests that are enabled should work. I know there are some issues
with jdbc tests that hang or fail due to previous runs no cleaning up,
but that should be the most of it. I got a bunch of full test runs
before the 5.11 release if that is any help.

On 23 February 2015 at 20:38, Kevin Burton <bur...@spinn3r.com> wrote:

  OK.  This is ready to go and I have a patch branch:

https://issues.apache.org/jira/browse/AMQ-5609

I’m stuck at the moment though because tests don’t pass.  But it was
failing tests before so I don’t think it has anything to do with my
changes.



On Sun, Feb 22, 2015 at 11:11 PM, Kevin Burton <bur...@spinn3r.com>

  wrote:

  Actually, is the lock even needed here?  Why would it be?  if we’re

*removing* a subscription, why does it care if we possibly ALSO
remove
a
separate / isolated queue before/after the subscription is removed.

I think this is redundant and can be removed.  Maybe I’m wrong
though.

I looked at all the callers and none were associated with queues.

On Sun, Feb 22, 2015 at 11:07 PM, Kevin Burton <bur...@spinn3r.com>

  wrote:

So I have some working/theoretical code that should resolve this.

It acquires a lock *per* ActiveMQDestination, this way there is no

lock
contention.

But here’s where I’m stuck.

       @Override

        public void removeSubscription(ConnectionContext context,

RemoveSubscriptionInfo info) throws Exception {
           inactiveDestinationsPurgeLock.readLock().lock();
           try {
               topicRegion.removeSubscription(context, info);
           } finally {
               inactiveDestinationsPurgeLock.readLock().unlock();
           }
       }

  .. this is in RegionBroker.

There is no ActiveMQDestination involved here so I’m not sure the
best
way to resolve this.

Any advice?


On Sun, Feb 22, 2015 at 8:11 PM, Kevin Burton <bur...@spinn3r.com>

  wrote:

Yes.  That was my thinking too.. that just replacing the

CopyOnWriteArraySet
with something more performance would solve the issue.
This would also improve queue creation time as well as queue deletion

time.

What I think I”m going to do in the mean time is:

- implement a granular lock based on queue name… I am going to use
an
interface so we can replace the implementation later.

- implement timing for the purge thread so it tracks how long it
takes
to remove a queue but also how long the entire loop takes.

I’ll do this on a branch so it should be easy to merge.

On Sun, Feb 22, 2015 at 7:40 PM, Tim Bain <tb...@alumni.duke.edu>

  wrote:

A decent amount of the time is being spent calling remove() on
various

array-backed collections.  Those data structures might be

inappropriate

for

the number of destinations you're running, since array-backed

collections
tend to have add/remove operations that are O(N); some improvement

  might

come from something as simple as moving to a ConcurrentHashSet

instead

of a

CopyOnWriteArraySet, for example.  (Or it might make performance

  worse

because of other aspects of how those collections are used; people

other

than me would be in a better position to evaluate the full range of

performance requirements for those collections.)

Scheduler.cancel() also takes an alarming amount of time for what

  looks

like a really simple method (

   http://grepcode.com/file/repo1.maven.org/maven2/org.

apache.activemq/activemq-all/5.10.0/org/apache/activemq/

thread/Scheduler.java#Scheduler.cancel%28java.lang.Runnable%29

  ).

On Sun, Feb 22, 2015 at 7:56 PM, Kevin Burton <bur...@spinn3r.com>

wrote:

   Here’s a jprofiler view with the advisory support enabled if
you’re

curious.

http://i.imgur.com/I1jesZz.jpg

I’m not familiar with the internals of ActiveMQ enough to have
any

  obvious

  optimization ideas.

One other idea I had (which would require a ton of refactoring I

  think)

  would be to potentially bulk delete all the queues at once.


On Sun, Feb 22, 2015 at 6:42 PM, Kevin Burton <
bur...@spinn3r.com>

  wrote:

  And spending some more time in jprofiler, looks like 20% of this

is

do to

schedulerSupport and the other 80% of this is do to

advisorySupport.

if I set both to false the total runtime of my tests drops in

half…

and

the latencies fall from

max create producer latency: 10,176 ms

  max create message on existing producer and consumer: 2 ms

  … to

max create producer latency: 1 ms

  max create message on existing producer and consumer: 1 ms

  and this isn’t without fixing the purge background lock.

So the question now is what the heck is the advisory support

  doing

that

can result in such massive performance overhead.

… and I think advisorySupport is enabled by default so that’s

  problematic

as well.

On Sun, Feb 22, 2015 at 4:45 PM, Kevin Burton <

  bur...@spinn3r.com>

wrote:

OK.  Loaded up JProfiler and confirmed that it’s not LevelDB.

This is a non-persistent broker I’m testing on.

Looks like it’s spending all it’s time in

  CopyOnWriteArrayList.remove

and

  Timer.purge…

Which is hopeful because this is ALL due to ActiveMQ internals

  and

in

theory LevelDB should perform well if we improve the

performance of

ActiveMQ internals and fix this lock bug.

Which would rock!

It should ALSO make queue creation faster.


On Sun, Feb 22, 2015 at 4:10 PM, Kevin Burton <

  bur...@spinn3r.com>

wrote:

On Sun, Feb 22, 2015 at 3:30 PM, Tim Bain <

  tb...@alumni.duke.edu>

wrote:

So if LevelDB cleanup during removeDestination() is the

presumed
culprit,

can we spin off the LevelDB cleanup work into a separate

  thread

(better: a

task object to be serviced by a ThreadPool so you can avoid a

  fork

bomb

if

we remove many destinations at once) so the call to

  removeDestination()

can
return quickly and LevelDB can do its record-keeping in the

  background

without blocking message-processing?

   Would that be possible?  If the delete is pending on ActiveMQ

there is

  race where a producer could re-create it unless the lock is

held.

Though I guess if you dispatched to the GC thread WITH the lock

still

held you would be ok but I think if we use the existing purge
thread
then

  we’re fine.

OK. I think I’m wrong about LevelDB being the issue.  To be

  fair I

wasn’t 100% certain before but I should have specified.

Our current production broker is running with

  persistent=false..

.and I

just re-ran the tests without disk persistence and it has the

same
problem.

So the main issue how is why the heck is ActiveMQ taking SO

LONG

to GC

  queue.  It’s taking about 100ms which is an insane amount of

time

considering this is done all in memory.

Kevin

--

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>



  --

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>



  --

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>



  --

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>


  --

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>



  --

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>



  --

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>



  --

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>

--

Tim Bish
Sr Software Engineer | RedHat Inc.
tim.b...@redhat.com | www.redhat.com
skype: tabish121 | twitter: @tabish121
blog: http://timbish.blogspot.com/

--
Tim Bish
Sr Software Engineer | RedHat Inc.
tim.b...@redhat.com | www.redhat.com
skype: tabish121 | twitter: @tabish121
blog: http://timbish.blogspot.com/

Re: A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.

Reply via email to