It sounds like your switch fabric might be the issue? Those types of hangs should show pretty frequent kernel alarms.
On Jun 2, 2013, at 21:10, Christian Posta <christian.po...@gmail.com> wrote: > You should checkout the failover transport to handle reconnecting. > > On Sunday, June 2, 2013, fenbers wrote: > >> >> >> >> >> >> I don't know how to determine the NFS version but we are running on >> RHEL 5.5. >> >> I have not checked the syslog. Thanks for the tip. I will >> do that >> after our morning Operations. >> >> We are also very inclined to believe this is an NFS issue, based on >> behaviors network-wide which have nothing to do with ActiveMQ, e.g, >> often taking 10 seconds to list just 5 files in an NFS-mounted >> directory. >> >> So, we are creating an action plan this weekend to eliminate as many >> NFS mount points as possible, and seeing how that helps the >> situation. The plan needs approval/buy-in from key people to be >> implemented, so it may be a couple of weeks to implement the >> plan. >> In the meantime, ActiveMQ either shuts itself down or behaves in >> rather despondent ways, so we find we are having to restart ActiveMQ >> every 3 or 4 hours (and this frequency is slowly increasing). >> >> Once ActiveMQ is rebooted, we find that our producers and our >> consumers have to be shut down and relaunched in order to >> reestablish the connection with ActiveMQ. This is a royal >> pain! >> However, a producer will throw an exception whenever it tries to >> send a message through a lost connection, and so I catch the >> exception where I close the connection and reopen it. Thus, my >> producers are able to reconnect automatically in the event ActiveMQ >> is restarted. >> >> But with the consumers, no exception is thrown as it waits for >> notifications. It simply waits for a notification that never >> happens after the connection with ActiveMQ is lost. So what is >> your >> recommended method for a consumer to check for a disconnection?? >> (Maybe I should post his question as a separate thread...) >> >> Mark >> >> >> On 5/29/2013 3:21 AM, rajdavies [via >> ActiveMQ] wrote: >> >> Ultimately I'm pretty confident this problem is an >> NFS problem - and as Johan has already let the cat out of the >> bag >> ;) - let me ask the following: >> >> >> Which version of NFS 4 are you using and which environment? >> >> Have you checked the system logs for NFS errors on all the >> machines running ActiveMQ brokers ? >> >> >> thanks, >> >> >> Rob >> >> >> On 29 May 2013, at 00:46, Christian Posta < [hidden email] > >> wrote: >> >> >> > I can make two recommendations. >> >> > >> > #1, being the preferred, create a test case that shows >> this... that will >> >> > give us the best chance of finding out what's going on... >> take a look at >> >> > the following test cases in the activemq source code to >> give you an idea >> >> > about how to go about doing it... >> >> > >> > >> http://svn.apache.org/viewvc/activemq/trunk/activemq-unit-tests/src/test/java/org/apache/activemq/usecases/ >> > >> > >> http://svn.apache.org/viewvc/activemq/trunk/activemq-unit-tests/src/test/java/org/apache/activemq/bugs/ >> > >> > >> http://svn.apache.org/viewvc/activemq/trunk/activemq-unit-tests/src/test/java/org/apache/activemq/test/JmsTopicSendReceiveTest.java?view=markup >> > >> > >> > #2, if creating a test case doesn't sound like something >> you want to get >> >> > into.. i guess, give us the exact configs of broker, >> clients, number of >> >> > consumers, number of topics, message sizes, etc, etc all >> details and if one >> >> > of us gets the urge we can try it out on our boxes. this >> will not be nearly >> >> > as good as #1, and will provide a higher barrier to entry >> because we spend >> >> > our spare time doing this and like to spend that time >> debugging and fixing, >> >> > and not setting up environments and usecases which may not >> even show a bug >> >> > :) >> >> > >> > >> > >> > >> > On Tue, May 28, 2013 at 4:34 PM, fenbers < [hidden email] >> > >> wrote: >> >> > >> >> >> >> >> >> >> >> >> >> >> >> I'm getting the Sync exception on both, >> local and >> NFS.&nbsp; >> >> >> Originally, >> >> >> I was only using a local disk, but there >> wasn't much >> disk space for >> >> >> the ever growing list of 33MB enumerated >> .log files >> that weren't >> >> >> cleaned up.&nbsp; So I reconfigured >> ActiveMQ to >> put these db files on >> >> >> an >> >> >> NFS mount.&nbsp; But the sync exceptions >> occurred either way. >> >> >> >> >> I've changed *all* my consumers to >> AUTO_ACKNOWLEDGE, >> thinking that >> >> >> maybe an ACKNOWLEDGEment leak was causing the >> undeleted files.&nbsp; >> >> >> That >> >> >> didn't help...&nbsp; The TRACE level >> logging >> points to only two of my 5 >> >> >> topics that accumulate these undeleted db >> files.&nbsp; So I've >> >> >> concentrated by scrutiny over consumers of >> these two >> topics.&nbsp; But >> >> >> have not found anything out of the >> ordinary.&nbsp; >> >> >> >> >> What is puzzling me still, is that the >> frequency of >> the log file >> >> >> build-up and the frequency of exceptions >> continues >> to increase even >> >> >> though the amount of messages sent per day >> by the >> producers remains >> >> >> nearly constant... >> >> >> Mark >> >> >> >> >> On 5/28/2013 6:06 PM, ceposta [via >> >> >> ActiveMQ] wrote: >> >> >> >> >> Sounds like there's multiple issues... >> >> >> >> >> You're journal files aren't being >> cleaned up, AND >> you're getting >> >> >> the Sync >> >> >> >> >> exception? >> >> >> >> >> You get the sync exception on local >> disk mount? Or >> just NFS? >> >> >> >> >> >> >> If the journals aren't being cleaned >> up, are your >> consumers >> >> >> properly >> >> >> >> >> ack'ing messages? >> >> >> >> >> >> >> >> >> On Tue, May 28, 2013 at 2:42 PM, >> fenbers &lt; >> [hidden email] &gt; >> >> >> wrote: >> >> >> >> >> >> >> &gt; >> >> >> >> >> &gt; >> >> >> >> >> &gt; >> >> >> >> >> &gt; >> >> >> >> >> &gt; >> >> >> >> >> &gt; &nbsp; &nbsp; >> I would LOVE to >> help you help me!&amp;nbsp; But >> >> >> I have >> >> >> no idea how to go >> >> >> >> >> &gt; &nbsp; &nbsp; >> about making a >> test case.&amp;nbsp; If you >> >> >> could drop >> >> >> some hints in this >> >> >> >> >> &gt; &nbsp; &nbsp; >> regard, I might >> be able to produce one. >> >> >> >> >> &gt; >> >> >> >> >> &gt; &nbsp; &nbsp; >> My ActiveMQ >> issues seem to be related to network >> >> >> slowness, which we >> >> >> >> >> &gt; &nbsp; &nbsp; >> are diagnosing >> separately.&amp;nbsp; Or maybe >> >> >> it is the >> >> >> other way around, >> >> >> >> >> &gt; &nbsp; &nbsp; >> where ActiveMQ >> problems are causing network >> >> >> sluggishness.&amp;nbsp; >> Either >> >> >> >> >> &gt; &nbsp; &nbsp; >> way, there seems >> to be a correlation, except >> >> >> that when >> >> >> network >> >> >> >> >> &gt; &nbsp; &nbsp; >> responsiveness >> improves, ActiveMQ does not. >> >> >> >> >> &gt; >> >> >> >> >> &gt; &nbsp; &nbsp; >> The problem I'm >> having with AMQ is progressive, >> >> >> which >> >> >> is even more >> >> >> >> >> &gt; &nbsp; &nbsp; >> puzzling, because >> we are not adding to the >> >> >> number of >> >> >> messages that >> >> >> >> >> &gt; &nbsp; &nbsp; >> AMQ has to >> handle.&amp;nbsp; Today, we were up >> >> >> to 191 >> >> >> undeleted db-NNN.log >> >> >> >> >> &gt; &nbsp; &nbsp; >> files in the >> database directory before I >> >> >> stopped AMQ >> >> >> and deleted >> >> >> >> >> &gt; &nbsp; &nbsp; >> them.&amp;nbsp;&amp;nbsp; NNN was up to 451, so >> >> >> 260 >> >> >> files had been cleaned up >> >> >> >> >> &gt; by AMQ's >> >> >> >> >> &gt; &nbsp; &nbsp; >> automatic >> processes... >> >> >> >> >> &gt; >> >> >> >> >> &gt; &nbsp; &nbsp; >> Will log files >> assist you in helping >> >> >> me?&amp;nbsp; I >> >> >> have TRACE level >> >> >> >> >> &gt; &nbsp; &nbsp; >> messages turned >> on, so they are quite large. >> >> >> >> >> &gt; >> >> >> >> >> < > > > > -- > *Christian Posta* > http://www.christianposta.com/blog > twitter: @christianposta