I added a fix in the Jira AMQ-2774 thread. Eric-AWL
Eric-AWL wrote: > > We put a no-Duplex Configuration instead of a Duplex Configuration and it > seemed to work better.... But > today during a network problem (alternatively on/off) our process doesn't > resist .... > > We have > - a thread dump which shows 85 StartLocalBridge Threads waiting for the > same latch into the DemandForwardingBridgeSupport.StartLocalBridge method > : > > protected void startLocalBridge() throws Exception { > if (localBridgeStarted.compareAndSet(false, true)) { > synchronized (this) { > if (LOG.isTraceEnabled()) { > LOG.trace(configuration.getBrokerName() + " starting > local Bridge, localBroker=" + localBroker); > } > remoteBrokerNameKnownLatch.await(); > ... > } > > - 960 CLOSE_WAIT > - a file descriptor limit > > Will the transport.closeAsync=false flag be helpful here ? > > Eric-AWL > > > > Gary Tully wrote: >> >> Hi, as you can see, this is a complicated area of the code. The best >> approach is to try and produce a test case for your scenario. Take a >> look at the test: BrokerQueueNetworkWithDisconnectTest in >> activemq-core. This can simulate network failures and can use >> multicast (bridgeAllBrokers). Getting a reproducible test case is the >> best way to validate your changes and protect them into the future. >> >> The only other alternative is to keep adding your suggestions to the >> jira issue (https://issues.apache.org/activemq/browse/AMQ-2774) and >> with a bit of luck I (or some one else) will have a change to look at >> it before 5.4 . >> >> >> On 6 July 2010 12:37, Eric-AWL <eric.vinc...@atosorigin.com> wrote: >>> >>> I wonder if it could not have some undesirable effects on both side of >>> the >>> duplex connection .... >>> >>> perhaps we should test the started AtomicBoolean, in the start() method >>> after the corresponding "await" and shouldn't execute the end of the >>> start >>> method ? >>> >>> if (configuration.isDuplex() && duplexInitiatingConnection == >>> null) { >>> // initiator side of duplex network >>> remoteBrokerNameKnownLatch.await(); >>> } >>> >>> HERE ??? (if started.get()) { ??? >>> >>> try { >>> triggerRemoteStartBridge(); >>> } catch (IOException e) { >>> LOG.warn("Caught exception from remote start", e); >>> } >>> NetworkBridgeListener l = this.networkBridgeListener; >>> if (l != null) { >>> l.onStart(this); >>> } >>> >>> It's the first big problem I have with ActiveMQ complex configuration, >>> it >>> happens when network is faulty (that happens not very often), and I >>> don't >>> know ActiveMQ source code very well .... >>> >>> Who could help me to identify potential effects of this change, before I >>> try >>> to modify it ? (I can't do that on my production system without some >>> tests >>> and expert validation) >>> >>> Eric-AWL >>> >>> >>> Gary Tully wrote: >>>> >>>> that seems reasonable. want to submit a patch against trunk? >>>> >>>> On 6 July 2010 12:10, Eric-AWL <eric.vinc...@atosorigin.com> wrote: >>>>> >>>>> What could happen if we add >>>>> >>>>> if (configuration.isDuplex() && duplexInitiatingConnection == >>>>> null) >>>>> { >>>>> // initiator side of duplex network >>>>> remoteBrokerNameKnownLatch.countDown(); >>>>> } >>>>> >>>>> into the stop() method of DemandForwardingBridgeSupport class ? >>>>> >>>>> Eric-AWL >>>>> >>>>> >>>>> Eric-AWL wrote: >>>>>> >>>>>> Hi >>>>>> >>>>>> I'm sure that I identified a Latch problem in Multicast Network >>>>>> Discovery >>>>>> mechanism on Duplex connection >>>>>> >>>>>> The multicast notifier thread is blocked. here the trace >>>>>> >>>>>> "Notifier-MulticastDiscoveryAgent-listener:DiscoveryNetworkConnector:NOCSupervisorP5-ADMIN-OUT-IN:BrokerService[SIBBusModule-NOCP5-tpnocp08s-bus]" >>>>>> daemon prio=10 tid=0x0000000044ff2400 nid=0x1389 waiting on condition >>>>>> [0x0000000044c26000..0x0000000044c26b90] >>>>>> java.lang.Thread.State: WAITING (parking) >>>>>> at sun.misc.Unsafe.park(Native Method) >>>>>> - parking to wait for <0x00002aaab3dd66f0> (a >>>>>> java.util.concurrent.CountDownLatch$Sync) >>>>>> at >>>>>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) >>>>>> at >>>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:747) >>>>>> at >>>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:905) >>>>>> at >>>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1217) >>>>>> at >>>>>> java.util.concurrent.CountDownLatch.await(CountDownLatch.java:207) >>>>>> at >>>>>> org.apache.activemq.network.DemandForwardingBridgeSupport.start(DemandForwardingBridgeSupport.java:231) >>>>>> at >>>>>> org.apache.activemq.network.DiscoveryNetworkConnector.onServiceAdd(DiscoveryNetworkConnector.java:114) >>>>>> at >>>>>> org.apache.activemq.transport.discovery.multicast.MulticastDiscoveryAgent$2.run(MulticastDiscoveryAgent.java:484) >>>>>> at >>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >>>>>> at >>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >>>>>> at java.lang.Thread.run(Thread.java:619) >>>>>> >>>>>> The problem appears when the network is quickly and alternatively >>>>>> on/off >>>>>> between the two components. >>>>>> The bridge is created in one direction, but the answer can not be >>>>>> received. >>>>>> >>>>>> The thread is blocked on the CountDownLatch. Even if multicast frames >>>>>> are >>>>>> received, the component can not establish a new network connection. >>>>>> >>>>>> Here are an corresponding activemq trace >>>>>> >>>>>> When it is OK : >>>>>> 2010-06-22 22:56:24,500 [-tpnocp08s-bus]] INFO >>>>>> DiscoveryNetworkConnector >>>>>> - Establishing network connection from >>>>>> vm://SIBBusModule-NOCP5-tpnocp08s-bus to >>>>>> tcp://tpnocp11v-bus.vdm.priv.amm.noc:14101?useLocalHost=false >>>>>> 2010-06-22 22:56:26,083 [nocp08s-bus#160] INFO >>>>>> DemandForwardingBridge >>>>>> - Network connection between >>>>>> vm://SIBBusModule-NOCP5-tpnocp08s-bus#160 >>>>>> and >>>>>> tcp://tpnocp11v-bus.vdm.priv.amm.noc/10.18.126.30:14101(SIBBusSupervisor-tpnocp11v-bus) >>>>>> has been established. >>>>>> >>>>>> 2010-06-22 22:57:34,807 [-tpnocp08s-bus]] INFO >>>>>> DemandForwardingBridge >>>>>> - SIBBusModule-NOCP5-tpnocp08s-bus bridge to >>>>>> SIBBusSupervisor-tpnocp11v-bus stopped >>>>>> >>>>>> 2010-06-22 22:57:34,811 [-tpnocp08s-bus]] INFO >>>>>> DiscoveryNetworkConnector >>>>>> - Establishing network connection from >>>>>> vm://SIBBusModule-NOCP5-tpnocp08s-bus to >>>>>> tcp://tpnocp11v-bus.vdm.priv.amm.noc:14101?useLocalHost=false >>>>>> 2010-06-22 22:57:39,064 [nocp08s-bus#162] INFO >>>>>> DemandForwardingBridge >>>>>> - Network connection between >>>>>> vm://SIBBusModule-NOCP5-tpnocp08s-bus#162 >>>>>> and >>>>>> tcp://tpnocp11v-bus.vdm.priv.amm.noc/10.18.126.30:14101(SIBBusSupervisor-tpnocp11v-bus) >>>>>> has been established. >>>>>> >>>>>> 2010-06-22 22:58:42,578 [-tpnocp08s-bus]] INFO >>>>>> DemandForwardingBridge >>>>>> - SIBBusModule-NOCP5-tpnocp08s-bus bridge to >>>>>> SIBBusSupervisor-tpnocp11v-bus stopped >>>>>> >>>>>> When it is KO : "Unknown" >>>>>> >>>>>> 2010-06-22 22:58:42,648 [-tpnocp08s-bus]] INFO >>>>>> DiscoveryNetworkConnector >>>>>> - Establishing network connection from >>>>>> vm://SIBBusModule-NOCP5-tpnocp08s-bus to >>>>>> tcp://tpnocp11v-bus.vdm.priv.amm.noc:14101?useLocalHost=false >>>>>> 2010-06-22 22:59:18,031 [18.126.30:14101] WARN >>>>>> DemandForwardingBridge >>>>>> - Network connection between >>>>>> vm://SIBBusModule-NOCP5-tpnocp08s-bus#164 >>>>>> and >>>>>> tcp://tpnocp11v-bus.vdm.priv.amm.noc/10.18.126.30:14101 shutdown due >>>>>> to >>>>>> a >>>>>> remote error: java.net.SocketException: Connection reset >>>>>> 2010-06-22 22:59:18,033 [NetworkBridge ] INFO >>>>>> DemandForwardingBridge >>>>>> - SIBBusModule-NOCP5-tpnocp08s-bus bridge to Unknown stopped >>>>>> >>>>>> >>>>>> Here is the other side corresponding activemq trace >>>>>> >>>>>> activemq-server.log:2010-06-22 22:55:44,295 [26.190.27:40517] INFO >>>>>> TransportConnection - Created Duplex Bridge back to >>>>>> SIBBusModule-NOCP5-tpnocp08s-bus >>>>>> >>>>>> activemq-server.log:2010-06-22 22:56:24,438 [26.190.27:40517] INFO >>>>>> DemandForwardingBridge - SIBBusSupervisor-tpnocp11v-bus >>>>>> bridge >>>>>> to >>>>>> SIBBusModule-NOCP5-tpnocp08s-bus stopped >>>>>> >>>>>> activemq-server.log:2010-06-22 22:56:26,135 [26.190.27:40518] INFO >>>>>> TransportConnection - Created Duplex Bridge back to >>>>>> SIBBusModule-NOCP5-tpnocp08s-bus >>>>>> activemq-server.log:2010-06-22 22:56:26,135 [ocp11v-bus#1770] INFO >>>>>> DemandForwardingBridge - Network connection between >>>>>> vm://SIBBusSupervisor-tpnocp11v-bus#1770 and >>>>>> tcp:///10.26.190.27:40518(SIBBusModule-NOCP5-tpnocp08s-bus) has been >>>>>> established. >>>>>> >>>>>> activemq-server.log:2010-06-22 22:57:34,818 [26.190.27:40518] INFO >>>>>> DemandForwardingBridge - SIBBusSupervisor-tpnocp11v-bus >>>>>> bridge >>>>>> to >>>>>> SIBBusModule-NOCP5-tpnocp08s-bus stopped >>>>>> >>>>>> activemq-server.log:2010-06-22 22:57:39,153 [26.190.27:40519] INFO >>>>>> TransportConnection - Created Duplex Bridge back to >>>>>> SIBBusModule-NOCP5-tpnocp08s-bus >>>>>> activemq-server.log:2010-06-22 22:57:39,153 [ocp11v-bus#1806] INFO >>>>>> DemandForwardingBridge - Network connection between >>>>>> vm://SIBBusSupervisor-tpnocp11v-bus#1806 and >>>>>> tcp:///10.26.190.27:40519(SIBBusModule-NOCP5-tpnocp08s-bus) has been >>>>>> established. >>>>>> >>>>>> activemq-server.log:2010-06-22 22:58:44,328 [26.190.27:40519] INFO >>>>>> DemandForwardingBridge - SIBBusSupervisor-tpnocp11v-bus >>>>>> bridge >>>>>> to >>>>>> SIBBusModule-NOCP5-tpnocp08s-bus stopped >>>>>> >>>>>> >>>>>> Eric-AWL >>>>>> >>>>> >>>>> -- >>>>> View this message in context: >>>>> http://old.nabble.com/MultiCast-Discovery-and-refusal-of-connection-tp28827529p29084235.html >>>>> Sent from the ActiveMQ - User mailing list archive at Nabble.com. >>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> http://blog.garytully.com >>>> >>>> Open Source Integration >>>> http://fusesource.com >>>> >>>> >>> >>> -- >>> View this message in context: >>> http://old.nabble.com/MultiCast-Discovery-and-refusal-of-connection-tp28827529p29084410.html >>> Sent from the ActiveMQ - User mailing list archive at Nabble.com. >>> >>> >> >> >> >> -- >> http://blog.garytully.com >> >> Open Source Integration >> http://fusesource.com >> >> > > -- View this message in context: http://old.nabble.com/MultiCast-Discovery-and-refusal-of-connection-tp28827529p29236977.html Sent from the ActiveMQ - User mailing list archive at Nabble.com.