We have an issue in our Tomcat 7.0.30 clustered production environment on RHEL 5 where Tomcat fails to start our application when other nodes in the cluster are under extremely heavy load. It fails because the BackupManager cannot start the replicated map due to timeouts trying to connect to all the other nodes. The only way to recover from this seems to be shutting down almost all of the nodes and then starting them again. The cluster has 9 nodes, but we have also had the problem with 6 nodes.
Is there a way to ensure the application will start even if the BackupManager cannot connect to the other nodes? Or is there some configuration we can change to prevent the connection failure? E.g. should we set some of the attributes on <Transport> such as timeout, maxRetryAttempts, poolSize, or maxWait? We see many "WARNING: Channel key is registered, but has had no interest ops for the last 3000 ms." messages when we we have relatively heavy load. Should we increase maxThreads on <Receiver> to avoid this message and potentially prevent the connection failures? This is the server.xml: <Cluster className="org.apache.catalina.ha.tcp.SimpleTcpCluster" channelSendOptions="8"> <Manager className="org.apache.catalina.ha.session.BackupManager" mapSendOptions="8" expireSessionsOnShutdown="false" notifyListenersOnReplication="true"/> <Channel className="org.apache.catalina.tribes.group.GroupChannel"> <Membership className="org.apache.catalina.tribes.membership.McastService" address="228.0.0.4" port="45564" frequency="500" dropTime="3000"/> <Receiver className="org.apache.catalina.tribes.transport.nio.NioReceiver" address="auto" port="4001" selectorTimeout="100" maxThreads="6"/> <Sender className="org.apache.catalina.tribes.transport.ReplicationTransmitter"> <Transport className="org.apache.catalina.tribes.transport.nio.PooledParallelSender"/> </Sender> <Interceptor className="org.apache.catalina.tribes.group.interceptors.TcpFailureDetector" /> <Interceptor className="org.apache.catalina.tribes.group.interceptors.MessageDispatch15In terceptor"/> <Interceptor className="org.apache.catalina.tribes.group.interceptors.ThroughputIntercept or"/> </Channel> <Valve className="org.apache.catalina.ha.tcp.ReplicationValve" filter=".*\.gif|.*\.js|.*\.jpeg|.*\.jpg|.*\.png|.*\.htm|.*\.html|.*\.css|.*\ .txt"/> <Valve className="org.apache.catalina.ha.session.JvmRouteBinderValve" /> <ClusterListener className="org.apache.catalina.ha.session.ClusterSessionListener"/> <ClusterListener className="org.apache.catalina.ha.session.JvmRouteSessionIDBinderListener"/> </Cluster> --------------------- This is log for the tcp://{10, 230, 20, 85} node that failed to startup because it timed out connecting to 6 of the 8 other nodes: Jun 26, 2013 5:29:01 PM org.apache.catalina.tribes.tipis.AbstractReplicatedMap init WARNING: Unable to send map start message. Jun 26, 2013 5:29:01 PM org.apache.catalina.ha.session.BackupManager startInternal SEVERE: Unable to start BackupManager: [/appv5] java.lang.RuntimeException: Unable to start replicated map. at org.apache.catalina.tribes.tipis.AbstractReplicatedMap.init(AbstractReplicat edMap.java:234) at org.apache.catalina.tribes.tipis.AbstractReplicatedMap.<init>(AbstractReplic atedMap.java:176) at org.apache.catalina.tribes.tipis.LazyReplicatedMap.<init>(LazyReplicatedMap. java:104) at org.apache.catalina.ha.session.BackupManager.startInternal(BackupManager.jav a:163) at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150) at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java: 5294) at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150) at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:9 01) at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:877) at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:618) at org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:963) at org.apache.catalina.startup.HostConfig$DeployWar.run(HostConfig.java:1600) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.ja va:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:9 08) at java.lang.Thread.run(Thread.java:662) Caused by: org.apache.catalina.tribes.ChannelException: Operation has timed out(3000 ms.).; Faulty members:tcp://{10, 230, 20, 86}:4001; tcp://{10, 230, 20, 87}:4001; tcp://{10, 230, 20, 94}:4001; tcp://{10, 230, 20, 95}:4001; tcp://{10, 230, 20, 70}:4001; tcp://{10, 230, 20, 89}:4001; at org.apache.catalina.tribes.transport.nio.ParallelNioSender.sendMessage(Paral lelNioSender.java:109) at org.apache.catalina.tribes.transport.nio.PooledParallelSender.sendMessage(Po oledParallelSender.java:54) at org.apache.catalina.tribes.transport.ReplicationTransmitter.sendMessage(Repl icationTransmitter.java:79) at org.apache.catalina.tribes.group.ChannelCoordinator.sendMessage(ChannelCoord inator.java:78) at org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelI nterceptorBase.java:79) at org.apache.catalina.tribes.group.interceptors.ThroughputInterceptor.sendMess age(ThroughputInterceptor.java:64) at org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelI nterceptorBase.java:79) at org.apache.catalina.tribes.group.interceptors.MessageDispatchInterceptor.sen dMessage(MessageDispatchInterceptor.java:77) at org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelI nterceptorBase.java:79) at org.apache.catalina.tribes.group.interceptors.TcpFailureDetector.sendMessage (TcpFailureDetector.java:89) at org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelI nterceptorBase.java:79) at org.apache.catalina.tribes.group.GroupChannel.send(GroupChannel.java:224) at org.apache.catalina.tribes.group.GroupChannel.send(GroupChannel.java:182) at org.apache.catalina.tribes.group.RpcChannel.send(RpcChannel.java:96) at org.apache.catalina.tribes.tipis.AbstractReplicatedMap.broadcast(AbstractRep licatedMap.java:315) at org.apache.catalina.tribes.tipis.AbstractReplicatedMap.init(AbstractReplicat edMap.java:223) ... 17 more Jun 26, 2013 5:29:01 PM org.apache.catalina.core.StandardContext startInternal SEVERE: Error manager.start() org.apache.catalina.LifecycleException: Failed to start component [org.apache.catalina.ha.session.BackupManager[/appv5]] at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:154) at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java: 5294) at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150) at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:9 01) at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:877) at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:618) at org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:963) at org.apache.catalina.startup.HostConfig$DeployWar.run(HostConfig.java:1600) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.ja va:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:9 08) at java.lang.Thread.run(Thread.java:662) Caused by: org.apache.catalina.LifecycleException: Failed to start BackupManager: [/appv5] at org.apache.catalina.ha.session.BackupManager.startInternal(BackupManager.jav a:172) at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150) ... 13 more Caused by: java.lang.RuntimeException: Unable to start replicated map. at org.apache.catalina.tribes.tipis.AbstractReplicatedMap.init(AbstractReplicat edMap.java:234) at org.apache.catalina.tribes.tipis.AbstractReplicatedMap.<init>(AbstractReplic atedMap.java:176) at org.apache.catalina.tribes.tipis.LazyReplicatedMap.<init>(LazyReplicatedMap. java:104) at org.apache.catalina.ha.session.BackupManager.startInternal(BackupManager.jav a:163) ... 14 more Caused by: org.apache.catalina.tribes.ChannelException: Operation has timed out(3000 ms.).; Faulty members:tcp://{10, 230, 20, 86}:4001; tcp://{10, 230, 20, 87}:4001; tcp://{10, 230, 20, 94}:4001; tcp://{10, 230, 20, 95}:4001; tcp://{10, 230, 20, 70}:4001; tcp://{10, 230, 20, 89}:4001; at org.apache.catalina.tribes.transport.nio.ParallelNioSender.sendMessage(Paral lelNioSender.java:109) at org.apache.catalina.tribes.transport.nio.PooledParallelSender.sendMessage(Po oledParallelSender.java:54) at org.apache.catalina.tribes.transport.ReplicationTransmitter.sendMessage(Repl icationTransmitter.java:79) at org.apache.catalina.tribes.group.ChannelCoordinator.sendMessage(ChannelCoord inator.java:78) at org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelI nterceptorBase.java:79) at org.apache.catalina.tribes.group.interceptors.ThroughputInterceptor.sendMess age(ThroughputInterceptor.java:64) at org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelI nterceptorBase.java:79) at org.apache.catalina.tribes.group.interceptors.MessageDispatchInterceptor.sen dMessage(MessageDispatchInterceptor.java:77) at org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelI nterceptorBase.java:79) at org.apache.catalina.tribes.group.interceptors.TcpFailureDetector.sendMessage (TcpFailureDetector.java:89) at org.apache.catalina.tribes.group.ChannelInterceptorBase.sendMessage(ChannelI nterceptorBase.java:79) at org.apache.catalina.tribes.group.GroupChannel.send(GroupChannel.java:224) at org.apache.catalina.tribes.group.GroupChannel.send(GroupChannel.java:182) at org.apache.catalina.tribes.group.RpcChannel.send(RpcChannel.java:96) at org.apache.catalina.tribes.tipis.AbstractReplicatedMap.broadcast(AbstractRep licatedMap.java:315) at org.apache.catalina.tribes.tipis.AbstractReplicatedMap.init(AbstractReplicat edMap.java:223) ... 17 more Jun 26, 2013 5:29:01 PM org.apache.catalina.core.StandardContext startInternal SEVERE: Context [/appv5] startup failed due to previous errors Jun 26, 2013 5:29:01 PM org.apache.catalina.tribes.group.interceptors.TcpFailureDetector memberDisappeared INFO: Verification complete. Member already disappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{10, 230, 20, 70}:4001,{10, 230, 20, 70},4001, alive=83064341, securePort=-1, UDP Port=-1, id={42 5 66 -93 46 -96 75 68 -90 -107 -47 -120 1 -126 54 -85 }, payload={}, command={}, domain={}, ]] Jun 26, 2013 5:29:01 PM org.apache.catalina.tribes.group.GroupChannel$HeartbeatThread run SEVERE: Unable to send heartbeat through Tribes interceptor stack. Will try to sleep again. java.util.ConcurrentModificationException at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:372) at java.util.AbstractList$Itr.next(AbstractList.java:343) at org.apache.catalina.tribes.group.GroupChannel.heartbeat(GroupChannel.java:15 8) at org.apache.catalina.tribes.group.GroupChannel$HeartbeatThread.run(GroupChann el.java:690) --------------------- This is the log for one of the other nodes that tcp://{10, 230, 20, 85} was trying to connect to: Jun 26, 2013 5:28:42 PM org.apache.catalina.tribes.transport.nio.NioReceiver socketTimeouts WARNING: Channel key is registered, but has had no interest ops for the last 3000 ms. (cancelled:false):sun.nio.ch.SelectionKeyImpl@3b5a3910 last access:2013-06-26 17:28:33.152 Possible cause: all threads used, perform thread dump Jun 26, 2013 5:28:51 PM org.apache.catalina.tribes.transport.nio.NioReceiver socketTimeouts WARNING: Channel key is registered, but has had no interest ops for the last 3000 ms. (cancelled:true):sun.nio.ch.SelectionKeyImpl@3b5a3910 last access:2013-06-26 17:28:42.42 Possible cause: all threads used, perform thread dump Jun 26, 2013 5:29:09 PM org.apache.catalina.tribes.transport.nio.NioReceiver socketTimeouts WARNING: Channel key is registered, but has had no interest ops for the last 3000 ms. (cancelled:true):sun.nio.ch.SelectionKeyImpl@c0bf55e last access:2013-06-26 17:28:51.394 Possible cause: all threads used, perform thread dump Jun 26, 2013 5:29:19 PM org.apache.catalina.tribes.transport.nio.NioReceiver socketTimeouts WARNING: Channel key is registered, but has had no interest ops for the last 3000 ms. (cancelled:false):sun.nio.ch.SelectionKeyImpl@4adae0bf last access:2013-06-26 17:29:09.031 Possible cause: all threads used, perform thread dump Jun 26, 2013 5:29:28 PM org.apache.catalina.tribes.group.interceptors.TcpFailureDetector memberDisappeared INFO: Received memberDisappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{10 , 230, 20, 94}:4001,{10, 230, 20, 94},4001, alive=82356886, securePort=-1, UDP Port=-1, id={-124 80 -71 68 110 -111 77 42 -87 -19 40 10 19 58 119 -25 }, payload={}, command={}, domain={}, ]] message. Will verify. Jun 26, 2013 5:29:28 PM org.apache.catalina.tribes.transport.nio.NioReceiver socketTimeouts WARNING: Channel key is registered, but has had no interest ops for the last 3000 ms. (cancelled:true):sun.nio.ch.SelectionKeyImpl@4adae0bf last access:2013-06-26 17:29:19.386 Possible cause: all threads used, perform thread dump Jun 26, 2013 5:29:40 PM org.apache.catalina.tribes.transport.nio.NioReceiver socketTimeouts WARNING: Channel key is registered, but has had no interest ops for the last 3000 ms. (cancelled:true):sun.nio.ch.SelectionKeyImpl@2e693d38 last access:2013-06-26 17:29:37.841 Possible cause: all threads used, perform thread dump Jun 26, 2013 5:29:46 PM org.apache.catalina.tribes.transport.nio.NioReceiver socketTimeouts WARNING: Channel key is registered, but has had no interest ops for the last 3000 ms. (cancelled:false):sun.nio.ch.SelectionKeyImpl@5bc45082 last access:2013-06-26 17:29:43.859 Possible cause: all threads used, perform thread dump -- Patrick