I have 2 Tomcat 6.0.16 servers set up in a cluster running on a Windows 2003 VM as a windows service, with java version 1.6.0_10. After 10 - 14 days of running one of the Tomcat instances will start using 100% of the server CPU.
Through JConsole I see that the NIOReciever thread is the top CPU using thread, where it is usually at the bottom with next to none CPU usage. When I restart the Tomcat6 windows service everything goes back to normal, but a couple of days later the other server in the cluster will need to be restarted. I searched for similar occurrences but I was only able to find a problem with the NIO selector while running on Linux, and it was supposed to be fixed in a previous build of 1.6. I used the cluster setup from the tomcat manual, with the exception of using synchronous replication. <Cluster className="org.apache.catalina.ha.tcp.SimpleTcpCluster" channelSendOptions="4"> <Manager className="org.apache.catalina.ha.session.DeltaManager" expireSessionsOnShutdown="false" notifyListenersOnReplication="true" /> <Channel className="org.apache.catalina.tribes.group.GroupChannel"> <Membership className="org.apache.catalina.tribes.membership.McastService" address="228.0.0.4" port="45564" frequency="500" dropTime="3000" /> <Receiver className="org.apache.catalina.tribes.transport.nio.NioReceiver" address="auto" port="4000" autoBind="100" selectorTimeout="5000" maxThreads="6" /> <Sender className="org.apache.catalina.tribes.transport.ReplicationTransmitter"> <Transport className="org.apache.catalina.tribes.transport.nio.PooledParallelSender " /> </Sender> <Interceptor className="org.apache.catalina.tribes.group.interceptors.TcpFailureDetec tor" /> <Interceptor className="org.apache.catalina.tribes.group.interceptors.MessageDispatch 15Interceptor" /> </Channel> <Valve className="org.apache.catalina.ha.tcp.ReplicationValve" filter="" /> <Valve className="org.apache.catalina.ha.session.JvmRouteBinderValve" /> <ClusterListener className="org.apache.catalina.ha.session.JvmRouteSessionIDBinderListene r" /> <ClusterListener className="org.apache.catalina.ha.session.ClusterSessionListener" /> </Cluster> I took a thread dump during the most recent occurrence: [2010-05-04 07:49:40] [info] "NioReceiver" [2010-05-04 07:49:40] [info] daemon [2010-05-04 07:49:40] [info] prio=6 tid=0x54f9b400 [2010-05-04 07:49:40] [info] nid=0x2e8 [2010-05-04 07:49:40] [info] runnable [2010-05-04 07:49:40] [info] [0x5522f000..0x5522fa18] [2010-05-04 07:49:40] [info] java.lang.Thread.State: RUNNABLE [2010-05-04 07:49:40] [info] at sun.nio.ch.WindowsSelectorImpl$SubSelector.poll0(Native Method) [2010-05-04 07:49:40] [info] at sun.nio.ch.WindowsSelectorImpl$SubSelector.poll(Unknown Source) [2010-05-04 07:49:40] [info] at sun.nio.ch.WindowsSelectorImpl$SubSelector.access$400(Unknown Source) [2010-05-04 07:49:40] [info] at sun.nio.ch.WindowsSelectorImpl.doSelect(Unknown Source) [2010-05-04 07:49:40] [info] at sun.nio.ch.SelectorImpl.lockAndDoSelect(Unknown Source) [2010-05-04 07:49:40] [info] - locked <0x07563448> [2010-05-04 07:49:40] [info] (a sun.nio.ch.Util$1) [2010-05-04 07:49:40] [info] - locked <0x07563458> [2010-05-04 07:49:40] [info] (a java.util.Collections$UnmodifiableSet) [2010-05-04 07:49:40] [info] - locked <0x075633d0> [2010-05-04 07:49:40] [info] (a sun.nio.ch.WindowsSelectorImpl) [2010-05-04 07:49:40] [info] at sun.nio.ch.SelectorImpl.select(Unknown Source) [2010-05-04 07:49:40] [info] at org.apache.catalina.tribes.transport.nio.NioReceiver.listen(NioReceiver. java:243) [2010-05-04 07:49:40] [info] at org.apache.catalina.tribes.transport.nio.NioReceiver.run(NioReceiver.jav a:353) [2010-05-04 07:49:40] [info] at java.lang.Thread.run(Unknown Source) The only other thing I have noticed is that every evening around the same time I see the following messages posted in the catalina log for 5 - 30 minutes: Apr 28, 2010 6:47:16 PM org.apache.catalina.tribes.group.interceptors.TcpFailureDetector memberDisappeared INFO: Received memberDisappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp:/ /{10, -116, 111, 42}:4000,{10, -116, 111, 42},4000, alive=155973672,id={78 -71 -19 48 57 82 65 122 -80 52 -24 28 -126 95 77 27 }, payload={}, command={}, domain={}, ]] message. Will verify. Apr 28, 2010 6:47:16 PM org.apache.catalina.tribes.transport.nio.NioReceiver socketTimeouts WARNING: Channel key is registered, but has had no interest ops for the last 3000 ms. (cancelled:false):sun.nio.ch.selectionkeyi...@a3ae07 last access:2010-04-28 18:47:10.283 And this is the last message I see every day: Apr 28, 2010 6:47:29 PM org.apache.catalina.tribes.group.interceptors.TcpFailureDetector memberDisappeared INFO: Verification complete. Member still alive[org.apache.catalina.tribes.membership.MemberImpl[tcp://{10, -116, 111, 42}:4000,{10, -116, 111, 42},4000, alive=156009672,id={78 -71 -19 48 57 82 65 122 -80 52 -24 28 -126 95 77 27 }, payload={}, command={}, domain={}, ]] I'm trying to track down what in our environment is causing the two instances not to be able to communicate, and I'm not sure if this is what causes the NIOReciever to use all the CPU. Any help identifying what is causing the CPU usage increase would be appreciated. Thanks, Ryan ***************************************************************************** If you wish to communicate securely with Commerce Bank and its affiliates, you must log into your account under Online Services at http://www.commercebank.com or use the Commerce Bank Secure Email Message Center at https://securemail.commercebank.com NOTICE: This electronic mail message and any attached files are confidential. The information is exclusively for the use of the individual or entity intended as the recipient. If you are not the intended recipient, any use, copying, printing, reviewing, retention, disclosure, distribution or forwarding of the message or any attached file is not authorized and is strictly prohibited. If you have received this electronic mail message in error, please advise the sender by reply electronic mail immediately and permanently delete the original transmission, any attachments and any copies of this message from your computer system. *****************************************************************************