The detail on Tomcat Clustering Load Testing Environment: Application: A web Portal, Pure JSP/Servlet based implementation using JDBC (Oracle 10g RAC) and OLTP in nature.
Load Test Tool: Jmeter Clustering Setup: 4 nodes OS: SUSE Enterprize 9 (SP2) on all nodes (kernel: 2.6.5-7.97) Sofwares: JDK 1.5.0_05, Tomcat 5.5.12 Hardware configuration: Node #1: Dual Pentium III (Coppermine) 1 GHz, 1 GB RAM Node #2: Single Intel(R) XEON(TM) CPU 2.00GHz, 1 GB RAM Node #3: Dual Pentium III (Coppermine) 1 GHz, 1 GB RAM Node #4: Single Intel(R) XEON(TM) CPU 2.00GHz, 1 GB RAM Network Configuration: All nodes are behind Alteon Load balancer (response-time based load balancing), all have two nic cards with subnets 10.1.13.0 for load balancing network, 10.1.11.0 for private LAN. The private nic has multicast enabled. All private nic are connected to 10/100 Fast Ethernet switch. Tomcat cluster configuration (same on all nodes): <Cluster className="org.apache.catalina.cluster .tcp.SimpleTcpCluster" managerClassName=" org.apache.catalina.cluster.session.DeltaManager" expireSessionsOnShutdown="false" useDirtyFlag="true" notifyListenersOnReplication="true"> <Membership className="org.apache.catalina.cluster.mcast.McastService" mcastAddr="228.0.0.4" mcastPort="45564" mcastFrequency="1000" mcastDropTime="35000" mcastBindAddr="auto" /> <Receiver className=" org.apache.catalina.cluster.tcp.ReplicationListener" tcpListenAddress="auto" tcpListenPort="4001" tcpThreadCount="24"/> <Sender className=" org.apache.catalina.cluster.tcp.ReplicationTransmitter" replicationMode="pooled" autoConnect="true" keepAliveTimeout="-1" maxPoolSocketLimit="600" doTransmitterProcessingStats="true" /> <Valve className=" org.apache.catalina.cluster.tcp.ReplicationValve" filter=".*\.gif;.*\.js;.*\.jpg;.*\.png;.*\.htm;.*\.html;.*\.css;.*\.txt;"/> <Deployer className=" org.apache.catalina.cluster.deploy.FarmWarDeployer" tempDir="/tmp/war-temp/" deployDir="/tmp/war-deploy/" watchDir="/tmp/war-listen/" watchEnabled="false"/> <ClusterListener className=" org.apache.catalina.cluster.session.ClusterSessionListener"/> </Cluster> Note: for the application session availability on all the nodes is must, so using "pooled" mode. Tomcate VM Parameters (additional switches for VM tunning): -XX:+AggressiveHeap -Xms832m -Xmx832m -XX:+UseParallelGC -XX:+PrintGCDetails -XX:MaxGCPauseMillis=200 -XX:GCTimeRatio=9 After starting tomcat on all the nodes, when I run Jmeter scripts with 20-70 concurrent user threads, the entire cluster works fine (almost 0% error) but at high number of users like > 200 concurrent user threads the tomcat cluster session replication starts failing consistently and the replication messages getting lost. Here is what I get in tomcat logs on all the nodes (too many times): WARNING: Message lost: [10.1.11.95:4,001] type=[ org.apache.catalina.cluster.session.SessionMessageImpl], id=[40FC741DB987BF5161C3AEEB32570A8E- 1134732225260] java.net.SocketException: Broken pipe at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java :92) at java.net.SocketOutputStream.write(SocketOutputStream.java:124) at org.apache.catalina.cluster.tcp.DataSender.writeData( DataSender.java:858) at org.apache.catalina.cluster.tcp.DataSender.pushMessage( DataSender.java:799) at org.apache.catalina.cluster.tcp.DataSender.sendMessage( DataSender.java:623) at org.apache.catalina.cluster.tcp.PooledSocketSender.sendMessage( PooledSocketSender.java:128) at org.apache.catalina.cluster.tcp.ReplicationTransmitter.sendMessageData( ReplicationTransmitter.java:867) at org.apache.catalina.cluster.tcp.ReplicationTransmitter.sendMessageClusterDomain (ReplicationTransmitter.java:460) at org.apache.catalina.cluster.tcp.SimpleTcpCluster.sendClusterDomain( SimpleTcpCluster.java:1012) at org.apache.catalina.cluster.session.DeltaManager.send( DeltaManager.java:629) at org.apache.catalina.cluster.session.DeltaManager.sendCreateSession( DeltaManager.java:617) at org.apache.catalina.cluster.session.DeltaManager.createSession( DeltaManager.java:593) at org.apache.catalina.cluster.session.DeltaManager.createSession( DeltaManager.java:572) ............................. ............................. Also I have noticed fewer times on two of the nodes (#3, #4) following error: SEVERE: TCP Worker thread in cluster caught 'java.lang.ArrayIndexOutOfBoundsException: 1025' closing channel java.lang.ArrayIndexOutOfBoundsException: 1025 at org.apache.catalina.cluster.io.XByteBuffer.toInt(XByteBuffer.java :231) at org.apache.catalina.cluster.io.XByteBuffer.countPackages( XByteBuffer.java:164) at org.apache.catalina.cluster.io.ObjectReader.append( ObjectReader.java:87) at org.apache.catalina.cluster.tcp.TcpReplicationThread.drainChannel (TcpReplicationThread.java:127) at org.apache.catalina.cluster.tcp.TcpReplicationThread.run( TcpReplicationThread.java:69) With all the above warning/exception I get the following jmeter results (scripts runs at: 200 concurrent threads, 5 iteration, 0 sec ramp-up period): Rate: 28 req/sec Error: 9.07 % The rate is acceptable but error is very high and specially at high number of user thread the error % goes up. I have run the Jmeter script several times along with tweaking cluster configuration but I am not able to figure out what am I doing wrong. Is "Broken pipe" is some kind failure and serious blocker OR it can safely be ignored? "ArrayIndexOutOfBoundsException" looks to me a bug, it may already have been reported but I don't know yet? With current scenario the memory usage are below 600 MB. My target is reach 2000 concurrent users thread keeping error within 3% and maintain the same req/sec. Does this mean I have to add more memory (making it 2 GB on each node). Is there something else I am missing that I need to look at? Any suggestions, ideas, tips are most welcome and appreciated. Thanks Yogi