Cluster Not Syncing After Restart

David Rees Wed, 16 Jan 2008 17:26:19 -0800

I'm running a Tomcat 5.5.25 cluster with 2 nodes and the following
cluster configuration in the <Host/> element:


<Cluster className="org.apache.catalina.cluster.tcp.SimpleTcpCluster">
        <Valve className="org.apache.catalina.cluster.tcp.ReplicationValve"
                   
filter=".*\.gif;.*\.js;.*\.jpg;.*\.png;.*\.htm;.*\.html;.*\.css;.*\.txt;"/>
        <Valve 
className="org.apache.catalina.cluster.session.JvmRouteBinderValve"
enabled="true"/>
        <ClusterListener
className="org.apache.catalina.cluster.session.ClusterSessionListener"/>
        <ClusterListener
className="org.apache.catalina.cluster.session.JvmRouteSessionIDBinderListener"/>
</Cluster>

Every night, I restart the two nodes of the cluster, one at a time,
using a standard shutdown with a processing monitoring Tomcat to start
it back up automatically after it has exited (daemontools for those
interested).

But after restarting the first node, it does not rejoin the cluster.
In the logs of node 1 are this (node was shut down at 04:30):

2008-01-17 04:30:07,212 INFO : SimpleTcpCluster: Cluster is about to start
2008-01-17 04:30:07,317 INFO : SimpleTcpCluster: Add Default
ClusterReceiver at cluster localhost
2008-01-17 04:30:07,326 INFO : SimpleTcpCluster: Add Default
ClusterSender at cluster localhost
2008-01-17 04:30:07,341 INFO : SocketReplicationListener: Open Socket
at [10.1.1.5:8015]
2008-01-17 04:30:07,363 ERROR: ClusterListener: Context manager doesn't exist:
^^^ That line repeats ~25,000 times for a couple seconds
2008-01-17 04:30:10,110 ERROR: ClusterListener: Context manager doesn't exist:
2008-01-17 04:30:11,443 INFO : McastService: membership mbean
registered (Catalina:type=ClusterMembership,host=localhost)
2008-01-17 04:30:11,790 INFO : DeltaManager: Starting clustering manager...:
2008-01-17 04:30:11,790 INFO : DeltaManager: Register manager  to
cluster element Host with name localhost
2008-01-17 04:30:11,790 INFO : DeltaManager: Starting clustering manager at
2008-01-17 04:30:11,794 WARN : DeltaManager: Manager [], requesting
session state from
org.apache.catalina.cluster.mcast.McastMember[tcp://10.1.1.6:8015,catalina,10.1.1.6,8015,
alive=86161577]. This operation will timeout if no session state has
been received within 60 seconds.
2008-01-17 04:30:21,894 ERROR: DeltaManager: Manager []: No session
state send at 17/01/08 04:30 received, timing out after 10,102 ms.

While node 1 is shutting down, ~4000 messages like this are generated on node 2:

2008-01-17 04:30:06,233 WARN : FastAsyncSocketSender: Unable to
asynchronously send session with id=[xxxxxxxxx.c-web2-1200438737718] -
message will be ignored.
java.lang.NullPointerException
        at 
org.apache.catalina.cluster.tcp.DataSender.pushMessage(DataSender.java:1057)
        at 
org.apache.catalina.cluster.tcp.FastAsyncSocketSender$FastQueueThread.pushQueuedMessages(FastAsyncSocketSender.java:524)
        at 
org.apache.catalina.cluster.tcp.FastAsyncSocketSender$FastQueueThread.run(FastAsyncSocketSender.java:487)
2008-01-17 04:30:06,235 WARN : FastAsyncSocketSender: Unable to
asynchronously send session with id=[xxxxxxxxx.c-web2-1200438739053] -
message will be ignored.
java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333)
        at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195)
        at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
        at java.net.Socket.connect(Socket.java:519)
        at 
org.apache.catalina.cluster.tcp.DataSender.createSocket(DataSender.java:897)
        at 
org.apache.catalina.cluster.tcp.DataSender.openSocket(DataSender.java:866)
        at 
org.apache.catalina.cluster.tcp.DataSender.pushMessage(DataSender.java:1009)
        at 
org.apache.catalina.cluster.tcp.FastAsyncSocketSender$FastQueueThread.pushQueuedMessages(FastAsyncSocketSender.java:524)
        at 
org.apache.catalina.cluster.tcp.FastAsyncSocketSender$FastQueueThread.run(FastAsyncSocketSender.java:487)

These stop after about 1 second (last one at 04:30:07.340), and then a
decent number of "Send stats" lines are then printed on node 2:

2008-01-17 04:30:07,376 INFO : DataSender: Send stats from
[10.1.1.5:8,015], Nr of bytes sent=20,441,242 over 26,000 = 786
bytes/request, processing time 58,669,413 msec, avg processing time
2,257 msec
<!-- More lines similar to the above and below -->
2008-01-17 04:30:10,053 INFO : DataSender: Send stats from
[10.1.1.5:8,015], Nr of bytes sent=39,242,737 over 50,000 = 784
bytes/request, processing time 58,672,059 msec, avg processing time
1,173 msec

When node 2 restarts a couple minutes later, it is able to
successfully rejoin the cluster:

2008-01-17 04:34:05,433 INFO : SimpleTcpCluster: Cluster is about to start
2008-01-17 04:34:05,515 INFO : SimpleTcpCluster: Add Default
ClusterReceiver at cluster localhost
2008-01-17 04:34:05,521 INFO : SimpleTcpCluster: Add Default
ClusterSender at cluster localhost
2008-01-17 04:34:05,531 INFO : SocketReplicationListener: Open Socket
at [10.1.1.6:8015]
2008-01-17 04:34:05,592 INFO : ReplicationTransmitter: Start
ClusterSender at cluster Catalina:type=Cluster,host=localhost with
name Catalina:type=ClusterSender,host=localhost
2008-01-17 04:34:05,593 INFO : SimpleTcpCluster: Add Default
Membership Service at cluster localhost
2008-01-17 04:34:05,601 INFO : McastService: Setting cluster mcast
soTimeout to 1000
2008-01-17 04:34:05,618 INFO : McastService: Sleeping for 4000
milliseconds to establish cluster membership
2008-01-17 04:34:05,989 INFO : SimpleTcpCluster: Replication member
added:org.apache.catalina.cluster.mcast.McastMember[tcp://10.1.1.5:8015,catalina,10.1.1.5,8015,
alive=234556]
2008-01-17 04:34:05,998 INFO : FastAsyncSocketSender: Create sender
[/10.1.1.5:8,015] queue thread to tcp background replication
2008-01-17 04:34:09,634 INFO : McastService: membership mbean
registered (Catalina:type=ClusterMembership,host=localhost)
2008-01-17 04:34:10,029 INFO : DeltaManager: Starting clustering manager...:
2008-01-17 04:34:10,034 INFO : DeltaManager: Register manager  to
cluster element Host with name localhost
2008-01-17 04:34:10,034 INFO : DeltaManager: Starting clustering manager at
2008-01-17 04:34:10,060 WARN : DeltaManager: Manager [], requesting
session state from
org.apache.catalina.cluster.mcast.McastMember[tcp://10.1.1.5:8015,catalina,10.1.1.5,8015,
alive=238564]. This operation will timeout if no session state has
been received within 60 seconds.
2008-01-17 04:34:10,367 INFO : DeltaManager: Manager []; session state
send at 17/01/08 04:34 received in 316 ms.

But, when looking at the web application manager for each, the number
of sessions does not match and sessions are not being synced up
between nodes.

If I restart node 1 multiple times, it will eventually sync up
(verified by watching the web application manager and checking that
the number of sessions match), but I am not sure what eventually lets
it sync up.

What I've tried: I've tried waiting a minute before starting the node
back up, but that doesn't seem to help.
I haven't tried changing any cluster parameter settings as I can't
determine if changing anything would help.

Any ideas?

-Dave

---------------------------------------------------------------------
To start a new topic, e-mail: [email protected]
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Cluster Not Syncing After Restart

Reply via email to