Are you using EC2 ? On 11 May 2012, at 16:13, Pavel Polushkin wrote:
> We use 1.0.8 version. > > From: David Leimbach [mailto:leim...@gmail.com] > Sent: Friday, May 11, 2012 18:48 > To: user@cassandra.apache.org > Subject: Re: Cassandra stucks > > What's the version number of Cassandra? > > On Fri, May 11, 2012 at 7:38 AM, Pavel Polushkin <ppolush...@enkata.com> > wrote: > Hello, > > > > We faced with a strange problem while testing performance on Cassandra > cluster. After some time all nodes went to down state for several days. Now > all nodes went back to up state and only one node still down. > > > > Nodetool on down node throws exception: > > Error connection to remote JMX agent! > > java.io.IOException: Failed to retrieve RMIServer stub: > javax.naming.CommunicationException [Root exception is > java.rmi.ConnectIOException: error during JRMP connection establishment; > nested exception is: > > java.net.SocketTimeoutException: Read timed out] > > at > javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:340) > > at > javax.management.remote.JMXConnectorFactory.connect(JMXConnectorFactory.java:248) > > at org.apache.cassandra.tools.NodeProbe.connect(NodeProbe.java:144) > > at org.apache.cassandra.tools.NodeProbe.<init>(NodeProbe.java:114) > > at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:623) > > Caused by: javax.naming.CommunicationException [Root exception is > java.rmi.ConnectIOException: error during JRMP connection establishment; > nested exception is: > > java.net.SocketTimeoutException: Read timed out] > > at > com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:101) > > at > com.sun.jndi.toolkit.url.GenericURLContext.lookup(GenericURLContext.java:185) > > at javax.naming.InitialContext.lookup(InitialContext.java:392) > > at > javax.management.remote.rmi.RMIConnector.findRMIServerJNDI(RMIConnector.java:1888) > > at > javax.management.remote.rmi.RMIConnector.findRMIServer(RMIConnector.java:1858) > > at > javax.management.remote.rmi.RMIConnector.connect(RMIConnector.java:257) > > ... 4 more > > Caused by: java.rmi.ConnectIOException: error during JRMP connection > establishment; nested exception is: > > java.net.SocketTimeoutException: Read timed out > > at > sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:286) > > at sun.rmi.transport.tcp.TCPChannel.newConnection(TCPChannel.java:184) > > at sun.rmi.server.UnicastRef.newCall(UnicastRef.java:322) > > at sun.rmi.registry.RegistryImpl_Stub.lookup(Unknown Source) > > at > com.sun.jndi.rmi.registry.RegistryContext.lookup(RegistryContext.java:97) > > ... 9 more > > Caused by: java.net.SocketTimeoutException: Read timed out > > at java.net.SocketInputStream.socketRead0(Native Method) > > at java.net.SocketInputStream.read(SocketInputStream.java:129) > > at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > > at java.io.BufferedInputStream.read(BufferedInputStream.java:237) > > at java.io.DataInputStream.readByte(DataInputStream.java:248) > > at > sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:228) > > ... 13 more > > > > In system log of down node unlimited list of such errors: > > INFO [GossipStage:1] 2012-05-10 23:18:27,579 Gossiper.java (line 804) > InetAddress /172.15.2.161 is now UP INFO [GossipStage:1] 2012-05-10 > 23:18:27,580 Gossiper.java (line 804) InetAddress /172.15.2.162 is now UP > INFO [GossipStage:1] 2012-05-10 23:18:27,580 Gossiper.java (line 804) > InetAddress /172.15.2.163 is now UP INFO [GossipStage:1] 2012-05-10 > 23:18:27,580 Gossiper.java (line 804) InetAddress /172.15.2.165 is now UP > INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818) > InetAddress /172.15.2.161 is now dead. > > INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818) > InetAddress /172.15.2.165 is now dead. > > INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818) > InetAddress /172.15.2.162 is now dead. > > INFO [GossipTasks:1] 2012-05-10 23:18:29,291 Gossiper.java (line 818) > InetAddress /172.15.2.163 is now dead. > > INFO [GossipStage:1] 2012-05-10 23:18:29,291 Gossiper.java (line 804) > InetAddress /172.15.2.161 is now UP INFO [GossipStage:1] 2012-05-10 > 23:18:29,292 Gossiper.java (line 804) InetAddress /172.15.2.162 is now UP > INFO [GossipStage:1] 2012-05-10 23:18:29,292 Gossiper.java (line 804) > InetAddress /172.15.2.163 is now UP INFO [GossipStage:1] 2012-05-10 > 23:18:29,292 Gossiper.java (line 804) InetAddress /172.15.2.165 is now UP > > > > The suspicious fact is that on this node we have several tcp connections to > other nodes 7000 port in CLOSE_WAIT state: > > Active Internet connections (servers and established) > > Proto Recv-Q Send-Q Local Address Foreign Address State > > tcp 869073 0 rcwocas:afs3-fileserver rcwocas03.enkata.:34274 CLOSE_WAIT > > tcp 463429 0 rcwocas:afs3-fileserver rcwocas02.enkata.:39654 CLOSE_WAIT > > tcp 873838 0 rcwocas:afs3-fileserver rcwocas01.enkata.:49486 CLOSE_WAIT > > tcp 860245 0 rcwocas:afs3-fileserver rcwocas05.enkata.:43028 CLOSE_WAIT > > tcp 112 0 rcwocas:afs3-fileserver rcwocas02.enkata.:40321 CLOSE_WAIT > > tcp 2124 0 rcwocas:afs3-fileserver rcwocas03.enkata.:39338 CLOSE_WAIT > > tcp 0 0 rcwocas:afs3-fileserver rcwocas01.enkata.:56408 > ESTABLISHED > > tcp 184 0 rcwocas:afs3-fileserver rcwocas01.enkata.:48862 CLOSE_WAIT > > tcp 534489 0 rcwocas:afs3-fileserver rcwocas02.enkata.:35331 > ESTABLISHED > > tcp 886 0 rcwocas:afs3-fileserver rcwocas03.enkata.:56034 CLOSE_WAIT > > tcp 0 0 rcwocas04.Enkata.:48800 rcwocas:afs3-fileserver > ESTABLISHED > > tcp 0 0 rcwocas:afs3-fileserver rcwocas01.enkata.:51348 > ESTABLISHED > > tcp 187 0 rcwocas:afs3-fileserver rcwocas05.enkata.:45538 CLOSE_WAIT > > tcp 253 0 rcwocas:afs3-fileserver rcwocas03.enkata.:51359 CLOSE_WAIT > > > > Also I have attached thread dump > > > > Thanks, > > Pavel > > >