Hello, I have isolated one of our data centers to simulate a rolling restart upgrade from C* 1.1.10 to 1.2.10. We replayed our production traffic to the C* nodes during the upgrade and observed an increased number of read timeouts during the upgrade process.
I executed nodetool drain before upgrading each node, and during the upgrade "nodetool ring" was showing that node as DOWN, as expected. After each upgrade all nodes were showing the upgraded node as UP, so apparently all nodes were communicating fine. I manually tried to insert and retrieve some data into both the newly upgraded nodes and the old nodes, and the behavior was very unstable: sometimes it worked, sometimes it didn't (TimedOutException), so I don't think it was a network problem. The number of read timeouts diminished as the number of upgraded nodes increased, until it reached stability. The logs were showing the following messages periodically: INFO [HANDSHAKE-/10.176.249.XX] 2013-10-03 17:36:16,948 OutboundTcpConnection.java (line 399) Handshaking version with /10.176.249.XX INFO [HANDSHAKE-/10.176.182.YY] 2013-10-03 17:36:17,280 OutboundTcpConnection.java (line 408) Cannot handshake version with /10.176.182.YY INFO [HANDSHAKE-/10.176.182.YY] 2013-10-03 17:36:17,280 OutboundTcpConnection.java (line 399) Handshaking version with /10.176.182.YY INFO [HANDSHAKE-/10.188.13.ZZ] 2013-10-03 17:36:17,510 OutboundTcpConnection.java (line 408) Cannot handshake version with /10.188.13.ZZ INFO [HANDSHAKE-/10.188.13.ZZ] 2013-10-03 17:36:17,511 OutboundTcpConnection.java (line 399) Handshaking version with /10.188.13.ZZ DEBUG [WRITE-/54.215.70.YY] 2013-10-03 18:01:50,237 OutboundTcpConnection.java (line 338) Target max version is -2147483648; no version information yet, will retry TRACE [HANDSHAKE-/10.177.14.XX] 2013-10-03 18:01:50,237 OutboundTcpConnection.java (line 406) Cannot handshake version with /10.177.14.XX java.nio.channels.AsynchronousCloseException at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:185) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:272) at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:176) at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86) at java.io.InputStream.read(InputStream.java:82) at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:64) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.cassandra.net.OutboundTcpConnection$1.run(OutboundTcpConnection.java:400) Another fact is that the number of completed compaction tasks decreased as the number of upgraded nodes increased. I don't know if that's related to the increased number read timeouts or just a coincidence. The timeout configuration is the default (10000ms). Two similar issues were reported, but without satisfactory responses: - http://stackoverflow.com/questions/15355115/rolling-upgrade-for-cassandra-1-0-9-cluster-to-1-2-1 - https://issues.apache.org/jira/browse/CASSANDRA-5740 Is that an expected behavior or is there something that might be going wrong during the upgrade? Has anyone faced similar issues? Any help would be very much appreciated. Thanks, Paulo