Hello,

I have isolated one of our data centers to simulate a rolling restart
upgrade from C* 1.1.10 to 1.2.10. We replayed our production traffic to the
C* nodes during the upgrade and observed an increased number of read
timeouts during the upgrade process.

I executed nodetool drain before upgrading each node, and during the
upgrade "nodetool ring" was showing that node as DOWN, as expected. After
each upgrade all nodes were showing the upgraded node as UP, so apparently
all nodes were communicating fine.

I manually tried to insert and retrieve some data into both the newly
upgraded nodes and the old nodes, and the behavior was very unstable:
sometimes it worked, sometimes it didn't (TimedOutException), so I don't
think it was a network problem.

The number of read timeouts diminished as the number of upgraded nodes
increased, until it reached stability. The logs were showing the following
messages periodically:

 INFO [HANDSHAKE-/10.176.249.XX] 2013-10-03 17:36:16,948
OutboundTcpConnection.java (line 399) Handshaking version with
/10.176.249.XX
 INFO [HANDSHAKE-/10.176.182.YY] 2013-10-03 17:36:17,280
OutboundTcpConnection.java (line 408) Cannot handshake version with
/10.176.182.YY
 INFO [HANDSHAKE-/10.176.182.YY] 2013-10-03 17:36:17,280
OutboundTcpConnection.java (line 399) Handshaking version with
/10.176.182.YY
 INFO [HANDSHAKE-/10.188.13.ZZ] 2013-10-03 17:36:17,510
OutboundTcpConnection.java (line 408) Cannot handshake version with
/10.188.13.ZZ
 INFO [HANDSHAKE-/10.188.13.ZZ] 2013-10-03 17:36:17,511
OutboundTcpConnection.java (line 399) Handshaking version with /10.188.13.ZZ
DEBUG [WRITE-/54.215.70.YY] 2013-10-03 18:01:50,237
OutboundTcpConnection.java (line 338) Target max version is -2147483648; no
version information yet, will retry
TRACE [HANDSHAKE-/10.177.14.XX] 2013-10-03 18:01:50,237
OutboundTcpConnection.java (line 406) Cannot handshake version with
/10.177.14.XX
java.nio.channels.AsynchronousCloseException
at
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:185)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:272)
 at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:176)
at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:86)
 at java.io.InputStream.read(InputStream.java:82)
 at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:64)
at java.io.DataInputStream.readInt(DataInputStream.java:370)
        at
org.apache.cassandra.net.OutboundTcpConnection$1.run(OutboundTcpConnection.java:400)

Another fact is that the number of completed compaction tasks decreased as
the number of upgraded nodes increased. I don't know if that's related to
the increased number read timeouts or just a coincidence. The timeout
configuration is the default (10000ms).

Two similar issues were reported, but without satisfactory responses:

-
http://stackoverflow.com/questions/15355115/rolling-upgrade-for-cassandra-1-0-9-cluster-to-1-2-1
- https://issues.apache.org/jira/browse/CASSANDRA-5740

Is that an expected behavior or is there something that might be going
wrong during the upgrade? Has anyone faced similar issues?

Any help would be very much appreciated.

Thanks,

Paulo

Reply via email to