Hello,

I seem to have gotten my cluster into a bit of a strange state.
Pardon the rather verbose email, but there is a fair amount of
background.  I'm running a 3 node Cassandra 2.0.1 cluster.  This
particular cluster is used only rather intermittently for dev/testing
and does not see particularly heavy use, it's mostly a catch-all
cluster for environments which don't have a dedicated cluster to
themselves.  I noticed today that one of the nodes had died because
nodetool repair was failing due to a down replica.  I run nodetool
status and sure enough, one of my nodes shows up as down.

When I looked on the actual box, the cassandra process was up and
running and everything in the logs looked sensible.  The most
controversial thing I saw was 1 CMS Garbage Collection per hour, each
taking ~250 ms.  None the less, the node was not responding, so I
restarted it.  So far so good, everything is starting up, my ~30
column families across ~6 key spaces are all initializing.  The node
then handshakes with my other two nodes and reports them both as up.
Here is where things get strange.  According to the logs on the other
two nodes, the third node has come back up and all is well.  However
in the third node, I see a wall of the following in the logs (IP
addresses masked):

 INFO [GossipTasks:1] 2013-10-17 20:22:25,652 Gossiper.java (line 806)
InetAddress /x.x.x.222 is now DOWN
 INFO [GossipTasks:1] 2013-10-17 20:22:25,653 Gossiper.java (line 806)
InetAddress /x.x.x.221 is now DOWN
 INFO [HANDSHAKE-/10.21.5.222] 2013-10-17 20:22:25,655
OutboundTcpConnection.java (line 386) Handshaking version with
/x.x.x.222
 INFO [RequestResponseStage:3] 2013-10-17 20:22:25,658 Gossiper.java
(line 789) InetAddress /x.x.x.222 is now UP
 INFO [GossipTasks:1] 2013-10-17 20:22:26,654 Gossiper.java (line 806)
InetAddress /x.x.x.222 is now DOWN
 INFO [HANDSHAKE-/10.21.5.222] 2013-10-17 20:22:26,657
OutboundTcpConnection.java (line 386) Handshaking version with
/x.x.x.222
 INFO [RequestResponseStage:4] 2013-10-17 20:22:26,660 Gossiper.java
(line 789) InetAddress /x.x.x.222 is now UP
 INFO [RequestResponseStage:3] 2013-10-17 20:22:26,660 Gossiper.java
(line 789) InetAddress /x.x.x.222 is now UP
 INFO [GossipTasks:1] 2013-10-17 20:22:27,655 Gossiper.java (line 806)
InetAddress /x.x.x.222 is now DOWN
 INFO [HANDSHAKE-/10.21.5.222] 2013-10-17 20:22:27,660
OutboundTcpConnection.java (line 386) Handshaking version with
/x.x.x.222
 INFO [RequestResponseStage:4] 2013-10-17 20:22:27,662 Gossiper.java
(line 789) InetAddress /x.x.x.222 is now UP
 INFO [RequestResponseStage:3] 2013-10-17 20:22:27,662 Gossiper.java
(line 789) InetAddress /x.x.x.222 is now UP
 INFO [HANDSHAKE-/10.21.5.221] 2013-10-17 20:22:28,254
OutboundTcpConnection.java (line 386) Handshaking version with
/x.x.x.221
 INFO [GossipTasks:1] 2013-10-17 20:22:28,657 Gossiper.java (line 806)
InetAddress /x.x.x.222 is now DOWN
 INFO [RequestResponseStage:4] 2013-10-17 20:22:28,660 Gossiper.java
(line 789) InetAddress /x.x.x.221 is now UP
 INFO [RequestResponseStage:3] 2013-10-17 20:22:28,660 Gossiper.java
(line 789) InetAddress /x.x.x.221 is now UP
 INFO [HANDSHAKE-/10.21.5.222] 2013-10-17 20:22:28,661
OutboundTcpConnection.java (line 386) Handshaking version with
/x.x.x.222
 INFO [RequestResponseStage:4] 2013-10-17 20:22:28,663 Gossiper.java
(line 789) InetAddress /x.x.x.222 is now UP
 INFO [GossipTasks:1] 2013-10-17 20:22:29,658 Gossiper.java (line 806)
InetAddress /x.x.x.222 is now DOWN
 INFO [GossipTasks:1] 2013-10-17 20:22:29,660 Gossiper.java (line 806)
InetAddress /x.x.x.221 is now DOWN

Additional, client requests to the cluster at consistency QUORUM start
failing (saying 2 responses were required but only 1 replica
responded).  According to nodetool status, all the nodes are up.

This is clearly not good.  I take down the problem node.  Nodetool
reports it down and QUORUM client reads/writes start working again.
In an attempt to get the cluster back into a good state, I delete all
the data on the problem node and then bring it back up.  The other two
nodes log a changed host ID for the IP of the node I wiped and then
handshake with it.  The problem node also comes up, but reads/writes
start failing again with the same error.

I decide to take the problem node down again.  However this time, even
after the process is dead, nodetool and the other two nodes report
that my third node is still up and requests to the cluster continue to
fail.  Running nodetool status against either of the live nodes shows
that all nodes are up.  Running nodetool status against the dead node
fails (unsurprisingly since Cassandra is not even running).

With that background out of the way, I have two questions.

1) What on earth just happened?

2) How do I fix my cluster?

Thanks!

Reply via email to