Hello, I seem to have gotten my cluster into a bit of a strange state. Pardon the rather verbose email, but there is a fair amount of background. I'm running a 3 node Cassandra 2.0.1 cluster. This particular cluster is used only rather intermittently for dev/testing and does not see particularly heavy use, it's mostly a catch-all cluster for environments which don't have a dedicated cluster to themselves. I noticed today that one of the nodes had died because nodetool repair was failing due to a down replica. I run nodetool status and sure enough, one of my nodes shows up as down.
When I looked on the actual box, the cassandra process was up and running and everything in the logs looked sensible. The most controversial thing I saw was 1 CMS Garbage Collection per hour, each taking ~250 ms. None the less, the node was not responding, so I restarted it. So far so good, everything is starting up, my ~30 column families across ~6 key spaces are all initializing. The node then handshakes with my other two nodes and reports them both as up. Here is where things get strange. According to the logs on the other two nodes, the third node has come back up and all is well. However in the third node, I see a wall of the following in the logs (IP addresses masked): INFO [GossipTasks:1] 2013-10-17 20:22:25,652 Gossiper.java (line 806) InetAddress /x.x.x.222 is now DOWN INFO [GossipTasks:1] 2013-10-17 20:22:25,653 Gossiper.java (line 806) InetAddress /x.x.x.221 is now DOWN INFO [HANDSHAKE-/10.21.5.222] 2013-10-17 20:22:25,655 OutboundTcpConnection.java (line 386) Handshaking version with /x.x.x.222 INFO [RequestResponseStage:3] 2013-10-17 20:22:25,658 Gossiper.java (line 789) InetAddress /x.x.x.222 is now UP INFO [GossipTasks:1] 2013-10-17 20:22:26,654 Gossiper.java (line 806) InetAddress /x.x.x.222 is now DOWN INFO [HANDSHAKE-/10.21.5.222] 2013-10-17 20:22:26,657 OutboundTcpConnection.java (line 386) Handshaking version with /x.x.x.222 INFO [RequestResponseStage:4] 2013-10-17 20:22:26,660 Gossiper.java (line 789) InetAddress /x.x.x.222 is now UP INFO [RequestResponseStage:3] 2013-10-17 20:22:26,660 Gossiper.java (line 789) InetAddress /x.x.x.222 is now UP INFO [GossipTasks:1] 2013-10-17 20:22:27,655 Gossiper.java (line 806) InetAddress /x.x.x.222 is now DOWN INFO [HANDSHAKE-/10.21.5.222] 2013-10-17 20:22:27,660 OutboundTcpConnection.java (line 386) Handshaking version with /x.x.x.222 INFO [RequestResponseStage:4] 2013-10-17 20:22:27,662 Gossiper.java (line 789) InetAddress /x.x.x.222 is now UP INFO [RequestResponseStage:3] 2013-10-17 20:22:27,662 Gossiper.java (line 789) InetAddress /x.x.x.222 is now UP INFO [HANDSHAKE-/10.21.5.221] 2013-10-17 20:22:28,254 OutboundTcpConnection.java (line 386) Handshaking version with /x.x.x.221 INFO [GossipTasks:1] 2013-10-17 20:22:28,657 Gossiper.java (line 806) InetAddress /x.x.x.222 is now DOWN INFO [RequestResponseStage:4] 2013-10-17 20:22:28,660 Gossiper.java (line 789) InetAddress /x.x.x.221 is now UP INFO [RequestResponseStage:3] 2013-10-17 20:22:28,660 Gossiper.java (line 789) InetAddress /x.x.x.221 is now UP INFO [HANDSHAKE-/10.21.5.222] 2013-10-17 20:22:28,661 OutboundTcpConnection.java (line 386) Handshaking version with /x.x.x.222 INFO [RequestResponseStage:4] 2013-10-17 20:22:28,663 Gossiper.java (line 789) InetAddress /x.x.x.222 is now UP INFO [GossipTasks:1] 2013-10-17 20:22:29,658 Gossiper.java (line 806) InetAddress /x.x.x.222 is now DOWN INFO [GossipTasks:1] 2013-10-17 20:22:29,660 Gossiper.java (line 806) InetAddress /x.x.x.221 is now DOWN Additional, client requests to the cluster at consistency QUORUM start failing (saying 2 responses were required but only 1 replica responded). According to nodetool status, all the nodes are up. This is clearly not good. I take down the problem node. Nodetool reports it down and QUORUM client reads/writes start working again. In an attempt to get the cluster back into a good state, I delete all the data on the problem node and then bring it back up. The other two nodes log a changed host ID for the IP of the node I wiped and then handshake with it. The problem node also comes up, but reads/writes start failing again with the same error. I decide to take the problem node down again. However this time, even after the process is dead, nodetool and the other two nodes report that my third node is still up and requests to the cluster continue to fail. Running nodetool status against either of the live nodes shows that all nodes are up. Running nodetool status against the dead node fails (unsurprisingly since Cassandra is not even running). With that background out of the way, I have two questions. 1) What on earth just happened? 2) How do I fix my cluster? Thanks!