When debugging gossip related problems (is this node really
down/dead/some-werid state) you might have better luck looking at
`nodetool gossipinfo`. The "UN even though everything is bad thing"
might be https://issues.apache.org/jira/browse/CASSANDRA-5913
I'm not sure what exactly what happened in your case. I'm also confused
why an IP changed on restart.
On 10/17/2013 06:12 PM, Philip Persad wrote:
Hello,
I seem to have gotten my cluster into a bit of a strange state.
Pardon the rather verbose email, but there is a fair amount of
background. I'm running a 3 node Cassandra 2.0.1 cluster. This
particular cluster is used only rather intermittently for dev/testing
and does not see particularly heavy use, it's mostly a catch-all
cluster for environments which don't have a dedicated cluster to
themselves. I noticed today that one of the nodes had died because
nodetool repair was failing due to a down replica. I run nodetool
status and sure enough, one of my nodes shows up as down.
When I looked on the actual box, the cassandra process was up and
running and everything in the logs looked sensible. The most
controversial thing I saw was 1 CMS Garbage Collection per hour, each
taking ~250 ms. None the less, the node was not responding, so I
restarted it. So far so good, everything is starting up, my ~30
column families across ~6 key spaces are all initializing. The node
then handshakes with my other two nodes and reports them both as up.
Here is where things get strange. According to the logs on the other
two nodes, the third node has come back up and all is well. However
in the third node, I see a wall of the following in the logs (IP
addresses masked):
INFO [GossipTasks:1] 2013-10-17 20:22:25,652 Gossiper.java (line 806)
InetAddress /x.x.x.222 is now DOWN
INFO [GossipTasks:1] 2013-10-17 20:22:25,653 Gossiper.java (line 806)
InetAddress /x.x.x.221 is now DOWN
INFO [HANDSHAKE-/10.21.5.222] 2013-10-17 20:22:25,655
OutboundTcpConnection.java (line 386) Handshaking version with
/x.x.x.222
INFO [RequestResponseStage:3] 2013-10-17 20:22:25,658 Gossiper.java
(line 789) InetAddress /x.x.x.222 is now UP
INFO [GossipTasks:1] 2013-10-17 20:22:26,654 Gossiper.java (line 806)
InetAddress /x.x.x.222 is now DOWN
INFO [HANDSHAKE-/10.21.5.222] 2013-10-17 20:22:26,657
OutboundTcpConnection.java (line 386) Handshaking version with
/x.x.x.222
INFO [RequestResponseStage:4] 2013-10-17 20:22:26,660 Gossiper.java
(line 789) InetAddress /x.x.x.222 is now UP
INFO [RequestResponseStage:3] 2013-10-17 20:22:26,660 Gossiper.java
(line 789) InetAddress /x.x.x.222 is now UP
INFO [GossipTasks:1] 2013-10-17 20:22:27,655 Gossiper.java (line 806)
InetAddress /x.x.x.222 is now DOWN
INFO [HANDSHAKE-/10.21.5.222] 2013-10-17 20:22:27,660
OutboundTcpConnection.java (line 386) Handshaking version with
/x.x.x.222
INFO [RequestResponseStage:4] 2013-10-17 20:22:27,662 Gossiper.java
(line 789) InetAddress /x.x.x.222 is now UP
INFO [RequestResponseStage:3] 2013-10-17 20:22:27,662 Gossiper.java
(line 789) InetAddress /x.x.x.222 is now UP
INFO [HANDSHAKE-/10.21.5.221] 2013-10-17 20:22:28,254
OutboundTcpConnection.java (line 386) Handshaking version with
/x.x.x.221
INFO [GossipTasks:1] 2013-10-17 20:22:28,657 Gossiper.java (line 806)
InetAddress /x.x.x.222 is now DOWN
INFO [RequestResponseStage:4] 2013-10-17 20:22:28,660 Gossiper.java
(line 789) InetAddress /x.x.x.221 is now UP
INFO [RequestResponseStage:3] 2013-10-17 20:22:28,660 Gossiper.java
(line 789) InetAddress /x.x.x.221 is now UP
INFO [HANDSHAKE-/10.21.5.222] 2013-10-17 20:22:28,661
OutboundTcpConnection.java (line 386) Handshaking version with
/x.x.x.222
INFO [RequestResponseStage:4] 2013-10-17 20:22:28,663 Gossiper.java
(line 789) InetAddress /x.x.x.222 is now UP
INFO [GossipTasks:1] 2013-10-17 20:22:29,658 Gossiper.java (line 806)
InetAddress /x.x.x.222 is now DOWN
INFO [GossipTasks:1] 2013-10-17 20:22:29,660 Gossiper.java (line 806)
InetAddress /x.x.x.221 is now DOWN
Additional, client requests to the cluster at consistency QUORUM start
failing (saying 2 responses were required but only 1 replica
responded). According to nodetool status, all the nodes are up.
This is clearly not good. I take down the problem node. Nodetool
reports it down and QUORUM client reads/writes start working again.
In an attempt to get the cluster back into a good state, I delete all
the data on the problem node and then bring it back up. The other two
nodes log a changed host ID for the IP of the node I wiped and then
handshake with it. The problem node also comes up, but reads/writes
start failing again with the same error.
I decide to take the problem node down again. However this time, even
after the process is dead, nodetool and the other two nodes report
that my third node is still up and requests to the cluster continue to
fail. Running nodetool status against either of the live nodes shows
that all nodes are up. Running nodetool status against the dead node
fails (unsurprisingly since Cassandra is not even running).
With that background out of the way, I have two questions.
1) What on earth just happened?
2) How do I fix my cluster?
Thanks!