decommission of one EC2 node in cluster causes other nodes to go DOWN/UP and results in "May not be enough replicas..."

John Pyeatt Mon, 21 Oct 2013 11:12:45 -0700

We have a 6 node cassandra 1.2.10 cluster running on aws with
NetworkTopologyStrategy, a replication factor of 3 and the EC2Snitch. Each
AWS availability zone has 2 nodes in it.


When we are reading or writing data with consistency of Quorum to the
cluster while decommissioning a node we are getting 'May not be enough
replicas present to handle consistency level".

This doesn't make sense because we are only taking one node down, we have
an RF of three so even if we take one node down with a quorum read/write
there should still be enough nodes with the data (2).

Looking at the cassandra log on a server that we are not decommissioning we
are seeing this during the decommission of the other node.

 INFO [GossipTasks:1] 2013-10-21 15:18:10,695 Gossiper.java (line 803)
InetAddress /10.0.22.142 *is now DOWN*
 INFO [GossipTasks:1] 2013-10-21 15:18:10,696 Gossiper.java (line 803)
InetAddress /10.0.32.159 *is now DOWN*
 INFO [HANDSHAKE-/10.0.22.142] 2013-10-21 15:18:10,862
OutboundTcpConnection.java (line 399) Handshaking version with /10.0.22.142
 INFO [GossipTasks:1] 2013-10-21 15:18:11,696 Gossiper.java (line 803)
InetAddress /10.0.12.178* is now DOWN*
 INFO [GossipTasks:1] 2013-10-21 15:18:11,697 Gossiper.java (line 803)
InetAddress /10.0.22.106* is now DOWN*
 INFO [GossipTasks:1] 2013-10-21 15:18:11,698 Gossiper.java (line 803)
InetAddress /10.0.32.248 *is now DOWN*

Eventually we are seeing a message that looks like this.
 INFO [GossipStage:3] 2013-10-21 15:18:19,429 Gossiper.java (line 789)
InetAddress /10.0.32.248 is now UP

for each of the nodes. So eventually the remaining nodes in the cluster
come back to life.

While these nodes are down I can see why we get the "May not be enough
replicas..." message. Because everything is down.

My question is *why does gossip shutdown for these nodes that we aren't
decommissioning in the first place*?

-- 
John Pyeatt
Singlewire Software, LLC
www.singlewire.com
------------------
608.661.1184
john.pye...@singlewire.com

decommission of one EC2 node in cluster causes other nodes to go DOWN/UP and results in "May not be enough replicas..."

Reply via email to