We are running a 6-node cluster in amazon cloud (2 nodes in each
availability zone). The ec2 instances are m1.large and we have 256 vnodes
on each node.

We are using Ec2Snitch, NetworkTopologyStrategy and a replication factor of
3.

When we decommission one node suddenly reads and writes start to fail. We
are seeing Not Enough Replicas error messages which doesn't make sense even
though we are doing QUORUM reads/writes because there should still be 2
copies of each piece of data in the cluster.

Digging deep in the logs we see that the phi_convict_threshold is being
exceeded so all nodes in the cluster are being marked down for a period of
approximately 10 seconds.

*Why on earth is the decommissioning of one node causing all the nodes to
be marked down?*

We have two ways to work around this, though we think we have found the
ultimate cause of the problem.
1) If we set the phi_convict_threshold to 12 or higher the nodes never get
marked down.
2) or If we set the vnodes to 16 or lower we never see them get marked down.

*Is either of these solutions dangerous or better than the other?*


The ultimate cause of the problem appears to be that the
calculatePendingRanges in StorageService.java is an extremely expensive
process and is running in the same thread pool (GossipTasks) as the
Gossiper.java code. calculatePendingRanges() runs during state changes of
nodes (ex. decommissioning). During this time it appears that it is hogging
the one thread in the GossipTasks thread pool thus causing things to get
marked down from FailureDetector.java.



-- 
John Pyeatt
Singlewire Software, LLC
www.singlewire.com
------------------
608.661.1184
john.pye...@singlewire.com

Reply via email to