I just experimented the same thing on our 28 m1.xlarge C*1.2.11 cluster. phi_convict_threshold is default : 8. I will try increasing it to 12 as 12 seems to be the good value :)
That's still weird to see all nodes marked down at once. I never experimented this before using vnodes... Alain 2013/10/28 Aaron Morton <aa...@thelastpickle.com> > > (2 nodes in each availability zone) > How many AZ’s ? > > > The ec2 instances are m1.large > I strongly recommend using m1.xlarge with ephemeral disks or a higher spec > machine. m1.large is not up to the task. > > > Why on earth is the decommissioning of one node causing all the nodes to > be marked down? > decommissioning a node causes it to stream it’s data to the remaining > nodes, which results in them performing compaction. I would guess the low > power m1.large nodes could not handle the incoming traffic and compaction. > This probably resulted in GC problems (check the logs), which causes them > to be marked as down. > > > 1) If we set the phi_convict_threshold to 12 or higher the nodes never > get marked down. > 12 is a good number on aws. > > > 2) or If we set the vnodes to 16 or lower we never see them get marked > down. > I would leave this at 256. > The less vnodes may result in slightly less overhead in repair, but the > ultimate cause is the choice of HW. > > > Is either of these solutions dangerous or better than the other? > Change the phi and move to m1.xlarge by doing a lift-and-shift. Stop one > node at a time and copy all it’s data and config to a new node. > > > The ultimate cause of the problem appears to be that the > calculatePendingRanges in StorageService.java is an extremely expensive > proces > > We don’t see issues like this other than on low powered nodes. > > Cheers > > ----------------- > Aaron Morton > New Zealand > @aaronmorton > > Co-Founder & Principal Consultant > Apache Cassandra Consulting > http://www.thelastpickle.com > > On 26/10/2013, at 6:14 am, John Pyeatt <john.pye...@singlewire.com> wrote: > > > We are running a 6-node cluster in amazon cloud (2 nodes in each > availability zone). The ec2 instances are m1.large and we have 256 vnodes > on each node. > > > > We are using Ec2Snitch, NetworkTopologyStrategy and a replication factor > of 3. > > > > When we decommission one node suddenly reads and writes start to fail. > We are seeing Not Enough Replicas error messages which doesn't make sense > even though we are doing QUORUM reads/writes because there should still be > 2 copies of each piece of data in the cluster. > > > > Digging deep in the logs we see that the phi_convict_threshold is being > exceeded so all nodes in the cluster are being marked down for a period of > approximately 10 seconds. > > > > Why on earth is the decommissioning of one node causing all the nodes to > be marked down? > > > > We have two ways to work around this, though we think we have found the > ultimate cause of the problem. > > 1) If we set the phi_convict_threshold to 12 or higher the nodes never > get marked down. > > 2) or If we set the vnodes to 16 or lower we never see them get marked > down. > > > > Is either of these solutions dangerous or better than the other? > > > > > > The ultimate cause of the problem appears to be that the > calculatePendingRanges in StorageService.java is an extremely expensive > process and is running in the same thread pool (GossipTasks) as the > Gossiper.java code. calculatePendingRanges() runs during state changes of > nodes (ex. decommissioning). During this time it appears that it is hogging > the one thread in the GossipTasks thread pool thus causing things to get > marked down from FailureDetector.java. > > > > > > > > -- > > John Pyeatt > > Singlewire Software, LLC > > www.singlewire.com > > ------------------ > > 608.661.1184 > > john.pye...@singlewire.com > >