I just experimented the same thing on our 28 m1.xlarge C*1.2.11 cluster.

phi_convict_threshold is default : 8. I will try increasing it to 12 as 12
seems to be the good value :)

That's still weird to see all nodes marked down at once. I never
experimented this before using vnodes...

Alain






2013/10/28 Aaron Morton <aa...@thelastpickle.com>

> >  (2 nodes in each availability zone)
> How many AZ’s ?
>
> > The ec2 instances are m1.large
> I strongly recommend using m1.xlarge with ephemeral disks or a higher spec
> machine.  m1.large is not up to the task.
>
> > Why on earth is the decommissioning of one node causing all the nodes to
> be marked down?
> decommissioning a node causes it to stream it’s data to the remaining
> nodes, which results in them performing compaction. I would guess the low
> power m1.large nodes could not handle the incoming traffic and compaction.
> This probably resulted in GC problems (check the logs), which causes them
> to be marked as down.
>
> > 1) If we set the phi_convict_threshold to 12 or higher the nodes never
> get marked down.
> 12 is a good number on aws.
>
> > 2) or If we set the vnodes to 16 or lower we never see them get marked
> down.
> I would leave this at 256.
> The less vnodes may result in slightly less overhead in repair, but the
> ultimate cause is the choice of HW.
>
> > Is either of these solutions dangerous or better than the other?
> Change the phi and move to m1.xlarge by doing a lift-and-shift. Stop one
> node at a time and copy all it’s data and config to a new node.
>
> > The ultimate cause of the problem appears to be that the
> calculatePendingRanges in StorageService.java is an extremely expensive
> proces
>
> We don’t see issues like this other than on low powered nodes.
>
> Cheers
>
> -----------------
> Aaron Morton
> New Zealand
> @aaronmorton
>
> Co-Founder & Principal Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> On 26/10/2013, at 6:14 am, John Pyeatt <john.pye...@singlewire.com> wrote:
>
> > We are running a 6-node cluster in amazon cloud (2 nodes in each
> availability zone). The ec2 instances are m1.large and we have 256 vnodes
> on each node.
> >
> > We are using Ec2Snitch, NetworkTopologyStrategy and a replication factor
> of 3.
> >
> > When we decommission one node suddenly reads and writes start to fail.
> We are seeing Not Enough Replicas error messages which doesn't make sense
> even though we are doing QUORUM reads/writes because there should still be
> 2 copies of each piece of data in the cluster.
> >
> > Digging deep in the logs we see that the phi_convict_threshold is being
> exceeded so all nodes in the cluster are being marked down for a period of
> approximately 10 seconds.
> >
> > Why on earth is the decommissioning of one node causing all the nodes to
> be marked down?
> >
> > We have two ways to work around this, though we think we have found the
> ultimate cause of the problem.
> > 1) If we set the phi_convict_threshold to 12 or higher the nodes never
> get marked down.
> > 2) or If we set the vnodes to 16 or lower we never see them get marked
> down.
> >
> > Is either of these solutions dangerous or better than the other?
> >
> >
> > The ultimate cause of the problem appears to be that the
> calculatePendingRanges in StorageService.java is an extremely expensive
> process and is running in the same thread pool (GossipTasks) as the
> Gossiper.java code. calculatePendingRanges() runs during state changes of
> nodes (ex. decommissioning). During this time it appears that it is hogging
> the one thread in the GossipTasks thread pool thus causing things to get
> marked down from FailureDetector.java.
> >
> >
> >
> > --
> > John Pyeatt
> > Singlewire Software, LLC
> > www.singlewire.com
> > ------------------
> > 608.661.1184
> > john.pye...@singlewire.com
>
>

Reply via email to