hi; do we see any hung process like Repairs on those 3 nodes? what does "nodetool netstats" show??
thanks Sai On Tue, Apr 19, 2016 at 8:24 AM, Erik Forsberg <forsb...@opera.com> wrote: > Hi! > > I have this problem where 3 of my 84 nodes misbehave with too long GC > times, leading to them being marked as DN. > > This happens when I load data to them using CQL from a hadoop job, so > quite a lot of inserts at a time. The CQL loading job is using > TokenAwarePolicy with fallback to DCAwareRoundRobinPolicy. Cassandra java > driver version 2.1.7.1 is in use. > > My other observation is that around the time the GC starts to work like > crazy, there is a lot of outbound network traffic from the troublesome > nodes. If a healthy node has around 25 Mbit/s in, 25 Mbit/s out, an > unhealthy sees 25 Mbit/s in, 200 Mbit/s out. > > So, something is iffy with these 3 nodes, but I have some trouble finding > out exactly what makes them differ. > > This is Cassandra 2.0.13 (yes, old) using vnodes. Keyspace is using > NetworkTopologyStrategy with replication 2, in one datacenter. > > One thing I know I'm doing wrong is that I have slightly differing number > of hosts in each of my 6 chassies (One of them have 15 nodes, one of have > 13, the remaining have 14). Could what I'm seeing here be the effect of > that? > > Other ideas on what could be wrong? Some kind of vnode imbalance? How can > I diagnose that? What metrics should I be looking at? > > Thanks, > \EF > > >