What's your replication factor? Can you check tp stats and net stats to see if you are getting more mutations on these nodes ?
Sent from my iPhone On Jul 16, 2013, at 3:18 PM, Jure Koren <[email protected]> wrote: > Hi C* user list, > > I have a curious recurring problem with Cassandra 1.2 and what seems like a > GC issue. > > The cluster looks somewhat well balanced, all nodes are running HotSpot JVM > 1.6.0_31-b04 and cassandra 1.2.3. > > Address Rack Status State Load Owns > 10.2.3.6 RAC6 Up Normal 15.13 GB 12.71% > 10.2.3.5 RAC5 Up Normal 16.87 GB 13.57% > 10.2.3.8 RAC8 Up Normal 13.27 GB 13.71% > 10.2.3.1 RAC1 Up Normal 16.46 GB 14.08% > 10.2.3.7 RAC7 Up Normal 11.59 GB 14.34% > 10.2.3.2 RAC2 Up Normal 23.15 GB 15.12% > 10.2.3.4 RAC4 Up Normal 16.52 GB 16.47% > > Every now and then (roughly once a month, currently), two nodes (always the > same two) need to be restarted after they start eating all available CPU > cycles and read and write latencies increase dramatically. Restart fixes this > every time. > > The only metric that significantly deviates from the average for all nodes > shows GC doing something: http://bou.si/rest/parnew.png > > Is there a way to debug this? After searching online it appears as nobody has > really solved this problem and I have no idea what could cause such behaviour > in just two particular cluster nodes. > > I'm now thinking of decomissioning the problematic nodes and bootstrapping > them anew, but can't decide if this could possibly help. > > Thanks in advance for any insight anyone might offer, > > -- > Jure Koren, DevOps > http://www.zemanta.com/
