Dive into the logs and look for messages from the GCInspector. These log ParNew 
and CMS activity that takes over 200 ms. To get further insight consider 
enabling the full GC logging (see cassandra-env.sh) on one of the problem 
nodes. 

Looking at your graphs you are getting about 2 ParNew collections a second that 
are running around 130ms, so the server is pausing for about 260ms per second 
to do ParNew. Which is not great. 

CMS activity can also suck up CPU specially if  it's not able to drain the 
tenured heap.

ParNew activity is more of a measure of the throughput on the node. Can you 
correlate the problems with application load? Does it happen at regular 
intervals ? Can you correlate it with repaur or compaction processes ?

Hope that helps 

-----------------
Aaron Morton
Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 17/07/2013, at 12:14 AM, Mohit Anchlia <mohitanch...@gmail.com> wrote:

> What's your replication factor? Can you check tp stats and net stats to see 
> if you are getting more mutations on these nodes ?
> 
> Sent from my iPhone
> 
> On Jul 16, 2013, at 3:18 PM, Jure Koren <jure.ko...@zemanta.com> wrote:
> 
>> Hi C* user list,
>> 
>> I have a curious recurring problem with Cassandra 1.2 and what seems like a 
>> GC issue.
>> 
>> The cluster looks somewhat well balanced, all nodes are running HotSpot JVM 
>> 1.6.0_31-b04 and cassandra 1.2.3.
>> 
>> Address Rack Status State Load Owns
>> 10.2.3.6 RAC6 Up Normal 15.13 GB 12.71%
>> 10.2.3.5 RAC5 Up Normal 16.87 GB 13.57%
>> 10.2.3.8 RAC8 Up Normal 13.27 GB 13.71%
>> 10.2.3.1 RAC1 Up Normal 16.46 GB 14.08%
>> 10.2.3.7 RAC7 Up Normal 11.59 GB 14.34%
>> 10.2.3.2 RAC2 Up Normal 23.15 GB 15.12%
>> 10.2.3.4 RAC4 Up Normal 16.52 GB 16.47%
>> 
>> Every now and then (roughly once a month, currently), two nodes (always the 
>> same two) need to be restarted after they start eating all available CPU 
>> cycles and read and write latencies increase dramatically. Restart fixes 
>> this every time.
>> 
>> The only metric that significantly deviates from the average for all nodes 
>> shows GC doing something: http://bou.si/rest/parnew.png
>> 
>> Is there a way to debug this? After searching online it appears as nobody 
>> has really solved this problem and I have no idea what could cause such 
>> behaviour in just two particular cluster nodes.
>> 
>> I'm now thinking of decomissioning the problematic nodes and bootstrapping 
>> them anew, but can't decide if this could possibly help.
>> 
>> Thanks in advance for any insight anyone might offer,
>> 
>> --
>> Jure Koren, DevOps
>> http://www.zemanta.com/

Reply via email to