Thanks for your comments. The application is indeed suffering from a freezing 
Cassandra node. Queries are taking longer than 10 seconds at the moment of a 
full garbage collect.

Here is an example from the logs. I have a three node cluster. At some point I 
see on a node the following log:

21:53:35,986 InetAddress /172.16.107.46 is now dead.

On node "172.16.107.46", I see the following:

21:53:27.192+0100: 1335393.834: [GC 1335393.834: [ParNew (promotion failed): 
319468K->324959K(345024K), 0.1304456 secs]1335393.964: [CMS: 
6000844K->3298251K(8005248K), 10.8526193 secs] 6310427K->3298251K(8350272K), 
[CMS Perm : 26355K->26346K(44268K)], 10.9832679 secs] [Times: user=11.15 
sys=0.03, real=10.98 secs] 
21:53:38,174 GC for ConcurrentMarkSweep: 10856 ms for 1 collections, 3389079904 
used; max is 8550678528

I have not yet tested the "XX:+DisableExplicitGC" switch.

Is the right thing to do to decrease the CMSInitiatingOccupancyFraction setting?

Thanks!

Rene

-----Original Message-----
From: sc...@scode.org [mailto:sc...@scode.org] On Behalf Of Peter Schuller
Sent: dinsdag 20 december 2011 6:38
To: user@cassandra.apache.org
Subject: Re: Garbage collection freezes cassandra node

I should add: If you are indeed actually pausing due to "promotion
failed" or "concurrent mode failure" (which you will see in the GC log
if you enable it with the options I suggested), the first thing I
would try to mitigate is:

* Decrease the occupancy trigger (search for "occupancy") of CMS to a
lower percentage, making the concurrent mark phase start earlier.
* Increase heap size significantly (probably not necessary based on
your graph, but for good measure).

If it then goes away, report back and we can perhaps figure out
details. There are other things that can be done.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Reply via email to