Thanks for your comments. The application is indeed suffering from a freezing Cassandra node. Queries are taking longer than 10 seconds at the moment of a full garbage collect.
Here is an example from the logs. I have a three node cluster. At some point I see on a node the following log: 21:53:35,986 InetAddress /172.16.107.46 is now dead. On node "172.16.107.46", I see the following: 21:53:27.192+0100: 1335393.834: [GC 1335393.834: [ParNew (promotion failed): 319468K->324959K(345024K), 0.1304456 secs]1335393.964: [CMS: 6000844K->3298251K(8005248K), 10.8526193 secs] 6310427K->3298251K(8350272K), [CMS Perm : 26355K->26346K(44268K)], 10.9832679 secs] [Times: user=11.15 sys=0.03, real=10.98 secs] 21:53:38,174 GC for ConcurrentMarkSweep: 10856 ms for 1 collections, 3389079904 used; max is 8550678528 I have not yet tested the "XX:+DisableExplicitGC" switch. Is the right thing to do to decrease the CMSInitiatingOccupancyFraction setting? Thanks! Rene -----Original Message----- From: sc...@scode.org [mailto:sc...@scode.org] On Behalf Of Peter Schuller Sent: dinsdag 20 december 2011 6:38 To: user@cassandra.apache.org Subject: Re: Garbage collection freezes cassandra node I should add: If you are indeed actually pausing due to "promotion failed" or "concurrent mode failure" (which you will see in the GC log if you enable it with the options I suggested), the first thing I would try to mitigate is: * Decrease the occupancy trigger (search for "occupancy") of CMS to a lower percentage, making the concurrent mark phase start earlier. * Increase heap size significantly (probably not necessary based on your graph, but for good measure). If it then goes away, report back and we can perhaps figure out details. There are other things that can be done. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)