I am having some reliability problems in my Cassandra cluster which I am almost certain is due to GC. I was about to start delving into the guts of the problem by turning on GC logging but I have never done any serious java GC tuning before (time to learn I guess). As a first step however, I was hoping to gain some insight into the GC settings shipped with Cassandra 0.7. I realize its a pretty complicated problem but I was specifically interested in knowing about:
-XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 Why are these set the way they are? What specifically was used to determine these settings? Was it purely experimental or was there a specific, undesirable behavior adding these settings corrected for? From my various web wanderings, I read the survivor ratio and tenuring threshold settings as "Cassandra creates mostly long lived objects, with objects being promoted very quickly from the young generation to the old generation". Furthermore, the CMSInitiatingOccupancyFraction of 75 (from a JVM default of 68) means "start gc in the old generation later", presumably to allow Cassandra to use more of the old generation heap without needlessly trying to free up used space (?). Please correct me if I am misinterpreting these settings. One of the issues I have been having is extreme node instability when running a major compaction. After 20-30 seconds of operation, the node spends 30+ seconds in (what I believe to be) GC. Now I have tried halving all memtable thresholds to reduce overall heap memory usage but that has not seemed to help with the instability. After one of these blips, I often see log entries as follows: INFO [ScheduledTasks:1] 2011-01-17 10:41:21,961 GCInspector.java (line 133) GC for ParNew: 215 ms, 45084168 reclaimed leaving 11068700368 used; max is 12783583232 INFO [ScheduledTasks:1] 2011-01-17 10:41:28,033 GCInspector.java (line 133) GC for ParNew: 234 ms, 40401120 reclaimed leaving 12144504848 used; max is 12783583232 INFO [ScheduledTasks:1] 2011-01-17 10:42:15,911 GCInspector.java (line 133) GC for ConcurrentMarkSweep: 45828 ms, 3350764696 reclaimed leaving 9224048472 used; max is 12783583232 Given that the 3 GB of garbage collected via ConcurrentMarkSweep was generated in < 30 seconds, one of the first things I was going to try was increasing the survivor ratio (to 16) and increase the MaxTenuringThreshold (to 5) to try and keep more objects in the young generation and therefore cleaned up faster. As a more general approach to solving my problem, I was also going to reduce the CMSInitiatingOccupancyFraction to 65. Does this seem reasonable? Obviously, the best answer is to just try it but I hesitate to start playing with settings when I have only vaguest notions of what they do and little concept of why they are there in the first place. Thanks for any help