Dan, Please kindly attach your: 1) java -version 2) full commandline settings, heap sizes. 3) gc log from one of the nodes via:
-XX:+PrintTenuringDistribution \ -XX:+PrintGCDetails \ -XX:+PrintGCTimeStamps \ -Xloggc:/var/log/cassandra/gc.log \ 4) number of cores on your system. How busy is the system? 5) Any workload specifics for your particular usecase? While some of this is workload specific: If you are seeing too frequent & very long CMS collection times: C1) Upping the MaxTenuringThreshold=5/10/15 will reduce frequent promotion that is made essential by current setting. C2) Increasing -Xmn512mb/1g will help induce more parnew activity. C3) If you have enough cores to handle multi threaded ParNewGen - I'd also add -XX:+ParallelGCThreads=4 (or 8) depending on your situation. Reducing CMIOF (& other thresholds) will trigger CMS at a lower threshold of occupancy (counteracting some measures above) but might help offset conc mode failures or promotion failure in case you are seeing it in the logs. Anyways that is a simplistic analysis: I'd try changes (C1), (C2), (C3) & then revisit further tuning as necessary. thanks, Sri On Mon, Jan 17, 2011 at 5:03 PM, Dan Hendry <dan.hendry.j...@gmail.com>wrote: > I am having some reliability problems in my Cassandra cluster which I am > almost certain is due to GC. I was about to start delving into the guts of > the problem by turning on GC logging but I have never done any serious java > GC tuning before (time to learn I guess). As a first step however, I was > hoping to gain some insight into the GC settings shipped with Cassandra 0.7. > I realize its a pretty complicated problem but I was specifically interested > in knowing about: > > -XX:SurvivorRatio=8 > -XX:MaxTenuringThreshold=1 > -XX:CMSInitiatingOccupancyFraction=75 > > Why are these set the way they are? What specifically was used to determine > these settings? Was it purely experimental or was there a specific, > undesirable behavior adding these settings corrected for? From my various > web wanderings, I read the survivor ratio and tenuring threshold settings as > "Cassandra creates mostly long lived objects, with objects being promoted > very quickly from the young generation to the old generation". Furthermore, > the CMSInitiatingOccupancyFraction of 75 (from a JVM default of 68) means > "start gc in the old generation later", presumably to allow Cassandra to use > more of the old generation heap without needlessly trying to free up used > space (?). Please correct me if I am misinterpreting these settings. > > One of the issues I have been having is extreme node instability when > running a major compaction. After 20-30 seconds of operation, the node > spends 30+ seconds in (what I believe to be) GC. Now I have tried halving > all memtable thresholds to reduce overall heap memory usage but that has not > seemed to help with the instability. After one of these blips, I often see > log entries as follows: > > INFO [ScheduledTasks:1] 2011-01-17 10:41:21,961 GCInspector.java (line > 133) GC for ParNew: 215 ms, 45084168 reclaimed leaving 11068700368 used; max > is 12783583232 > INFO [ScheduledTasks:1] 2011-01-17 10:41:28,033 GCInspector.java (line > 133) GC for ParNew: 234 ms, 40401120 reclaimed leaving 12144504848 used; max > is 12783583232 > INFO [ScheduledTasks:1] 2011-01-17 10:42:15,911 GCInspector.java (line > 133) GC for ConcurrentMarkSweep: 45828 ms, 3350764696 reclaimed leaving > 9224048472 used; max is 12783583232 > > Given that the 3 GB of garbage collected via ConcurrentMarkSweep was > generated in < 30 seconds, one of the first things I was going to try was > increasing the survivor ratio (to 16) and increase the MaxTenuringThreshold > (to 5) to try and keep more objects in the young generation and therefore > cleaned up faster. As a more general approach to solving my problem, I was > also going to reduce the CMSInitiatingOccupancyFraction to 65. Does this > seem reasonable? Obviously, the best answer is to just try it but I hesitate > to start playing with settings when I have only vaguest notions of what they > do and little concept of why they are there in the first place. > > Thanks for any help >