Dan,

Please kindly attach your:
1) java -version
2) full commandline settings, heap sizes.
3) gc log from one of the nodes via:

-XX:+PrintTenuringDistribution \
-XX:+PrintGCDetails \
-XX:+PrintGCTimeStamps \
-Xloggc:/var/log/cassandra/gc.log \

4) number of cores on your system. How busy is the system?
5) Any workload specifics for your particular usecase?

While some of this is workload specific:

If you are seeing too frequent & very long CMS collection times:

C1) Upping the MaxTenuringThreshold=5/10/15  will reduce frequent promotion
that is made essential by current setting.

C2) Increasing -Xmn512mb/1g will help induce more parnew activity.

C3) If you have enough cores to handle multi threaded ParNewGen - I'd also
add -XX:+ParallelGCThreads=4 (or 8) depending on your situation.

Reducing CMIOF (& other thresholds) will trigger CMS at a lower threshold of
occupancy (counteracting some measures above) but might help offset conc
mode failures or promotion failure in case you are seeing it in the logs.

Anyways that is a simplistic analysis: I'd try changes (C1), (C2), (C3) &
then revisit further tuning as necessary.
thanks,
Sri

On Mon, Jan 17, 2011 at 5:03 PM, Dan Hendry <dan.hendry.j...@gmail.com>wrote:

> I am having some reliability problems in my Cassandra cluster which I am
> almost certain is due to GC. I was about to start delving into the guts of
> the problem by turning on GC logging but I have never done any serious java
> GC tuning before (time to learn I guess). As a first step however, I was
> hoping to gain some insight into the GC settings shipped with Cassandra 0.7.
> I realize its a pretty complicated problem but I was specifically interested
> in knowing about:
>
> -XX:SurvivorRatio=8
> -XX:MaxTenuringThreshold=1
> -XX:CMSInitiatingOccupancyFraction=75
>
> Why are these set the way they are? What specifically was used to determine
> these settings? Was it purely experimental or was there a specific,
> undesirable behavior adding these settings corrected for? From my various
> web wanderings, I read the survivor ratio and tenuring threshold settings as
> "Cassandra creates mostly long lived objects, with objects being promoted
> very quickly from the young generation to the old generation". Furthermore,
> the CMSInitiatingOccupancyFraction of 75 (from a JVM default of 68) means
> "start gc in the old generation later", presumably to allow Cassandra to use
> more of the old generation heap without needlessly trying to free up used
> space (?). Please correct me if I am misinterpreting these settings.
>
> One of the issues I have been having is extreme node instability when
> running a major compaction. After 20-30 seconds of operation, the node
> spends 30+ seconds in (what I believe to be) GC. Now I have tried halving
> all memtable thresholds to reduce overall heap memory usage but that has not
> seemed to help with the instability. After one of these blips, I often see
> log entries as follows:
>
>  INFO [ScheduledTasks:1] 2011-01-17 10:41:21,961 GCInspector.java (line
> 133) GC for ParNew: 215 ms, 45084168 reclaimed leaving 11068700368 used; max
> is 12783583232
>  INFO [ScheduledTasks:1] 2011-01-17 10:41:28,033 GCInspector.java (line
> 133) GC for ParNew: 234 ms, 40401120 reclaimed leaving 12144504848 used; max
> is 12783583232
>  INFO [ScheduledTasks:1] 2011-01-17 10:42:15,911 GCInspector.java (line
> 133) GC for ConcurrentMarkSweep: 45828 ms, 3350764696 reclaimed leaving
> 9224048472 used; max is 12783583232
>
> Given that the 3 GB of garbage collected via ConcurrentMarkSweep was
> generated in < 30 seconds, one of the first things I was going to try was
> increasing the survivor ratio (to 16) and increase the MaxTenuringThreshold
> (to 5) to try and keep more objects in the young generation and therefore
> cleaned up faster. As a more general approach to solving my problem, I was
> also going to reduce the CMSInitiatingOccupancyFraction to 65. Does this
> seem reasonable? Obviously, the best answer is to just try it but I hesitate
> to start playing with settings when I have only vaguest notions of what they
> do and little concept of why they are there in the first place.
>
> Thanks for any help
>

Reply via email to