2011/4/7 Jonathan Ellis <jbel...@gmail.com>

> Hypothesis: it's probably the flush causing the CMS, not the snapshot
> linking.
>
> Confirmation possibility #1: Add a logger.warn to
> CLibrary.createHardLinkWithExec -- with JNA enabled it shouldn't be
> called, but let's rule it out.
>
> Confirmation possibility #2: Force some flushes w/o snapshot.
>
> Either way: "concurrent mode failure" is the easy GC problem.
> Hopefully you really are seeing mostly that -- this means the JVM
> didn't start CMS early enough, so it ran out of space before it could
> finish the concurrent collection, so it falls back to stop-the-world.
> The fix is a combination of reducing XX:CMSInitiatingOccupancyFraction
> and (possibly) increasing heap capacity if your heap is simply too
> full too much of the time.
>
> You can also mitigate it by increasing the phi threshold for the
> failure detector, so the node doing the GC doesn't mark everyone else
> as dead.
>
> (Eventually your heap will fragment and you will see STW collections
> due to "promotion failed," but you should see that much less
> frequently. GC tuning to reduce fragmentation may be possible based on
> your workload, but that's out of scope here and in any case the "real"
> fix for that is https://issues.apache.org/jira/browse/CASSANDRA-2252.)
>
>
Jonatan do you have plans to backport this to 0.7 branch. (Because It's very
hard to tune CMS, and if people is novice in java this task becomes much
harder )

Reply via email to