2011/4/7 Jonathan Ellis <jbel...@gmail.com>
> Hypothesis: it's probably the flush causing the CMS, not the snapshot > linking. > > Confirmation possibility #1: Add a logger.warn to > CLibrary.createHardLinkWithExec -- with JNA enabled it shouldn't be > called, but let's rule it out. > > Confirmation possibility #2: Force some flushes w/o snapshot. > > Either way: "concurrent mode failure" is the easy GC problem. > Hopefully you really are seeing mostly that -- this means the JVM > didn't start CMS early enough, so it ran out of space before it could > finish the concurrent collection, so it falls back to stop-the-world. > The fix is a combination of reducing XX:CMSInitiatingOccupancyFraction > and (possibly) increasing heap capacity if your heap is simply too > full too much of the time. > > You can also mitigate it by increasing the phi threshold for the > failure detector, so the node doing the GC doesn't mark everyone else > as dead. > > (Eventually your heap will fragment and you will see STW collections > due to "promotion failed," but you should see that much less > frequently. GC tuning to reduce fragmentation may be possible based on > your workload, but that's out of scope here and in any case the "real" > fix for that is https://issues.apache.org/jira/browse/CASSANDRA-2252.) > > Jonatan do you have plans to backport this to 0.7 branch. (Because It's very hard to tune CMS, and if people is novice in java this task becomes much harder )