> We did indeed have a problem with our GC settings. The survivor ratio was > too low. After changing that things are better but we are still seeing GC > that takes 5-10 seconds, which is enough for the node to drop out of the > cluster briefly.
This still indicates full GC:s. What is your write activity like? Do you know if you're legitimately growing the heap quickly enough that the concurrent marking in CMS is unable to catch up? What is the free heap ratio (according to the logs produced with -XX:+PrintGC/-XX:+PrintGCDetails) after a concurrent mark-sweep has finished? If the heap is very full even after a mark/sweep you likely need a bigger heep or smaller caches sizes/memtables flush thresholds etc. On the other hand if you have very significant amounts of free space in the heap after a mark/sweep, the problem may rather be that CMS is just kicking in too late. If so you can experiment with the -XX:+UseCMSInitiatingOccupancyOnly and -XX:CMSInitiatingOccupancyFraction=XXX options. If you're willing to temporarily accept that CMS is continuously running (due to an aggressive initiating occupancy fraction) that should at least tell you whether you can in fact avoid the fallbacks and if so, then look at more proper tuning... -- / Peter Schuller aka scode