We have a 60 node CS cluster running 2.2.7 and about 20GB of RAM allocated to each C* node. We're aware of the recommended 8GB limit to keep GCs low but our memory has been creeping up (probably) related to this bug.
Here's what we're seeing... if we do a low level of writes we think everything generally looks good. What happens is that we then need to catch up and then do a TON of writes all in a small time window. Then CS nodes start dropping like flies. Some of them just GC frequently and are able to recover. When they GC like this we see GC pause in the 30 second range which then cause them to not gossip for a while and they drop out of the cluster. This happens as a flurry around the cluster so we're not always able to catch which ones are doing it as they recover. However, if we have 3 down, we mostly have a locked up cluster. Writes don't complete and our app essentially locks up. SOME of the boxes never recover. I'm in this state now. We have t3-5 nodes that are in GC storms which they won't recover from. I reconfigured the GC settings to enable jstat. I was able to catch it while it was happening: ^Croot@util0067 ~ # sudo -u cassandra jstat -gcutil 4235 2500 S0 S1 E O M CCS YGC YGCT FGC FGCT GCT 0.00 100.00 100.00 94.76 97.60 93.06 10435 1686.191 471 1139.142 2825.332 0.00 100.00 100.00 94.76 97.60 93.06 10435 1686.191 471 1139.142 2825.332 0.00 100.00 100.00 94.76 97.60 93.06 10435 1686.191 471 1139.142 2825.332 0.00 100.00 100.00 94.76 97.60 93.06 10435 1686.191 471 1139.142 2825.332 0.00 100.00 100.00 94.76 97.60 93.06 10435 1686.191 471 1139.142 2825.332 0.00 100.00 100.00 94.76 97.60 93.06 10435 1686.191 471 1139.142 2825.332 ... as you can see the box is legitimately out of memory. S0, S1, E and O are all completely full. I'm not sure were to go from here. I think 20GB for our work load is more than reasonable. 90% of the time they're well below 10GB of RAM used. While I was watching this box I was seeing 30% RAM used until it decided to climb to 100% Any advice on what do do next... I don't see anything obvious in the logs to signal a problem. I attached all the command line arguments we use. Note that I think that the cassandra-env.sh script puts them in there twice. -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms20000M -Xmx20000M -Xmn4096M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=1000003 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:CompileCommandFile=/hotspot_compiler -XX:CMSWaitDuration=10000 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=10000 -XX:+UseCondCardMark -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -XX:+PrintPromotionFailure -XX:PrintFLSStatistics=1 -Xloggc:/var/log/cassandra/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10M -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Djava.library.path=/usr/share/cassandra/lib/sigar-bin -XX:+UnlockCommercialFeatures -XX:+FlightRecorder -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms20000M -Xmx20000M -Xmn4096M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=1000003 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:CompileCommandFile=/etc/cassandra/hotspot_compiler -XX:CMSWaitDuration=10000 -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways -XX:CMSWaitDuration=10000 -XX:+UseCondCardMark -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -XX:+PrintPromotionFailure -XX:PrintFLSStatistics=1 -Xloggc:/var/log/cassandra/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10M -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.rmi.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Djava.library.path=/usr/share/cassandra/lib/sigar-bin -XX:+UnlockCommercialFeatures -XX:+FlightRecorder -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid -- We’re hiring if you know of any awesome Java Devops or Linux Operations Engineers! Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog: http://burtonator.wordpress.com … or check out my Google+ profile <https://plus.google.com/102718274791889610666/posts>