I woke up this morning to all 4 of 4 of my cassandra instances reporting
they were down in my cluster.  I quickly started them all, and everything
seems fine.  I'm doing a postmortem now, but it appears they all OOM'd at
roughly the same time, which was not reported in any cassandra log, but I
discovered something in /var/log/kern that showed java died of oom(*).  In
amazon, I'm using large instances for cassandra, and they have no swap (as
recommended), so I have ~8GB of ram.  Should I use a different max mem
setting?  I'm using a stock rpm from riptano/datastax.  If I run "ps -aux" I
get:

/usr/bin/java -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42
-Xms3843M -Xmx3843M -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss128k
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
-XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1
-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
-Djava.net.preferIPv4Stack=true -Djava.rmi.server.hostname=X.X.X.X
-Dcom.sun.management.jmxremote.port=8080
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false -Dmx4jaddress=0.0.0.0
-Dmx4jport=8081 -Dlog4j.configuration=log4j-server.properties
-Dlog4j.defaultInitOverride=true
-Dcassandra-pidfile=/var/run/cassandra/cassandra.pid -cp
:/etc/cassandra/conf:/usr/share/cassandra/lib/antlr-3.1.3.jar:/usr/share/cassandra/lib/apache-cassandra-0.7.4.jar:/usr/share/cassandra/lib/avro-1.4.0-fixes.jar:/usr/share/cassandra/lib/avro-1.4.0-sources-fixes.jar:/usr/share/cassandra/lib/commons-cli-1.1.jar:/usr/share/cassandra/lib/commons-codec-1.2.jar:/usr/share/cassandra/lib/commons-collections-3.2.1.jar:/usr/share/cassandra/lib/commons-lang-2.4.jar:/usr/share/cassandra/lib/concurrentlinkedhashmap-lru-1.1.jar:/usr/share/cassandra/lib/guava-r05.jar:/usr/share/cassandra/lib/high-scale-lib.jar:/usr/share/cassandra/lib/jackson-core-asl-1.4.0.jar:/usr/share/cassandra/lib/jackson-mapper-asl-1.4.0.jar:/usr/share/cassandra/lib/jetty-6.1.21.jar:/usr/share/cassandra/lib/jetty-util-6.1.21.jar:/usr/share/cassandra/lib/jline-0.9.94.jar:/usr/share/cassandra/lib/json-simple-1.1.jar:/usr/share/cassandra/lib/jug-2.0.0.jar:/usr/share/cassandra/lib/libthrift-0.5.jar:/usr/share/cassandra/lib/log4j-1.2.16.jar:/usr/share/cassandra/lib/mx4j-tools.jar:/usr/share/cassandra/lib/servlet-api-2.5-20081211.jar:/usr/share/cassandra/lib/slf4j-api-1.6.1.jar:/usr/share/cassandra/lib/slf4j-log4j12-1.6.1.jar:/usr/share/cassandra/lib/snakeyaml-1.6.jar
org.apache.cassandra.thrift.CassandraDaemon

(*) Also, why would they all OOM so close to each other?  Bad luck?  Or once
the first node went down, is there an increased chance of the rest?

I'm still on 0.7.4, when I released cassandra to production that was the
latest release.  In addition to (or instead of?) fixing memory settings, I'm
guessing I should upgrade.

will

Reply via email to