We have a new 6-node cluster running 0.6.13 (Due to some client side issues we need to be on 0.6x for time being) that we are injecting data into and ran into some issues with nodes going down and then up quickly in the ring. All nodes are effected and we have rules out the network layer.
It happens on all nodes and seems related to GC or mtable flushes. We had things stable but after a series of data migrations we saw some swapping so we tuned to max heap down and this helped with swapping but the flapping still persists. The systems have 6-cores and 24 GB ram, max heap is at 12G. We are using the Parallel GC colector for throughput. Our run file for starting cassandra looks like this: exec 2>&1 ulimit -n 262144 cd /opt/cassandra-0.6.13 exec chpst -u cassandra java \ -ea \ -Xms4G \ -Xmx12G \ -XX:TargetSurvivorRatio=90 \ -XX:+PrintGCDetails \ -XX:+AggressiveOpts \ -XX:+UseParallelGC \ -XX:+CMSParallelRemarkEnabled \ -XX:SurvivorRatio=128 \ -XX:MaxTenuringThreshold=0 \ -Djava.rmi.server.hostname=10.20.3.155 \ -Dcom.sun.management.jmxremote.port=8080 \ -Dcom.sun.management.jmxremote.ssl=false \ -Dcom.sun.management.jmxremote.authenticate=false \ -Dcassandra-foreground=yes \ -Dstorage-config=/etc/cassandra \ -cp '/etc/cassandra:/opt/cassandra-0.6.13/lib/*' \ org.apache.cassandra.thrift.CassandraDaemon <&- Our storage conf like this for the mem/disk stuff: <!--======================================================================--> <!-- Memory, Disk, and Performance --> <!--======================================================================--> <DiskAccessMode>mmap</DiskAccessMode> <RowWarningThresholdInMB>4</RowWarningThresholdInMB> <SlicedBufferSizeInKB>64</SlicedBufferSizeInKB> <FlushDataBufferSizeInMB>32</FlushDataBufferSizeInMB> <FlushIndexBufferSizeInMB>64</FlushIndexBufferSizeInMB> <ColumnIndexSizeInKB>16</ColumnIndexSizeInKB> <MemtableThroughputInMB>64</MemtableThroughputInMB> <BinaryMemtableThroughputInMB>256</BinaryMemtableThroughputInMB> <MemtableOperationsInMillions>0.3</MemtableOperationsInMillions> <MemtableFlushAfterMinutes>60</MemtableFlushAfterMinutes> <ConcurrentReads>12</ConcurrentReads> <ConcurrentWrites>32</ConcurrentWrites> <CommitLogSync>periodic</CommitLogSync> <CommitLogSyncPeriodInMS>10000</CommitLogSyncPeriodInMS> <GCGraceSeconds>864000</GCGraceSeconds> <DoConsistencyChecksBoolean>true</DoConsistencyChecksBoolean> </Storage> Any thoughts on this would be really interesting. -- Jake Maizel Head of Network Operations Soundcloud Mail & GTalk: j...@soundcloud.com Skype: jakecloud Rosenthaler strasse 13, 101 19, Berlin, DE