We have a new 6-node cluster running 0.6.13 (Due to some client side issues
we need to be on 0.6x for time being) that we are injecting data into and
ran into some issues with nodes going down and then up quickly in the
ring.  All nodes are effected and we have rules out the network layer.

It happens on all nodes and seems related to GC or mtable flushes.  We had
things stable but after a series of data migrations we saw some swapping so
we tuned to max heap down and this helped with swapping but the flapping
still persists.

The systems have 6-cores and 24 GB ram, max heap is at 12G.   We are using
the Parallel GC colector for throughput.

Our run file for starting cassandra looks like this:

exec 2>&1

ulimit -n 262144

cd /opt/cassandra-0.6.13

exec chpst -u cassandra java \
  -ea \
  -Xms4G \
  -Xmx12G \
  -XX:TargetSurvivorRatio=90 \
  -XX:+PrintGCDetails \
  -XX:+AggressiveOpts \
  -XX:+UseParallelGC \
  -XX:+CMSParallelRemarkEnabled \
  -XX:SurvivorRatio=128 \
  -XX:MaxTenuringThreshold=0 \
  -Djava.rmi.server.hostname=10.20.3.155 \
  -Dcom.sun.management.jmxremote.port=8080 \
  -Dcom.sun.management.jmxremote.ssl=false \
  -Dcom.sun.management.jmxremote.authenticate=false \
  -Dcassandra-foreground=yes \
  -Dstorage-config=/etc/cassandra \
  -cp '/etc/cassandra:/opt/cassandra-0.6.13/lib/*' \
  org.apache.cassandra.thrift.CassandraDaemon <&-

Our storage conf like this for the mem/disk stuff:


<!--======================================================================-->
  <!-- Memory, Disk, and Performance
-->

<!--======================================================================-->
  <DiskAccessMode>mmap</DiskAccessMode>
  <RowWarningThresholdInMB>4</RowWarningThresholdInMB>
  <SlicedBufferSizeInKB>64</SlicedBufferSizeInKB>

  <FlushDataBufferSizeInMB>32</FlushDataBufferSizeInMB>
  <FlushIndexBufferSizeInMB>64</FlushIndexBufferSizeInMB>

  <ColumnIndexSizeInKB>16</ColumnIndexSizeInKB>

  <MemtableThroughputInMB>64</MemtableThroughputInMB>
  <BinaryMemtableThroughputInMB>256</BinaryMemtableThroughputInMB>
  <MemtableOperationsInMillions>0.3</MemtableOperationsInMillions>
  <MemtableFlushAfterMinutes>60</MemtableFlushAfterMinutes>

  <ConcurrentReads>12</ConcurrentReads>
  <ConcurrentWrites>32</ConcurrentWrites>

  <CommitLogSync>periodic</CommitLogSync>
  <CommitLogSyncPeriodInMS>10000</CommitLogSyncPeriodInMS>

  <GCGraceSeconds>864000</GCGraceSeconds>

  <DoConsistencyChecksBoolean>true</DoConsistencyChecksBoolean>
</Storage>

Any thoughts on this would be really interesting.

-- 
Jake Maizel
Head of Network Operations
Soundcloud

Mail & GTalk: j...@soundcloud.com
Skype: jakecloud

Rosenthaler strasse 13, 101 19, Berlin, DE

Reply via email to