Re: Overwhelming a cluster with writes?

Ilya Maykov Mon, 05 Apr 2010 23:40:11 -0700

I'm running the nodes with a JVM heap size of 6GB, and here are the
related options from my storage-conf.xml. As mentioned in the first
email, I left everything at the default value. I briefly googled
around for "Cassandra performance tuning" etc but haven't found a
definitive guide ... any help with tuning these parameters is greatly
appreciated!


  <DiskAccessMode>auto</DiskAccessMode>
  <RowWarningThresholdInMB>512</RowWarningThresholdInMB>
  <SlicedBufferSizeInKB>64</SlicedBufferSizeInKB>
  <FlushDataBufferSizeInMB>32</FlushDataBufferSizeInMB>
  <FlushIndexBufferSizeInMB>8</FlushIndexBufferSizeInMB>
  <ColumnIndexSizeInKB>64</ColumnIndexSizeInKB>
  <MemtableThroughputInMB>64</MemtableThroughputInMB>
  <BinaryMemtableThroughputInMB>256</BinaryMemtableThroughputInMB>
  <MemtableOperationsInMillions>0.3</MemtableOperationsInMillions>
  <MemtableFlushAfterMinutes>60</MemtableFlushAfterMinutes>
  <ConcurrentReads>8</ConcurrentReads>
  <ConcurrentWrites>64</ConcurrentWrites>
  <CommitLogSync>periodic</CommitLogSync>
  <CommitLogSyncPeriodInMS>10000</CommitLogSyncPeriodInMS>
  <GCGraceSeconds>864000</GCGraceSeconds>

-- Ilya

On Mon, Apr 5, 2010 at 11:26 PM, Boris Shulman <[email protected]> wrote:
> You are running out of memory on your nodes. Before the final crash
> your nodes are probably slow  due to GC. What is your memtable size?
> What cache options did you configure?
>
> On Tue, Apr 6, 2010 at 7:31 AM, Ilya Maykov <[email protected]> wrote:
>> Hi all,
>>
>> I've just started experimenting with Cassandra to get a feel for the
>> system. I've set up a test cluster and to get a ballpark idea of its
>> performance I wrote a simple tool to load some toy data into the
>> system. Surprisingly, I am able to "overwhelm" my 4-node cluster with
>> writes from a single client. I'm trying to figure out if this is a
>> problem with my setup, if I'm hitting bugs in the Cassandra codebase,
>> or if this is intended behavior. Sorry this email is kind of long,
>> here is the TLDR version:
>>
>> While writing to Cassandra from a single node, I am able to get the
>> cluster into a bad state, where nodes are randomly disconnecting from
>> each other, write performance plummets, and sometimes nodes even
>> crash. Further, the nodes do not recover as long as the writes
>> continue (even at a much lower rate), and sometimes do not recover at
>> all unless I restart them. I can get this to happen simply by throwing
>> data at the cluster fast enough, and I'm wondering if this is a known
>> issue or if I need to tweak my setup.
>>
>> Now, the details.
>>
>> First, a little bit about the setup:
>>
>> 4-node cluster of identical machines, running cassandra-0.6.0-rc1 with
>> the fixes for CASSANDRA-933, CASSANDRA-934, and CASSANDRA-936 patched
>> in. Node specs:
>> 8-core Intel Xeon [email protected]
>> 8GB RAM
>> 1Gbit ethernet
>> Red Hat Linux 2.6.18
>> JVM 1.6.0_19 64-bit
>> 1TB spinning disk houses both commitlog and data directories (which I
>> know is not ideal).
>> The client machine is on the same local network and has very similar specs.
>>
>> The cassandra nodes are started with the following JVM options:
>>
>> ./cassandra JVM_OPTS="-Xms6144m -Xmx6144m -XX:+UseConcMarkSweepGC -d64
>> -XX:NewSize=1024m -XX:MaxNewSize=1024m -XX:+DisableExplicitGC"
>>
>> I'm using default settings for all of the tunable stuff at the bottom
>> of storage-conf.xml. I also selected my initial tokens to evenly
>> partition the key space when the cluster was bootstrapped. I am using
>> the RandomPartitioner.
>>
>> Now, about the test. Basically I am trying to get an idea of just how
>> fast I can make this thing go. I am writing ~250M data records into
>> the cluster, replicated at 3x, using Ran Tavory's Hector client
>> (Java), writing with ConsistencyLevel.ZERO and
>> FailoverPolicy.FAIL_FAST. The client is using 32 threads with 8
>> threads talking to each of the 4 nodes in the cluster. Records are
>> identified by a numeric id, and I'm writing them in batches of up to
>> 10k records per row, with each record in its own column. The row key
>> identifies the bucket into which records fall. So, records with ids 0
>> - 9999 are written to row "0", 10000 - 19999 are written to row
>> "10000", etc. Each record is a JSON object with ~10-20 fields.
>>
>> Records: {  // Column Family
>>   0 : {  // row key for the start of the bucket. Buckets span a range
>> of up to 10000 records
>>     1 : "{ /* some JSON */ }",  // Column for record with id=1
>>     3 : "{ /* some more JSON */ }",  // Column for record with id=3
>>    ...
>>    9999 : "{ /* ... */ }"
>>   },
>>  10000 : {  // row key for the start of the next bucket
>>    10001 : ...
>>    10004 :
>> }
>>
>> I am reading the data out of a local, sorted file on the client, so I
>> only write a row to Cassandra once all records for that row have been
>> read, and each row is written to exactly once. I'm using a
>> producer-consumer queue to pump data from the input reader thread to
>> the output writer threads. I found that I have to throttle the reader
>> thread heavily in order to get good behavior. So, if I make the reader
>> sleep for 7 seconds every 1M records, everything is fine - the data
>> loads in about an hour, half of which is spent by the reader thread
>> sleeping. In between the sleeps, I see ~40-50 MB/s throughput on the
>> client's network interface while the reader is not sleeping, and it
>> takes ~7-8 seconds to write each batch of 1M records.
>>
>> Now, if I remove the 7 second sleeps on the client side, things get
>> bad after the first ~8M records are written to the client. Write
>> throughput drops to <5 MB/s. I start seeing messages about nodes
>> disconnecting and reconnecting in Cassandra's system.log, as well as
>> lots of GC messages:
>>
>> ...
>>  INFO [Timer-1] 2010-04-06 04:03:27,178 Gossiper.java (line 179)
>> InetAddress /10.15.38.88 is now dead.
>>  INFO [GC inspection] 2010-04-06 04:03:30,259 GCInspector.java (line
>> 110) GC for ConcurrentMarkSweep: 2989 ms, 55326320 reclaimed leaving
>> 1035998648 used; max is 1211170816
>>  INFO [GC inspection] 2010-04-06 04:03:41,838 GCInspector.java (line
>> 110) GC for ConcurrentMarkSweep: 3004 ms, 24377240 reclaimed leaving
>> 1066120952 used; max is 1211170816
>>  INFO [Timer-1] 2010-04-06 04:03:44,136 Gossiper.java (line 179)
>> InetAddress /10.15.38.55 is now dead.
>>  INFO [GMFD:1] 2010-04-06 04:03:44,138 Gossiper.java (line 568)
>> InetAddress /10.15.38.55 is now UP
>>  INFO [GC inspection] 2010-04-06 04:03:52,957 GCInspector.java (line
>> 110) GC for ConcurrentMarkSweep: 2319 ms, 4504888 reclaimed leaving
>> 1086023832 used; max is 1211170816
>>  INFO [Timer-1] 2010-04-06 04:04:19,508 Gossiper.java (line 179)
>> InetAddress /10.15.38.242 is now dead.
>>  INFO [Timer-1] 2010-04-06 04:05:03,039 Gossiper.java (line 179)
>> InetAddress /10.15.38.55 is now dead.
>>  INFO [GMFD:1] 2010-04-06 04:05:03,041 Gossiper.java (line 568)
>> InetAddress /10.15.38.55 is now UP
>>  INFO [GC inspection] 2010-04-06 04:05:08,539 GCInspector.java (line
>> 110) GC for ConcurrentMarkSweep: 2375 ms, 39534920 reclaimed leaving
>> 1051620856 used; max is 1211170816
>> ...
>>
>> Finally followed by this and some/all nodes going down:
>>
>> ERROR [COMPACTION-POOL:1] 2010-04-06 04:05:05,475
>> DebuggableThreadPoolExecutor.java (line 94) Error in executor
>> futuretask
>> java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError:
>> Java heap space
>>        at java.util.concurrent.FutureTask$Sync.innerGet(Unknown Source)
>>        at java.util.concurrent.FutureTask.get(Unknown Source)
>>        at 
>> org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.afterExecute(DebuggableThreadPoolExecutor.java:86)
>>        at 
>> org.apache.cassandra.db.CompactionManager$CompactionExecutor.afterExecute(CompactionManager.java:582)
>>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
>> Source)
>>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>>        at java.lang.Thread.run(Unknown Source)
>> Caused by: java.lang.OutOfMemoryError: Java heap space
>>        at java.util.Arrays.copyOf(Unknown Source)
>>        at java.io.ByteArrayOutputStream.write(Unknown Source)
>>        at java.io.DataOutputStream.write(Unknown Source)
>>        at org.apache.cassandra.io.IteratingRow.echoData(IteratingRow.java:69)
>>        at 
>> org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:138)
>>        at 
>> org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:1)
>>        at 
>> org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73)
>>        at 
>> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
>>        at 
>> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
>>        at 
>> org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183)
>>        at 
>> org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94)
>>        at 
>> org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:299)
>>        at 
>> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:102)
>>        at 
>> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:1)
>>        at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
>>        at java.util.concurrent.FutureTask.run(Unknown Source)
>>        ... 3 more
>>
>> At first I thought that with ConsistencyLevel.ZERO I must be doing
>> async writes so Cassandra can't push back on the client threads (by
>> blocking them), thus the server is getting overwhelmed. But, I would
>> expect it to start dropping data and not crash in that case (after
>> all, I did say ZERO so I can't expect any reliability, right?).
>> However, I see similar slowdown / node dropout behavior when I set the
>> consistency level to ONE. Does Cassandra push back on writers under
>> heavy load? Is there some magic setting I need to tune to have it not
>> fall over? Do I just need a bigger cluster? Thanks in advance,
>>
>> -- Ilya
>>
>> P.S. I realize that it's still handling a LOT of data with just 4
>> nodes, and in practice nobody would run a system that gets 125k writes
>> per second on top of a 4 node cluster. I was just surprised that I
>> could make Cassandra fall over at all using a single client that's
>> pumping data at 40-50 MB/s.
>>
>

Re: Overwhelming a cluster with writes?

Reply via email to