Hi all,

I've just started experimenting with Cassandra to get a feel for the
system. I've set up a test cluster and to get a ballpark idea of its
performance I wrote a simple tool to load some toy data into the
system. Surprisingly, I am able to "overwhelm" my 4-node cluster with
writes from a single client. I'm trying to figure out if this is a
problem with my setup, if I'm hitting bugs in the Cassandra codebase,
or if this is intended behavior. Sorry this email is kind of long,
here is the TLDR version:

While writing to Cassandra from a single node, I am able to get the
cluster into a bad state, where nodes are randomly disconnecting from
each other, write performance plummets, and sometimes nodes even
crash. Further, the nodes do not recover as long as the writes
continue (even at a much lower rate), and sometimes do not recover at
all unless I restart them. I can get this to happen simply by throwing
data at the cluster fast enough, and I'm wondering if this is a known
issue or if I need to tweak my setup.

Now, the details.

First, a little bit about the setup:

4-node cluster of identical machines, running cassandra-0.6.0-rc1 with
the fixes for CASSANDRA-933, CASSANDRA-934, and CASSANDRA-936 patched
in. Node specs:
8-core Intel Xeon e5...@2.00ghz
8GB RAM
1Gbit ethernet
Red Hat Linux 2.6.18
JVM 1.6.0_19 64-bit
1TB spinning disk houses both commitlog and data directories (which I
know is not ideal).
The client machine is on the same local network and has very similar specs.

The cassandra nodes are started with the following JVM options:

./cassandra JVM_OPTS="-Xms6144m -Xmx6144m -XX:+UseConcMarkSweepGC -d64
-XX:NewSize=1024m -XX:MaxNewSize=1024m -XX:+DisableExplicitGC"

I'm using default settings for all of the tunable stuff at the bottom
of storage-conf.xml. I also selected my initial tokens to evenly
partition the key space when the cluster was bootstrapped. I am using
the RandomPartitioner.

Now, about the test. Basically I am trying to get an idea of just how
fast I can make this thing go. I am writing ~250M data records into
the cluster, replicated at 3x, using Ran Tavory's Hector client
(Java), writing with ConsistencyLevel.ZERO and
FailoverPolicy.FAIL_FAST. The client is using 32 threads with 8
threads talking to each of the 4 nodes in the cluster. Records are
identified by a numeric id, and I'm writing them in batches of up to
10k records per row, with each record in its own column. The row key
identifies the bucket into which records fall. So, records with ids 0
- 9999 are written to row "0", 10000 - 19999 are written to row
"10000", etc. Each record is a JSON object with ~10-20 fields.

Records: {  // Column Family
  0 : {  // row key for the start of the bucket. Buckets span a range
of up to 10000 records
    1 : "{ /* some JSON */ }",  // Column for record with id=1
    3 : "{ /* some more JSON */ }",  // Column for record with id=3
    ...
    9999 : "{ /* ... */ }"
  },
  10000 : {  // row key for the start of the next bucket
    10001 : ...
    10004 :
}

I am reading the data out of a local, sorted file on the client, so I
only write a row to Cassandra once all records for that row have been
read, and each row is written to exactly once. I'm using a
producer-consumer queue to pump data from the input reader thread to
the output writer threads. I found that I have to throttle the reader
thread heavily in order to get good behavior. So, if I make the reader
sleep for 7 seconds every 1M records, everything is fine - the data
loads in about an hour, half of which is spent by the reader thread
sleeping. In between the sleeps, I see ~40-50 MB/s throughput on the
client's network interface while the reader is not sleeping, and it
takes ~7-8 seconds to write each batch of 1M records.

Now, if I remove the 7 second sleeps on the client side, things get
bad after the first ~8M records are written to the client. Write
throughput drops to <5 MB/s. I start seeing messages about nodes
disconnecting and reconnecting in Cassandra's system.log, as well as
lots of GC messages:

...
 INFO [Timer-1] 2010-04-06 04:03:27,178 Gossiper.java (line 179)
InetAddress /10.15.38.88 is now dead.
 INFO [GC inspection] 2010-04-06 04:03:30,259 GCInspector.java (line
110) GC for ConcurrentMarkSweep: 2989 ms, 55326320 reclaimed leaving
1035998648 used; max is 1211170816
 INFO [GC inspection] 2010-04-06 04:03:41,838 GCInspector.java (line
110) GC for ConcurrentMarkSweep: 3004 ms, 24377240 reclaimed leaving
1066120952 used; max is 1211170816
 INFO [Timer-1] 2010-04-06 04:03:44,136 Gossiper.java (line 179)
InetAddress /10.15.38.55 is now dead.
 INFO [GMFD:1] 2010-04-06 04:03:44,138 Gossiper.java (line 568)
InetAddress /10.15.38.55 is now UP
 INFO [GC inspection] 2010-04-06 04:03:52,957 GCInspector.java (line
110) GC for ConcurrentMarkSweep: 2319 ms, 4504888 reclaimed leaving
1086023832 used; max is 1211170816
 INFO [Timer-1] 2010-04-06 04:04:19,508 Gossiper.java (line 179)
InetAddress /10.15.38.242 is now dead.
 INFO [Timer-1] 2010-04-06 04:05:03,039 Gossiper.java (line 179)
InetAddress /10.15.38.55 is now dead.
 INFO [GMFD:1] 2010-04-06 04:05:03,041 Gossiper.java (line 568)
InetAddress /10.15.38.55 is now UP
 INFO [GC inspection] 2010-04-06 04:05:08,539 GCInspector.java (line
110) GC for ConcurrentMarkSweep: 2375 ms, 39534920 reclaimed leaving
1051620856 used; max is 1211170816
...

Finally followed by this and some/all nodes going down:

ERROR [COMPACTION-POOL:1] 2010-04-06 04:05:05,475
DebuggableThreadPoolExecutor.java (line 94) Error in executor
futuretask
java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError:
Java heap space
        at java.util.concurrent.FutureTask$Sync.innerGet(Unknown Source)
        at java.util.concurrent.FutureTask.get(Unknown Source)
        at 
org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.afterExecute(DebuggableThreadPoolExecutor.java:86)
        at 
org.apache.cassandra.db.CompactionManager$CompactionExecutor.afterExecute(CompactionManager.java:582)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Unknown Source)
        at java.io.ByteArrayOutputStream.write(Unknown Source)
        at java.io.DataOutputStream.write(Unknown Source)
        at org.apache.cassandra.io.IteratingRow.echoData(IteratingRow.java:69)
        at 
org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:138)
        at 
org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:1)
        at 
org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73)
        at 
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
        at 
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
        at 
org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183)
        at 
org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94)
        at 
org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:299)
        at 
org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:102)
        at 
org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:1)
        at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
        at java.util.concurrent.FutureTask.run(Unknown Source)
        ... 3 more

At first I thought that with ConsistencyLevel.ZERO I must be doing
async writes so Cassandra can't push back on the client threads (by
blocking them), thus the server is getting overwhelmed. But, I would
expect it to start dropping data and not crash in that case (after
all, I did say ZERO so I can't expect any reliability, right?).
However, I see similar slowdown / node dropout behavior when I set the
consistency level to ONE. Does Cassandra push back on writers under
heavy load? Is there some magic setting I need to tune to have it not
fall over? Do I just need a bigger cluster? Thanks in advance,

-- Ilya

P.S. I realize that it's still handling a LOT of data with just 4
nodes, and in practice nobody would run a system that gets 125k writes
per second on top of a 4 node cluster. I was just surprised that I
could make Cassandra fall over at all using a single client that's
pumping data at 40-50 MB/s.

Reply via email to