Do you see one of the disks used by cassandra filled up when a node crashes?
On Tue, Apr 6, 2010 at 9:39 AM, Ilya Maykov <ivmay...@gmail.com> wrote: > I'm running the nodes with a JVM heap size of 6GB, and here are the > related options from my storage-conf.xml. As mentioned in the first > email, I left everything at the default value. I briefly googled > around for "Cassandra performance tuning" etc but haven't found a > definitive guide ... any help with tuning these parameters is greatly > appreciated! > > <DiskAccessMode>auto</DiskAccessMode> > <RowWarningThresholdInMB>512</RowWarningThresholdInMB> > <SlicedBufferSizeInKB>64</SlicedBufferSizeInKB> > <FlushDataBufferSizeInMB>32</FlushDataBufferSizeInMB> > <FlushIndexBufferSizeInMB>8</FlushIndexBufferSizeInMB> > <ColumnIndexSizeInKB>64</ColumnIndexSizeInKB> > <MemtableThroughputInMB>64</MemtableThroughputInMB> > <BinaryMemtableThroughputInMB>256</BinaryMemtableThroughputInMB> > <MemtableOperationsInMillions>0.3</MemtableOperationsInMillions> > <MemtableFlushAfterMinutes>60</MemtableFlushAfterMinutes> > <ConcurrentReads>8</ConcurrentReads> > <ConcurrentWrites>64</ConcurrentWrites> > <CommitLogSync>periodic</CommitLogSync> > <CommitLogSyncPeriodInMS>10000</CommitLogSyncPeriodInMS> > <GCGraceSeconds>864000</GCGraceSeconds> > > -- Ilya > > On Mon, Apr 5, 2010 at 11:26 PM, Boris Shulman <shulm...@gmail.com> wrote: > > You are running out of memory on your nodes. Before the final crash > > your nodes are probably slow due to GC. What is your memtable size? > > What cache options did you configure? > > > > On Tue, Apr 6, 2010 at 7:31 AM, Ilya Maykov <ivmay...@gmail.com> wrote: > >> Hi all, > >> > >> I've just started experimenting with Cassandra to get a feel for the > >> system. I've set up a test cluster and to get a ballpark idea of its > >> performance I wrote a simple tool to load some toy data into the > >> system. Surprisingly, I am able to "overwhelm" my 4-node cluster with > >> writes from a single client. I'm trying to figure out if this is a > >> problem with my setup, if I'm hitting bugs in the Cassandra codebase, > >> or if this is intended behavior. Sorry this email is kind of long, > >> here is the TLDR version: > >> > >> While writing to Cassandra from a single node, I am able to get the > >> cluster into a bad state, where nodes are randomly disconnecting from > >> each other, write performance plummets, and sometimes nodes even > >> crash. Further, the nodes do not recover as long as the writes > >> continue (even at a much lower rate), and sometimes do not recover at > >> all unless I restart them. I can get this to happen simply by throwing > >> data at the cluster fast enough, and I'm wondering if this is a known > >> issue or if I need to tweak my setup. > >> > >> Now, the details. > >> > >> First, a little bit about the setup: > >> > >> 4-node cluster of identical machines, running cassandra-0.6.0-rc1 with > >> the fixes for CASSANDRA-933, CASSANDRA-934, and CASSANDRA-936 patched > >> in. Node specs: > >> 8-core Intel Xeon e5...@2.00ghz > >> 8GB RAM > >> 1Gbit ethernet > >> Red Hat Linux 2.6.18 > >> JVM 1.6.0_19 64-bit > >> 1TB spinning disk houses both commitlog and data directories (which I > >> know is not ideal). > >> The client machine is on the same local network and has very similar > specs. > >> > >> The cassandra nodes are started with the following JVM options: > >> > >> ./cassandra JVM_OPTS="-Xms6144m -Xmx6144m -XX:+UseConcMarkSweepGC -d64 > >> -XX:NewSize=1024m -XX:MaxNewSize=1024m -XX:+DisableExplicitGC" > >> > >> I'm using default settings for all of the tunable stuff at the bottom > >> of storage-conf.xml. I also selected my initial tokens to evenly > >> partition the key space when the cluster was bootstrapped. I am using > >> the RandomPartitioner. > >> > >> Now, about the test. Basically I am trying to get an idea of just how > >> fast I can make this thing go. I am writing ~250M data records into > >> the cluster, replicated at 3x, using Ran Tavory's Hector client > >> (Java), writing with ConsistencyLevel.ZERO and > >> FailoverPolicy.FAIL_FAST. The client is using 32 threads with 8 > >> threads talking to each of the 4 nodes in the cluster. Records are > >> identified by a numeric id, and I'm writing them in batches of up to > >> 10k records per row, with each record in its own column. The row key > >> identifies the bucket into which records fall. So, records with ids 0 > >> - 9999 are written to row "0", 10000 - 19999 are written to row > >> "10000", etc. Each record is a JSON object with ~10-20 fields. > >> > >> Records: { // Column Family > >> 0 : { // row key for the start of the bucket. Buckets span a range > >> of up to 10000 records > >> 1 : "{ /* some JSON */ }", // Column for record with id=1 > >> 3 : "{ /* some more JSON */ }", // Column for record with id=3 > >> ... > >> 9999 : "{ /* ... */ }" > >> }, > >> 10000 : { // row key for the start of the next bucket > >> 10001 : ... > >> 10004 : > >> } > >> > >> I am reading the data out of a local, sorted file on the client, so I > >> only write a row to Cassandra once all records for that row have been > >> read, and each row is written to exactly once. I'm using a > >> producer-consumer queue to pump data from the input reader thread to > >> the output writer threads. I found that I have to throttle the reader > >> thread heavily in order to get good behavior. So, if I make the reader > >> sleep for 7 seconds every 1M records, everything is fine - the data > >> loads in about an hour, half of which is spent by the reader thread > >> sleeping. In between the sleeps, I see ~40-50 MB/s throughput on the > >> client's network interface while the reader is not sleeping, and it > >> takes ~7-8 seconds to write each batch of 1M records. > >> > >> Now, if I remove the 7 second sleeps on the client side, things get > >> bad after the first ~8M records are written to the client. Write > >> throughput drops to <5 MB/s. I start seeing messages about nodes > >> disconnecting and reconnecting in Cassandra's system.log, as well as > >> lots of GC messages: > >> > >> ... > >> INFO [Timer-1] 2010-04-06 04:03:27,178 Gossiper.java (line 179) > >> InetAddress /10.15.38.88 is now dead. > >> INFO [GC inspection] 2010-04-06 04:03:30,259 GCInspector.java (line > >> 110) GC for ConcurrentMarkSweep: 2989 ms, 55326320 reclaimed leaving > >> 1035998648 used; max is 1211170816 > >> INFO [GC inspection] 2010-04-06 04:03:41,838 GCInspector.java (line > >> 110) GC for ConcurrentMarkSweep: 3004 ms, 24377240 reclaimed leaving > >> 1066120952 used; max is 1211170816 > >> INFO [Timer-1] 2010-04-06 04:03:44,136 Gossiper.java (line 179) > >> InetAddress /10.15.38.55 is now dead. > >> INFO [GMFD:1] 2010-04-06 04:03:44,138 Gossiper.java (line 568) > >> InetAddress /10.15.38.55 is now UP > >> INFO [GC inspection] 2010-04-06 04:03:52,957 GCInspector.java (line > >> 110) GC for ConcurrentMarkSweep: 2319 ms, 4504888 reclaimed leaving > >> 1086023832 used; max is 1211170816 > >> INFO [Timer-1] 2010-04-06 04:04:19,508 Gossiper.java (line 179) > >> InetAddress /10.15.38.242 is now dead. > >> INFO [Timer-1] 2010-04-06 04:05:03,039 Gossiper.java (line 179) > >> InetAddress /10.15.38.55 is now dead. > >> INFO [GMFD:1] 2010-04-06 04:05:03,041 Gossiper.java (line 568) > >> InetAddress /10.15.38.55 is now UP > >> INFO [GC inspection] 2010-04-06 04:05:08,539 GCInspector.java (line > >> 110) GC for ConcurrentMarkSweep: 2375 ms, 39534920 reclaimed leaving > >> 1051620856 used; max is 1211170816 > >> ... > >> > >> Finally followed by this and some/all nodes going down: > >> > >> ERROR [COMPACTION-POOL:1] 2010-04-06 04:05:05,475 > >> DebuggableThreadPoolExecutor.java (line 94) Error in executor > >> futuretask > >> java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: > >> Java heap space > >> at java.util.concurrent.FutureTask$Sync.innerGet(Unknown Source) > >> at java.util.concurrent.FutureTask.get(Unknown Source) > >> at > org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.afterExecute(DebuggableThreadPoolExecutor.java:86) > >> at > org.apache.cassandra.db.CompactionManager$CompactionExecutor.afterExecute(CompactionManager.java:582) > >> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown > Source) > >> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown > Source) > >> at java.lang.Thread.run(Unknown Source) > >> Caused by: java.lang.OutOfMemoryError: Java heap space > >> at java.util.Arrays.copyOf(Unknown Source) > >> at java.io.ByteArrayOutputStream.write(Unknown Source) > >> at java.io.DataOutputStream.write(Unknown Source) > >> at > org.apache.cassandra.io.IteratingRow.echoData(IteratingRow.java:69) > >> at > org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:138) > >> at > org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:1) > >> at > org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73) > >> at > com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135) > >> at > com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130) > >> at > org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183) > >> at > org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94) > >> at > org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:299) > >> at > org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:102) > >> at > org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:1) > >> at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) > >> at java.util.concurrent.FutureTask.run(Unknown Source) > >> ... 3 more > >> > >> At first I thought that with ConsistencyLevel.ZERO I must be doing > >> async writes so Cassandra can't push back on the client threads (by > >> blocking them), thus the server is getting overwhelmed. But, I would > >> expect it to start dropping data and not crash in that case (after > >> all, I did say ZERO so I can't expect any reliability, right?). > >> However, I see similar slowdown / node dropout behavior when I set the > >> consistency level to ONE. Does Cassandra push back on writers under > >> heavy load? Is there some magic setting I need to tune to have it not > >> fall over? Do I just need a bigger cluster? Thanks in advance, > >> > >> -- Ilya > >> > >> P.S. I realize that it's still handling a LOT of data with just 4 > >> nodes, and in practice nobody would run a system that gets 125k writes > >> per second on top of a 4 node cluster. I was just surprised that I > >> could make Cassandra fall over at all using a single client that's > >> pumping data at 40-50 MB/s. > >> > > >