There seems to have been a fair amount of discussion on memory related
issues so I apologize if this exact situation has come up before. 

 

I am currently in the process of load testing an metrics platform I have
written which uses Cassandra and I have run into some very troubling issues.
The application is writing quite heavily, about 1000-2000 updates (columns)
per second using batch mutates of 20 columns each. This is divided between
creating new rows and adding columns to a fairly limited number of existing
index rows (<30). Nearly all of these updates are read within 10 seconds and
none contain any significant amount of data (generally much less than 100
bytes of data which I specify). Initially, the test hums along nicely but
after some amount of time (1-2 hours) Cassandra crashes with an out of
memory error. Unfortunately I have not had the opportunity to watch the test
as it crashes, but it has happened in 2/2 tests.

 

This is quite annoying but the absolutely TERRIFYING behaviour is that when
I restart Cassandra, it starts replaying the commit logs then crashes with
an out of memory error again. Restart a second time, crash with OOM; it
seems to get through about 3/4 of the commit logs. Just to be absolutely
explicit, I am not trying to insert or read at this point, just recover the
previous updates. Unless somebody can suggest a way to recover the commit
logs, I have effectively lost my data. The only way I have found to recover
is wipe the data directories. It does not matter right now given that it is
only a test but this behaviour is completely unacceptable for a production
system. 

 

Here is information about the system which is probably relevant. Let me know
if any additional details about my application would help sort out this
issue:

-          Cassandra 0.7 Beta2

-          DB Machine: EC2 m1 large with the commit log directory on an ebs
and the data directory on ephemeral storage.

-          OS: Ubuntu server 10.04

-          With the exception of changing JMX settings, no memory or JVM
changes were made to options in cassandra-env.sh

-          In Cassandra.yaml, I reduced binary_memtable_throughput_in_mb to
100 in my second test to try follow the heap memory calculation formula; I
have 8 column families.

-          I am using the Sun JVM, specifically "build 1.6.0_20-b02"

-          The app is written in java and I am using the latest Pelops
library, I am sending updates at consistency level ONE and reading them at
level ALL.

 

I have been fairly impressed with Cassandra overall and given that I am
using a beta version, I don't expect fully polished behaviour. What is
unacceptable, and quite frankly nearly unbelievable, is the fact Cassandra
cant seem to recover from the error and I am loosing data.

 

Dan Hendry

Reply via email to