There seems to have been a fair amount of discussion on memory related issues so I apologize if this exact situation has come up before.
I am currently in the process of load testing an metrics platform I have written which uses Cassandra and I have run into some very troubling issues. The application is writing quite heavily, about 1000-2000 updates (columns) per second using batch mutates of 20 columns each. This is divided between creating new rows and adding columns to a fairly limited number of existing index rows (<30). Nearly all of these updates are read within 10 seconds and none contain any significant amount of data (generally much less than 100 bytes of data which I specify). Initially, the test hums along nicely but after some amount of time (1-2 hours) Cassandra crashes with an out of memory error. Unfortunately I have not had the opportunity to watch the test as it crashes, but it has happened in 2/2 tests. This is quite annoying but the absolutely TERRIFYING behaviour is that when I restart Cassandra, it starts replaying the commit logs then crashes with an out of memory error again. Restart a second time, crash with OOM; it seems to get through about 3/4 of the commit logs. Just to be absolutely explicit, I am not trying to insert or read at this point, just recover the previous updates. Unless somebody can suggest a way to recover the commit logs, I have effectively lost my data. The only way I have found to recover is wipe the data directories. It does not matter right now given that it is only a test but this behaviour is completely unacceptable for a production system. Here is information about the system which is probably relevant. Let me know if any additional details about my application would help sort out this issue: - Cassandra 0.7 Beta2 - DB Machine: EC2 m1 large with the commit log directory on an ebs and the data directory on ephemeral storage. - OS: Ubuntu server 10.04 - With the exception of changing JMX settings, no memory or JVM changes were made to options in cassandra-env.sh - In Cassandra.yaml, I reduced binary_memtable_throughput_in_mb to 100 in my second test to try follow the heap memory calculation formula; I have 8 column families. - I am using the Sun JVM, specifically "build 1.6.0_20-b02" - The app is written in java and I am using the latest Pelops library, I am sending updates at consistency level ONE and reading them at level ALL. I have been fairly impressed with Cassandra overall and given that I am using a beta version, I don't expect fully polished behaviour. What is unacceptable, and quite frankly nearly unbelievable, is the fact Cassandra cant seem to recover from the error and I am loosing data. Dan Hendry