Although I didn't get an answer on this, it's worth noting the removing the compaction_in_progress folder resolved the issue.
From: Walsh, Stephen Sent: 17 September 2015 16:37 To: 'user@cassandra.apache.org' <user@cassandra.apache.org> Subject: RE: Cassandra shutdown during large number of compactions - now fails to start with OOM Exception Some more info, Looking at the Java Memory Dump file. I see about 400 SSTableScanners - one for each of our column Families. Each is about 200MB in size. And (from what I can see) all of them are reading from a "compactions_in_progress-ka-000000-Data.db" file dfile org.apache.cassandra.io.compress.CompressedRandomAccessReader path = "/var/lib/cassandra/data/system/compactions_in_progress-55080ab05d9c388690a4acb25fe1f77b/system-compactions_in_progress-ka-71661-Data.db" 131840 104 Steve From: Walsh, Stephen Sent: 17 September 2015 15:33 To: user@cassandra.apache.org<mailto:user@cassandra.apache.org> Subject: Cassandra shutdown during large number of compactions - now fails to start with OOM Exception Hey all, I was hoping someone had a similar issue. We're using 2.1.6 and shutdown a testbed in AWS thinking we were finished with it, We started it backup today and saw that only 2 of 4 nodes came up. Seems there was a lot of compaction happening at the time it was shutdown, cassandra tries to start-up and we get an OutOfMemory Exception. INFO 13:45:57 Initializing system.range_xfers INFO 13:45:57 Initializing system.schema_keyspaces INFO 13:45:57 Opening /var/lib/cassandra/data/system/schema_keyspaces-b0f2235744583cdb9631c43e59ce3676/system-schema_keyspaces-ka-21807 (19418 bytes) java.lang.OutOfMemoryError: Java heap space Dumping heap to /var/log/cassandra/java_pid3011.hprof ... Heap dump file created [7751760805 bytes in 52.439 secs] ERROR 13:47:11 Exception encountered during startup java.lang.OutOfMemoryError: Java heap space it's not related the key_cache, we removed this and the issue is still present. So we believe its re-trying all the compactions that were in place when it went down. We've modified the HEAP size to be half of the systems RAM (8GB in this case) At the moment the only work around we have is to empty the data / saved_cache / commit_log folders and let it re-sync with the other nodes. Has anyone seen this before and what have they done to solve it? Can we remove unfinished compactions? Steve This email (including any attachments) is proprietary to Aspect Software, Inc. and may contain information that is confidential. If you have received this message in error, please do not read, copy or forward this message. Please notify the sender immediately, delete it from your system and destroy any copies. You may not further disclose or distribute this email or its attachments.