We're getting fairly reproducible OOMs on a 2-node cluster using
Cassandra 1.2.11, typically in situations with a heavy read load. A
sample of some stack traces is at
https://gist.github.com/KlausBrunner/7820902 - they're all failing
somewhere down from table.getRow(), though I don't know if that's part
of query processing, compaction, or something else.

- The main CFs contain some 100k rows, none of them particularly wide.
- Heap dumps invariably show a single huge byte array (~1.6 GiB
associated with the OOM'ing thread) hogging > 80% of the Java heap.
The array seems to contain all/many rows of one CF.
- We're moderately certain there's no "killer query" with a huge
result set involved here, but we can't see exactly what triggers this.
- We've tried to switch to LeveledCompaction, to no avail.
- Xms/x is set to some 4 GB.
- The logs show the usual signs of panic ("flushing memtables") before
actually OOMing. It seems that this scenario is often or even always
after a compaction, but it's not quite conclusive.

I'm somewhat worried that Cassandra would read so much data into a
single contiguous byte[] at any point. Could this be related to
compaction? Any ideas what we could do about this?

Thanks

Klaus

Reply via email to