Can you monitor cassandra-level metrics like the ones in http://github.com/jbellis/cassandra-munin-plugins ?
the usual culprit is usually compaction but your compacted row size is small. nothing else really comes to mind. (you should check system keyspace too tho, HH rows can get large) On Fri, May 21, 2010 at 2:36 PM, Ran Tavory <ran...@gmail.com> wrote: > I see some OOM on one of the hosts in the cluster and I wonder if there's a > formula that'll help me calculate what's the required memory setting given > the parameters x,y,z... > In short, I need advice on: > 1. How to set up proper heap space and which parameters should I look at > when doing so. > 2. Help setting up an alert policy and define some counter measures or sos > steps an admin can take to prevent further degradation of service when > alerts fire. > The OOM is at the row mutation stage and it happens after extensive GC > activity. (log tail below). > The server has 16G physical ram and java heap space 4G. No other significant > processes run on the same server. I actually upped the java heap space to 8G > but it OOMed again... > Most of my settings are the defaults with a few keyspaces and a few CFs in > each KS. Here's the output of cfstats for the largest and most heavily used > CF. (currently reads/writes are stopped but data is there). > Keyspace: outbrain_kvdb > Read Count: 3392 > Read Latency: 160.33135908018866 ms. > Write Count: 2005839 > Write Latency: 0.029233923061621595 ms. > Pending Tasks: 0 > Column Family: KvImpressions > SSTable count: 8 > Space used (live): 21923629878 > Space used (total): 21923629878 > Memtable Columns Count: 69440 > Memtable Data Size: 9719364 > Memtable Switch Count: 26 > Read Count: 3392 > Read Latency: NaN ms. > Write Count: 1998821 > Write Latency: 0.018 ms. > Pending Tasks: 0 > Key cache capacity: 200000 > Key cache size: 11661 > Key cache hit rate: NaN > Row cache: disabled > Compacted row minimum size: 302 > Compacted row maximum size: 22387 > Compacted row mean size: 641 > I'm also attaching a few graphs of "the incidenst" I hope they help. From > the graphs it looks like: > 1. message deserializer pool is behind so maybe taking too much mem. If > graphs are correct, it gets as high as 10m pending before crash. > 2. row-read-stage has a high number of pending (4k) so first of all - this > isn't good for performance whether it caused the oom or not, and second, > this may also have taken up heap space and caused the crash. > Thanks! > INFO [GC inspection] 2010-05-21 00:53:25,885 GCInspector.java (line 110) GC > for ConcurrentMarkSweep: 10819 ms, 939992 reclaimed leaving 4312064504 used; > max is 4431216640 > INFO [GC inspection] 2010-05-21 00:53:44,605 GCInspector.java (line 110) GC > for ConcurrentMarkSweep: 9672 ms, 673400 reclaimed leaving 4312337208 used; > max is 4431216640 > INFO [GC inspection] 2010-05-21 00:54:23,110 GCInspector.java (line 110) GC > for ConcurrentMarkSweep: 9150 ms, 402072 reclaimed leaving 4312609776 used; > max is 4431216640 > ERROR [ROW-MUTATION-STAGE:19] 2010-05-21 01:55:37,951 CassandraDaemon.java > (line 88) Fatal exception in thread Thread[ROW-MUTATION-STAGE:19,5,main] > java.lang.OutOfMemoryError: Java heap space > ERROR [Thread-10] 2010-05-21 01:55:37,951 CassandraDaemon.java (line 88) > Fatal exception in thread Thread[Thread-10,5,main] > java.lang.OutOfMemoryError: Java heap space > ERROR [CACHETABLE-TIMER-2] 2010-05-21 01:55:37,951 CassandraDaemon.java > (line 88) Fatal exception in thread Thread[CACHETABLE-TIMER-2,5,main] > java.lang.OutOfMemoryError: Java heap space > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com