Is there another solution except adding capacity? How does the ConcurrentReads (default 8) affect that? If I expect to have similar number of reads and writes should I set the ConcurrentReads equal to ConcurrentWrites (default 32) ?
thanks On Sun, May 23, 2010 at 5:43 PM, Jonathan Ellis <jbel...@gmail.com> wrote: > looks like reads are backing up, which in turn is making deserialize back > up > > On Sun, May 23, 2010 at 4:25 AM, Ran Tavory <ran...@gmail.com> wrote: > > Here's tpstats on a server with traffic that I think will get OOM > shortly. > > We have 4k pending reads and 123k pending at MESSAGE-DESERIALIZER-POOL > > Is there something I can do to prevent that? (other than adding RAM...) > > Pool Name Active Pending Completed > > FILEUTILS-DELETE-POOL 0 0 55 > > STREAM-STAGE 0 0 6 > > RESPONSE-STAGE 0 0 0 > > ROW-READ-STAGE 8 4088 7537229 > > LB-OPERATIONS 0 0 0 > > MESSAGE-DESERIALIZER-POOL 1 123799 22198459 > > GMFD 0 0 471827 > > LB-TARGET 0 0 0 > > CONSISTENCY-MANAGER 0 0 0 > > ROW-MUTATION-STAGE 0 0 14142351 > > MESSAGE-STREAMING-POOL 0 0 16 > > LOAD-BALANCER-STAGE 0 0 0 > > FLUSH-SORTER-POOL 0 0 0 > > MEMTABLE-POST-FLUSHER 0 0 128 > > FLUSH-WRITER-POOL 0 0 128 > > AE-SERVICE-STAGE 1 1 8 > > HINTED-HANDOFF-POOL 0 0 10 > > > > On Sat, May 22, 2010 at 11:05 PM, Ran Tavory <ran...@gmail.com> wrote: > >> > >> The message deserializer has 10m pending tasks before the oom. What do > you > >> think makes the message deserializer blow up? I'd suspect that when it > goes > >> up to 10m pending tasks, don't know how much mem a task actually takes > up, > >> but they may consume a lot of memory. Is there a setting I need to > tweak? > >> (or am I barking at the wrong tree?). > >> I'll add the counters > >> from http://github.com/jbellis/cassandra-munin-plugins but I already > have > >> most of them monitored, so I attached the graphs of the ones that seemed > the > >> most suspicious in the previous email. > >> The system keyspace and HH CF don't look too bad, I think, here they > are: > >> Keyspace: system > >> Read Count: 154 > >> Read Latency: 0.875012987012987 ms. > >> Write Count: 9 > >> Write Latency: 0.20055555555555554 ms. > >> Pending Tasks: 0 > >> Column Family: LocationInfo > >> SSTable count: 1 > >> Space used (live): 2714 > >> Space used (total): 2714 > >> Memtable Columns Count: 0 > >> Memtable Data Size: 0 > >> Memtable Switch Count: 3 > >> Read Count: 2 > >> Read Latency: NaN ms. > >> Write Count: 9 > >> Write Latency: 0.011 ms. > >> Pending Tasks: 0 > >> Key cache capacity: 1 > >> Key cache size: 1 > >> Key cache hit rate: NaN > >> Row cache: disabled > >> Compacted row minimum size: 203 > >> Compacted row maximum size: 397 > >> Compacted row mean size: 300 > >> Column Family: HintsColumnFamily > >> SSTable count: 1 > >> Space used (live): 1457 > >> Space used (total): 4371 > >> Memtable Columns Count: 0 > >> Memtable Data Size: 0 > >> Memtable Switch Count: 0 > >> Read Count: 152 > >> Read Latency: 0.369 ms. > >> Write Count: 0 > >> Write Latency: NaN ms. > >> Pending Tasks: 0 > >> Key cache capacity: 1 > >> Key cache size: 1 > >> Key cache hit rate: 0.07142857142857142 > >> Row cache: disabled > >> Compacted row minimum size: 829 > >> Compacted row maximum size: 829 > >> Compacted row mean size: 829 > >> > >> > >> > >> > >> On Sat, May 22, 2010 at 4:14 AM, Jonathan Ellis <jbel...@gmail.com> > wrote: > >>> > >>> Can you monitor cassandra-level metrics like the ones in > >>> http://github.com/jbellis/cassandra-munin-plugins ? > >>> > >>> the usual culprit is usually compaction but your compacted row size is > >>> small. nothing else really comes to mind. > >>> > >>> (you should check system keyspace too tho, HH rows can get large) > >>> > >>> On Fri, May 21, 2010 at 2:36 PM, Ran Tavory <ran...@gmail.com> wrote: > >>> > I see some OOM on one of the hosts in the cluster and I wonder if > >>> > there's a > >>> > formula that'll help me calculate what's the required memory setting > >>> > given > >>> > the parameters x,y,z... > >>> > In short, I need advice on: > >>> > 1. How to set up proper heap space and which parameters should I look > >>> > at > >>> > when doing so. > >>> > 2. Help setting up an alert policy and define some counter measures > or > >>> > sos > >>> > steps an admin can take to prevent further degradation of service > when > >>> > alerts fire. > >>> > The OOM is at the row mutation stage and it happens after extensive > GC > >>> > activity. (log tail below). > >>> > The server has 16G physical ram and java heap space 4G. No other > >>> > significant > >>> > processes run on the same server. I actually upped the java heap > space > >>> > to 8G > >>> > but it OOMed again... > >>> > Most of my settings are the defaults with a few keyspaces and a few > CFs > >>> > in > >>> > each KS. Here's the output of cfstats for the largest and most > heavily > >>> > used > >>> > CF. (currently reads/writes are stopped but data is there). > >>> > Keyspace: outbrain_kvdb > >>> > Read Count: 3392 > >>> > Read Latency: 160.33135908018866 ms. > >>> > Write Count: 2005839 > >>> > Write Latency: 0.029233923061621595 ms. > >>> > Pending Tasks: 0 > >>> > Column Family: KvImpressions > >>> > SSTable count: 8 > >>> > Space used (live): 21923629878 > >>> > Space used (total): 21923629878 > >>> > Memtable Columns Count: 69440 > >>> > Memtable Data Size: 9719364 > >>> > Memtable Switch Count: 26 > >>> > Read Count: 3392 > >>> > Read Latency: NaN ms. > >>> > Write Count: 1998821 > >>> > Write Latency: 0.018 ms. > >>> > Pending Tasks: 0 > >>> > Key cache capacity: 200000 > >>> > Key cache size: 11661 > >>> > Key cache hit rate: NaN > >>> > Row cache: disabled > >>> > Compacted row minimum size: 302 > >>> > Compacted row maximum size: 22387 > >>> > Compacted row mean size: 641 > >>> > I'm also attaching a few graphs of "the incidenst" I hope they help. > >>> > From > >>> > the graphs it looks like: > >>> > 1. message deserializer pool is behind so maybe taking too much mem. > If > >>> > graphs are correct, it gets as high as 10m pending before crash. > >>> > 2. row-read-stage has a high number of pending (4k) so first of all - > >>> > this > >>> > isn't good for performance whether it caused the oom or not, and > >>> > second, > >>> > this may also have taken up heap space and caused the crash. > >>> > Thanks! > >>> > INFO [GC inspection] 2010-05-21 00:53:25,885 GCInspector.java (line > >>> > 110) GC > >>> > for ConcurrentMarkSweep: 10819 ms, 939992 reclaimed leaving > 4312064504 > >>> > used; > >>> > max is 4431216640 > >>> > INFO [GC inspection] 2010-05-21 00:53:44,605 GCInspector.java (line > >>> > 110) GC > >>> > for ConcurrentMarkSweep: 9672 ms, 673400 reclaimed leaving 4312337208 > >>> > used; > >>> > max is 4431216640 > >>> > INFO [GC inspection] 2010-05-21 00:54:23,110 GCInspector.java (line > >>> > 110) GC > >>> > for ConcurrentMarkSweep: 9150 ms, 402072 reclaimed leaving 4312609776 > >>> > used; > >>> > max is 4431216640 > >>> > ERROR [ROW-MUTATION-STAGE:19] 2010-05-21 01:55:37,951 > >>> > CassandraDaemon.java > >>> > (line 88) Fatal exception in thread > >>> > Thread[ROW-MUTATION-STAGE:19,5,main] > >>> > java.lang.OutOfMemoryError: Java heap space > >>> > ERROR [Thread-10] 2010-05-21 01:55:37,951 CassandraDaemon.java (line > >>> > 88) > >>> > Fatal exception in thread Thread[Thread-10,5,main] > >>> > java.lang.OutOfMemoryError: Java heap space > >>> > ERROR [CACHETABLE-TIMER-2] 2010-05-21 01:55:37,951 > CassandraDaemon.java > >>> > (line 88) Fatal exception in thread Thread[CACHETABLE-TIMER-2,5,main] > >>> > java.lang.OutOfMemoryError: Java heap space > >>> > > >>> > >>> > >>> > >>> -- > >>> Jonathan Ellis > >>> Project Chair, Apache Cassandra > >>> co-founder of Riptano, the source for professional Cassandra support > >>> http://riptano.com > >> > > > > > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of Riptano, the source for professional Cassandra support > http://riptano.com >