Re: oom in ROW-MUTATION-STAGE

Ran Tavory Sun, 23 May 2010 02:26:22 -0700

Here's tpstats on a server with traffic that I think will get OOM shortly.
We have 4k pending reads and 123k pending at MESSAGE-DESERIALIZER-POOL


Is there something I can do to prevent that? (other than adding RAM...)

Pool Name                    Active   Pending      Completed
FILEUTILS-DELETE-POOL             0         0             55
STREAM-STAGE                      0         0              6
RESPONSE-STAGE                    0         0              0
ROW-READ-STAGE                    8      4088        7537229
LB-OPERATIONS                     0         0              0
MESSAGE-DESERIALIZER-POOL         1    123799       22198459
GMFD                              0         0         471827
LB-TARGET                         0         0              0
CONSISTENCY-MANAGER               0         0              0
ROW-MUTATION-STAGE                0         0       14142351
MESSAGE-STREAMING-POOL            0         0             16
LOAD-BALANCER-STAGE               0         0              0
FLUSH-SORTER-POOL                 0         0              0
MEMTABLE-POST-FLUSHER             0         0            128
FLUSH-WRITER-POOL                 0         0            128
AE-SERVICE-STAGE                  1         1              8
HINTED-HANDOFF-POOL               0         0             10


On Sat, May 22, 2010 at 11:05 PM, Ran Tavory <ran...@gmail.com> wrote:

> The message deserializer has 10m pending tasks before the oom. What do you
> think makes the message deserializer blow up? I'd suspect that when it goes
> up to 10m pending tasks, don't know how much mem a task actually takes up,
> but they may consume a lot of memory. Is there a setting I need to tweak?
> (or am I barking at the wrong tree?).
>
> I'll add the counters from
> http://github.com/jbellis/cassandra-munin-plugins but I already have most
> of them monitored, so I attached the graphs of the ones that seemed the most
> suspicious in the previous email.
>
> The system keyspace and HH CF don't look too bad, I think, here they are:
>
> Keyspace: system
>         Read Count: 154
>         Read Latency: 0.875012987012987 ms.
>         Write Count: 9
>         Write Latency: 0.20055555555555554 ms.
>         Pending Tasks: 0
>                 Column Family: LocationInfo
>                 SSTable count: 1
>                 Space used (live): 2714
>                 Space used (total): 2714
>                 Memtable Columns Count: 0
>                 Memtable Data Size: 0
>                 Memtable Switch Count: 3
>                 Read Count: 2
>                 Read Latency: NaN ms.
>                 Write Count: 9
>                 Write Latency: 0.011 ms.
>                 Pending Tasks: 0
>                 Key cache capacity: 1
>                 Key cache size: 1
>                 Key cache hit rate: NaN
>                 Row cache: disabled
>                 Compacted row minimum size: 203
>                 Compacted row maximum size: 397
>                 Compacted row mean size: 300
>
>                 Column Family: HintsColumnFamily
>                 SSTable count: 1
>                 Space used (live): 1457
>                 Space used (total): 4371
>                 Memtable Columns Count: 0
>                 Memtable Data Size: 0
>                 Memtable Switch Count: 0
>                 Read Count: 152
>                 Read Latency: 0.369 ms.
>                 Write Count: 0
>                 Write Latency: NaN ms.
>                 Pending Tasks: 0
>                 Key cache capacity: 1
>                 Key cache size: 1
>                 Key cache hit rate: 0.07142857142857142
>                 Row cache: disabled
>                 Compacted row minimum size: 829
>                 Compacted row maximum size: 829
>                 Compacted row mean size: 829
>
>
>
>
>
> On Sat, May 22, 2010 at 4:14 AM, Jonathan Ellis <jbel...@gmail.com> wrote:
>
>> Can you monitor cassandra-level metrics like the ones in
>> http://github.com/jbellis/cassandra-munin-plugins ?
>>
>> the usual culprit is usually compaction but your compacted row size is
>> small.  nothing else really comes to mind.
>>
>> (you should check system keyspace too tho, HH rows can get large)
>>
>> On Fri, May 21, 2010 at 2:36 PM, Ran Tavory <ran...@gmail.com> wrote:
>> > I see some OOM on one of the hosts in the cluster and I wonder if
>> there's a
>> > formula that'll help me calculate what's the required memory setting
>> given
>> > the parameters x,y,z...
>> > In short, I need advice on:
>> > 1. How to set up proper heap space and which parameters should I look at
>> > when doing so.
>> > 2. Help setting up an alert policy and define some counter measures or
>> sos
>> > steps an admin can take to prevent further degradation of service when
>> > alerts fire.
>> > The OOM is at the row mutation stage and it happens after extensive GC
>> > activity. (log tail below).
>> > The server has 16G physical ram and java heap space 4G. No other
>> significant
>> > processes run on the same server. I actually upped the java heap space
>> to 8G
>> > but it OOMed again...
>> > Most of my settings are the defaults with a few keyspaces and a few CFs
>> in
>> > each KS. Here's the output of cfstats for the largest and most heavily
>> used
>> > CF. (currently reads/writes are stopped but data is there).
>> > Keyspace: outbrain_kvdb
>> >         Read Count: 3392
>> >         Read Latency: 160.33135908018866 ms.
>> >         Write Count: 2005839
>> >         Write Latency: 0.029233923061621595 ms.
>> >         Pending Tasks: 0
>> >                 Column Family: KvImpressions
>> >                 SSTable count: 8
>> >                 Space used (live): 21923629878
>> >                 Space used (total): 21923629878
>> >                 Memtable Columns Count: 69440
>> >                 Memtable Data Size: 9719364
>> >                 Memtable Switch Count: 26
>> >                 Read Count: 3392
>> >                 Read Latency: NaN ms.
>> >                 Write Count: 1998821
>> >                 Write Latency: 0.018 ms.
>> >                 Pending Tasks: 0
>> >                 Key cache capacity: 200000
>> >                 Key cache size: 11661
>> >                 Key cache hit rate: NaN
>> >                 Row cache: disabled
>> >                 Compacted row minimum size: 302
>> >                 Compacted row maximum size: 22387
>> >                 Compacted row mean size: 641
>> > I'm also attaching a few graphs of "the incidenst" I hope they help.
>> From
>> > the graphs it looks like:
>> > 1. message deserializer pool is behind so maybe taking too much mem. If
>> > graphs are correct, it gets as high as 10m pending before crash.
>> > 2. row-read-stage has a high number of pending (4k) so first of all -
>> this
>> > isn't good for performance whether it caused the oom or not, and second,
>> > this may also have taken up heap space and caused the crash.
>> > Thanks!
>> >  INFO [GC inspection] 2010-05-21 00:53:25,885 GCInspector.java (line
>> 110) GC
>> > for ConcurrentMarkSweep: 10819 ms, 939992 reclaimed leaving 4312064504
>> used;
>> > max is 4431216640
>> >  INFO [GC inspection] 2010-05-21 00:53:44,605 GCInspector.java (line
>> 110) GC
>> > for ConcurrentMarkSweep: 9672 ms, 673400 reclaimed leaving 4312337208
>> used;
>> > max is 4431216640
>> >  INFO [GC inspection] 2010-05-21 00:54:23,110 GCInspector.java (line
>> 110) GC
>> > for ConcurrentMarkSweep: 9150 ms, 402072 reclaimed leaving 4312609776
>> used;
>> > max is 4431216640
>> > ERROR [ROW-MUTATION-STAGE:19] 2010-05-21 01:55:37,951
>> CassandraDaemon.java
>> > (line 88) Fatal exception in thread Thread[ROW-MUTATION-STAGE:19,5,main]
>> > java.lang.OutOfMemoryError: Java heap space
>> > ERROR [Thread-10] 2010-05-21 01:55:37,951 CassandraDaemon.java (line 88)
>> > Fatal exception in thread Thread[Thread-10,5,main]
>> > java.lang.OutOfMemoryError: Java heap space
>> > ERROR [CACHETABLE-TIMER-2] 2010-05-21 01:55:37,951 CassandraDaemon.java
>> > (line 88) Fatal exception in thread Thread[CACHETABLE-TIMER-2,5,main]
>> > java.lang.OutOfMemoryError: Java heap space
>> >
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of Riptano, the source for professional Cassandra support
>> http://riptano.com
>>
>
>

Re: oom in ROW-MUTATION-STAGE

Reply via email to