Re: oom in ROW-MUTATION-STAGE

Jonathan Ellis Sun, 23 May 2010 07:44:45 -0700

looks like reads are backing up, which in turn is making deserialize back up


On Sun, May 23, 2010 at 4:25 AM, Ran Tavory <ran...@gmail.com> wrote:
> Here's tpstats on a server with traffic that I think will get OOM shortly.
> We have 4k pending reads and 123k pending at MESSAGE-DESERIALIZER-POOL
> Is there something I can do to prevent that? (other than adding RAM...)
> Pool Name                    Active   Pending      Completed
> FILEUTILS-DELETE-POOL             0         0             55
> STREAM-STAGE                      0         0              6
> RESPONSE-STAGE                    0         0              0
> ROW-READ-STAGE                    8      4088        7537229
> LB-OPERATIONS                     0         0              0
> MESSAGE-DESERIALIZER-POOL         1    123799       22198459
> GMFD                              0         0         471827
> LB-TARGET                         0         0              0
> CONSISTENCY-MANAGER               0         0              0
> ROW-MUTATION-STAGE                0         0       14142351
> MESSAGE-STREAMING-POOL            0         0             16
> LOAD-BALANCER-STAGE               0         0              0
> FLUSH-SORTER-POOL                 0         0              0
> MEMTABLE-POST-FLUSHER             0         0            128
> FLUSH-WRITER-POOL                 0         0            128
> AE-SERVICE-STAGE                  1         1              8
> HINTED-HANDOFF-POOL               0         0             10
>
> On Sat, May 22, 2010 at 11:05 PM, Ran Tavory <ran...@gmail.com> wrote:
>>
>> The message deserializer has 10m pending tasks before the oom. What do you
>> think makes the message deserializer blow up? I'd suspect that when it goes
>> up to 10m pending tasks, don't know how much mem a task actually takes up,
>> but they may consume a lot of memory. Is there a setting I need to tweak?
>> (or am I barking at the wrong tree?).
>> I'll add the counters
>> from http://github.com/jbellis/cassandra-munin-plugins but I already have
>> most of them monitored, so I attached the graphs of the ones that seemed the
>> most suspicious in the previous email.
>> The system keyspace and HH CF don't look too bad, I think, here they are:
>> Keyspace: system
>>         Read Count: 154
>>         Read Latency: 0.875012987012987 ms.
>>         Write Count: 9
>>         Write Latency: 0.20055555555555554 ms.
>>         Pending Tasks: 0
>>                 Column Family: LocationInfo
>>                 SSTable count: 1
>>                 Space used (live): 2714
>>                 Space used (total): 2714
>>                 Memtable Columns Count: 0
>>                 Memtable Data Size: 0
>>                 Memtable Switch Count: 3
>>                 Read Count: 2
>>                 Read Latency: NaN ms.
>>                 Write Count: 9
>>                 Write Latency: 0.011 ms.
>>                 Pending Tasks: 0
>>                 Key cache capacity: 1
>>                 Key cache size: 1
>>                 Key cache hit rate: NaN
>>                 Row cache: disabled
>>                 Compacted row minimum size: 203
>>                 Compacted row maximum size: 397
>>                 Compacted row mean size: 300
>>                 Column Family: HintsColumnFamily
>>                 SSTable count: 1
>>                 Space used (live): 1457
>>                 Space used (total): 4371
>>                 Memtable Columns Count: 0
>>                 Memtable Data Size: 0
>>                 Memtable Switch Count: 0
>>                 Read Count: 152
>>                 Read Latency: 0.369 ms.
>>                 Write Count: 0
>>                 Write Latency: NaN ms.
>>                 Pending Tasks: 0
>>                 Key cache capacity: 1
>>                 Key cache size: 1
>>                 Key cache hit rate: 0.07142857142857142
>>                 Row cache: disabled
>>                 Compacted row minimum size: 829
>>                 Compacted row maximum size: 829
>>                 Compacted row mean size: 829
>>
>>
>>
>>
>> On Sat, May 22, 2010 at 4:14 AM, Jonathan Ellis <jbel...@gmail.com> wrote:
>>>
>>> Can you monitor cassandra-level metrics like the ones in
>>> http://github.com/jbellis/cassandra-munin-plugins ?
>>>
>>> the usual culprit is usually compaction but your compacted row size is
>>> small.  nothing else really comes to mind.
>>>
>>> (you should check system keyspace too tho, HH rows can get large)
>>>
>>> On Fri, May 21, 2010 at 2:36 PM, Ran Tavory <ran...@gmail.com> wrote:
>>> > I see some OOM on one of the hosts in the cluster and I wonder if
>>> > there's a
>>> > formula that'll help me calculate what's the required memory setting
>>> > given
>>> > the parameters x,y,z...
>>> > In short, I need advice on:
>>> > 1. How to set up proper heap space and which parameters should I look
>>> > at
>>> > when doing so.
>>> > 2. Help setting up an alert policy and define some counter measures or
>>> > sos
>>> > steps an admin can take to prevent further degradation of service when
>>> > alerts fire.
>>> > The OOM is at the row mutation stage and it happens after extensive GC
>>> > activity. (log tail below).
>>> > The server has 16G physical ram and java heap space 4G. No other
>>> > significant
>>> > processes run on the same server. I actually upped the java heap space
>>> > to 8G
>>> > but it OOMed again...
>>> > Most of my settings are the defaults with a few keyspaces and a few CFs
>>> > in
>>> > each KS. Here's the output of cfstats for the largest and most heavily
>>> > used
>>> > CF. (currently reads/writes are stopped but data is there).
>>> > Keyspace: outbrain_kvdb
>>> >         Read Count: 3392
>>> >         Read Latency: 160.33135908018866 ms.
>>> >         Write Count: 2005839
>>> >         Write Latency: 0.029233923061621595 ms.
>>> >         Pending Tasks: 0
>>> >                 Column Family: KvImpressions
>>> >                 SSTable count: 8
>>> >                 Space used (live): 21923629878
>>> >                 Space used (total): 21923629878
>>> >                 Memtable Columns Count: 69440
>>> >                 Memtable Data Size: 9719364
>>> >                 Memtable Switch Count: 26
>>> >                 Read Count: 3392
>>> >                 Read Latency: NaN ms.
>>> >                 Write Count: 1998821
>>> >                 Write Latency: 0.018 ms.
>>> >                 Pending Tasks: 0
>>> >                 Key cache capacity: 200000
>>> >                 Key cache size: 11661
>>> >                 Key cache hit rate: NaN
>>> >                 Row cache: disabled
>>> >                 Compacted row minimum size: 302
>>> >                 Compacted row maximum size: 22387
>>> >                 Compacted row mean size: 641
>>> > I'm also attaching a few graphs of "the incidenst" I hope they help.
>>> > From
>>> > the graphs it looks like:
>>> > 1. message deserializer pool is behind so maybe taking too much mem. If
>>> > graphs are correct, it gets as high as 10m pending before crash.
>>> > 2. row-read-stage has a high number of pending (4k) so first of all -
>>> > this
>>> > isn't good for performance whether it caused the oom or not, and
>>> > second,
>>> > this may also have taken up heap space and caused the crash.
>>> > Thanks!
>>> >  INFO [GC inspection] 2010-05-21 00:53:25,885 GCInspector.java (line
>>> > 110) GC
>>> > for ConcurrentMarkSweep: 10819 ms, 939992 reclaimed leaving 4312064504
>>> > used;
>>> > max is 4431216640
>>> >  INFO [GC inspection] 2010-05-21 00:53:44,605 GCInspector.java (line
>>> > 110) GC
>>> > for ConcurrentMarkSweep: 9672 ms, 673400 reclaimed leaving 4312337208
>>> > used;
>>> > max is 4431216640
>>> >  INFO [GC inspection] 2010-05-21 00:54:23,110 GCInspector.java (line
>>> > 110) GC
>>> > for ConcurrentMarkSweep: 9150 ms, 402072 reclaimed leaving 4312609776
>>> > used;
>>> > max is 4431216640
>>> > ERROR [ROW-MUTATION-STAGE:19] 2010-05-21 01:55:37,951
>>> > CassandraDaemon.java
>>> > (line 88) Fatal exception in thread
>>> > Thread[ROW-MUTATION-STAGE:19,5,main]
>>> > java.lang.OutOfMemoryError: Java heap space
>>> > ERROR [Thread-10] 2010-05-21 01:55:37,951 CassandraDaemon.java (line
>>> > 88)
>>> > Fatal exception in thread Thread[Thread-10,5,main]
>>> > java.lang.OutOfMemoryError: Java heap space
>>> > ERROR [CACHETABLE-TIMER-2] 2010-05-21 01:55:37,951 CassandraDaemon.java
>>> > (line 88) Fatal exception in thread Thread[CACHETABLE-TIMER-2,5,main]
>>> > java.lang.OutOfMemoryError: Java heap space
>>> >
>>>
>>>
>>>
>>> --
>>> Jonathan Ellis
>>> Project Chair, Apache Cassandra
>>> co-founder of Riptano, the source for professional Cassandra support
>>> http://riptano.com
>>
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: oom in ROW-MUTATION-STAGE

Reply via email to