having 3 digit pending counts in both RRS and RMS is a danger sign.
It looks like you are i/o bound on reads, and possibly on writes as
well. (commitlog not on separate disk?)

On Mon, Aug 9, 2010 at 10:53 PM, Edward Capriolo <edlinuxg...@gmail.com> wrote:
> On Mon, Aug 9, 2010 at 8:20 PM, Jonathan Ellis <jbel...@gmail.com> wrote:
>> what does tpstats or other JMX monitoring of the o.a.c.concurrent stages 
>> show?
>>
>> On Mon, Aug 9, 2010 at 4:50 PM, Edward Capriolo <edlinuxg...@gmail.com> 
>> wrote:
>>> I have a 16 node 6.3 cluster and two nodes from my cluster are giving
>>> me major headaches.
>>>
>>> 10.71.71.56   Up         58.19 GB
>>> 108271662202116783829255556910108067277    |   ^
>>> 10.71.71.61   Down       67.77 GB
>>> 123739042516704895804863493611552076888    v   |
>>> 10.71.71.66   Up         43.51 GB
>>> 127605887595351923798765477786913079296    |   ^
>>> 10.71.71.59   Down       90.22 GB
>>> 139206422831293007780471430312996086499    v   |
>>> 10.71.71.65   Up         22.97 GB
>>> 148873535527910577765226390751398592512    |   ^
>>>
>>> The symptoms I am seeing are nodes 61 and nodes 59 have huge 6 GB +
>>> commit log directories. They keep growing, along with memory usage,
>>> eventually the logs start showing GCInspection errors and then the
>>> nodes will go OOM
>>>
>>> INFO 14:20:01,296 Creating new commitlog segment
>>> /var/lib/cassandra/commitlog/CommitLog-1281378001296.log
>>>  INFO 14:20:02,199 GC for ParNew: 327 ms, 57545496 reclaimed leaving
>>> 7955651792 used; max is 9773776896
>>>  INFO 14:20:03,201 GC for ParNew: 443 ms, 45124504 reclaimed leaving
>>> 8137412920 used; max is 9773776896
>>>  INFO 14:20:04,314 GC for ParNew: 438 ms, 54158832 reclaimed leaving
>>> 8310139720 used; max is 9773776896
>>>  INFO 14:20:05,547 GC for ParNew: 409 ms, 56888760 reclaimed leaving
>>> 8480136592 used; max is 9773776896
>>>  INFO 14:20:06,900 GC for ParNew: 441 ms, 58149704 reclaimed leaving
>>> 8648872520 used; max is 9773776896
>>>  INFO 14:20:08,904 GC for ParNew: 462 ms, 59185992 reclaimed leaving
>>> 8816581312 used; max is 9773776896
>>>  INFO 14:20:09,973 GC for ParNew: 460 ms, 57403840 reclaimed leaving
>>> 8986063136 used; max is 9773776896
>>>  INFO 14:20:11,976 GC for ParNew: 447 ms, 59814376 reclaimed leaving
>>> 9153134392 used; max is 9773776896
>>>  INFO 14:20:13,150 GC for ParNew: 441 ms, 61879728 reclaimed leaving
>>> 9318140296 used; max is 9773776896
>>> java.lang.OutOfMemoryError: Java heap space
>>> Dumping heap to java_pid10913.hprof ...
>>>  INFO 14:22:30,620 InetAddress /10.71.71.66 is now dead.
>>>  INFO 14:22:30,621 InetAddress /10.71.71.65 is now dead.
>>>  INFO 14:22:30,621 GC for ConcurrentMarkSweep: 44862 ms, 261200
>>> reclaimed leaving 9334753480 used; max is 9773776896
>>>  INFO 14:22:30,621 InetAddress /10.71.71.64 is now dead.
>>>
>>> Heap dump file created [12730501093 bytes in 253.445 secs]
>>> ERROR 14:28:08,945 Uncaught exception in thread Thread[Thread-2288,5,main]
>>> java.lang.OutOfMemoryError: Java heap space
>>>        at 
>>> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:71)
>>> ERROR 14:28:08,948 Uncaught exception in thread Thread[Thread-2281,5,main]
>>> java.lang.OutOfMemoryError: Java heap space
>>>        at 
>>> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:71)
>>>  INFO 14:28:09,017 GC for ConcurrentMarkSweep: 33737 ms, 85880
>>> reclaimed leaving 9335215296 used; max is 9773776896
>>>
>>> Does anyone have any ideas what is going on?
>>>
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of Riptano, the source for professional Cassandra support
>> http://riptano.com
>>
>
> Hey guys thanks for the help. I had lowered my Xmx from 12GB to 10xmx
> because I saw:
>
> [r...@cdbsd09 ~]# /usr/local/cassandra/bin/nodetool --host 10.71.71.59
> --port 8585 info
> 123739042516704895804863493611552076888
> Load             : 68.91 GB
> Generation No    : 1281407425
> Uptime (seconds) : 1459
> Heap Memory (MB) : 6476.70 / 12261.00
>
> This was happening:
> [r...@cdbsd11 ~]# /usr/local/cassandra/bin/nodetool --host
> cdbsd09.hadoop.pvt --port 8585 tpstats
> Pool Name                    Active   Pending      Completed
> STREAM-STAGE                      0         0              0
> RESPONSE-STAGE                    0         0          16478
> ROW-READ-STAGE                   64      4014          18190
> LB-OPERATIONS                     0         0              0
> MESSAGE-DESERIALIZER-POOL         0         0          60290
> GMFD                              0         0            385
> LB-TARGET                         0         0              0
> CONSISTENCY-MANAGER               0         0           7526
> ROW-MUTATION-STAGE               64       908         182612
> MESSAGE-STREAMING-POOL            0         0              0
> LOAD-BALANCER-STAGE               0         0              0
> FLUSH-SORTER-POOL                 0         0              0
> MEMTABLE-POST-FLUSHER             0         0              8
> FLUSH-WRITER-POOL                 0         0              8
> AE-SERVICE-STAGE                  0         0              0
> HINTED-HANDOFF-POOL               1         9              6
>
> After raising the level I realized I was maxing out the heap. The
> other nodes are running fine with xmx9GB but I guess these nodes can
> not.
>
> Thanks again.
> Edward
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Reply via email to