having 3 digit pending counts in both RRS and RMS is a danger sign. It looks like you are i/o bound on reads, and possibly on writes as well. (commitlog not on separate disk?)
On Mon, Aug 9, 2010 at 10:53 PM, Edward Capriolo <edlinuxg...@gmail.com> wrote: > On Mon, Aug 9, 2010 at 8:20 PM, Jonathan Ellis <jbel...@gmail.com> wrote: >> what does tpstats or other JMX monitoring of the o.a.c.concurrent stages >> show? >> >> On Mon, Aug 9, 2010 at 4:50 PM, Edward Capriolo <edlinuxg...@gmail.com> >> wrote: >>> I have a 16 node 6.3 cluster and two nodes from my cluster are giving >>> me major headaches. >>> >>> 10.71.71.56 Up 58.19 GB >>> 108271662202116783829255556910108067277 | ^ >>> 10.71.71.61 Down 67.77 GB >>> 123739042516704895804863493611552076888 v | >>> 10.71.71.66 Up 43.51 GB >>> 127605887595351923798765477786913079296 | ^ >>> 10.71.71.59 Down 90.22 GB >>> 139206422831293007780471430312996086499 v | >>> 10.71.71.65 Up 22.97 GB >>> 148873535527910577765226390751398592512 | ^ >>> >>> The symptoms I am seeing are nodes 61 and nodes 59 have huge 6 GB + >>> commit log directories. They keep growing, along with memory usage, >>> eventually the logs start showing GCInspection errors and then the >>> nodes will go OOM >>> >>> INFO 14:20:01,296 Creating new commitlog segment >>> /var/lib/cassandra/commitlog/CommitLog-1281378001296.log >>> INFO 14:20:02,199 GC for ParNew: 327 ms, 57545496 reclaimed leaving >>> 7955651792 used; max is 9773776896 >>> INFO 14:20:03,201 GC for ParNew: 443 ms, 45124504 reclaimed leaving >>> 8137412920 used; max is 9773776896 >>> INFO 14:20:04,314 GC for ParNew: 438 ms, 54158832 reclaimed leaving >>> 8310139720 used; max is 9773776896 >>> INFO 14:20:05,547 GC for ParNew: 409 ms, 56888760 reclaimed leaving >>> 8480136592 used; max is 9773776896 >>> INFO 14:20:06,900 GC for ParNew: 441 ms, 58149704 reclaimed leaving >>> 8648872520 used; max is 9773776896 >>> INFO 14:20:08,904 GC for ParNew: 462 ms, 59185992 reclaimed leaving >>> 8816581312 used; max is 9773776896 >>> INFO 14:20:09,973 GC for ParNew: 460 ms, 57403840 reclaimed leaving >>> 8986063136 used; max is 9773776896 >>> INFO 14:20:11,976 GC for ParNew: 447 ms, 59814376 reclaimed leaving >>> 9153134392 used; max is 9773776896 >>> INFO 14:20:13,150 GC for ParNew: 441 ms, 61879728 reclaimed leaving >>> 9318140296 used; max is 9773776896 >>> java.lang.OutOfMemoryError: Java heap space >>> Dumping heap to java_pid10913.hprof ... >>> INFO 14:22:30,620 InetAddress /10.71.71.66 is now dead. >>> INFO 14:22:30,621 InetAddress /10.71.71.65 is now dead. >>> INFO 14:22:30,621 GC for ConcurrentMarkSweep: 44862 ms, 261200 >>> reclaimed leaving 9334753480 used; max is 9773776896 >>> INFO 14:22:30,621 InetAddress /10.71.71.64 is now dead. >>> >>> Heap dump file created [12730501093 bytes in 253.445 secs] >>> ERROR 14:28:08,945 Uncaught exception in thread Thread[Thread-2288,5,main] >>> java.lang.OutOfMemoryError: Java heap space >>> at >>> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:71) >>> ERROR 14:28:08,948 Uncaught exception in thread Thread[Thread-2281,5,main] >>> java.lang.OutOfMemoryError: Java heap space >>> at >>> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:71) >>> INFO 14:28:09,017 GC for ConcurrentMarkSweep: 33737 ms, 85880 >>> reclaimed leaving 9335215296 used; max is 9773776896 >>> >>> Does anyone have any ideas what is going on? >>> >> >> >> >> -- >> Jonathan Ellis >> Project Chair, Apache Cassandra >> co-founder of Riptano, the source for professional Cassandra support >> http://riptano.com >> > > Hey guys thanks for the help. I had lowered my Xmx from 12GB to 10xmx > because I saw: > > [r...@cdbsd09 ~]# /usr/local/cassandra/bin/nodetool --host 10.71.71.59 > --port 8585 info > 123739042516704895804863493611552076888 > Load : 68.91 GB > Generation No : 1281407425 > Uptime (seconds) : 1459 > Heap Memory (MB) : 6476.70 / 12261.00 > > This was happening: > [r...@cdbsd11 ~]# /usr/local/cassandra/bin/nodetool --host > cdbsd09.hadoop.pvt --port 8585 tpstats > Pool Name Active Pending Completed > STREAM-STAGE 0 0 0 > RESPONSE-STAGE 0 0 16478 > ROW-READ-STAGE 64 4014 18190 > LB-OPERATIONS 0 0 0 > MESSAGE-DESERIALIZER-POOL 0 0 60290 > GMFD 0 0 385 > LB-TARGET 0 0 0 > CONSISTENCY-MANAGER 0 0 7526 > ROW-MUTATION-STAGE 64 908 182612 > MESSAGE-STREAMING-POOL 0 0 0 > LOAD-BALANCER-STAGE 0 0 0 > FLUSH-SORTER-POOL 0 0 0 > MEMTABLE-POST-FLUSHER 0 0 8 > FLUSH-WRITER-POOL 0 0 8 > AE-SERVICE-STAGE 0 0 0 > HINTED-HANDOFF-POOL 1 9 6 > > After raising the level I realized I was maxing out the heap. The > other nodes are running fine with xmx9GB but I guess these nodes can > not. > > Thanks again. > Edward > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com