Sorry, correction, we're using 0.8.1. On Aug 17, 2011, at 11:24 AM, Hefeng Yuan wrote:
> Hi, > > We're noticing that when one node gets hot (very high cpu usage) because of > 'nodetool repair', the whole cluster's performance becomes really bad. > > We're using 0.8.1 with random partition. We have 6 nodes with RF 5. Our > repair is scheduled to run once a week, spread across whole cluster. I do get > suggestion from Jonothan that 0.8.0 has some bug on the repair, but wondering > why one hot node would slow down the whole cluster. > > We saw this symptom yesterday on one node, and today on the adjacent node. > Most probably it'll happen on the next one tomorrow. > > We do see lots of (~200) RejectedExecutionException 3 hours before the repair > job, and also in the middle of the repair job, not sure whether they're > related. Full stack is attached in the end. > > Do we have any suggestion/hint? > > Thanks, > Hefeng > > > ERROR [pool-2-thread-3097] 2011-08-17 08:42:38,118 Cassandra.java (line 3462) > Internal error processing batch_mutate > java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut > down > at > org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:73) > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658) > at > org.apache.cassandra.service.StorageProxy.insertLocal(StorageProxy.java:360) > at > org.apache.cassandra.service.StorageProxy.sendToHintedEndpoints(StorageProxy.java:241) > at > org.apache.cassandra.service.StorageProxy.access$000(StorageProxy.java:62) > at > org.apache.cassandra.service.StorageProxy$1.apply(StorageProxy.java:99) > at > org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:210) > at > org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:154) > at > org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:560) > at > org.apache.cassandra.thrift.CassandraServer.internal_batch_mutate(CassandraServer.java:511) > at > org.apache.cassandra.thrift.CassandraServer.batch_mutate(CassandraServer.java:519) > at > org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:3454) > at > org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2889) > at > org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > ERROR [Thread-137480] 2011-08-17 08:42:38,121 AbstractCassandraDaemon.java > (line 113) Fatal exception in thread Thread[Thread-137480,5,main] > java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut > down > at > org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:73) > at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658) > at > org.apache.cassandra.net.MessagingService.receive(MessagingService.java:444) > at > org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:117)