wrt the Exception something has shutdown the Mutation thread pool. The only thing I can see in the code to do this is nodetool drain and running the Embedded server. If it was drain you should see an INFO level messages "Node is drained" somewhere. Could either of these things be happening ?
wrt the slow down: - what CL are you using for reads and writes ? What does the ring look like ? - have a look at tp stats to see what stage backing up - ensure you have the dynamic snitch enabled - what setting do you have for dynamic_snitch_badness_threshold in yaml - have a look at the o.a.c.DynamicEndpointSnitch info in JMX / JConsole at dumpTimings() and scores Basically slower nodes should be used less. But there are reasons they may not be, so lets work out what requests are running slow and if the Dynamic Snitch is doing the right thing. I would look at that error first, seems odd. Cheers ----------------- Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 18/08/2011, at 6:52 AM, Hefeng Yuan wrote: > Just wondering, would it help if we shorten the rpc_timeout_in_ms (currently > using 30,000), so that when one node gets hot and responding slowly, others > will just take it as down and move forward? > > On Aug 17, 2011, at 11:35 AM, Hefeng Yuan wrote: > >> Sorry, correction, we're using 0.8.1. >> >> On Aug 17, 2011, at 11:24 AM, Hefeng Yuan wrote: >> >>> Hi, >>> >>> We're noticing that when one node gets hot (very high cpu usage) because of >>> 'nodetool repair', the whole cluster's performance becomes really bad. >>> >>> We're using 0.8.1 with random partition. We have 6 nodes with RF 5. Our >>> repair is scheduled to run once a week, spread across whole cluster. I do >>> get suggestion from Jonothan that 0.8.0 has some bug on the repair, but >>> wondering why one hot node would slow down the whole cluster. >>> >>> We saw this symptom yesterday on one node, and today on the adjacent node. >>> Most probably it'll happen on the next one tomorrow. >>> >>> We do see lots of (~200) RejectedExecutionException 3 hours before the >>> repair job, and also in the middle of the repair job, not sure whether >>> they're related. Full stack is attached in the end. >>> >>> Do we have any suggestion/hint? >>> >>> Thanks, >>> Hefeng >>> >>> >>> ERROR [pool-2-thread-3097] 2011-08-17 08:42:38,118 Cassandra.java (line >>> 3462) Internal error processing batch_mutate >>> java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has >>> shut down >>> at >>> org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:73) >>> at >>> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767) >>> at >>> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658) >>> at >>> org.apache.cassandra.service.StorageProxy.insertLocal(StorageProxy.java:360) >>> at >>> org.apache.cassandra.service.StorageProxy.sendToHintedEndpoints(StorageProxy.java:241) >>> at >>> org.apache.cassandra.service.StorageProxy.access$000(StorageProxy.java:62) >>> at >>> org.apache.cassandra.service.StorageProxy$1.apply(StorageProxy.java:99) >>> at >>> org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:210) >>> at >>> org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:154) >>> at >>> org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:560) >>> at >>> org.apache.cassandra.thrift.CassandraServer.internal_batch_mutate(CassandraServer.java:511) >>> at >>> org.apache.cassandra.thrift.CassandraServer.batch_mutate(CassandraServer.java:519) >>> at >>> org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:3454) >>> at >>> org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2889) >>> at >>> org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187) >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >>> at java.lang.Thread.run(Thread.java:619) >>> ERROR [Thread-137480] 2011-08-17 08:42:38,121 AbstractCassandraDaemon.java >>> (line 113) Fatal exception in thread Thread[Thread-137480,5,main] >>> java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has >>> shut down >>> at >>> org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:73) >>> at >>> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767) >>> at >>> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658) >>> at >>> org.apache.cassandra.net.MessagingService.receive(MessagingService.java:444) >>> at >>> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:117) >> >