Thanks Aaron for the response. We're not doing drain on node, and there's no that message in the log.
We used LOCAL_QUORUM CL, endpoint_snitch: org.apache.cassandra.locator.PropertyFileSnitch dynamic_snitch: false dynamic_snitch_badness_threshold: 0.0 Because we have another 3 nodes DC for Brisk, maybe that's the problem? On Aug 17, 2011, at 4:27 PM, aaron morton wrote: > wrt the Exception something has shutdown the Mutation thread pool. The only > thing I can see in the code to do this is nodetool drain and running the > Embedded server. If it was drain you should see an INFO level messages "Node > is drained" somewhere. Could either of these things be happening ? > > wrt the slow down: > - what CL are you using for reads and writes ? What does the ring look like > ? > - have a look at tp stats to see what stage backing up > - ensure you have the dynamic snitch enabled > - what setting do you have for dynamic_snitch_badness_threshold in yaml > - have a look at the o.a.c.DynamicEndpointSnitch info in JMX / JConsole at > dumpTimings() and scores > > Basically slower nodes should be used less. But there are reasons they may > not be, so lets work out what requests are running slow and if the Dynamic > Snitch is doing the right thing. I would look at that error first, seems odd. > > Cheers > > ----------------- > Aaron Morton > Freelance Cassandra Developer > @aaronmorton > http://www.thelastpickle.com > > On 18/08/2011, at 6:52 AM, Hefeng Yuan wrote: > >> Just wondering, would it help if we shorten the rpc_timeout_in_ms (currently >> using 30,000), so that when one node gets hot and responding slowly, others >> will just take it as down and move forward? >> >> On Aug 17, 2011, at 11:35 AM, Hefeng Yuan wrote: >> >>> Sorry, correction, we're using 0.8.1. >>> >>> On Aug 17, 2011, at 11:24 AM, Hefeng Yuan wrote: >>> >>>> Hi, >>>> >>>> We're noticing that when one node gets hot (very high cpu usage) because >>>> of 'nodetool repair', the whole cluster's performance becomes really bad. >>>> >>>> We're using 0.8.1 with random partition. We have 6 nodes with RF 5. Our >>>> repair is scheduled to run once a week, spread across whole cluster. I do >>>> get suggestion from Jonothan that 0.8.0 has some bug on the repair, but >>>> wondering why one hot node would slow down the whole cluster. >>>> >>>> We saw this symptom yesterday on one node, and today on the adjacent node. >>>> Most probably it'll happen on the next one tomorrow. >>>> >>>> We do see lots of (~200) RejectedExecutionException 3 hours before the >>>> repair job, and also in the middle of the repair job, not sure whether >>>> they're related. Full stack is attached in the end. >>>> >>>> Do we have any suggestion/hint? >>>> >>>> Thanks, >>>> Hefeng >>>> >>>> >>>> ERROR [pool-2-thread-3097] 2011-08-17 08:42:38,118 Cassandra.java (line >>>> 3462) Internal error processing batch_mutate >>>> java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has >>>> shut down >>>> at >>>> org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:73) >>>> at >>>> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767) >>>> at >>>> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658) >>>> at >>>> org.apache.cassandra.service.StorageProxy.insertLocal(StorageProxy.java:360) >>>> at >>>> org.apache.cassandra.service.StorageProxy.sendToHintedEndpoints(StorageProxy.java:241) >>>> at >>>> org.apache.cassandra.service.StorageProxy.access$000(StorageProxy.java:62) >>>> at >>>> org.apache.cassandra.service.StorageProxy$1.apply(StorageProxy.java:99) >>>> at >>>> org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:210) >>>> at >>>> org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:154) >>>> at >>>> org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:560) >>>> at >>>> org.apache.cassandra.thrift.CassandraServer.internal_batch_mutate(CassandraServer.java:511) >>>> at >>>> org.apache.cassandra.thrift.CassandraServer.batch_mutate(CassandraServer.java:519) >>>> at >>>> org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:3454) >>>> at >>>> org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2889) >>>> at >>>> org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187) >>>> at >>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >>>> at >>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >>>> at java.lang.Thread.run(Thread.java:619) >>>> ERROR [Thread-137480] 2011-08-17 08:42:38,121 AbstractCassandraDaemon.java >>>> (line 113) Fatal exception in thread Thread[Thread-137480,5,main] >>>> java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has >>>> shut down >>>> at >>>> org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:73) >>>> at >>>> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767) >>>> at >>>> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658) >>>> at >>>> org.apache.cassandra.net.MessagingService.receive(MessagingService.java:444) >>>> at >>>> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:117) >>> >> >