Hi, We're noticing that when one node gets hot (very high cpu usage) because of 'nodetool repair', the whole cluster's performance becomes really bad.
We're using 0.8.0 with random partition. We have 6 nodes with RF 5. Our repair is scheduled to run once a week, spread across whole cluster. I do get suggestion from Jonothan that 0.8.0 has some bug on the repair, but wondering why one hot node would slow down the whole cluster. We saw this symptom yesterday on one node, and today on the adjacent node. Most probably it'll happen on the next one tomorrow. We do see lots of (~200) RejectedExecutionException 3 hours before the repair job, and also in the middle of the repair job, not sure whether they're related. Full stack is attached in the end. Do we have any suggestion/hint? Thanks, Hefeng ERROR [pool-2-thread-3097] 2011-08-17 08:42:38,118 Cassandra.java (line 3462) Internal error processing batch_mutate java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut down at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:73) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658) at org.apache.cassandra.service.StorageProxy.insertLocal(StorageProxy.java:360) at org.apache.cassandra.service.StorageProxy.sendToHintedEndpoints(StorageProxy.java:241) at org.apache.cassandra.service.StorageProxy.access$000(StorageProxy.java:62) at org.apache.cassandra.service.StorageProxy$1.apply(StorageProxy.java:99) at org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:210) at org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:154) at org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:560) at org.apache.cassandra.thrift.CassandraServer.internal_batch_mutate(CassandraServer.java:511) at org.apache.cassandra.thrift.CassandraServer.batch_mutate(CassandraServer.java:519) at org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:3454) at org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2889) at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) ERROR [Thread-137480] 2011-08-17 08:42:38,121 AbstractCassandraDaemon.java (line 113) Fatal exception in thread Thread[Thread-137480,5,main] java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut down at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:73) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658) at org.apache.cassandra.net.MessagingService.receive(MessagingService.java:444) at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:117)