One hot node slows down whole cluster

Hefeng Yuan Wed, 17 Aug 2011 11:24:54 -0700

Hi,

We're noticing that when one node gets hot (very high cpu usage) because of 
'nodetool repair', the whole cluster's performance becomes really bad.


We're using 0.8.0 with random partition. We have 6 nodes with RF 5. Our repair 
is scheduled to run once a week, spread across whole cluster. I do get 
suggestion from Jonothan that 0.8.0 has some bug on the repair, but wondering 
why one hot node would slow down the whole cluster.

We saw this symptom yesterday on one node, and today on the adjacent node. Most 
probably it'll happen on the next one tomorrow.

We do see lots of (~200) RejectedExecutionException 3 hours before the repair 
job, and also in the middle of the repair job, not sure whether they're 
related. Full stack is attached in the end.

Do we have any suggestion/hint?

Thanks,
Hefeng


ERROR [pool-2-thread-3097] 2011-08-17 08:42:38,118 Cassandra.java (line 3462) 
Internal error processing batch_mutate
java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut 
down
        at 
org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:73)
        at 
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
        at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
        at 
org.apache.cassandra.service.StorageProxy.insertLocal(StorageProxy.java:360)
        at 
org.apache.cassandra.service.StorageProxy.sendToHintedEndpoints(StorageProxy.java:241)
        at 
org.apache.cassandra.service.StorageProxy.access$000(StorageProxy.java:62)
        at 
org.apache.cassandra.service.StorageProxy$1.apply(StorageProxy.java:99)
        at 
org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:210)
        at 
org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:154)
        at 
org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:560)
        at 
org.apache.cassandra.thrift.CassandraServer.internal_batch_mutate(CassandraServer.java:511)
        at 
org.apache.cassandra.thrift.CassandraServer.batch_mutate(CassandraServer.java:519)
        at 
org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:3454)
        at 
org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2889)
        at 
org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)
ERROR [Thread-137480] 2011-08-17 08:42:38,121 AbstractCassandraDaemon.java 
(line 113) Fatal exception in thread Thread[Thread-137480,5,main]
java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut 
down
        at 
org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:73)
        at 
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
        at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
        at 
org.apache.cassandra.net.MessagingService.receive(MessagingService.java:444)
        at 
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:117)

One hot node slows down whole cluster

Reply via email to