Re: One hot node slows down whole cluster

aaron morton Wed, 17 Aug 2011 16:28:46 -0700

wrt the Exception something has shutdown the Mutation thread pool. The only 
thing I can see in the code to do this is nodetool drain and running the 
Embedded server. If it was drain you should see an INFO level messages "Node is 
drained" somewhere. Could either of these things be happening ?


wrt the slow down:
- what CL are you using  for reads and writes ? What does the ring look like ? 
- have a look at tp stats to see what stage backing up 
- ensure you have the dynamic snitch enabled
- what setting do you have for dynamic_snitch_badness_threshold in yaml
- have a look at the o.a.c.DynamicEndpointSnitch info in JMX / JConsole  at 
dumpTimings() and scores

Basically slower nodes should be used less. But there are reasons they may not 
be, so lets work out what requests are running slow and if the Dynamic Snitch 
is doing the right thing. I would look at that error first, seems odd.

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 18/08/2011, at 6:52 AM, Hefeng Yuan wrote:

> Just wondering, would it help if we shorten the rpc_timeout_in_ms (currently 
> using 30,000), so that when one node gets hot and responding slowly, others 
> will just take it as down and move forward?
> 
> On Aug 17, 2011, at 11:35 AM, Hefeng Yuan wrote:
> 
>> Sorry, correction, we're using 0.8.1.
>> 
>> On Aug 17, 2011, at 11:24 AM, Hefeng Yuan wrote:
>> 
>>> Hi,
>>> 
>>> We're noticing that when one node gets hot (very high cpu usage) because of 
>>> 'nodetool repair', the whole cluster's performance becomes really bad.
>>> 
>>> We're using 0.8.1 with random partition. We have 6 nodes with RF 5. Our 
>>> repair is scheduled to run once a week, spread across whole cluster. I do 
>>> get suggestion from Jonothan that 0.8.0 has some bug on the repair, but 
>>> wondering why one hot node would slow down the whole cluster.
>>> 
>>> We saw this symptom yesterday on one node, and today on the adjacent node. 
>>> Most probably it'll happen on the next one tomorrow.
>>> 
>>> We do see lots of (~200) RejectedExecutionException 3 hours before the 
>>> repair job, and also in the middle of the repair job, not sure whether 
>>> they're related. Full stack is attached in the end.
>>> 
>>> Do we have any suggestion/hint?
>>> 
>>> Thanks,
>>> Hefeng
>>> 
>>> 
>>> ERROR [pool-2-thread-3097] 2011-08-17 08:42:38,118 Cassandra.java (line 
>>> 3462) Internal error processing batch_mutate
>>> java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has 
>>> shut down
>>>     at 
>>> org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:73)
>>>     at 
>>> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
>>>     at 
>>> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
>>>     at 
>>> org.apache.cassandra.service.StorageProxy.insertLocal(StorageProxy.java:360)
>>>     at 
>>> org.apache.cassandra.service.StorageProxy.sendToHintedEndpoints(StorageProxy.java:241)
>>>     at 
>>> org.apache.cassandra.service.StorageProxy.access$000(StorageProxy.java:62)
>>>     at 
>>> org.apache.cassandra.service.StorageProxy$1.apply(StorageProxy.java:99)
>>>     at 
>>> org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:210)
>>>     at 
>>> org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:154)
>>>     at 
>>> org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:560)
>>>     at 
>>> org.apache.cassandra.thrift.CassandraServer.internal_batch_mutate(CassandraServer.java:511)
>>>     at 
>>> org.apache.cassandra.thrift.CassandraServer.batch_mutate(CassandraServer.java:519)
>>>     at 
>>> org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:3454)
>>>     at 
>>> org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2889)
>>>     at 
>>> org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187)
>>>     at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>     at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>     at java.lang.Thread.run(Thread.java:619)
>>> ERROR [Thread-137480] 2011-08-17 08:42:38,121 AbstractCassandraDaemon.java 
>>> (line 113) Fatal exception in thread Thread[Thread-137480,5,main]
>>> java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has 
>>> shut down
>>>     at 
>>> org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:73)
>>>     at 
>>> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
>>>     at 
>>> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
>>>     at 
>>> org.apache.cassandra.net.MessagingService.receive(MessagingService.java:444)
>>>     at 
>>> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:117)
>> 
>

Re: One hot node slows down whole cluster

Reply via email to