Re: One hot node slows down whole cluster

Hefeng Yuan Wed, 17 Aug 2011 16:44:49 -0700

Thanks Aaron for the response. We're not doing drain on node, and there's no 
that message in the log.


We used LOCAL_QUORUM CL,
endpoint_snitch: org.apache.cassandra.locator.PropertyFileSnitch
dynamic_snitch: false
dynamic_snitch_badness_threshold: 0.0

Because we have another 3 nodes DC for Brisk, maybe that's the problem?


On Aug 17, 2011, at 4:27 PM, aaron morton wrote:

> wrt the Exception something has shutdown the Mutation thread pool. The only 
> thing I can see in the code to do this is nodetool drain and running the 
> Embedded server. If it was drain you should see an INFO level messages "Node 
> is drained" somewhere. Could either of these things be happening ? 
> 
> wrt the slow down:
> - what CL are you using  for reads and writes ? What does the ring look like 
> ? 
> - have a look at tp stats to see what stage backing up 
> - ensure you have the dynamic snitch enabled
> - what setting do you have for dynamic_snitch_badness_threshold in yaml
> - have a look at the o.a.c.DynamicEndpointSnitch info in JMX / JConsole  at 
> dumpTimings() and scores
> 
> Basically slower nodes should be used less. But there are reasons they may 
> not be, so lets work out what requests are running slow and if the Dynamic 
> Snitch is doing the right thing. I would look at that error first, seems odd.
> 
> Cheers
> 
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 18/08/2011, at 6:52 AM, Hefeng Yuan wrote:
> 
>> Just wondering, would it help if we shorten the rpc_timeout_in_ms (currently 
>> using 30,000), so that when one node gets hot and responding slowly, others 
>> will just take it as down and move forward?
>> 
>> On Aug 17, 2011, at 11:35 AM, Hefeng Yuan wrote:
>> 
>>> Sorry, correction, we're using 0.8.1.
>>> 
>>> On Aug 17, 2011, at 11:24 AM, Hefeng Yuan wrote:
>>> 
>>>> Hi,
>>>> 
>>>> We're noticing that when one node gets hot (very high cpu usage) because 
>>>> of 'nodetool repair', the whole cluster's performance becomes really bad.
>>>> 
>>>> We're using 0.8.1 with random partition. We have 6 nodes with RF 5. Our 
>>>> repair is scheduled to run once a week, spread across whole cluster. I do 
>>>> get suggestion from Jonothan that 0.8.0 has some bug on the repair, but 
>>>> wondering why one hot node would slow down the whole cluster.
>>>> 
>>>> We saw this symptom yesterday on one node, and today on the adjacent node. 
>>>> Most probably it'll happen on the next one tomorrow.
>>>> 
>>>> We do see lots of (~200) RejectedExecutionException 3 hours before the 
>>>> repair job, and also in the middle of the repair job, not sure whether 
>>>> they're related. Full stack is attached in the end.
>>>> 
>>>> Do we have any suggestion/hint?
>>>> 
>>>> Thanks,
>>>> Hefeng
>>>> 
>>>> 
>>>> ERROR [pool-2-thread-3097] 2011-08-17 08:42:38,118 Cassandra.java (line 
>>>> 3462) Internal error processing batch_mutate
>>>> java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has 
>>>> shut down
>>>>    at 
>>>> org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:73)
>>>>    at 
>>>> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
>>>>    at 
>>>> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
>>>>    at 
>>>> org.apache.cassandra.service.StorageProxy.insertLocal(StorageProxy.java:360)
>>>>    at 
>>>> org.apache.cassandra.service.StorageProxy.sendToHintedEndpoints(StorageProxy.java:241)
>>>>    at 
>>>> org.apache.cassandra.service.StorageProxy.access$000(StorageProxy.java:62)
>>>>    at 
>>>> org.apache.cassandra.service.StorageProxy$1.apply(StorageProxy.java:99)
>>>>    at 
>>>> org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:210)
>>>>    at 
>>>> org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:154)
>>>>    at 
>>>> org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:560)
>>>>    at 
>>>> org.apache.cassandra.thrift.CassandraServer.internal_batch_mutate(CassandraServer.java:511)
>>>>    at 
>>>> org.apache.cassandra.thrift.CassandraServer.batch_mutate(CassandraServer.java:519)
>>>>    at 
>>>> org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:3454)
>>>>    at 
>>>> org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2889)
>>>>    at 
>>>> org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187)
>>>>    at 
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>>    at 
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>>    at java.lang.Thread.run(Thread.java:619)
>>>> ERROR [Thread-137480] 2011-08-17 08:42:38,121 AbstractCassandraDaemon.java 
>>>> (line 113) Fatal exception in thread Thread[Thread-137480,5,main]
>>>> java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has 
>>>> shut down
>>>>    at 
>>>> org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:73)
>>>>    at 
>>>> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:767)
>>>>    at 
>>>> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:658)
>>>>    at 
>>>> org.apache.cassandra.net.MessagingService.receive(MessagingService.java:444)
>>>>    at 
>>>> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:117)
>>> 
>> 
>

Re: One hot node slows down whole cluster

Reply via email to