Re: One hot node slows down whole cluster

2011-08-17 Thread Hefeng Yuan
Thanks Aaron for the response. We're not doing drain on node, and there's no that message in the log. We used LOCAL_QUORUM CL, endpoint_snitch: org.apache.cassandra.locator.PropertyFileSnitch dynamic_snitch: false dynamic_snitch_badness_threshold: 0.0 Because we have another 3 nodes DC for Brisk

Re: One hot node slows down whole cluster

2011-08-17 Thread aaron morton
wrt the Exception something has shutdown the Mutation thread pool. The only thing I can see in the code to do this is nodetool drain and running the Embedded server. If it was drain you should see an INFO level messages "Node is drained" somewhere. Could either of these things be happening ? w

Re: One hot node slows down whole cluster

2011-08-17 Thread Hefeng Yuan
Just wondering, would it help if we shorten the rpc_timeout_in_ms (currently using 30,000), so that when one node gets hot and responding slowly, others will just take it as down and move forward? On Aug 17, 2011, at 11:35 AM, Hefeng Yuan wrote: > Sorry, correction, we're using 0.8.1. > > On A

Re: One hot node slows down whole cluster

2011-08-17 Thread Hefeng Yuan
Sorry, correction, we're using 0.8.1. On Aug 17, 2011, at 11:24 AM, Hefeng Yuan wrote: > Hi, > > We're noticing that when one node gets hot (very high cpu usage) because of > 'nodetool repair', the whole cluster's performance becomes really bad. > > We're using 0.8.1 with random partition. We

One hot node slows down whole cluster

2011-08-17 Thread Hefeng Yuan
Hi, We're noticing that when one node gets hot (very high cpu usage) because of 'nodetool repair', the whole cluster's performance becomes really bad. We're using 0.8.0 with random partition. We have 6 nodes with RF 5. Our repair is scheduled to run once a week, spread across whole cluster. I d