Hi Praveen,
> We are not removing multiple nodes at the same time. All dead nodes are > from same AZ so there were no errors when the nodes were down as expected > (because we use QUORUM) Do you use at leat 3 distinct AZ ? If so, you should indeed be fine regarding data integrity. Also repair should then work for you. If you have less than 3 AZ, then you are in troubles... About the unreachable errors, I believe it can be due to the overload due to the missing nodes. Pressure on the remaining node might be too strong. However, As soon as I started removing nodes one by one, every time time we > see lot of timeout and unavailable exceptions which doesn’t make any sense > because I am just removing a node that doesn’t even exist. > This probably added even more load, if you are using vnodes, all the remaining nodes probably started streaming data to each other node at the speed of "nodetool getstreamthroughput". AWS network isn't that good, and is probably saturated. Also have you the phi_convict_threshold configured to a high value at least 10 or 12 ? This would avoid nodes to be marked down that often. What does "nodetool tpstats" outputs ? Also you might try to monitor resources and see what happens (my guess is you should focus at iowait, disk usage and network, have an eye at cpu too). A quick fix would probably be to hardly throttle the network on all the nodes and see if it helps: nodetool setstreamthroughput 2 If this work, you could incrementally increase it and monitor, find the good tuning and put it the cassandra.yaml. I opened a ticket a while ago about that issue: https://issues.apache.org/jira/browse/CASSANDRA-9509 I hope this will help you to go back to a healthy state allowing you a fast upgrade ;-). C*heers, ----------------------- Alain Rodriguez - al...@thelastpickle.com France The Last Pickle - Apache Cassandra Consulting http://www.thelastpickle.com 2016-03-02 22:17 GMT+01:00 Peddi, Praveen <pe...@amazon.com>: > Hi Robert, > Thanks for your response. > > Replication factor is 3. > > We are in the process of upgrading to 2.2.4. We have had too many > performance issues with later versions of Cassandra (I have asked asked for > help related to that in the forum). We are close to getting to similar > performance now and hopefully upgrade in next few weeks. Lot of testing to > do :(. > > We are not removing multiple nodes at the same time. All dead nodes are > from same AZ so there were no errors when the nodes were down as expected > (because we use QUORUM). However, As soon as I started removing nodes one > by one, every time time we see lot of timeout and unavailable exceptions > which doesn’t make any sense because I am just removing a node that doesn’t > even exist. > > > > > > > From: Robert Coli <rc...@eventbrite.com> > Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org> > Date: Wednesday, March 2, 2016 at 2:52 PM > To: "user@cassandra.apache.org" <user@cassandra.apache.org> > Subject: Re: Removing Node causes bunch of HostUnavailableException > > On Wed, Mar 2, 2016 at 8:10 AM, Peddi, Praveen <pe...@amazon.com> wrote: > >> We have few dead nodes in the cluster (Amazon ASG removed those thinking >> there is an issue with health). Now we are trying to remove those dead >> nodes from the cluster so that other nodes can take over. As soon as I >> execute nodetool removenode <ID>, we see lots of HostUnavailableExceptions >> both on reads and writes. What I am not able to understand is, these are >> deadnodes and don’t even physically exists. Why would removenode command >> cause any outage of nodes in Cassandra when we had no errors whatsoever >> before removing them. I could not really find a jira ticket for this. >> > > What is your replication factor? > > Also, 2.0.9 is meaningfully old at this point, consider upgrading ASAP. > > Also, removing multiple nodes with removenode means your consistency is > pretty hosed. Repair ASAP, but there are potential cases where repair won't > help. > > =Rob > > > =Rob > >