Just thought of a solution that might actually work even better. You could try replacing one node at the time (instead of removing them)
I believe thhis would decrease the amount of streams significantly. https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_replace_node_t.html Good luck, ----------------------- Alain Rodriguez - al...@thelastpickle.com France The Last Pickle - Apache Cassandra Consulting http://www.thelastpickle.com 2016-03-02 23:57 GMT+01:00 Alain RODRIGUEZ <arodr...@gmail.com>: > Hi Praveen, > > >> We are not removing multiple nodes at the same time. All dead nodes are >> from same AZ so there were no errors when the nodes were down as expected >> (because we use QUORUM) > > > Do you use at leat 3 distinct AZ ? If so, you should indeed be fine > regarding data integrity. Also repair should then work for you. If you have > less than 3 AZ, then you are in troubles... > > About the unreachable errors, I believe it can be due to the overload due > to the missing nodes. Pressure on the remaining node might be too strong. > > However, As soon as I started removing nodes one by one, every time time >> we see lot of timeout and unavailable exceptions which doesn’t make any >> sense because I am just removing a node that doesn’t even exist. >> > > This probably added even more load, if you are using vnodes, all the > remaining nodes probably started streaming data to each other node at the > speed of "nodetool getstreamthroughput". AWS network isn't that good, and > is probably saturated. Also have you the phi_convict_threshold configured > to a high value at least 10 or 12 ? This would avoid nodes to be marked > down that often. > > What does "nodetool tpstats" outputs ? > > Also you might try to monitor resources and see what happens (my guess is > you should focus at iowait, disk usage and network, have an eye at cpu too). > > A quick fix would probably be to hardly throttle the network on all the > nodes and see if it helps: > > nodetool setstreamthroughput 2 > > If this work, you could incrementally increase it and monitor, find the > good tuning and put it the cassandra.yaml. > > I opened a ticket a while ago about that issue: > https://issues.apache.org/jira/browse/CASSANDRA-9509 > > I hope this will help you to go back to a healthy state allowing you a > fast upgrade ;-). > > C*heers, > ----------------------- > Alain Rodriguez - al...@thelastpickle.com > France > > The Last Pickle - Apache Cassandra Consulting > http://www.thelastpickle.com > > 2016-03-02 22:17 GMT+01:00 Peddi, Praveen <pe...@amazon.com>: > >> Hi Robert, >> Thanks for your response. >> >> Replication factor is 3. >> >> We are in the process of upgrading to 2.2.4. We have had too many >> performance issues with later versions of Cassandra (I have asked asked for >> help related to that in the forum). We are close to getting to similar >> performance now and hopefully upgrade in next few weeks. Lot of testing to >> do :(. >> >> We are not removing multiple nodes at the same time. All dead nodes are >> from same AZ so there were no errors when the nodes were down as expected >> (because we use QUORUM). However, As soon as I started removing nodes one >> by one, every time time we see lot of timeout and unavailable exceptions >> which doesn’t make any sense because I am just removing a node that doesn’t >> even exist. >> >> >> >> >> >> >> From: Robert Coli <rc...@eventbrite.com> >> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org> >> Date: Wednesday, March 2, 2016 at 2:52 PM >> To: "user@cassandra.apache.org" <user@cassandra.apache.org> >> Subject: Re: Removing Node causes bunch of HostUnavailableException >> >> On Wed, Mar 2, 2016 at 8:10 AM, Peddi, Praveen <pe...@amazon.com> wrote: >> >>> We have few dead nodes in the cluster (Amazon ASG removed those thinking >>> there is an issue with health). Now we are trying to remove those dead >>> nodes from the cluster so that other nodes can take over. As soon as I >>> execute nodetool removenode <ID>, we see lots of HostUnavailableExceptions >>> both on reads and writes. What I am not able to understand is, these are >>> deadnodes and don’t even physically exists. Why would removenode command >>> cause any outage of nodes in Cassandra when we had no errors whatsoever >>> before removing them. I could not really find a jira ticket for this. >>> >> >> What is your replication factor? >> >> Also, 2.0.9 is meaningfully old at this point, consider upgrading ASAP. >> >> Also, removing multiple nodes with removenode means your consistency is >> pretty hosed. Repair ASAP, but there are potential cases where repair won't >> help. >> >> =Rob >> >> >> =Rob >> >> > >