Just thought of a solution that might actually work even better.

You could try replacing one node at the time (instead of removing them)

I believe thhis would decrease the amount of streams significantly.

https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_replace_node_t.html

Good luck,
-----------------------
Alain Rodriguez - al...@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-03-02 23:57 GMT+01:00 Alain RODRIGUEZ <arodr...@gmail.com>:

> Hi Praveen,
>
>
>> We are not removing multiple nodes at the same time. All dead nodes are
>> from same AZ so there were no errors when the nodes were down as expected
>> (because we use QUORUM)
>
>
> Do you use at leat 3 distinct AZ ? If so, you should indeed be fine
> regarding data integrity. Also repair should then work for you. If you have
> less than 3 AZ, then you are in troubles...
>
> About the unreachable errors, I believe it can be due to the overload due
> to the missing nodes. Pressure on the remaining node might be too strong.
>
> However, As soon as I started removing nodes one by one, every time time
>> we see lot of timeout and unavailable exceptions which doesn’t make any
>> sense because I am just removing a node that doesn’t even exist.
>>
>
> This probably added even more load, if you are using vnodes, all the
> remaining nodes probably started streaming data to each other node at the
> speed of "nodetool getstreamthroughput". AWS network isn't that good, and
> is probably saturated. Also have you the phi_convict_threshold configured
> to a high value at least 10 or 12 ? This would avoid nodes to be marked
> down that often.
>
> What does "nodetool tpstats" outputs ?
>
> Also you might try to monitor resources and see what happens (my guess is
> you should focus at iowait, disk usage and network, have an eye at cpu too).
>
> A quick fix would probably be to hardly throttle the network on all the
> nodes and see if it helps:
>
> nodetool setstreamthroughput 2
>
> If this work, you could incrementally increase it and monitor, find the
> good tuning and put it the cassandra.yaml.
>
> I opened a ticket a while ago about that issue:
> https://issues.apache.org/jira/browse/CASSANDRA-9509
>
> I hope this will help you to go back to a healthy state allowing you a
> fast upgrade ;-).
>
> C*heers,
> -----------------------
> Alain Rodriguez - al...@thelastpickle.com
> France
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> 2016-03-02 22:17 GMT+01:00 Peddi, Praveen <pe...@amazon.com>:
>
>> Hi Robert,
>> Thanks for your response.
>>
>> Replication factor is 3.
>>
>> We are in the process of upgrading to 2.2.4. We have had too many
>> performance issues with later versions of Cassandra (I have asked asked for
>> help related to that in the forum). We are close to getting to similar
>> performance now and hopefully upgrade in next few weeks. Lot of testing to
>> do :(.
>>
>> We are not removing multiple nodes at the same time. All dead nodes are
>> from same AZ so there were no errors when the nodes were down as expected
>> (because we use QUORUM). However, As soon as I started removing nodes one
>> by one, every time time we see lot of timeout and unavailable exceptions
>> which doesn’t make any sense because I am just removing a node that doesn’t
>> even exist.
>>
>>
>>
>>
>>
>>
>> From: Robert Coli <rc...@eventbrite.com>
>> Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>> Date: Wednesday, March 2, 2016 at 2:52 PM
>> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>> Subject: Re: Removing Node causes bunch of HostUnavailableException
>>
>> On Wed, Mar 2, 2016 at 8:10 AM, Peddi, Praveen <pe...@amazon.com> wrote:
>>
>>> We have few dead nodes in the cluster (Amazon ASG removed those thinking
>>> there is an issue with health). Now we are trying to remove those dead
>>> nodes from the cluster so that other nodes can take over. As soon as I
>>> execute nodetool removenode <ID>, we see lots of HostUnavailableExceptions
>>> both on reads and writes. What I am not able to understand is, these are
>>> deadnodes and don’t even physically exists. Why would removenode command
>>> cause any outage of nodes in Cassandra when we had no errors whatsoever
>>> before removing them. I could not really find a jira ticket for this.
>>>
>>
>> What is your replication factor?
>>
>> Also, 2.0.9 is meaningfully old at this point, consider upgrading ASAP.
>>
>> Also, removing multiple nodes with removenode means your consistency is
>> pretty hosed. Repair ASAP, but there are potential cases where repair won't
>> help.
>>
>> =Rob
>>
>>
>> =Rob
>>
>>
>
>

Reply via email to