Dropping down replication factor

brian . spindler Sat, 12 Aug 2017 14:58:31 -0700

Hi folks, hopefully a quick one:

We are running a 12 node cluster (2.1.15) in AWS with Ec2Snitch.  It's all in 
one region but spread across 3 availability zones.  It was nicely balanced with 
4 nodes in each.


But with a couple of failures and subsequent provisions to the wrong az we now 
have a cluster with : 

5 nodes in az A
5 nodes in az B
2 nodes in az C

Not sure why, but when adding a third node in AZ C it fails to stream after 
getting all the way to completion and no apparent error in logs.  I've looked 
at a couple of bugs referring to scrubbing and possible OOM bugs due to 
metadata writing at end of streaming (sorry don't have ticket handy).  I'm 
worried I might not be able to do much with these since the disk space usage is 
high and they are under a lot of load given the small number of them for this 
rack.

Rather than troubleshoot this further, what I was thinking about doing was:
- drop the replication factor on our keyspace to two
- hopefully this would reduce load on these two remaining nodes 
- run repairs/cleanup across the cluster 
- then shoot these two nodes in the 'c' rack
- run repairs/cleanup across the cluster

Would this work with minimal/no disruption? 
Should I update their "rack" before hand or after ?
What else am I not thinking about? 

My main goal atm is to get back to where the cluster is in a clean consistent 
state that allows nodes to properly bootstrap.

Thanks for your help in advance.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Dropping down replication factor

Reply via email to