Thanks Tyler !!
I understand that we need to consider a node as lost when its down for gc grace 
and bootstrap it. My question is more about the JIRA 
https://issues.apache.org/jira/plugins/servlet/mobile#issue/CASSANDRA-2290
where an intentional decision was taken to abort the repair if a single replica 
is down. Precisely, I need to understand the reasoning behind aborting the 
repair instead of proceeding with available replicas. As it is related to a 
specific fix, I thought that developers involved in the decision could better 
explain the reasoning. So, I posted it on dev list first.
Consider a scenario where I have a 20 node clsuter, RF=5, Read/Write Quorum, gc 
grace period=20. My cluster is fault tolerant and it can afford 2 node failure. 
Suddenly, one node goes down due to some hardware issue. Its 10 days since my 
node is down, none of the 19 nodes are being repaired and now its decision 
time. I am not sure how soon issue would be fixed may be 8 days before gc 
grace, so I shouldnt remove node early and add node back as it would cause 
unnecessary streaming. At the same time, if I dont remove the failed node, my 
entire system health would be in question and it would be a panic situation as 
no data got repaired in last 10 days and gc grace is approaching. I need 
sufficient time to repair 19 nodes.
What looked like a fault tolerant system which can afford 2 node failure, 
required urgent attention and manual decision making when a single node went 
down. Why cant we just go ahead and repair remaining replicas if some replicas 
are down? If failed node comes up before gc grace period, we would run repair 
to fix inconsistencies and otheriwse we would discard data and bootstrap. I 
think that would be a really robust fault tolerant system.

ThanksAnuj

 
 
  On Tue, 19 Jan, 2016 at 9:44 pm, Tyler Hobbs<ty...@datastax.com> wrote:   On 
Fri, Jan 15, 2016 at 12:06 PM, Anuj Wadehra <anujw_2...@yahoo.co.in>
wrote:

> Increase the gc grace period temporarily. Then we should have capacity
> planning to accomodate the extra storage needed for extra gc grace that may
> be needed in case of node failure scenarios.


I would do this.  Nodes that are down for longer than gc_grace_seconds
should not re-enter the cluster, because they may contain data that has
been deleted and the tombstone has already been purged (repairing doesn't
change this).  Bringing them back up will result in "zombie" data.

Also, I do think that the user mailing list is a better place for the first
round of this conversation.

-- 
Tyler Hobbs
DataStax <http://datastax.com/>
  

Reply via email to