Thanks Tyler !! I understand that we need to consider a node as lost when its down for gc grace and bootstrap it. My question is more about the JIRA https://issues.apache.org/jira/plugins/servlet/mobile#issue/CASSANDRA-2290 where an intentional decision was taken to abort the repair if a single replica is down. Precisely, I need to understand the reasoning behind aborting the repair instead of proceeding with available replicas. As it is related to a specific fix, I thought that developers involved in the decision could better explain the reasoning. So, I posted it on dev list first. Consider a scenario where I have a 20 node clsuter, RF=5, Read/Write Quorum, gc grace period=20. My cluster is fault tolerant and it can afford 2 node failure. Suddenly, one node goes down due to some hardware issue. Its 10 days since my node is down, none of the 19 nodes are being repaired and now its decision time. I am not sure how soon issue would be fixed may be 8 days before gc grace, so I shouldnt remove node early and add node back as it would cause unnecessary streaming. At the same time, if I dont remove the failed node, my entire system health would be in question and it would be a panic situation as no data got repaired in last 10 days and gc grace is approaching. I need sufficient time to repair 19 nodes. What looked like a fault tolerant system which can afford 2 node failure, required urgent attention and manual decision making when a single node went down. Why cant we just go ahead and repair remaining replicas if some replicas are down? If failed node comes up before gc grace period, we would run repair to fix inconsistencies and otheriwse we would discard data and bootstrap. I think that would be a really robust fault tolerant system.
ThanksAnuj On Tue, 19 Jan, 2016 at 9:44 pm, Tyler Hobbs<ty...@datastax.com> wrote: On Fri, Jan 15, 2016 at 12:06 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote: > Increase the gc grace period temporarily. Then we should have capacity > planning to accomodate the extra storage needed for extra gc grace that may > be needed in case of node failure scenarios. I would do this. Nodes that are down for longer than gc_grace_seconds should not re-enter the cluster, because they may contain data that has been deleted and the tombstone has already been purged (repairing doesn't change this). Bringing them back up will result in "zombie" data. Also, I do think that the user mailing list is a better place for the first round of this conversation. -- Tyler Hobbs DataStax <http://datastax.com/>