Hi,
I need to understand the use case of join_ring=false in case of node outages.
As per https://issues.apache.org/jira/browse/CASSANDRA-6961, you would want
join_ring=false when you have to repair a node before bringing a node back
after some considerable outage. The problem I see with join_ring=false is that
unlike autobootstrap, the node will NOT accept writes while you are running
repair on it. If a node was down for 5 hours and you bring it back with
join_ring=false, repair the node for 7 hours and then make it join the ring, it
will STILL have missed writes because while the time repair was running (7
hrs), writes only went to other others. So, if you want to make sure that reads
served by the restored node at CL ONE will return consistent data after the
node has joined, you wont get that as writes have been missed while the node is
being repaired. And if you work with Read/Write CL=QUORUM, even if you bring
back the node without join_ring=false, you would anyways get the desired
consistency. So, how join_ring would provide any additional consistency in this
case ??
I can see join_ring=false useful only when I am restoring from Snapshot or
bootstrapping and there are dropped mutations in my cluster which are not fixed
by hinted handoff.
For Example: 3 nodes A,B,C working at Read/Write CL QUORUM. Hinted Handoff=3
hrs.10 AM Snapshot taken on all 3 nodes11 AM: Node B goes down for 4 hours3 PM:
Node B comes up but data is not repaired. So, 1 hr of dropped mutations (2-3
PM) not fixed via Hinted Handoff.5 PM: Node A crashes.6 PM: Node A restored
from 10 AM Snapshot, Node A started with join_ring=false, repaired and then
joined the cluster.
In above restore snapshot example, updates from 2-3 PM were outside hinted
handoff window of 3 hours. Thus, node B wont get those updates. Node A data for
2-3 PM is already lost. So, 2-3 PM updates are only on one replica i.e. node C
and minimum consistency needed is QUORUM so join_ring=false would help. But
this is very specific use case.
ThanksAnuj