recovering from network partition

Thorsten von Eicken Mon, 30 Jan 2012 09:52:08 -0800

I'm trying to work through various failure modes to figure out the
proper operating procedure and proper client coding practices. I'm a
little unclear about what happens when a network partition gets
repaired. Take the following scenario:
 - cluster with 5 nodes: A thru E; RF = 3; read_cf = 1; write_cf = 1
 - network partition divides A-C off from D-E
 - operation continues on both sides, obviously some data is unavailable
from D-E
 - hinted handoffs accumulate


Now the network partition is repaired. The question I have is what is
the sequencing of events, in particular between processing HH and
forwarding read requests across the former partition. I'm hoping that
there is a time period to process HH *before* nodes forward requests.
E.g. it would be really good for A not to forward read requests to D
until D is done with HH processing. Otherwise, clients of A may see a
discontinuity where data that was available during the partition see it
go away and then come back.

Is there a manual or wiki section that discusses some of this and I just
missed it?

recovering from network partition

Reply via email to