Hi,
I've started to look through the Riak sources, and I've been wondering how
the system behaves in certain failure scenarios.
In particular, it seems to me that it's quite easy to get into a state where the
client thinks a PUT request failed, but the object was in fact written to 
storage
and will be replicated to n_val nodes eventually.
For example, assume I have a cluster with three nodes, A, B, C.
I use r=pr=w=pw=dw=quorum=2 and an n_val of 3.
All nodes have a copy of object O with key K and vector clock V.

A client reads O, modifies it locally, then attempts to write its modified copy 
O'.
Let's say the client talks to node A. riak_client:put/2 will cause a 
riak_kv_put_fsm
to be started. In execute_local/1, we tell the local vnode to coordinate this 
PUT,
which will cause it to eventually call riak_kv_vnode:prepare_put/2, which 
increments
the object's vector clock to V', and perform_put/3, which writes O' to disk 
with vector clock
V'. The vnode replies {dw, ...} to the FSM which has been idling in 
waiting_local_node/2.
If A now crashes, the client will get an {error, timeout}.
If A then recovers and the client does a new GET for key K it may see either O 
or O'
(if the vnode on A is one of the first two vnodes to return a repsonse, its 
version of the object
will be used as the result since V descends from V').

Is this actually possible? Also, how would read-repair work in this scenario? 
If all three
nodes are up, I guess Riak could detect the incosistency, but if e.g. C was 
down at the
time, would K' be replicated to B and then later to C as well?

cheers,
-jakob


_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to