Hi, I've started to look through the Riak sources, and I've been wondering how the system behaves in certain failure scenarios. In particular, it seems to me that it's quite easy to get into a state where the client thinks a PUT request failed, but the object was in fact written to storage and will be replicated to n_val nodes eventually.
For example, assume I have a cluster with three nodes, A, B, C. I use r=pr=w=pw=dw=quorum=2 and an n_val of 3. All nodes have a copy of object O with key K and vector clock V. A client reads O, modifies it locally, then attempts to write its modified copy O'. Let's say the client talks to node A. riak_client:put/2 will cause a riak_kv_put_fsm to be started. In execute_local/1, we tell the local vnode to coordinate this PUT, which will cause it to eventually call riak_kv_vnode:prepare_put/2, which increments the object's vector clock to V', and perform_put/3, which writes O' to disk with vector clock V'. The vnode replies {dw, ...} to the FSM which has been idling in waiting_local_node/2. If A now crashes, the client will get an {error, timeout}. If A then recovers and the client does a new GET for key K it may see either O or O' (if the vnode on A is one of the first two vnodes to return a repsonse, its version of the object will be used as the result since V descends from V'). Is this actually possible? Also, how would read-repair work in this scenario? If all three nodes are up, I guess Riak could detect the incosistency, but if e.g. C was down at the time, would K' be replicated to B and then later to C as well? cheers, -jakob _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com