Hi list, After some discussion internally we've agreed that setting PR=R, PW=W=DW and PR+PW > N is insufficient to guarantee reading your writes. In the case where PR=quorum, PW=quorum, say for N=3 that would mean PR=R=PW=W=DW=2 there is at least one case where you would not be *guaranteed* to read your write.
It goes like this Write to a preference list containing primary1, primary2, primary3 - write succeeds on primary1, primary2 and fails on primary3 (say out of disk space) Primary 2 goes down Read from a preference list containing primary1, fallback2, primary3 and fallback2/primary3 answer first fulfilling R and giving stale data. The current best option I can think of is to set PR=R=PW=W=DW=N to ensure all primary replicas and only primary replicas are involved. Any failures will make access to those objects unavailable, but that may be better than nothing for some use cases. I've added a task for us to investigate whether applying the PR/PW constraint on the replies received from the vnode is sufficient to make it more useful, however with the way that handoff works there are likely to be windows of inconsistency with PR/PW set to anything other than N. Riak is at it's heart an eventually consistent database, however investigating alternate approaches to stronger consistency is a very interesting research topic for us going into 2012. Best regards, Jon On Tue, Jan 10, 2012 at 12:15 PM, Andrew Thompson <and...@hijacked.us>wrote: > Thomas, > > I just replicated your setup (at least for the PR gets) and you can > indeed violate PR/PW when you pause a node on a VM. The reason this > happens is that riak's check for PR/PW simply looks at the ring's > preflist for a partition and checks that the required number of > partitions for that preflist are marked as primaries. > > Now, when you pause a VM you interrupt any TCP connections that node has > open, just like if you unplugged the network cable, but not like if the > OS shut down or riak itself crashed. In those cases a FIN packet is sent > so that the other erlang nodes notice that their persistant connections > to that node have been reset, they will then reassign ownership of the > partitions owned by that downed node and PR/PW will start to fail. > > However, since FIN packets are not generated when you pause the VM, it > takes a few moments for the erlang network heartbeat stuff to notice > that the node is down, so the preflists aren't recalculated. This is the > window where you see the mysterious behaviour. > > Now, this is arguably a bug, although fixing it might be challenging. > I've filed https://issues.basho.com/show_bug.cgi?id=1318 to track this. > > I don't have a workaround that I can think of offhand, unfortunately. > > Andrew > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > -- Jon Meredith Platform Engineering Manager Basho Technologies, Inc. jmered...@basho.com
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com