Hi list,

After some discussion internally we've agreed that setting PR=R, PW=W=DW
and PR+PW > N is insufficient to guarantee reading your writes.
In the case where PR=quorum, PW=quorum, say for N=3 that would mean
PR=R=PW=W=DW=2 there is at least one case where you would not be
*guaranteed* to read your write.

It goes like this
  Write to a preference list containing primary1, primary2, primary3 -
write succeeds on primary1, primary2 and fails on primary3 (say out of disk
space)
  Primary 2 goes down
  Read from a preference list containing primary1, fallback2, primary3 and
fallback2/primary3 answer first fulfilling R and giving stale data.

The current best option I can think of is to set PR=R=PW=W=DW=N to ensure
all primary replicas and only primary replicas are involved.  Any failures
will make access to those objects unavailable, but that may be better than
nothing for some use cases.

I've added a task for us to investigate whether applying the PR/PW
constraint on the replies received from the vnode is sufficient to make it
more useful, however with the way that handoff works there are likely to be
windows of inconsistency with PR/PW set to anything other than N.  Riak is
at it's heart an eventually consistent database, however investigating
alternate approaches to stronger consistency is a very interesting research
topic for us going into 2012.

Best regards,
Jon


On Tue, Jan 10, 2012 at 12:15 PM, Andrew Thompson <and...@hijacked.us>wrote:

> Thomas,
>
> I just replicated your setup (at least for the PR gets) and you can
> indeed violate PR/PW when you pause a node on a VM. The reason this
> happens is that riak's check for PR/PW simply looks at the ring's
> preflist for a partition and checks that the required number of
> partitions for that preflist are marked as primaries.
>
> Now, when you pause a VM you interrupt any TCP connections that node has
> open, just like if you unplugged the network cable, but not like if the
> OS shut down or riak itself crashed. In those cases a FIN packet is sent
> so that the other erlang nodes notice that their persistant connections
> to that node have been reset, they will then reassign ownership of the
> partitions owned by that downed node and PR/PW will start to fail.
>
> However, since FIN packets are not generated when you pause the VM, it
> takes a few moments for the erlang network heartbeat stuff to notice
> that the node is down, so the preflists aren't recalculated. This is the
> window where you see the mysterious behaviour.
>
> Now, this is arguably a bug, although fixing it might be challenging.
> I've filed https://issues.basho.com/show_bug.cgi?id=1318 to track this.
>
> I don't have a workaround that I can think of offhand, unfortunately.
>
> Andrew
>
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>



-- 
Jon Meredith
Platform Engineering Manager
Basho Technologies, Inc.
jmered...@basho.com
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to