Thanks Jon and congratulations for the new release,

We will be available to fill any missing info and iterate on this. I the 
meantime we will be progressing in the port to the new version, so that we get 
a more precise idea of the implications.

Cheers,

Carlos

-----
Carlos Baquero
HASLab / INESC TEC,
Universidade do Minho,
Portugal

c...@di.uminho.pt
http://gsd.di.uminho.pt/cbm




On 22/02/2012, at 03:36, Jon Meredith wrote:

> Thanks Carlos. We've been busy with the 1.1.0 release. I'll digest all you've 
> written over the next couple days and get back to you. 
> 
> Cheers, Jon
> 
> On Mon, Feb 20, 2012 at 9:02 AM, Carlos Baquero <c...@di.uminho.pt> wrote:
> 
> Hi,
> 
> Me and my colleagues are particularly interested on causality management
> in optimistic replications settings, and since most other systems use simple
> Last Writer Wins approaches, Riak is a good testbed as it allows exposing
> concurrency to the clients, offering more options than a pure LWW interface.
> With Riak 1.0 there were several changes to the way that causality is
> tracked in the system. See http://wiki.basho.com/Vector-Clocks.html
> 
> The use of client ids in Riak's (pre 1.0) vector clocks, was recognized as a
> scalability problem. However, it had the advantage that any node could
> coordinate a Put request. Now in Riak (post 1.0), there is at most one entry
> in vector clocks per server node, as client ids are no longer used, but
> instead the server node id is used. One change that this required is that
> now requests must be forwarded to a node hosting the replica.
> 
> (Here we use the term vector clock, but version vector would be more
> appropriate in this data dependency context. More info in
> http://haslab.wordpress.com/2011/07/08/version-vectors-are-not-vector-clocks/)
> 
> Riak (post 1.0) is using an ingenious solution to register updates while
> having (at most) only one entry per replica node (a server node holding
> a replica).
> 
> From what we could check from the new code, we can construct this example:
> Suppose that we are in node A (of a two node system) and have a data item
> with value "Mon" and version (A:4,B:2). We can see this one sibling as
> (A:4,B:2)"Mon".
> 
> If a client reads this context and value (in a Get) and writes some new
> value "Tue" to node A, the node can detect that it is evolving the current
> installed version (as it gets the current vector) and change the state to
> (again a single sibling) (A:5,B:2)"Tue", where the local counter was
> advanced by 1. Now, if another (concurrent) client has also read
> (A:4,B:2)"Mon" and is now Putting "Sun", the node detects concurrency (the
> vector clock from the client is no longer the current one) and now registers
> a second sibling, becoming (A:6,B:2)"Tue""Sun".
> 
> Now if we are lucky, some client will eventually read this two-sibling state
> and write back, say "Sat", (with no other concurrent client interleaving)
> and one gets again a single sibling state (A:7,B:2)"Sat".
> 
> Till now everything looks fine.
> 
> However other runs are problematic. Again we start with (A:4,B:2)"Mon" and
> evolve to (A:5,B:2)"Tue" as before. Suppose that a client X reads this
> state. Next, the system evolves as before and becomes (A:6,B:2)"Tue""Sun".
> Now if client X submits a put with "Wed", Riak will see that (A:5,B:2) is
> not the current version (A:6,B:2) and thus registers a third sibling and
> evolves to (A:7,B:2)"Tue""Sun""Wed".  However in terms of causality "Wed" is
> not created concurrently with "Tue". In fact "Wed" succeeds "Tue" and the
> state should not have the three siblings depicted, but only two: "Sun" and
> "Wed".
> 
> This effect can be described as the creation of false concurrency, and when
> frequent, it leads to an artificial growth in the state and more
> reconciliation work at the client level. It is the same undesired situation
> that can occur as a consequence of pruning.
> 
> One aspect that may be a bit scary is that in the current Riak the number of
> falsely concurrent siblings is not bounded. This is the case because a Put
> by client X only simplifies siblings if there was no other Put (for the same
> key, coordinated by the same server) since the time client X did the Get;
> otherwise the Put will add the value to the current set of siblings.  This
> means that for highly contended keys, specially if the get-calculate-put
> cycle takes some time, it may happen that the set of siblings keeps growing
> indefinitely (if clients keep looping on that get-calculate-put cycle). As
> in this case the number of siblings returned is increasing, it will make
> clients take more time to resolve conflicts and calculate the new value,
> increasing latency and reducing the chances for an opportunity that some
> client can perform a get-calculate-put with no interleaving from other
> client. I.e., we have a positive feedback loop, with undesirable consequences.
> 
> In the past year, myself and colleagues have been working in a mechanism
> that overcomes this. We had some talks with the Riak team and at the time
> one of the drawbacks was the need to re-direct all puts to a node that had
> the replica.  Now, Riak is much closer to the required architecture, so we
> would like to share more broadly our proposal and see if it can be useful to
> the community.  One of the changes needed is to tag each sibling with an
> appropriate vector (a "special" version vector with one additional entry)
> and not use a single vector for the whole set of siblings. Each sibling
> already has support for storing its own metadata, so the support is possibly
> already there.
> 
> We named the causality tracking mechanism as Dotted Version Vectors, and its
> most current description
> (http://gsd.di.uminho.pt/members/vff/dotted-version-vectors-2012.pdf) also
> includes some performance evaluation in Riak 0.13, that this community might
> find interesting to check. Currently we are in the process of modifying the 
> current
> Riak implementation to use it, now a much simpler task.
> 
> The mechanism code is available in
> https://github.com/ricardobcl/Dotted-Version-Vectors
> 
> Although these "advanced" techniques add some additional complexity, we hope 
> that
> overcoming false concurrency will be a good tradeoff.
> 
> Regards,
> 
> The DVV team,
> 
> Carlos Baquero
> Nuno Preguiça
> Paulo Almeida
> Ricardo Gonçalves
> Victor Fonte
> 
> 
> -----
> Carlos Baquero
> HASLab / INESC TEC,
> Universidade do Minho,
> Portugal
> 
> c...@di.uminho.pt
> http://gsd.di.uminho.pt/cbm
> 
> 
> 
> 
> 
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 
> 
> -- 
> Jon Meredith
> Platform Engineering Manager
> Basho Technologies, Inc.
> jmered...@basho.com
> 


_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to