Thanks Carlos. We've been busy with the 1.1.0 release. I'll digest all you've written over the next couple days and get back to you.
Cheers, Jon On Mon, Feb 20, 2012 at 9:02 AM, Carlos Baquero <c...@di.uminho.pt> wrote: Hi, Me and my colleagues are particularly interested on causality management in optimistic replications settings, and since most other systems use simple Last Writer Wins approaches, Riak is a good testbed as it allows exposing concurrency to the clients, offering more options than a pure LWW interface. With Riak 1.0 there were several changes to the way that causality is tracked in the system. See http://wiki.basho.com/Vector-Clocks.html The use of client ids in Riak's (pre 1.0) vector clocks, was recognized as a scalability problem. However, it had the advantage that any node could coordinate a Put request. Now in Riak (post 1.0), there is at most one entry in vector clocks per server node, as client ids are no longer used, but instead the server node id is used. One change that this required is that now requests must be forwarded to a node hosting the replica. (Here we use the term vector clock, but version vector would be more appropriate in this data dependency context. More info in http://haslab.wordpress.com/2011/07/08/version-vectors-are-not-vector-clocks/) Riak (post 1.0) is using an ingenious solution to register updates while having (at most) only one entry per replica node (a server node holding a replica). From what we could check from the new code, we can construct this example: Suppose that we are in node A (of a two node system) and have a data item with value "Mon" and version (A:4,B:2). We can see this one sibling as (A:4,B:2)"Mon". If a client reads this context and value (in a Get) and writes some new value "Tue" to node A, the node can detect that it is evolving the current installed version (as it gets the current vector) and change the state to (again a single sibling) (A:5,B:2)"Tue", where the local counter was advanced by 1. Now, if another (concurrent) client has also read (A:4,B:2)"Mon" and is now Putting "Sun", the node detects concurrency (the vector clock from the client is no longer the current one) and now registers a second sibling, becoming (A:6,B:2)"Tue""Sun". Now if we are lucky, some client will eventually read this two-sibling state and write back, say "Sat", (with no other concurrent client interleaving) and one gets again a single sibling state (A:7,B:2)"Sat". Till now everything looks fine. However other runs are problematic. Again we start with (A:4,B:2)"Mon" and evolve to (A:5,B:2)"Tue" as before. Suppose that a client X reads this state. Next, the system evolves as before and becomes (A:6,B:2)"Tue""Sun". Now if client X submits a put with "Wed", Riak will see that (A:5,B:2) is not the current version (A:6,B:2) and thus registers a third sibling and evolves to (A:7,B:2)"Tue""Sun""Wed". However in terms of causality "Wed" is not created concurrently with "Tue". In fact "Wed" succeeds "Tue" and the state should not have the three siblings depicted, but only two: "Sun" and "Wed". This effect can be described as the creation of false concurrency, and when frequent, it leads to an artificial growth in the state and more reconciliation work at the client level. It is the same undesired situation that can occur as a consequence of pruning. One aspect that may be a bit scary is that in the current Riak the number of falsely concurrent siblings is not bounded. This is the case because a Put by client X only simplifies siblings if there was no other Put (for the same key, coordinated by the same server) since the time client X did the Get; otherwise the Put will add the value to the current set of siblings. This means that for highly contended keys, specially if the get-calculate-put cycle takes some time, it may happen that the set of siblings keeps growing indefinitely (if clients keep looping on that get-calculate-put cycle). As in this case the number of siblings returned is increasing, it will make clients take more time to resolve conflicts and calculate the new value, increasing latency and reducing the chances for an opportunity that some client can perform a get-calculate-put with no interleaving from other client. I.e., we have a positive feedback loop, with undesirable consequences. In the past year, myself and colleagues have been working in a mechanism that overcomes this. We had some talks with the Riak team and at the time one of the drawbacks was the need to re-direct all puts to a node that had the replica. Now, Riak is much closer to the required architecture, so we would like to share more broadly our proposal and see if it can be useful to the community. One of the changes needed is to tag each sibling with an appropriate vector (a "special" version vector with one additional entry) and not use a single vector for the whole set of siblings. Each sibling already has support for storing its own metadata, so the support is possibly already there. We named the causality tracking mechanism as Dotted Version Vectors, and its most current description (http://gsd.di.uminho.pt/members/vff/dotted-version-vectors-2012.pdf) also includes some performance evaluation in Riak 0.13, that this community might find interesting to check. Currently we are in the process of modifying the current Riak implementation to use it, now a much simpler task. The mechanism code is available in https://github.com/ricardobcl/Dotted-Version-Vectors Although these "advanced" techniques add some additional complexity, we hope that overcoming false concurrency will be a good tradeoff. Regards, The DVV team, Carlos Baquero Nuno Preguiça Paulo Almeida Ricardo Gonçalves Victor Fonte ----- Carlos Baquero HASLab / INESC TEC, Universidade do Minho, Portugal c...@di.uminho.pt http://gsd.di.uminho.pt/cbm _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com -- Jon Meredith Platform Engineering Manager Basho Technologies, Inc. jmered...@basho.com
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com