Thanks Jon and congratulations for the new release, We will be available to fill any missing info and iterate on this. I the meantime we will be progressing in the port to the new version, so that we get a more precise idea of the implications.
Cheers, Carlos ----- Carlos Baquero HASLab / INESC TEC, Universidade do Minho, Portugal c...@di.uminho.pt http://gsd.di.uminho.pt/cbm On 22/02/2012, at 03:36, Jon Meredith wrote: > Thanks Carlos. We've been busy with the 1.1.0 release. I'll digest all you've > written over the next couple days and get back to you. > > Cheers, Jon > > On Mon, Feb 20, 2012 at 9:02 AM, Carlos Baquero <c...@di.uminho.pt> wrote: > > Hi, > > Me and my colleagues are particularly interested on causality management > in optimistic replications settings, and since most other systems use simple > Last Writer Wins approaches, Riak is a good testbed as it allows exposing > concurrency to the clients, offering more options than a pure LWW interface. > With Riak 1.0 there were several changes to the way that causality is > tracked in the system. See http://wiki.basho.com/Vector-Clocks.html > > The use of client ids in Riak's (pre 1.0) vector clocks, was recognized as a > scalability problem. However, it had the advantage that any node could > coordinate a Put request. Now in Riak (post 1.0), there is at most one entry > in vector clocks per server node, as client ids are no longer used, but > instead the server node id is used. One change that this required is that > now requests must be forwarded to a node hosting the replica. > > (Here we use the term vector clock, but version vector would be more > appropriate in this data dependency context. More info in > http://haslab.wordpress.com/2011/07/08/version-vectors-are-not-vector-clocks/) > > Riak (post 1.0) is using an ingenious solution to register updates while > having (at most) only one entry per replica node (a server node holding > a replica). > > From what we could check from the new code, we can construct this example: > Suppose that we are in node A (of a two node system) and have a data item > with value "Mon" and version (A:4,B:2). We can see this one sibling as > (A:4,B:2)"Mon". > > If a client reads this context and value (in a Get) and writes some new > value "Tue" to node A, the node can detect that it is evolving the current > installed version (as it gets the current vector) and change the state to > (again a single sibling) (A:5,B:2)"Tue", where the local counter was > advanced by 1. Now, if another (concurrent) client has also read > (A:4,B:2)"Mon" and is now Putting "Sun", the node detects concurrency (the > vector clock from the client is no longer the current one) and now registers > a second sibling, becoming (A:6,B:2)"Tue""Sun". > > Now if we are lucky, some client will eventually read this two-sibling state > and write back, say "Sat", (with no other concurrent client interleaving) > and one gets again a single sibling state (A:7,B:2)"Sat". > > Till now everything looks fine. > > However other runs are problematic. Again we start with (A:4,B:2)"Mon" and > evolve to (A:5,B:2)"Tue" as before. Suppose that a client X reads this > state. Next, the system evolves as before and becomes (A:6,B:2)"Tue""Sun". > Now if client X submits a put with "Wed", Riak will see that (A:5,B:2) is > not the current version (A:6,B:2) and thus registers a third sibling and > evolves to (A:7,B:2)"Tue""Sun""Wed". However in terms of causality "Wed" is > not created concurrently with "Tue". In fact "Wed" succeeds "Tue" and the > state should not have the three siblings depicted, but only two: "Sun" and > "Wed". > > This effect can be described as the creation of false concurrency, and when > frequent, it leads to an artificial growth in the state and more > reconciliation work at the client level. It is the same undesired situation > that can occur as a consequence of pruning. > > One aspect that may be a bit scary is that in the current Riak the number of > falsely concurrent siblings is not bounded. This is the case because a Put > by client X only simplifies siblings if there was no other Put (for the same > key, coordinated by the same server) since the time client X did the Get; > otherwise the Put will add the value to the current set of siblings. This > means that for highly contended keys, specially if the get-calculate-put > cycle takes some time, it may happen that the set of siblings keeps growing > indefinitely (if clients keep looping on that get-calculate-put cycle). As > in this case the number of siblings returned is increasing, it will make > clients take more time to resolve conflicts and calculate the new value, > increasing latency and reducing the chances for an opportunity that some > client can perform a get-calculate-put with no interleaving from other > client. I.e., we have a positive feedback loop, with undesirable consequences. > > In the past year, myself and colleagues have been working in a mechanism > that overcomes this. We had some talks with the Riak team and at the time > one of the drawbacks was the need to re-direct all puts to a node that had > the replica. Now, Riak is much closer to the required architecture, so we > would like to share more broadly our proposal and see if it can be useful to > the community. One of the changes needed is to tag each sibling with an > appropriate vector (a "special" version vector with one additional entry) > and not use a single vector for the whole set of siblings. Each sibling > already has support for storing its own metadata, so the support is possibly > already there. > > We named the causality tracking mechanism as Dotted Version Vectors, and its > most current description > (http://gsd.di.uminho.pt/members/vff/dotted-version-vectors-2012.pdf) also > includes some performance evaluation in Riak 0.13, that this community might > find interesting to check. Currently we are in the process of modifying the > current > Riak implementation to use it, now a much simpler task. > > The mechanism code is available in > https://github.com/ricardobcl/Dotted-Version-Vectors > > Although these "advanced" techniques add some additional complexity, we hope > that > overcoming false concurrency will be a good tradeoff. > > Regards, > > The DVV team, > > Carlos Baquero > Nuno Preguiça > Paulo Almeida > Ricardo Gonçalves > Victor Fonte > > > ----- > Carlos Baquero > HASLab / INESC TEC, > Universidade do Minho, > Portugal > > c...@di.uminho.pt > http://gsd.di.uminho.pt/cbm > > > > > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > > -- > Jon Meredith > Platform Engineering Manager > Basho Technologies, Inc. > jmered...@basho.com > _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com