Thanks Carlos. We've been busy with the 1.1.0 release. I'll digest all you've 
written over the next couple days and get back to you. 

Cheers, Jon

On Mon, Feb 20, 2012 at 9:02 AM, Carlos Baquero <c...@di.uminho.pt> wrote:

Hi,

Me and my colleagues are particularly interested on causality management
in optimistic replications settings, and since most other systems use simple
Last Writer Wins approaches, Riak is a good testbed as it allows exposing
concurrency to the clients, offering more options than a pure LWW interface.
With Riak 1.0 there were several changes to the way that causality is
tracked in the system. See http://wiki.basho.com/Vector-Clocks.html

The use of client ids in Riak's (pre 1.0) vector clocks, was recognized as a
scalability problem. However, it had the advantage that any node could
coordinate a Put request. Now in Riak (post 1.0), there is at most one entry
in vector clocks per server node, as client ids are no longer used, but
instead the server node id is used. One change that this required is that
now requests must be forwarded to a node hosting the replica.

(Here we use the term vector clock, but version vector would be more
appropriate in this data dependency context. More info in
http://haslab.wordpress.com/2011/07/08/version-vectors-are-not-vector-clocks/)

Riak (post 1.0) is using an ingenious solution to register updates while
having (at most) only one entry per replica node (a server node holding
a replica).

From what we could check from the new code, we can construct this example:
Suppose that we are in node A (of a two node system) and have a data item
with value "Mon" and version (A:4,B:2). We can see this one sibling as
(A:4,B:2)"Mon".

If a client reads this context and value (in a Get) and writes some new
value "Tue" to node A, the node can detect that it is evolving the current
installed version (as it gets the current vector) and change the state to
(again a single sibling) (A:5,B:2)"Tue", where the local counter was
advanced by 1. Now, if another (concurrent) client has also read
(A:4,B:2)"Mon" and is now Putting "Sun", the node detects concurrency (the
vector clock from the client is no longer the current one) and now registers
a second sibling, becoming (A:6,B:2)"Tue""Sun".

Now if we are lucky, some client will eventually read this two-sibling state
and write back, say "Sat", (with no other concurrent client interleaving)
and one gets again a single sibling state (A:7,B:2)"Sat".

Till now everything looks fine.

However other runs are problematic. Again we start with (A:4,B:2)"Mon" and
evolve to (A:5,B:2)"Tue" as before. Suppose that a client X reads this
state. Next, the system evolves as before and becomes (A:6,B:2)"Tue""Sun".
Now if client X submits a put with "Wed", Riak will see that (A:5,B:2) is
not the current version (A:6,B:2) and thus registers a third sibling and
evolves to (A:7,B:2)"Tue""Sun""Wed".  However in terms of causality "Wed" is
not created concurrently with "Tue". In fact "Wed" succeeds "Tue" and the
state should not have the three siblings depicted, but only two: "Sun" and
"Wed".

This effect can be described as the creation of false concurrency, and when
frequent, it leads to an artificial growth in the state and more
reconciliation work at the client level. It is the same undesired situation
that can occur as a consequence of pruning.

One aspect that may be a bit scary is that in the current Riak the number of
falsely concurrent siblings is not bounded. This is the case because a Put
by client X only simplifies siblings if there was no other Put (for the same
key, coordinated by the same server) since the time client X did the Get;
otherwise the Put will add the value to the current set of siblings.  This
means that for highly contended keys, specially if the get-calculate-put
cycle takes some time, it may happen that the set of siblings keeps growing
indefinitely (if clients keep looping on that get-calculate-put cycle). As
in this case the number of siblings returned is increasing, it will make
clients take more time to resolve conflicts and calculate the new value,
increasing latency and reducing the chances for an opportunity that some
client can perform a get-calculate-put with no interleaving from other
client. I.e., we have a positive feedback loop, with undesirable consequences.

In the past year, myself and colleagues have been working in a mechanism
that overcomes this. We had some talks with the Riak team and at the time
one of the drawbacks was the need to re-direct all puts to a node that had
the replica.  Now, Riak is much closer to the required architecture, so we
would like to share more broadly our proposal and see if it can be useful to
the community.  One of the changes needed is to tag each sibling with an
appropriate vector (a "special" version vector with one additional entry)
and not use a single vector for the whole set of siblings. Each sibling
already has support for storing its own metadata, so the support is possibly
already there.

We named the causality tracking mechanism as Dotted Version Vectors, and its
most current description
(http://gsd.di.uminho.pt/members/vff/dotted-version-vectors-2012.pdf) also
includes some performance evaluation in Riak 0.13, that this community might
find interesting to check. Currently we are in the process of modifying the 
current
Riak implementation to use it, now a much simpler task.

The mechanism code is available in
https://github.com/ricardobcl/Dotted-Version-Vectors

Although these "advanced" techniques add some additional complexity, we hope 
that
overcoming false concurrency will be a good tradeoff.

Regards,

The DVV team,

Carlos Baquero
Nuno Preguiça
Paulo Almeida
Ricardo Gonçalves
Victor Fonte


-----
Carlos Baquero
HASLab / INESC TEC,
Universidade do Minho,
Portugal

c...@di.uminho.pt
http://gsd.di.uminho.pt/cbm





_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



-- 
Jon Meredith
Platform Engineering Manager
Basho Technologies, Inc.
jmered...@basho.com

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to