Re: Riak doesn't use consistent hashing

Greg Nelson Thu, 26 May 2011 11:50:36 -0700

The "not found" issue is a different one, but related. The issue there is that 
when a node joins the ring, the ring state is immediately changed. However, it 
takes time to handoff partitions to new owners. During that time, if a request 
comes in for data which has > r of its replicas on partitions which changed 
ownership, the new owners will reply "not found" if they don't have the data 
yet.


It's possible that *all* the partitions in a preflist changed ownership, 
especially in circumstances I described with re-striping. So no r_val can help 
you there.

And actually, even if only 2 of the 3 (assuming n_val=3) partitions in the 
preflist moved, a read with r=1 *still* won't work because of an optimization 
called "basic quorum". That is, if the majority of replicas come back "not 
found" the coordinator will reply with "not found" instead of waiting to see if 
the other nodes respond with something.

Furthermore, the time it takes for handoffs to finish (or even start) can be a 
long time because vnodes will wait for periods of inactivity before doing 
handoff, and there are also restrictions on how many handoffs can happen at a 
time. You can tune those configurations with the handoff_concurrency and 
vnode_inactivity_timeout parameters in the riak_core section of app.config.

I believe the next release will have an option for turning off basic quorum. It 
will also have options that will allow your client to tell the difference 
between a real "not found" and one where r was not satisfied. And a further out 
release will have a proper fix for this whole issue. Probably involving 
forwarding of requests to old owners in cases where the handoff hasn't finished.

-Greg
On Thursday, May 26, 2011 at 7:21 AM, Ben Tilly wrote:
Performance is fine.  However requests get a "not found" response for an 
extended period of time.  See 
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2011-May/thread.html#4078
 for previous discussion of what sounds like the same issue.
> 
> On Thu, May 26, 2011 at 6:57 AM, Jonathan Langevin 
> <jlange...@loomlearning.com> wrote:
> >  That sounds quite disconcerting. What happens to the performance of the 
> > cluster when this occurs?
> > 
> > 
> > Jonathan Langevin
> > Systems Administrator
> > Loom Inc.
> > Wilmington, NC: (910) 241-0433 - jlange...@loomlearning.com - 
> > www.loomlearning.com - Skype: intel352 
> > 
> > 
> > On Thu, May 26, 2011 at 1:54 AM, Greg Nelson <gro...@dropcam.com> wrote:
> > I've been doing some digging through the details of how a node joins a 
> > cluster. When you hear that Riak uses consistent hashing, you'd expect it 
> > to distribute keys to nodes by hashing keys onto the ring AND hashing nodes 
> > onto the ring. Keys belong to the closest node on the ring, in the 
> > clockwise direction. Add a node, it hashes onto the ring and takes over 
> > some keys. Ordinarily the node would hash onto the ring in several places, 
> > to achieve better spread. Some data (roughly 1 / #nodes) moves to the new 
> > node from each of the other nodes, and everything else stays the same. 
> > > 
> > > In what Amazon describes as operationally simpler (strategy 3 in the 
> > > Dynamo paper), the ring is instead divided into equally-sized partitions. 
> > > Nodes are hashed onto the ring, and preflists are calculated by walking 
> > > clockwise from a partition, skipping partitions on already visited nodes. 
> > > Riak does something similar: it divides the ring into equally-sized 
> > > partitions, then nodes "randomly" claim partitions. However, the skipping 
> > > bit isn't part of Riak's preflist calculation. Instead, nodes claim 
> > > partitions in such a way as to be spaced out by target_n_val, to obviate 
> > > the need for skipping. 
> > > 
> > > Now, getting back to what happens when a node joins. The new node 
> > > calculates a new ring state that maintains the target_n_val invariant, as 
> > > well as trying to keep even spread of partitions per node. The algorithm 
> > > (default_choose_claim) is heuristic and greedy in nature, and recursively 
> > > transfers partitions to the new node until optimal spread is achieved, 
> > > maintaining target_n_val along the way. But if -- during one of those 
> > > recursive calls -- it can't meet the target_n_val, it will throw up its 
> > > hands and completely re-do the whole ring (by calling claim_rebalance_n). 
> > > Striping the partitions across nodes, in a round-robin fashion. When that 
> > > happens, most of the data needs to be handed off between nodes. 
> > > 
> > > This happens a lot, with many ring sizes. With ring_creation_size=128 
> > > (i.e., 128 partitions), it will happen when adding node 9 (87.5% of data 
> > > moves), adding node 12 (82%), adding node 15 (80%), adding node 19 (94%). 
> > > It happens with all ring sizes >= 128 (256, 512, 1024, ...). It appears 
> > > that any ring_creation_size (64 by default) is safe for growing to 8 
> > > nodes or so. But if you want to go beyond that... A ring size of >= 128 
> > > with more than 8 nodes doesn't seem all that unusual, surely someone has 
> > > hit this before? I've filed a bug report here: 
> > > https://issues.basho.com/show_bug.cgi?id=1111 
> > > 
> > > Anyway, this feels like a bit of a departure from consistent hashing. In 
> > > fact, could this not be replaced by normal hashing + a lookup table 
> > > mapping intervals of the hash space to nodes? And isn't that simply 
> > > sharding? 
> > > 
> > > At any rate, I believe the claim algorithm can be improved to avoid those 
> > > "throw up hands and stripe everything" scenarios. In fact, here is such 
> > > an implementation: https://github.com/basho/riak_core/pull/55. It is 
> > > still heuristic and greedy, but it seems to do a better job of avoiding 
> > > re-stripe. Test results are attached in a zip on the bug linked above. 
> > > I'd love to get the riak_core gurus at Basho to look at this and help 
> > > validate it. It probably could use some cleaning up, but I want to make 
> > > sure there aren't other invariants or considerations I'm leaving out -- 
> > > besides maintaining target_n_val, keeping optimal partition spread, and 
> > > minimizing handoff between ring states. 
> > > 
> > > -Greg 
> > > _______________________________________________
> > >  riak-users mailing list
> > > riak-users@lists.basho.com
> > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > > 
> > 
> > 
> > _______________________________________________
> >  riak-users mailing list
> > riak-users@lists.basho.com
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > 
> 
>

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Riak doesn't use consistent hashing

Reply via email to