Performance is fine. However requests get a "not found" response for an extended period of time. See http://lists.basho.com/pipermail/riak-users_lists.basho.com/2011-May/thread.html#4078for previous discussion of what sounds like the same issue.
On Thu, May 26, 2011 at 6:57 AM, Jonathan Langevin < jlange...@loomlearning.com> wrote: > That sounds quite disconcerting. What happens to the performance of the > cluster when this occurs?* > > <http://www.loomlearning.com/> > Jonathan Langevin > Systems Administrator > Loom Inc. > Wilmington, NC: (910) 241-0433 - jlange...@loomlearning.com - > www.loomlearning.com - Skype: intel352 > * > > > On Thu, May 26, 2011 at 1:54 AM, Greg Nelson <gro...@dropcam.com> wrote: > >> I've been doing some digging through the details of how a node joins a >> cluster. When you hear that Riak uses consistent hashing, you'd expect it >> to distribute keys to nodes by hashing keys onto the ring AND hashing nodes >> onto the ring. Keys belong to the closest node on the ring, in the >> clockwise direction. Add a node, it hashes onto the ring and takes over >> some keys. Ordinarily the node would hash onto the ring in several places, >> to achieve better spread. Some data (roughly 1 / #nodes) moves to the new >> node from each of the other nodes, and everything else stays the same. >> >> In what Amazon describes as operationally simpler (strategy 3 in the >> Dynamo paper), the ring is instead divided into equally-sized partitions. >> Nodes are hashed onto the ring, and preflists are calculated by walking >> clockwise from a partition, skipping partitions on already visited nodes. >> Riak does something similar: it divides the ring into equally-sized >> partitions, then nodes "randomly" claim partitions. However, the skipping >> bit isn't part of Riak's preflist calculation. Instead, nodes claim >> partitions in such a way as to be spaced out by target_n_val, to obviate the >> need for skipping. >> >> Now, getting back to what happens when a node joins. The new node >> calculates a new ring state that maintains the target_n_val invariant, as >> well as trying to keep even spread of partitions per node. The algorithm >> (default_choose_claim) is heuristic and greedy in nature, and recursively >> transfers partitions to the new node until optimal spread is achieved, >> maintaining target_n_val along the way. But if -- during one of those >> recursive calls -- it can't meet the target_n_val, it will throw up its >> hands and completely re-do the whole ring (by calling claim_rebalance_n). >> Striping the partitions across nodes, in a round-robin fashion. When that >> happens, most of the data needs to be handed off between nodes. >> >> This happens a lot, with many ring sizes. With ring_creation_size=128 >> (i.e., 128 partitions), it will happen when adding node 9 (87.5% of data >> moves), adding node 12 (82%), adding node 15 (80%), adding node 19 (94%). >> It happens with all ring sizes >= 128 (256, 512, 1024, ...). It appears >> that any ring_creation_size (64 by default) is safe for growing to 8 nodes >> or so. But if you want to go beyond that... A ring size of >= 128 with >> more than 8 nodes doesn't seem all that unusual, surely someone has hit this >> before? I've filed a bug report here: >> https://issues.basho.com/show_bug.cgi?id=1111 >> >> Anyway, this feels like a bit of a departure from consistent hashing. In >> fact, could this not be replaced by normal hashing + a lookup table mapping >> intervals of the hash space to nodes? And isn't that simply sharding? >> >> At any rate, I believe the claim algorithm can be improved to avoid those >> "throw up hands and stripe everything" scenarios. In fact, here is such an >> implementation: https://github.com/basho/riak_core/pull/55. It is still >> heuristic and greedy, but it seems to do a better job of avoiding re-stripe. >> Test results are attached in a zip on the bug linked above. I'd love to >> get the riak_core gurus at Basho to look at this and help validate it. It >> probably could use some cleaning up, but I want to make sure there aren't >> other invariants or considerations I'm leaving out -- besides maintaining >> target_n_val, keeping optimal partition spread, and minimizing handoff >> between ring states. >> >> -Greg >> >> _______________________________________________ >> riak-users mailing list >> riak-users@lists.basho.com >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> >> > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com