John, All great points. The problem is that the ring changes immediately when a node is added. So now, all the sudden, the preflist is potentially pointing to nodes that don't have the data and they won't have that data until handoff occurs. The faster that data gets transferred, the less time your clients have to hit 'notfound'.
However, I agree completely with what you're saying. This is just a side effect of how the system currently works. In a perfect world we wouldn't care how long handoff takes and we would also do some sort of automatic congestion control akin to TCP Reno or something. The preflist would still point to the "old" partitions until all data has been successfully handed off, and then and only then would we flip the switch for that vnode. I'm pretty sure that's where we are heading (I say "pretty sure" b/c I just joined the team and haven't been heavily involved in these specific talks yet). It's all coming down the pipe... As for your specific I/O question re handoff_concurrecy, you might be right. I would think it depends on hardware/platform/etc. I was offering it as a possible stopgap to minimize Greg's pain. It's certainly a cure to a symptom, not the problem itself. -Ryan On Thu, May 5, 2011 at 1:10 PM, John D. Rowell <m...@jdrowell.com> wrote: > Hi Ryan, Greg, > > > 2011/5/5 Ryan Zezeski <rzeze...@basho.com> > >> 1. For example, riak_core has a `handoff_concurrency` setting that >> determines how many vnodes can concurrently handoff on a given node. By >> default this is set to 4. That's going to take a while with your 2048 >> vnodes and all :) >> > > Won't that make the handoff situation potentially worse? From the thread I > understood that the main problem was that the cluster was shuffling too much > data around and thus becoming unresponsive and/or returning unexpected > results (like "not founds"). I'm attributing the concerns more to an > excessive I/O situation than to how long the handoff takes. If the handoff > can be made transparent (no or little side effects) I don't think most > people will really care (e.g. the "fix the cluster tomorrow" anecdote). > > How about using a percentage of available I/O to throttle the vnode handoff > concurrency? Start with 1, and monitor the node's I/O (kinda like 'atop' > does, collection CPU, disk and network metrics), if it is below the expected > usage, then increase the vnode handoff concurrency, and vice-versa. > > I for one would be perfectly happy if the handoff took several hours (even > days) if we could maintain the core riak_kv characteristics intact during > those events. We've all seen looooong RAID rebuild times, and it's usually > better to just sit tight and keep the rebuild speed low (slower I/O) while > keeping all of the dependent systems running smoothly. > > cheers > -jd >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com