I just added node #5 to our cluster, and once again the experience during the 
subsequent 60-minute handoff period was pretty awful! I just don't understand 
why this would be expected behavior while adding a node. There doesn't seem to 
be any realistic way to join a node to an online cluster. As far as I'm 
concerned this is a huge defect in Riak.

Read-repair didn't seem to kick in immediately for data. My application was 
configured to retry GETs (with a few seconds of backoff), and still got 404s. I 
manually requested an object repeatedly for over 20 minutes until finally 
getting a result.

I think bug #992 (https://issues.basho.com/show_bug.cgi?id=992) describes the 
defect, but I'm wondering if there is more to it than this? Especially since 
read-repair didn't quite seem to work.

Could what Daniel describes on that bug ("Only return not found when all vnodes 
have reported not found (or error)") be implemented as a configurable option? 
Maybe something one could kick in when a node joins until all handoffs are 
complete?

What we can do to remedy this before I add node #6, #7, etc. We're storing huge 
amounts of data, which means that a) we'll be adding nodes often, and b) the 
amount of data handoff will be large, which means long periods of handoff where 
we don't want to have downtime.

Greg
On Tuesday, May 3, 2011 at 2:30 AM, Nico Meyer wrote:
Hi everyone,
> 
> I just want to note that I observed similar behaviour with a somewhat
> larger clusters of 10 or so nodes. I first noticed that handoff activity
> after node join (or leave for that matter) involved a lot more
> partitions than I would have expected. By comparing the old and the new
> ring file, I found out that more than 80 percent of partitions had to be
> moved to another node.
> My naive expectation was that joining a node to a cluster of size X
> would result in roughly ring_creation_size/(X+1) partitions to be handed
> off, which would also be the minimum if one expects a balanced cluster
> afterwards.
> Furthermore it would in theory be possible to move partitions in such a
> way that at least one partition from each preflist stays on the same
> node. Maybe for X>N it should even be possible to guarantee this for a
> basic quorum of each preflist, eliminating the notfound problem
> completely, but I am not sure about that.
> 
> I may be able to provide some ring files to analyze this behaviour if
> someone from basho is interested.
> 
> Cheer Nico
> 
> Am Montag, den 02.05.2011, 23:14 -0400 schrieb Ryan Zezeski:
> > Greg,
> > 
> > 
> > Your expectations are fair, just because you added a node doesn't mean
> > Riak should return notfounds. Unfortunately, we aren't quite there
> > yet. This is a side effect of how Riak currently implements handoff
> > in that it immediately updates/gossips the ring causing
> > many partitions to handoff immediately. If a request comes in that
> > relies on these partitions then it will get a notfound and perform
> > read repair. You're situation is multiplied by the fact that you are
> > going from 3 nodes to 4. More vnode shuffling occurs because of the
> > small cluster size.
> > 
> > 
> > We're well aware of this and have it on our radar for improvement in a
> > future release.
> > 
> > 
> > All this said, you data will be eventually consistent. That is, all
> > your data will eventually be handed off and things will work as
> > normal. It's only during the handoff that you _may_ encounter
> > notfounds. In this case it would be best to add a new node to your
> > cluster at lowest load times and if you can spare additional hardware
> > a few more nodes to start with is an even easier option.
> > 
> > 
> > -Ryan
> > 
> > On Mon, May 2, 2011 at 9:48 PM, Greg Nelson <gro...@dropcam.com>
> > wrote:
> >  Hello riak users! 
> > 
> > 
> >  I have a 4 node cluster that started out as 3 nodes.
> >  ring_creation_size = 2048, target_n_val is default (4), and
> >  all buckets have n_val = 3.
> > 
> > 
> >  When I joined the 4th node, for a few minutes some GETs were
> >  returning 'not found' for data that was already in riak.
> >  Eventually the data was returned, due to read repair I would
> >  assume. Is this expected? It seems that 'not found' and read
> >  repairs should only happen when something goes wrong, like a
> >  node goes down. Not when adding a node to the cluster, which
> >  is supposed to be part of normal operation!
> > 
> > 
> >  Any help or insight is appreciated!
> > 
> > 
> >  Greg
> > 
> >  _______________________________________________
> >  riak-users mailing list
> > riak-users@lists.basho.com
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > 
> > 
> > 
> > _______________________________________________
> > riak-users mailing list
> > riak-users@lists.basho.com
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 
> 
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to