I've also been asking for this, and the current master has code to remedy these things, but it's not in an official release yet.
At Erlang level, you can specify options to a RiakClient:get as follows -type option() :: {r, pos_integer()} | %% Minimum number of successful responses {pr, non_neg_integer()} | %% Minimum number of primary vnodes participating {basic_quorum, boolean()} | %% Whether to use basic quorum (return early %% in some failure cases. {notfound_ok, boolean()} | %% Count notfound reponses as successful. {timeout, pos_integer() | infinity}. %% Timeout for vnode responses And so to get the semantics I think you're asking for, do GET (assuming N=3) with [{r,2},{pr, 2}, {basic_quorum, false}, {notfound_ok, true}] So this will work as you want as long as there is only one node down. During handoff you may see a new kind of error HTTP 503 / {error r_value_unsatisfied ...} which is the behavior when basic quorum is disabled, i.e. the alternative to getting a notfound just because there was some node which did not have the value. Each of those are also available as query parameters when doing a HTTP get. curl http://127.0.0.1:8091/riak/buck/key?r=2&basic_quorum=false¬found_ok=true I'm also looking forward to a release which has this, and I'm hoping that the defaults can somehow be simplified / strengthened so people new to this don't need to be so surprised about these things. Kresten On May 5, 2011, at 8:18 AM, Greg Nelson wrote: I just added node #5 to our cluster, and once again the experience during the subsequent 60-minute handoff period was pretty awful! I just don't understand why this would be expected behavior while adding a node. There doesn't seem to be any realistic way to join a node to an online cluster. As far as I'm concerned this is a huge defect in Riak. Read-repair didn't seem to kick in immediately for data. My application was configured to retry GETs (with a few seconds of backoff), and still got 404s. I manually requested an object repeatedly for over 20 minutes until finally getting a result. I think bug #992 (https://issues.basho.com/show_bug.cgi?id=992) describes the defect, but I'm wondering if there is more to it than this? Especially since read-repair didn't quite seem to work. Could what Daniel describes on that bug ("Only return not found when all vnodes have reported not found (or error)") be implemented as a configurable option? Maybe something one could kick in when a node joins until all handoffs are complete? What we can do to remedy this before I add node #6, #7, etc. We're storing huge amounts of data, which means that a) we'll be adding nodes often, and b) the amount of data handoff will be large, which means long periods of handoff where we don't want to have downtime. Greg On Tuesday, May 3, 2011 at 2:30 AM, Nico Meyer wrote: Hi everyone, I just want to note that I observed similar behaviour with a somewhat larger clusters of 10 or so nodes. I first noticed that handoff activity after node join (or leave for that matter) involved a lot more partitions than I would have expected. By comparing the old and the new ring file, I found out that more than 80 percent of partitions had to be moved to another node. My naive expectation was that joining a node to a cluster of size X would result in roughly ring_creation_size/(X+1) partitions to be handed off, which would also be the minimum if one expects a balanced cluster afterwards. Furthermore it would in theory be possible to move partitions in such a way that at least one partition from each preflist stays on the same node. Maybe for X>N it should even be possible to guarantee this for a basic quorum of each preflist, eliminating the notfound problem completely, but I am not sure about that. I may be able to provide some ring files to analyze this behaviour if someone from basho is interested. Cheer Nico Am Montag, den 02.05.2011, 23:14 -0400 schrieb Ryan Zezeski: Greg, Your expectations are fair, just because you added a node doesn't mean Riak should return notfounds. Unfortunately, we aren't quite there yet. This is a side effect of how Riak currently implements handoff in that it immediately updates/gossips the ring causing many partitions to handoff immediately. If a request comes in that relies on these partitions then it will get a notfound and perform read repair. You're situation is multiplied by the fact that you are going from 3 nodes to 4. More vnode shuffling occurs because of the small cluster size. We're well aware of this and have it on our radar for improvement in a future release. All this said, you data will be eventually consistent. That is, all your data will eventually be handed off and things will work as normal. It's only during the handoff that you _may_ encounter notfounds. In this case it would be best to add a new node to your cluster at lowest load times and if you can spare additional hardware a few more nodes to start with is an even easier option. -Ryan On Mon, May 2, 2011 at 9:48 PM, Greg Nelson <gro...@dropcam.com<mailto:gro...@dropcam.com>> wrote: Hello riak users! I have a 4 node cluster that started out as 3 nodes. ring_creation_size = 2048, target_n_val is default (4), and all buckets have n_val = 3. When I joined the 4th node, for a few minutes some GETs were returning 'not found' for data that was already in riak. Eventually the data was returned, due to read repair I would assume. Is this expected? It seems that 'not found' and read repairs should only happen when something goes wrong, like a node goes down. Not when adding a node to the cluster, which is supposed to be part of normal operation! Any help or insight is appreciated! Greg ________________________________________________ riak-users mailing list riak-users@lists.basho.com<mailto:riak-users@lists.basho.com> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com _______________________________________________ riak-users mailing list riak-users@lists.basho.com<mailto:riak-users@lists.basho.com> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com _______________________________________________ riak-users mailing list riak-users@lists.basho.com<mailto:riak-users@lists.basho.com> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com <ATT00001..txt> _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com