Re: simulating physical node crash

Martin Woods Wed, 28 Sep 2011 13:59:06 -0700

Hi Francisco

I've seen the same error in a dev environment running on a single Riak node 
with an n_val of 1, so in my case it was nothing to do with a failing node. I 
wasn't running Riak Search either. I posted a question about it to this list a 
week or so ago but haven't seen a reply yet.


So indeed, does anyone know what's causing this error and how we can avoid it?

Regards,
Martin. 



On 28 Sep 2011, at 20:39, francisco treacy <francisco.tre...@gmail.com> wrote:

> Regarding (3) I found a Forcing Read Repair contrib function 
> (http://contrib.basho.com/bucket_inspector.html) which should help.
> 
> Otherwise for the m/r error, all of my buckets use default n_val and write 
> quorum. Could it be that some data never reached that particular node in the 
> cluster? That is, should've I used W=3?  During the failure, many assets were 
> returning 404s which triggered read-repair (and were ok upon subsequent 
> request), but no luck with the Map/Reduce function (it kept on failing).  
> Could it have something to do with Riak Search?
> 
> Thanks,
> 
> Francisco
> 
> 
> 2011/9/26 francisco treacy <francisco.tre...@gmail.com>
> Hi all,
> 
> I have a 3-node Riak cluster, and I am simulating the scenario of physical 
> nodes crashing.
> 
> When 2 nodes go down, and I query the remaining one, it fails with:
> 
> {error,
>     {exit,
>         {{{error,
>               {no_candidate_nodes,exhausted_prefist,
>                   [{riak_kv_mapred_planner,claim_keys,3},
>                    {riak_kv_map_phase,schedule_input,5},
>                    {riak_kv_map_phase,handle_input,3},
>                    {luke_phase,executing,3},
>                    {gen_fsm,handle_msg,7},
>                    {proc_lib,init_p_do_apply,3}],
>                   []}},
>           {gen_fsm,sync_send_event,
>               [<0.31566.2330>,
>                {inputs,
> 
> (...)
> 
> Here I'm doing a M/R, inputs being fed by Search.
> 
> (1) All of the involved buckets have N=3, and all involved requests R=1 (I 
> don't really need quorum for this usecase)
> 
> Why is it failing? I'm sure i'm missing something basic here
> 
> (2) Probably worth noting, those 3 nodes are spread across *two* physical 
> servers (1 on small one, 2 on beefier one). I've heard it is "not a good 
> idea", not sure why though. These two servers are definitely enough still for 
> our current load; should I consider adding a third one?
> 
> (3) To overcome the aforementioned error, I added a new node to the cluster 
> (installed on the small server). Now the setup looks like: 4 nodes = 2 on 
> small server, 2 on beefier one.
> 
> When 2 nodes go down, this works.  Which brings me to another topic... could 
> you point me to good strategies to "pre-" invoke read-repair? Is it up to 
> clients to scan the keyspace forcing reads?  It's a disaster usability-wise 
> when first users start getting 404s all over the place.
> 
> Francisco
> 
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: simulating physical node crash

Reply via email to