Re: simulating physical node crash

francisco treacy Wed, 28 Sep 2011 12:39:48 -0700

Regarding (3) I found a Forcing Read Repair contrib function (
http://contrib.basho.com/bucket_inspector.html) which should help.


Otherwise for the m/r error, all of my buckets use default n_val and write
quorum. Could it be that some data never reached that particular node in the
cluster? That is, should've I used W=3?  During the failure, many assets
were returning 404s which triggered read-repair (and were ok upon subsequent
request), but no luck with the Map/Reduce function (it kept on failing).
 Could it have something to do with Riak Search?

Thanks,

Francisco


2011/9/26 francisco treacy <francisco.tre...@gmail.com>

> Hi all,
>
> I have a 3-node Riak cluster, and I am simulating the scenario of physical
> nodes crashing.
>
> When 2 nodes go down, and I query the remaining one, it fails with:
>
> {error,
>     {exit,
>         {{{error,
>               {no_candidate_nodes,exhausted_prefist,
>                   [{riak_kv_mapred_planner,claim_keys,3},
>                    {riak_kv_map_phase,schedule_input,5},
>                    {riak_kv_map_phase,handle_input,3},
>                    {luke_phase,executing,3},
>                    {gen_fsm,handle_msg,7},
>                    {proc_lib,init_p_do_apply,3}],
>                   []}},
>           {gen_fsm,sync_send_event,
>               [<0.31566.2330>,
>                {inputs,
>
> (...)
>
> Here I'm doing a M/R, inputs being fed by Search.
>
> (1) All of the involved buckets have N=3, and all involved requests R=1 (I
> don't really need quorum for this usecase)
>
> Why is it failing? I'm sure i'm missing something basic here
>
> (2) Probably worth noting, those 3 nodes are spread across *two* physical
> servers (1 on small one, 2 on beefier one). I've heard it is "not a good
> idea", not sure why though. These two servers are definitely enough still
> for our current load; should I consider adding a third one?
>
> (3) To overcome the aforementioned error, I added a new node to the cluster
> (installed on the small server). Now the setup looks like: 4 nodes = 2 on
> small server, 2 on beefier one.
>
> When 2 nodes go down, this works.  Which brings me to another topic...
> could you point me to good strategies to "pre-" invoke read-repair? Is it up
> to clients to scan the keyspace forcing reads?  It's a disaster
> usability-wise when first users start getting 404s all over the place.
>
> Francisco
>

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: simulating physical node crash

Reply via email to