I increased the mapreduce timeout to 10 minutes and the system has been
running for about a day and a half with no flow_timeout errors and also none
of the nodes going down. The crashed nodes seem somehow related to the
mapreduce operations timing out.

I did a search on the machine and there were no erl_crash.dump files. I'm
going to turn down the timeout again to the default and see if any nodes go
down again. I'd have to sanitize the logs before sending them in as they
contain work data. If I can reproduce this again I'll see if I can get
something setup to reproduce it with dummy data and send the logs along.

- Jeremy


On Tue, Jun 21, 2011 at 6:05 PM, David Mitchell
<david.mitch...@ixicorp.com>wrote:

> As a follow up to my earlier post, I just reran all 208 MapReduce jobs, and
> this time I got four timeouts.  This time, riak03 was the culprit (rather
> than riak02).****
>
> ** **
>
> This first timeout wrote to the error log after seven seconds.  The second
> and third wrote to the error log after five seconds.  The four timeout wrote
> the error log after eight seconds.****
>
> ** **
>
> The only thing in logs on riak03 as the following:****
>
> =INFO REPORT==== 21-Jun-2011::17:19:53 ===****
>
> Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_hook)
> host starting (<0.169.0>)****
>
> Eshell V5.7.5  (abort with ^G)****
>
> (riak@10.0.60.210)1>****
>
> =INFO REPORT==== 21-Jun-2011::17:23:26 ===****
>
> Merged ["data/mr_queue",[],****
>
>         ["data/mr_queue/1308685475.bitcask.data",****
>
>          "data/mr_queue/1308687934.bitcask.data"]] in 0.023939 seconds.***
> *
>
> ** **
>
> ** **
>
> On riak01, I was getting the standard timeout errors:****
>
> ** **
>
> =ERROR REPORT==== 21-Jun-2011::17:27:08 ===****
>
> ** State machine <0.14622.2> terminating****
>
> ** Last event in was {mapexec_error,{785880,'riak@10.0.60.210'},****
>
>                                     {error,timeout}}****
>
> ** When State == executing****
>
> **      Data  == {state,0,riak_kv_map_phase,****
>
>                   {state,true,****
>
>                    {javascript,****
>
>                     {map,****
>
>                      {jsanon,****
>
>                       <<"function(value,keyData,arg){var****
>
> ** **
>
> David****
>
> ** **
>
> ** **
>
> ** **
>
> *From:* David Mitchell
> *Sent:* Tuesday, June 21, 2011 5:10 PM
> *To:* 'riak-users@lists.basho.com'
> *Subject:* Re: Riak crash on 0.14.2 riak_kv_stat terminating****
>
> ** **
>
> Erlang: R13B04****
>
> Riak: 0.14.2****
>
> ** **
>
> I am having the same issue as Jeremy.  ****
>
> ** **
>
> I just did 208 MapReduce jobs using anonymous JavaScript functions in the
> map and reduce phases.  I am sending the MapReduce jobs to a single node,
> riak01.  Out of the 208 jobs, I got two “mapexec_error” {error,timeout} on
> riak02.****
>
> ** **
>
> I read on the basho wiki that the default timeout is 60 seconds.
> http://wiki.basho.com/Loading-Data-and-Running-MapReduce-Queries.html****
>
> Map/Reduce queries have a default timeout of 60000 milliseconds (60
> seconds).****
>
> ** **
>
> I have discovered that if a MapReduce job does not complete within 10
> seconds, then it likely is having issues.  Most MapReduce jobs complete in
> one to two seconds.  I can try increasing the MapReduce timeout to 120
> seconds, but I doubt that this will help.****
>
> ** **
>
> I have discovered that if there are several timeouts, then the beam process
> can terminate.****
>
> ** **
>
> Any help would be appreciated.****
>
> ** **
>
> The following is from the sals-error.log on riak01.****
>
> ** **
>
> =ERROR REPORT==== 21-Jun-2011::16:29:11 ===****
>
> ** State machine <0.11130.0> terminating****
>
> ** Last event in was {mapexec_error,{<<"46">>,'riak@10.0.60.209'},****
>
>                                     {error,timeout}}****
>
> ** When State == executing****
>
> **      Data  == {state,0,riak_kv_map_phase,****
>
>                   {state,true,****
>
>                    {javascript,****
>
>                     {map,****
>
>                      {jsanon,****
>
> ……………..****
>
> ** **
>
> =ERROR REPORT==== 21-Jun-2011::16:29:11 ===****
>
> ** State machine <0.11127.0> terminating****
>
> ** Last message in was {'EXIT',<0.11130.0>,{error,timeout}}****
>
> ** When State == executing****
>
> **      Data  == {state,41465578,****
>
>                         [<0.11130.0>,[<0.11129.0>,<0.11128.0>]],****
>
>                         <0.10971.0>,66000,****
>
>                         {1308688220159363,#Ref<0.0.0.198634>},****
>
>
> #Fun<riak_kv_mapred_json.jsonify_not_found.1>,[],[]}****
>
> ** Reason for termination =****
>
> ** {error,{phase_error,{error,timeout}}}****
>
> ** **
>
> =CRASH REPORT==== 21-Jun-2011::16:29:11 ===****
>
>   crasher:****
>
>     initial call: luke_flow:init/1****
>
>     pid: <0.11127.0>****
>
>     registered_name: []****
>
>     exception exit: {error,{phase_error,{error,timeout}}}****
>
>       in function  gen_fsm:terminate/7****
>
>       in call from proc_lib:init_p_do_apply/3****
>
>     ancestors: [luke_flow_sup,luke_sup,<0.91.0>]****
>
>     messages: []****
>
>     links: [<0.11128.0>,<0.11129.0>,<0.93.0>]****
>
>     dictionary: []****
>
>     trap_exit: true****
>
>     status: running****
>
>     heap_size: 233****
>
>     stack_size: 24****
>
>     reductions: 23099****
>
>   neighbours:****
>
>     neighbour:
> [{pid,<0.11129.0>},{registered_name,[]},{initial_call,{luke_phase,init,[Argument__1]}},{current_function,{gen_fsm,loop,7}},{ancestors,[luke_phase_sup,luke_sup,<0.
> ****
>
>
> 91.0>]},{messages,[]},{links,[<0.11127.0>,<0.11128.0>,<0.94.0>]},{dictionary,[]},{trap_exit,false},{status,waiting},{heap_size,6765},{stack_size,10},{reductions,4926}]
> ****
>
>     neighbour:
> [{pid,<0.11128.0>},{registered_name,[]},{initial_call,{luke_phase,init,[Argument__1]}},{current_function,{gen_fsm,loop,7}},{ancestors,[luke_phase_sup,luke_sup,<0.
> ****
>
>
> 91.0>]},{messages,[]},{links,[<0.11127.0>,<0.11129.0>,<0.94.0>]},{dictionary,[]},{trap_exit,false},{status,waiting},{heap_size,4181},{stack_size,10},{reductions,4905}]
> ****
>
> ** **
>
> The second timeout error:****
>
> ** **
>
> =ERROR REPORT==== 21-Jun-2011::16:31:10 ===****
>
> ** State machine <0.15144.0> terminating****
>
> ** Last message in was flow_timeout****
>
> ** When State == executing****
>
> **      Data  == {state,78575179,****
>
>                         [<0.15147.0>,[<0.15146.0>,<0.15145.0>]],****
>
>                         <0.15118.0>,66000,****
>
>                         {1308688285727293,#Ref<0.0.1.11874>},****
>
>
> #Fun<riak_kv_mapred_json.jsonify_not_found.1>,[],[]}****
>
> ** Reason for termination =****
>
> ** {error,flow_timeout}****
>
> ** **
>
> =CRASH REPORT==== 21-Jun-2011::16:31:10 ===****
>
>   crasher:****
>
>     initial call: luke_flow:init/1****
>
>     pid: <0.15144.0>****
>
>     registered_name: []****
>
>     exception exit: {error,flow_timeout}****
>
>       in function  gen_fsm:terminate/7****
>
>       in call from proc_lib:init_p_do_apply/3****
>
>     ancestors: [luke_flow_sup,luke_sup,<0.91.0>]****
>
>     messages: []****
>
>     links: [<0.15145.0>,<0.15147.0>,<0.15146.0>,<0.93.0>]****
>
>     dictionary: []****
>
>     trap_exit: true****
>
>     status: running****
>
>     heap_size: 233****
>
>     stack_size: 24****
>
>     reductions: 20791****
>
>   neighbours:****
>
>     neighbour:
> [{pid,<0.15146.0>},{registered_name,[]},{initial_call,{luke_phase,init,[Argument__1]}},{current_function,{gen_fsm,loop,7}},{ancestors,[luke_phase_sup,luke_sup,<0.
> ****
>
>
> 91.0>]},{messages,[]},{links,[<0.15144.0>,<0.15145.0>,<0.94.0>]},{dictionary,[]},{trap_exit,false},{status,waiting},{heap_size,4181},{stack_size,10},{reductions,6554}]
> ****
>
>     neighbour:
> [{pid,<0.15145.0>},{registered_name,[]},{initial_call,{luke_phase,init,[Argument__1]}},{current_function,{gen_fsm,loop,7}},{ancestors,[luke_phase_sup,luke_sup,<0.
> ****
>
>
> 91.0>]},{messages,[]},{links,[<0.15144.0>,<0.15146.0>,<0.94.0>]},{dictionary,[]},{trap_exit,false},{status,waiting},{heap_size,4181},{stack_size,10},{reductions,6274}]
> ****
>
> ** **
>
> =SUPERVISOR REPORT==== 21-Jun-2011::16:31:10 ===****
>
>      Supervisor: {local,luke_flow_sup}****
>
>      Context:    child_terminated****
>
>      Reason:     {error,flow_timeout}****
>
>      Offender:
> [{pid,<0.15144.0>},{name,undefined},{mfa,{luke_flow,start_link,[<0.15118.0>,78575179,[{riak_kv_map_phase,[],[{javascript,{map,{jsanon,<<"function(value,keyData,
> ****
>
> …………….****
>
> ** **
>
> David****
>
> ** **
>
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to