As a follow up to my earlier post, I just reran all 208 MapReduce jobs, and this time I got four timeouts. This time, riak03 was the culprit (rather than riak02).
This first timeout wrote to the error log after seven seconds. The second and third wrote to the error log after five seconds. The four timeout wrote the error log after eight seconds. The only thing in logs on riak03 as the following: =INFO REPORT==== 21-Jun-2011::17:19:53 === Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_hook) host starting (<0.169.0>) Eshell V5.7.5 (abort with ^G) (riak@10.0.60.210)1> =INFO REPORT==== 21-Jun-2011::17:23:26 === Merged ["data/mr_queue",[], ["data/mr_queue/1308685475.bitcask.data", "data/mr_queue/1308687934.bitcask.data"]] in 0.023939 seconds. On riak01, I was getting the standard timeout errors: =ERROR REPORT==== 21-Jun-2011::17:27:08 === ** State machine <0.14622.2> terminating ** Last event in was {mapexec_error,{785880,'riak@10.0.60.210'}, {error,timeout}} ** When State == executing ** Data == {state,0,riak_kv_map_phase, {state,true, {javascript, {map, {jsanon, <<"function(value,keyData,arg){var David From: David Mitchell Sent: Tuesday, June 21, 2011 5:10 PM To: 'riak-users@lists.basho.com' Subject: Re: Riak crash on 0.14.2 riak_kv_stat terminating Erlang: R13B04 Riak: 0.14.2 I am having the same issue as Jeremy. I just did 208 MapReduce jobs using anonymous JavaScript functions in the map and reduce phases. I am sending the MapReduce jobs to a single node, riak01. Out of the 208 jobs, I got two "mapexec_error" {error,timeout} on riak02. I read on the basho wiki that the default timeout is 60 seconds. http://wiki.basho.com/Loading-Data-and-Running-MapReduce-Queries.html Map/Reduce queries have a default timeout of 60000 milliseconds (60 seconds). I have discovered that if a MapReduce job does not complete within 10 seconds, then it likely is having issues. Most MapReduce jobs complete in one to two seconds. I can try increasing the MapReduce timeout to 120 seconds, but I doubt that this will help. I have discovered that if there are several timeouts, then the beam process can terminate. Any help would be appreciated. The following is from the sals-error.log on riak01. =ERROR REPORT==== 21-Jun-2011::16:29:11 === ** State machine <0.11130.0> terminating ** Last event in was {mapexec_error,{<<"46">>,'riak@10.0.60.209'}, {error,timeout}} ** When State == executing ** Data == {state,0,riak_kv_map_phase, {state,true, {javascript, {map, {jsanon, ................. =ERROR REPORT==== 21-Jun-2011::16:29:11 === ** State machine <0.11127.0> terminating ** Last message in was {'EXIT',<0.11130.0>,{error,timeout}} ** When State == executing ** Data == {state,41465578, [<0.11130.0>,[<0.11129.0>,<0.11128.0>]], <0.10971.0>,66000, {1308688220159363,#Ref<0.0.0.198634>}, #Fun<riak_kv_mapred_json.jsonify_not_found.1>,[],[]} ** Reason for termination = ** {error,{phase_error,{error,timeout}}} =CRASH REPORT==== 21-Jun-2011::16:29:11 === crasher: initial call: luke_flow:init/1 pid: <0.11127.0> registered_name: [] exception exit: {error,{phase_error,{error,timeout}}} in function gen_fsm:terminate/7 in call from proc_lib:init_p_do_apply/3 ancestors: [luke_flow_sup,luke_sup,<0.91.0>] messages: [] links: [<0.11128.0>,<0.11129.0>,<0.93.0>] dictionary: [] trap_exit: true status: running heap_size: 233 stack_size: 24 reductions: 23099 neighbours: neighbour: [{pid,<0.11129.0>},{registered_name,[]},{initial_call,{luke_phase,init,[Argument__1]}},{current_function,{gen_fsm,loop,7}},{ancestors,[luke_phase_sup,luke_sup,<0. 91.0>]},{messages,[]},{links,[<0.11127.0>,<0.11128.0>,<0.94.0>]},{dictionary,[]},{trap_exit,false},{status,waiting},{heap_size,6765},{stack_size,10},{reductions,4926}] neighbour: [{pid,<0.11128.0>},{registered_name,[]},{initial_call,{luke_phase,init,[Argument__1]}},{current_function,{gen_fsm,loop,7}},{ancestors,[luke_phase_sup,luke_sup,<0. 91.0>]},{messages,[]},{links,[<0.11127.0>,<0.11129.0>,<0.94.0>]},{dictionary,[]},{trap_exit,false},{status,waiting},{heap_size,4181},{stack_size,10},{reductions,4905}] The second timeout error: =ERROR REPORT==== 21-Jun-2011::16:31:10 === ** State machine <0.15144.0> terminating ** Last message in was flow_timeout ** When State == executing ** Data == {state,78575179, [<0.15147.0>,[<0.15146.0>,<0.15145.0>]], <0.15118.0>,66000, {1308688285727293,#Ref<0.0.1.11874>}, #Fun<riak_kv_mapred_json.jsonify_not_found.1>,[],[]} ** Reason for termination = ** {error,flow_timeout} =CRASH REPORT==== 21-Jun-2011::16:31:10 === crasher: initial call: luke_flow:init/1 pid: <0.15144.0> registered_name: [] exception exit: {error,flow_timeout} in function gen_fsm:terminate/7 in call from proc_lib:init_p_do_apply/3 ancestors: [luke_flow_sup,luke_sup,<0.91.0>] messages: [] links: [<0.15145.0>,<0.15147.0>,<0.15146.0>,<0.93.0>] dictionary: [] trap_exit: true status: running heap_size: 233 stack_size: 24 reductions: 20791 neighbours: neighbour: [{pid,<0.15146.0>},{registered_name,[]},{initial_call,{luke_phase,init,[Argument__1]}},{current_function,{gen_fsm,loop,7}},{ancestors,[luke_phase_sup,luke_sup,<0. 91.0>]},{messages,[]},{links,[<0.15144.0>,<0.15145.0>,<0.94.0>]},{dictionary,[]},{trap_exit,false},{status,waiting},{heap_size,4181},{stack_size,10},{reductions,6554}] neighbour: [{pid,<0.15145.0>},{registered_name,[]},{initial_call,{luke_phase,init,[Argument__1]}},{current_function,{gen_fsm,loop,7}},{ancestors,[luke_phase_sup,luke_sup,<0. 91.0>]},{messages,[]},{links,[<0.15144.0>,<0.15146.0>,<0.94.0>]},{dictionary,[]},{trap_exit,false},{status,waiting},{heap_size,4181},{stack_size,10},{reductions,6274}] =SUPERVISOR REPORT==== 21-Jun-2011::16:31:10 === Supervisor: {local,luke_flow_sup} Context: child_terminated Reason: {error,flow_timeout} Offender: [{pid,<0.15144.0>},{name,undefined},{mfa,{luke_flow,start_link,[<0.15118.0>,78575179,[{riak_kv_map_phase,[],[{javascript,{map,{jsanon,<<"function(value,keyData, ................ David
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com