As a follow up to my earlier post, I just reran all 208 MapReduce jobs, and 
this time I got four timeouts.  This time, riak03 was the culprit (rather than 
riak02).

This first timeout wrote to the error log after seven seconds.  The second and 
third wrote to the error log after five seconds.  The four timeout wrote the 
error log after eight seconds.

The only thing in logs on riak03 as the following:
=INFO REPORT==== 21-Jun-2011::17:19:53 ===
Spidermonkey VM (thread stack: 16MB, max heap: 8MB, pool: riak_kv_js_hook) host 
starting (<0.169.0>)
Eshell V5.7.5  (abort with ^G)
(riak@10.0.60.210)1>
=INFO REPORT==== 21-Jun-2011::17:23:26 ===
Merged ["data/mr_queue",[],
        ["data/mr_queue/1308685475.bitcask.data",
         "data/mr_queue/1308687934.bitcask.data"]] in 0.023939 seconds.


On riak01, I was getting the standard timeout errors:

=ERROR REPORT==== 21-Jun-2011::17:27:08 ===
** State machine <0.14622.2> terminating
** Last event in was {mapexec_error,{785880,'riak@10.0.60.210'},
                                    {error,timeout}}
** When State == executing
**      Data  == {state,0,riak_kv_map_phase,
                  {state,true,
                   {javascript,
                    {map,
                     {jsanon,
                      <<"function(value,keyData,arg){var

David



From: David Mitchell
Sent: Tuesday, June 21, 2011 5:10 PM
To: 'riak-users@lists.basho.com'
Subject: Re: Riak crash on 0.14.2 riak_kv_stat terminating

Erlang: R13B04
Riak: 0.14.2

I am having the same issue as Jeremy.

I just did 208 MapReduce jobs using anonymous JavaScript functions in the map 
and reduce phases.  I am sending the MapReduce jobs to a single node, riak01.  
Out of the 208 jobs, I got two "mapexec_error" {error,timeout} on riak02.

I read on the basho wiki that the default timeout is 60 seconds.  
http://wiki.basho.com/Loading-Data-and-Running-MapReduce-Queries.html
Map/Reduce queries have a default timeout of 60000 milliseconds (60 seconds).

I have discovered that if a MapReduce job does not complete within 10 seconds, 
then it likely is having issues.  Most MapReduce jobs complete in one to two 
seconds.  I can try increasing the MapReduce timeout to 120 seconds, but I 
doubt that this will help.

I have discovered that if there are several timeouts, then the beam process can 
terminate.

Any help would be appreciated.

The following is from the sals-error.log on riak01.

=ERROR REPORT==== 21-Jun-2011::16:29:11 ===
** State machine <0.11130.0> terminating
** Last event in was {mapexec_error,{<<"46">>,'riak@10.0.60.209'},
                                    {error,timeout}}
** When State == executing
**      Data  == {state,0,riak_kv_map_phase,
                  {state,true,
                   {javascript,
                    {map,
                     {jsanon,
.................

=ERROR REPORT==== 21-Jun-2011::16:29:11 ===
** State machine <0.11127.0> terminating
** Last message in was {'EXIT',<0.11130.0>,{error,timeout}}
** When State == executing
**      Data  == {state,41465578,
                        [<0.11130.0>,[<0.11129.0>,<0.11128.0>]],
                        <0.10971.0>,66000,
                        {1308688220159363,#Ref<0.0.0.198634>},
                        #Fun<riak_kv_mapred_json.jsonify_not_found.1>,[],[]}
** Reason for termination =
** {error,{phase_error,{error,timeout}}}

=CRASH REPORT==== 21-Jun-2011::16:29:11 ===
  crasher:
    initial call: luke_flow:init/1
    pid: <0.11127.0>
    registered_name: []
    exception exit: {error,{phase_error,{error,timeout}}}
      in function  gen_fsm:terminate/7
      in call from proc_lib:init_p_do_apply/3
    ancestors: [luke_flow_sup,luke_sup,<0.91.0>]
    messages: []
    links: [<0.11128.0>,<0.11129.0>,<0.93.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 233
    stack_size: 24
    reductions: 23099
  neighbours:
    neighbour: 
[{pid,<0.11129.0>},{registered_name,[]},{initial_call,{luke_phase,init,[Argument__1]}},{current_function,{gen_fsm,loop,7}},{ancestors,[luke_phase_sup,luke_sup,<0.
91.0>]},{messages,[]},{links,[<0.11127.0>,<0.11128.0>,<0.94.0>]},{dictionary,[]},{trap_exit,false},{status,waiting},{heap_size,6765},{stack_size,10},{reductions,4926}]
    neighbour: 
[{pid,<0.11128.0>},{registered_name,[]},{initial_call,{luke_phase,init,[Argument__1]}},{current_function,{gen_fsm,loop,7}},{ancestors,[luke_phase_sup,luke_sup,<0.
91.0>]},{messages,[]},{links,[<0.11127.0>,<0.11129.0>,<0.94.0>]},{dictionary,[]},{trap_exit,false},{status,waiting},{heap_size,4181},{stack_size,10},{reductions,4905}]

The second timeout error:

=ERROR REPORT==== 21-Jun-2011::16:31:10 ===
** State machine <0.15144.0> terminating
** Last message in was flow_timeout
** When State == executing
**      Data  == {state,78575179,
                        [<0.15147.0>,[<0.15146.0>,<0.15145.0>]],
                        <0.15118.0>,66000,
                        {1308688285727293,#Ref<0.0.1.11874>},
                        #Fun<riak_kv_mapred_json.jsonify_not_found.1>,[],[]}
** Reason for termination =
** {error,flow_timeout}

=CRASH REPORT==== 21-Jun-2011::16:31:10 ===
  crasher:
    initial call: luke_flow:init/1
    pid: <0.15144.0>
    registered_name: []
    exception exit: {error,flow_timeout}
      in function  gen_fsm:terminate/7
      in call from proc_lib:init_p_do_apply/3
    ancestors: [luke_flow_sup,luke_sup,<0.91.0>]
    messages: []
    links: [<0.15145.0>,<0.15147.0>,<0.15146.0>,<0.93.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 233
    stack_size: 24
    reductions: 20791
  neighbours:
    neighbour: 
[{pid,<0.15146.0>},{registered_name,[]},{initial_call,{luke_phase,init,[Argument__1]}},{current_function,{gen_fsm,loop,7}},{ancestors,[luke_phase_sup,luke_sup,<0.
91.0>]},{messages,[]},{links,[<0.15144.0>,<0.15145.0>,<0.94.0>]},{dictionary,[]},{trap_exit,false},{status,waiting},{heap_size,4181},{stack_size,10},{reductions,6554}]
    neighbour: 
[{pid,<0.15145.0>},{registered_name,[]},{initial_call,{luke_phase,init,[Argument__1]}},{current_function,{gen_fsm,loop,7}},{ancestors,[luke_phase_sup,luke_sup,<0.
91.0>]},{messages,[]},{links,[<0.15144.0>,<0.15146.0>,<0.94.0>]},{dictionary,[]},{trap_exit,false},{status,waiting},{heap_size,4181},{stack_size,10},{reductions,6274}]

=SUPERVISOR REPORT==== 21-Jun-2011::16:31:10 ===
     Supervisor: {local,luke_flow_sup}
     Context:    child_terminated
     Reason:     {error,flow_timeout}
     Offender:   
[{pid,<0.15144.0>},{name,undefined},{mfa,{luke_flow,start_link,[<0.15118.0>,78575179,[{riak_kv_map_phase,[],[{javascript,{map,{jsanon,<<"function(value,keyData,
................

David

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to