I'm getting again crash reports about system_limits: 2013-07-19 14:30:58 UTC =CRASH REPORT==== crasher: initial call: riak_kv_exchange_fsm:init/1 pid: <0.25883.24> registered_name: [] exception exit: {{{system_limit,[{erlang,spawn,[riak_kv_get_put_monitor,spawned,[gets,<0.11717.31>]],[]},{riak_kv_get_put_monitor,get_fsm_spawned,1,[{file,"src/riak_kv_get_put_monitor.erl"},{line,53}]},{riak_kv_get_fsm,init,1,[{file,"src/riak_kv_get_fsm.erl"},{line,135}]},{gen_fsm,init_it,6,[{file,"gen_fsm.erl"},{line,361}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]},{gen_server,call,[<0.1187.0>,{compare,{856348615623575928634971581669697081829647974400,3},#Fun<riak_kv_exchange_fsm.0.49629222>,#Fun<riak_kv_exchange_fsm.1.49629222>},infinity]}},[{gen_fsm,terminate,7,[{file,"gen_fsm.erl"},{line,611}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]} ancestors: [riak_kv_entropy_manager,riak_kv_sup,<0.569.0>] messages: [{'DOWN',#Ref<0.0.26.196075>,process,<0.1187.0>,{system_limit,[{erlang,spawn,[riak_kv_get_put_monitor,spawned,[gets,<0.11717.31>]],[]},{riak_kv_get_put_monitor,get_fsm_spawned,1,[{file,"src/riak_kv_get_put_monitor.erl"},{line,53}]},{riak_kv_get_fsm,init,1,[{file,"src/riak_kv_get_fsm.erl"},{line,135}]},{gen_fsm,init_it,6,[{file,"gen_fsm.erl"},{line,361}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}}] links: [] dictionary: [] trap_exit: false status: running heap_size: 1597 stack_size: 24 reductions: 380 neighbours:
I'm trying now to increase the erlang process limit but "system_limit" always looks like a "system" limit and not an "erlang" limit?! That's the limits for the process: [root@kriak47-9:/var/log/riak]# cat /proc/17313/limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 8388608 unlimited bytes Max core file size 0 unlimited bytes Max resident set unlimited unlimited bytes Max processes unlimited unlimited processes Max open files 30000 30000 files Max locked memory 65536 65536 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 16382 16382 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us On Fri, 19 Jul 2013 16:08:44 +0200 Simon Effenberg <seffenb...@team.mobile.de> wrote: > only after restarting the Riak instance on this node the awaiting > handoffs where processed.. this is weird :( > > On Fri, 19 Jul 2013 15:55:43 +0200 > Simon Effenberg <seffenb...@team.mobile.de> wrote: > > > It looked good for some hours but now again we got > > > > 2013-07-19 13:27:07.800 UTC [error] > > <0.18747.29>@riak_core_handoff_sender:start_fold:216 hinted_handoff > > transfer of riak_kv_vnode from 'riak@10.46.109.207' > > 1136089163393944065322395631681798128560666312704 to 'riak@10.47.109.202' > > 1136089163393944065322395631681798128560666312704 failed because of TCP > > recv timeout > > > > and on the destination host I see: > > > > > > 2013-07-19 13:25:04.455 UTC [error] > > <0.28632.25>@riak_core_handoff_receiver:handle_info:80 Handoff receiver for > > partition 1136089163393944065322395631681798128560666312704 exited > > abnormally after processing 2 objects: > > {timeout,{gen_fsm,sync_send_all_state_event,[<0.1107.0>,{handoff_data,<<141,146,205,110,211,64,20,133,237,4,211,132,2,170,80,69,37,150,22,203,186,216,249,105,210,172,42,149,95,137,162,2,5,177,129,232,120,102,156,153,137,61,78,237,113,72,10,172,186,101,195,51,176,224,1,120,12,158,130,55,97,198,173,68,83,177,192,35,223,197,55,231,156,185,158,235,27,155,36,87,115,86,148,208,34,87,227,146,145,130,233,242,206,173,46,153,204,59,60,18,125,61,91,208,123,223,188,51,190,70,157,86,49,206,99,201,136,206,28,199,249,167,209,110,172,122,83,67,92,222,164,78,187,24,27,135,102,74,243,54,117,174,81,65,52,60,108,152,213,194,17,66,190,33,175,60,220,189,204,108,78,195,150,117,123,198,205,139,168,64,47,103,12,26,12,11,83,31,96,134,20,128,128,170,245,91,86,186,254,46,120,37,48,13,222,30,99,130,1,158,152,213,67,132,199,168,240,26,7,72,12,123,134,23,198,25,154,247,33,30,225,16,18,39,56,56,63,210,173,139,205,241,132,162,108,33,175,226,205,139,248,231,40,117,112,152,83,145,8,70,121,51,54,134,15,177,211,252,252,59,118,218,223,127,94,114,93,183,174,53,194,81,148,76,227,13,142,77,43,1,134,82,90,254,227,147,111,238,212,31,69,219,126,44,168,63,242,211,124,206,210,101,86,149,130,116,250,251,147,12,34,221,33,121,230,111,251,101,189,207,243,100,63,143,89,161,4,83,59,148,25,30,151,6,79,39,162,43,62,46,79,213,105,181,103,181,150,173,140,197,64,208,58,33,234,134,123,195,97,212,11,13,210,70,23,117,7,189,78,103,216,31,12,118,67,211,6,169,69,187,211,98,113,50,226,18,75,213,77,184,255,229,252,115,120,195,246,220,58,186,251,244,236,101,182,117,159,55,224,42,207,193,215,247,191,110,203,191,67,118,255,127,200,114,229,122,169,227,145,148,65,153,32,93,84,76,74,243,19,85,102,8,137,80,140,254,1>>},60000]}} > > > > so both shows a timeout. How could I takle this down? > > > > - could this happen when many read repairs occur (through AAE)? > > > > Also our "fsm PUT time is going higher but not really the GET time".. is > > this the normal behavior in LOAD/read repair situations? > > > > Also is this a bigger problem with eLevelDB or would it be the same case > > for Bitcask? > > > > Cheers > > Simon > > > > > > On Fri, 19 Jul 2013 10:25:05 +0200 > > Simon Effenberg <seffenb...@team.mobile.de> wrote: > > > > > once again with the list included... argh > > > > > > Hey Christian, > > > > > > so it could be also a erlang limit? I found out why my riak instances > > > are all having different processlimits. My mcollectived daemons have > > > the different limits and when I triggered a puppetrun through > > > mcollective they got this processlimit as well. > > > > > > Also in the crash log I see: > > > > > > exception exit: {{system_limit,[{erlang,spawn > > > > > > for the too many processes. So it doesn't look like a Erlang limit, do > > > it? But I will keep this +P in my mind!! Thanks a lot. > > > > > > The zdbbl is now at 100MB. > > > > > > Cheers > > > Simon > > > > > > On Fri, 19 Jul 2013 08:49:50 +0100 > > > Christian Dahlqvist <christ...@basho.com> wrote: > > > > > > > Hi Simon, > > > > > > > > If you have objects that can be a s big as 15MB, it is probably wise to > > > > increase the size of +zdbbl in order to avoid filling up buffers when > > > > these large objects need to be transferred between nodes. What an > > > > appropriate level is depends a lot on the size distribution of your > > > > data and your access patterns, so I would recommend benchmarking to > > > > find a suitable value. > > > > > > > > Erlang also has a default process limit of 32768 (at least in R15B01), > > > > which may be what you are hitting. You can override this to 256k by > > > > adding the following line to the vm.args file: > > > > > > > > +P 262144 > > > > > > > > Best regards, > > > > > > > > Christian > > > > > > > > > > > > > > > > On 19 Jul 2013, at 08:24, Simon Effenberg <seffenb...@team.mobile.de> > > > > wrote: > > > > > > > > > The +zdbbl parameter helped a lot but the hinted handoffs didn't > > > > > disappear completely. I have no more busy dist port errors in the > > > > > _console.log_ (why aren't they in the error.log? it looks for me like > > > > > a > > > > > serious problem you have.. at least our cluster was behaving not that > > > > > nice). > > > > > > > > > > I'll try to increase the buffer size to a higher value because my > > > > > suggestion is that also the objects send from one to another are also > > > > > stored therein and we have sometimes objects which are up to 15MB. > > > > > > > > > > But I saw now also some crashes in the last 6 hours on only two > > > > > machines > > > > > complaining about too many processes > > > > > > > > > > ---------------- > > > > > console.log > > > > > 2013-07-19 02:04:21.962 UTC [error] <0.12813.29> CRASH REPORT Process > > > > > <0.12813.29> with 15 neighbours exited with reason: {system_limit > > > > > > > > > > crash.log > > > > > 2013-07-19 02:04:21 UTC =ERROR REPORT==== > > > > > Too many processes > > > > > ---------------- > > > > > > > > > > the process has a process limit of 95142. So I will increase it now > > > > > but I never saw any information about such problems on the linux > > > > > tuning page. Am I missing something? > > > > > > > > > > Cheers > > > > > Simon > > > > > > > > > > > > > > > On Thu, 18 Jul 2013 19:34:18 +0100 > > > > > Guido Medina <guido.med...@temetra.com> wrote: > > > > > > > > > >> If what you are describing is happening for 1.4, type riak-admin > > > > >> diag > > > > >> and see the new recommended kernel parameters, also, on vm.args > > > > >> uncomment the +zdbbl 32768 parameter, since what you are describing > > > > >> is > > > > >> similar to what happened to us when we upgraded to 1.4. > > > > >> > > > > >> HTH, > > > > >> > > > > >> Guido. > > > > >> > > > > >> On 18/07/13 19:21, Simon Effenberg wrote: > > > > >>> Hi @list, > > > > >>> > > > > >>> I see sometimes logs talking about "hinted_handoff transfer of .. > > > > >>> failed because of TCP recv timeout". > > > > >>> Also riak-admin transfers shows me many handoffs (is it possible to > > > > >>> give some insights about "how many" handoffs happened through > > > > >>> "riak-admin status"?). > > > > >>> > > > > >>> - Is it a normal behavior to have up to 30 handoffs from/to > > > > >>> different nodes? > > > > >>> - How can I get down to the problem with the TCP recv timeout? I'm > > > > >>> not sure if this is a network problem or if the other node is too > > > > >>> slow. The load is ok on the machines (some IOwait but not 100%). > > > > >>> Maybe interfering with AAE? > > > > >>> > > > > >>> Here the log information about the TCP recv timeout. But that is > > > > >>> not that often but handoffs happens really often: > > > > >>> > > > > >>> 2013-07-18 16:22:05.654 UTC [error] > > > > >>> <0.28933.14>@riak_core_handoff_sender:start_fold:216 hinted_handoff > > > > >>> transfer of riak_kv_vnode from 'riak@10.46.109.207' > > > > >>> 1118962191081472546749696200048404186924073353216 to > > > > >>> 'riak@10.46.109.205' > > > > >>> 1118962191081472546749696200048404186924073353216 failed because of > > > > >>> TCP recv timeout > > > > >>> 2013-07-18 16:22:05.673 UTC [error] > > > > >>> <0.202.0>@riak_core_handoff_manager:handle_info:282 An outbound > > > > >>> handoff of partition riak_kv_vnode > > > > >>> 1118962191081472546749696200048404186924073353216 was terminated > > > > >>> for reason: {shutdown,timeout} > > > > >>> > > > > >>> > > > > >>> Thanks in advance > > > > >>> Simon > > > > >>> > > > > >>> _______________________________________________ > > > > >>> riak-users mailing list > > > > >>> riak-users@lists.basho.com > > > > >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > > >> > > > > >> > > > > >> _______________________________________________ > > > > >> riak-users mailing list > > > > >> riak-users@lists.basho.com > > > > >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > > > > > > > > > > > > > -- > > > > > Simon Effenberg | Site Ops Engineer | mobile.international GmbH > > > > > Fon: + 49-(0)30-8109 - 7173 > > > > > Fax: + 49-(0)30-8109 - 7131 > > > > > > > > > > Mail: seffenb...@team.mobile.de > > > > > Web: www.mobile.de > > > > > > > > > > Marktplatz 1 | 14532 Europarc Dreilinden | Germany > > > > > > > > > > > > > > > Geschäftsführer: Malte Krüger > > > > > HRB Nr.: 18517 P, Amtsgericht Potsdam > > > > > Sitz der Gesellschaft: Kleinmachnow > > > > > > > > > > _______________________________________________ > > > > > riak-users mailing list > > > > > riak-users@lists.basho.com > > > > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > > > > > > > > > > > -- > > > Simon Effenberg | Site Ops Engineer | mobile.international GmbH > > > Fon: + 49-(0)30-8109 - 7173 > > > Fax: + 49-(0)30-8109 - 7131 > > > > > > Mail: seffenb...@team.mobile.de > > > Web: www.mobile.de > > > > > > Marktplatz 1 | 14532 Europarc Dreilinden | Germany > > > > > > > > > Geschäftsführer: Malte Krüger > > > HRB Nr.: 18517 P, Amtsgericht Potsdam > > > Sitz der Gesellschaft: Kleinmachnow > > > > > > _______________________________________________ > > > riak-users mailing list > > > riak-users@lists.basho.com > > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > > > > -- > > Simon Effenberg | Site Ops Engineer | mobile.international GmbH > > Fon: + 49-(0)30-8109 - 7173 > > Fax: + 49-(0)30-8109 - 7131 > > > > Mail: seffenb...@team.mobile.de > > Web: www.mobile.de > > > > Marktplatz 1 | 14532 Europarc Dreilinden | Germany > > > > > > Geschäftsführer: Malte Krüger > > HRB Nr.: 18517 P, Amtsgericht Potsdam > > Sitz der Gesellschaft: Kleinmachnow > > > > _______________________________________________ > > riak-users mailing list > > riak-users@lists.basho.com > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > -- > Simon Effenberg | Site Ops Engineer | mobile.international GmbH > Fon: + 49-(0)30-8109 - 7173 > Fax: + 49-(0)30-8109 - 7131 > > Mail: seffenb...@team.mobile.de > Web: www.mobile.de > > Marktplatz 1 | 14532 Europarc Dreilinden | Germany > > > Geschäftsführer: Malte Krüger > HRB Nr.: 18517 P, Amtsgericht Potsdam > Sitz der Gesellschaft: Kleinmachnow > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com -- Simon Effenberg | Site Ops Engineer | mobile.international GmbH Fon: + 49-(0)30-8109 - 7173 Fax: + 49-(0)30-8109 - 7131 Mail: seffenb...@team.mobile.de Web: www.mobile.de Marktplatz 1 | 14532 Europarc Dreilinden | Germany Geschäftsführer: Malte Krüger HRB Nr.: 18517 P, Amtsgericht Potsdam Sitz der Gesellschaft: Kleinmachnow _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com