Re: TCP recv timeout and handoffs almost all the time

Simon Effenberg Fri, 19 Jul 2013 08:21:19 -0700

I'm getting again crash reports about system_limits:

2013-07-19 14:30:58 UTC =CRASH REPORT====
  crasher:
    initial call: riak_kv_exchange_fsm:init/1
    pid: <0.25883.24>
    registered_name: []
    exception exit: 
{{{system_limit,[{erlang,spawn,[riak_kv_get_put_monitor,spawned,[gets,<0.11717.31>]],[]},{riak_kv_get_put_monitor,get_fsm_spawned,1,[{file,"src/riak_kv_get_put_monitor.erl"},{line,53}]},{riak_kv_get_fsm,init,1,[{file,"src/riak_kv_get_fsm.erl"},{line,135}]},{gen_fsm,init_it,6,[{file,"gen_fsm.erl"},{line,361}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]},{gen_server,call,[<0.1187.0>,{compare,{856348615623575928634971581669697081829647974400,3},#Fun<riak_kv_exchange_fsm.0.49629222>,#Fun<riak_kv_exchange_fsm.1.49629222>},infinity]}},[{gen_fsm,terminate,7,[{file,"gen_fsm.erl"},{line,611}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}
    ancestors: [riak_kv_entropy_manager,riak_kv_sup,<0.569.0>]
    messages: 
[{'DOWN',#Ref<0.0.26.196075>,process,<0.1187.0>,{system_limit,[{erlang,spawn,[riak_kv_get_put_monitor,spawned,[gets,<0.11717.31>]],[]},{riak_kv_get_put_monitor,get_fsm_spawned,1,[{file,"src/riak_kv_get_put_monitor.erl"},{line,53}]},{riak_kv_get_fsm,init,1,[{file,"src/riak_kv_get_fsm.erl"},{line,135}]},{gen_fsm,init_it,6,[{file,"gen_fsm.erl"},{line,361}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}}]
    links: []
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 1597
    stack_size: 24
    reductions: 380
  neighbours:


I'm trying now to increase the erlang process limit but "system_limit" always 
looks like a "system" limit and not an "erlang" limit?!

That's the limits for the process:

[root@kriak47-9:/var/log/riak]# cat /proc/17313/limits
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             unlimited            unlimited            processes 
Max open files            30000                30000                files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       16382                16382                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        



On Fri, 19 Jul 2013 16:08:44 +0200
Simon Effenberg <seffenb...@team.mobile.de> wrote:

> only after restarting the Riak instance on this node the awaiting
> handoffs where processed.. this is weird :(
> 
> On Fri, 19 Jul 2013 15:55:43 +0200
> Simon Effenberg <seffenb...@team.mobile.de> wrote:
> 
> > It looked good for some hours but now again we got 
> > 
> > 2013-07-19 13:27:07.800 UTC [error] 
> > <0.18747.29>@riak_core_handoff_sender:start_fold:216 hinted_handoff 
> > transfer of riak_kv_vnode from 'riak@10.46.109.207' 
> > 1136089163393944065322395631681798128560666312704 to 'riak@10.47.109.202' 
> > 1136089163393944065322395631681798128560666312704 failed because of TCP 
> > recv timeout
> > 
> > and on the destination host I see:
> > 
> > 
> > 2013-07-19 13:25:04.455 UTC [error] 
> > <0.28632.25>@riak_core_handoff_receiver:handle_info:80 Handoff receiver for 
> > partition 1136089163393944065322395631681798128560666312704 exited 
> > abnormally after processing 2 objects: 
> > {timeout,{gen_fsm,sync_send_all_state_event,[<0.1107.0>,{handoff_data,<<141,146,205,110,211,64,20,133,237,4,211,132,2,170,80,69,37,150,22,203,186,216,249,105,210,172,42,149,95,137,162,2,5,177,129,232,120,102,156,153,137,61,78,237,113,72,10,172,186,101,195,51,176,224,1,120,12,158,130,55,97,198,173,68,83,177,192,35,223,197,55,231,156,185,158,235,27,155,36,87,115,86,148,208,34,87,227,146,145,130,233,242,206,173,46,153,204,59,60,18,125,61,91,208,123,223,188,51,190,70,157,86,49,206,99,201,136,206,28,199,249,167,209,110,172,122,83,67,92,222,164,78,187,24,27,135,102,74,243,54,117,174,81,65,52,60,108,152,213,194,17,66,190,33,175,60,220,189,204,108,78,195,150,117,123,198,205,139,168,64,47,103,12,26,12,11,83,31,96,134,20,128,128,170,245,91,86,186,254,46,120,37,48,13,222,30,99,130,1,158,152,213,67,132,199,168,240,26,7,72,12,123,134,23,198,25,154,247,33,30,225,16,18,39,56,56,63,210,173,139,205,241,132,162,108,33,175,226,205,139,248,231,40,117,112,152,83,145,8,70,121,51,54,134,15,177,211,252,252,59,118,218,223,127,94,114,93,183,174,53,194,81,148,76,227,13,142,77,43,1,134,82,90,254,227,147,111,238,212,31,69,219,126,44,168,63,242,211,124,206,210,101,86,149,130,116,250,251,147,12,34,221,33,121,230,111,251,101,189,207,243,100,63,143,89,161,4,83,59,148,25,30,151,6,79,39,162,43,62,46,79,213,105,181,103,181,150,173,140,197,64,208,58,33,234,134,123,195,97,212,11,13,210,70,23,117,7,189,78,103,216,31,12,118,67,211,6,169,69,187,211,98,113,50,226,18,75,213,77,184,255,229,252,115,120,195,246,220,58,186,251,244,236,101,182,117,159,55,224,42,207,193,215,247,191,110,203,191,67,118,255,127,200,114,229,122,169,227,145,148,65,153,32,93,84,76,74,243,19,85,102,8,137,80,140,254,1>>},60000]}}
> > 
> > so both shows a timeout. How could I takle this down?
> > 
> > - could this happen when many read repairs occur (through AAE)?
> > 
> > Also our "fsm PUT time is going higher but not really the GET time".. is 
> > this the normal behavior in LOAD/read repair situations?
> > 
> > Also is this a bigger problem with eLevelDB or would it be the same case 
> > for Bitcask?
> > 
> > Cheers
> > Simon
> > 
> > 
> > On Fri, 19 Jul 2013 10:25:05 +0200
> > Simon Effenberg <seffenb...@team.mobile.de> wrote:
> > 
> > > once again with the list included... argh
> > > 
> > > Hey Christian,
> > > 
> > > so it could be also a erlang limit? I found out why my riak instances
> > > are all having different processlimits. My mcollectived daemons have
> > > the different limits and when I triggered a puppetrun through
> > > mcollective they got this processlimit as well.
> > > 
> > > Also in the crash log I see:
> > > 
> > > exception exit: {{system_limit,[{erlang,spawn
> > > 
> > > for the too many processes. So it doesn't look like a Erlang limit, do
> > > it? But I will keep this +P in my mind!! Thanks a lot.
> > > 
> > > The zdbbl is now at 100MB.
> > > 
> > > Cheers
> > > Simon
> > > 
> > > On Fri, 19 Jul 2013 08:49:50 +0100
> > > Christian Dahlqvist <christ...@basho.com> wrote:
> > > 
> > > > Hi Simon,
> > > > 
> > > > If you have objects that can be a s big as 15MB, it is probably wise to 
> > > > increase the size of +zdbbl in order to avoid filling up buffers when 
> > > > these large objects need to be transferred between nodes. What an 
> > > > appropriate level is depends a lot on the size distribution of your 
> > > > data and your access patterns, so I would recommend benchmarking to 
> > > > find a suitable value.
> > > > 
> > > > Erlang also has a default process limit of 32768 (at least in R15B01), 
> > > > which may be what you are hitting. You can override this to 256k by 
> > > > adding the following line to the vm.args file:
> > > > 
> > > >     +P 262144
> > > > 
> > > > Best regards,
> > > > 
> > > > Christian
> > > > 
> > > > 
> > > > 
> > > > On 19 Jul 2013, at 08:24, Simon Effenberg <seffenb...@team.mobile.de> 
> > > > wrote:
> > > > 
> > > > > The +zdbbl parameter helped a lot but the hinted handoffs didn't
> > > > > disappear completely. I have no more busy dist port errors in the
> > > > > _console.log_ (why aren't they in the error.log? it looks for me like 
> > > > > a
> > > > > serious problem you have.. at least our cluster was behaving not that
> > > > > nice).
> > > > > 
> > > > > I'll try to increase the buffer size to a higher value because my
> > > > > suggestion is that also the objects send from one to another are also
> > > > > stored therein and we have sometimes objects which are up to 15MB.
> > > > > 
> > > > > But I saw now also some crashes in the last 6 hours on only two 
> > > > > machines
> > > > > complaining about too many processes
> > > > > 
> > > > > ----------------
> > > > > console.log
> > > > > 2013-07-19 02:04:21.962 UTC [error] <0.12813.29> CRASH REPORT Process 
> > > > > <0.12813.29> with 15 neighbours exited with reason: {system_limit
> > > > > 
> > > > > crash.log
> > > > > 2013-07-19 02:04:21 UTC =ERROR REPORT====
> > > > > Too many processes
> > > > > ----------------
> > > > > 
> > > > > the process has a process limit of 95142. So I will increase it now 
> > > > > but I never saw any information about such problems on the linux 
> > > > > tuning page. Am I missing something?
> > > > > 
> > > > > Cheers
> > > > > Simon
> > > > > 
> > > > > 
> > > > > On Thu, 18 Jul 2013 19:34:18 +0100
> > > > > Guido Medina <guido.med...@temetra.com> wrote:
> > > > > 
> > > > >> If what you are describing is happening for 1.4, type riak-admin 
> > > > >> diag 
> > > > >> and see the new recommended kernel parameters, also, on vm.args 
> > > > >> uncomment the +zdbbl 32768 parameter, since what you are describing 
> > > > >> is 
> > > > >> similar to what happened to us when we upgraded to 1.4.
> > > > >> 
> > > > >> HTH,
> > > > >> 
> > > > >> Guido.
> > > > >> 
> > > > >> On 18/07/13 19:21, Simon Effenberg wrote:
> > > > >>> Hi @list,
> > > > >>> 
> > > > >>> I see sometimes logs talking about "hinted_handoff transfer of .. 
> > > > >>> failed because of TCP recv timeout".
> > > > >>> Also riak-admin transfers shows me many handoffs (is it possible to 
> > > > >>> give some insights about "how many" handoffs happened through 
> > > > >>> "riak-admin status"?).
> > > > >>> 
> > > > >>> - Is it a normal behavior to have up to 30 handoffs from/to 
> > > > >>> different nodes?
> > > > >>> - How can I get down to the problem with the TCP recv timeout? I'm 
> > > > >>> not sure if this is a network problem or if the other node is too 
> > > > >>> slow. The load is ok on the machines (some IOwait but not 100%). 
> > > > >>> Maybe interfering with AAE?
> > > > >>> 
> > > > >>> Here the log information about the TCP recv timeout. But that is 
> > > > >>> not that often but handoffs happens really often:
> > > > >>> 
> > > > >>> 2013-07-18 16:22:05.654 UTC [error] 
> > > > >>> <0.28933.14>@riak_core_handoff_sender:start_fold:216 hinted_handoff 
> > > > >>> transfer of riak_kv_vnode from 'riak@10.46.109.207' 
> > > > >>> 1118962191081472546749696200048404186924073353216 to 
> > > > >>> 'riak@10.46.109.205' 
> > > > >>> 1118962191081472546749696200048404186924073353216 failed because of 
> > > > >>> TCP recv timeout
> > > > >>> 2013-07-18 16:22:05.673 UTC [error] 
> > > > >>> <0.202.0>@riak_core_handoff_manager:handle_info:282 An outbound 
> > > > >>> handoff of partition riak_kv_vnode 
> > > > >>> 1118962191081472546749696200048404186924073353216 was terminated 
> > > > >>> for reason: {shutdown,timeout}
> > > > >>> 
> > > > >>> 
> > > > >>> Thanks in advance
> > > > >>> Simon
> > > > >>> 
> > > > >>> _______________________________________________
> > > > >>> riak-users mailing list
> > > > >>> riak-users@lists.basho.com
> > > > >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > > > >> 
> > > > >> 
> > > > >> _______________________________________________
> > > > >> riak-users mailing list
> > > > >> riak-users@lists.basho.com
> > > > >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > > > > 
> > > > > 
> > > > > -- 
> > > > > Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> > > > > Fon:     + 49-(0)30-8109 - 7173
> > > > > Fax:     + 49-(0)30-8109 - 7131
> > > > > 
> > > > > Mail:     seffenb...@team.mobile.de
> > > > > Web:    www.mobile.de
> > > > > 
> > > > > Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> > > > > 
> > > > > 
> > > > > Geschäftsführer: Malte Krüger
> > > > > HRB Nr.: 18517 P, Amtsgericht Potsdam
> > > > > Sitz der Gesellschaft: Kleinmachnow 
> > > > > 
> > > > > _______________________________________________
> > > > > riak-users mailing list
> > > > > riak-users@lists.basho.com
> > > > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > > > 
> > > 
> > > 
> > > -- 
> > > Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> > > Fon:     + 49-(0)30-8109 - 7173
> > > Fax:     + 49-(0)30-8109 - 7131
> > > 
> > > Mail:     seffenb...@team.mobile.de
> > > Web:    www.mobile.de
> > > 
> > > Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> > > 
> > > 
> > > Geschäftsführer: Malte Krüger
> > > HRB Nr.: 18517 P, Amtsgericht Potsdam
> > > Sitz der Gesellschaft: Kleinmachnow 
> > > 
> > > _______________________________________________
> > > riak-users mailing list
> > > riak-users@lists.basho.com
> > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > 
> > 
> > -- 
> > Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> > Fon:     + 49-(0)30-8109 - 7173
> > Fax:     + 49-(0)30-8109 - 7131
> > 
> > Mail:     seffenb...@team.mobile.de
> > Web:    www.mobile.de
> > 
> > Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> > 
> > 
> > Geschäftsführer: Malte Krüger
> > HRB Nr.: 18517 P, Amtsgericht Potsdam
> > Sitz der Gesellschaft: Kleinmachnow 
> > 
> > _______________________________________________
> > riak-users mailing list
> > riak-users@lists.basho.com
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 
> -- 
> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> Fon:     + 49-(0)30-8109 - 7173
> Fax:     + 49-(0)30-8109 - 7131
> 
> Mail:     seffenb...@team.mobile.de
> Web:    www.mobile.de
> 
> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> 
> 
> Geschäftsführer: Malte Krüger
> HRB Nr.: 18517 P, Amtsgericht Potsdam
> Sitz der Gesellschaft: Kleinmachnow 
> 
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


-- 
Simon Effenberg | Site Ops Engineer | mobile.international GmbH
Fon:     + 49-(0)30-8109 - 7173
Fax:     + 49-(0)30-8109 - 7131

Mail:     seffenb...@team.mobile.de
Web:    www.mobile.de

Marktplatz 1 | 14532 Europarc Dreilinden | Germany


Geschäftsführer: Malte Krüger
HRB Nr.: 18517 P, Amtsgericht Potsdam
Sitz der Gesellschaft: Kleinmachnow 

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: TCP recv timeout and handoffs almost all the time

Reply via email to