Re: TCP recv timeout and handoffs almost all the time

Simon Effenberg Fri, 19 Jul 2013 09:25:16 -0700

only after restarting the Riak instance on this node the awaiting
handoffs where processed.. this is weird :(


On Fri, 19 Jul 2013 15:55:43 +0200
Simon Effenberg <seffenb...@team.mobile.de> wrote:

> It looked good for some hours but now again we got 
> 
> 2013-07-19 13:27:07.800 UTC [error] 
> <0.18747.29>@riak_core_handoff_sender:start_fold:216 hinted_handoff transfer 
> of riak_kv_vnode from 'riak@10.46.109.207' 
> 1136089163393944065322395631681798128560666312704 to 'riak@10.47.109.202' 
> 1136089163393944065322395631681798128560666312704 failed because of TCP recv 
> timeout
> 
> and on the destination host I see:
> 
> 
> 2013-07-19 13:25:04.455 UTC [error] 
> <0.28632.25>@riak_core_handoff_receiver:handle_info:80 Handoff receiver for 
> partition 1136089163393944065322395631681798128560666312704 exited abnormally 
> after processing 2 objects: 
> {timeout,{gen_fsm,sync_send_all_state_event,[<0.1107.0>,{handoff_data,<<141,146,205,110,211,64,20,133,237,4,211,132,2,170,80,69,37,150,22,203,186,216,249,105,210,172,42,149,95,137,162,2,5,177,129,232,120,102,156,153,137,61,78,237,113,72,10,172,186,101,195,51,176,224,1,120,12,158,130,55,97,198,173,68,83,177,192,35,223,197,55,231,156,185,158,235,27,155,36,87,115,86,148,208,34,87,227,146,145,130,233,242,206,173,46,153,204,59,60,18,125,61,91,208,123,223,188,51,190,70,157,86,49,206,99,201,136,206,28,199,249,167,209,110,172,122,83,67,92,222,164,78,187,24,27,135,102,74,243,54,117,174,81,65,52,60,108,152,213,194,17,66,190,33,175,60,220,189,204,108,78,195,150,117,123,198,205,139,168,64,47,103,12,26,12,11,83,31,96,134,20,128,128,170,245,91,86,186,254,46,120,37,48,13,222,30,99,130,1,158,152,213,67,132,199,168,240,26,7,72,12,123,134,23,198,25,154,247,33,30,225,16,18,39,56,56,63,210,173,139,205,241,132,162,108,33,175,226,205,139,248,231,40,117,112,152,83,145,8,70,121,51,54,134,15,177,211,252,252,59,118,218,223,127,94,114,93,183,174,53,194,81,148,76,227,13,142,77,43,1,134,82,90,254,227,147,111,238,212,31,69,219,126,44,168,63,242,211,124,206,210,101,86,149,130,116,250,251,147,12,34,221,33,121,230,111,251,101,189,207,243,100,63,143,89,161,4,83,59,148,25,30,151,6,79,39,162,43,62,46,79,213,105,181,103,181,150,173,140,197,64,208,58,33,234,134,123,195,97,212,11,13,210,70,23,117,7,189,78,103,216,31,12,118,67,211,6,169,69,187,211,98,113,50,226,18,75,213,77,184,255,229,252,115,120,195,246,220,58,186,251,244,236,101,182,117,159,55,224,42,207,193,215,247,191,110,203,191,67,118,255,127,200,114,229,122,169,227,145,148,65,153,32,93,84,76,74,243,19,85,102,8,137,80,140,254,1>>},60000]}}
> 
> so both shows a timeout. How could I takle this down?
> 
> - could this happen when many read repairs occur (through AAE)?
> 
> Also our "fsm PUT time is going higher but not really the GET time".. is this 
> the normal behavior in LOAD/read repair situations?
> 
> Also is this a bigger problem with eLevelDB or would it be the same case for 
> Bitcask?
> 
> Cheers
> Simon
> 
> 
> On Fri, 19 Jul 2013 10:25:05 +0200
> Simon Effenberg <seffenb...@team.mobile.de> wrote:
> 
> > once again with the list included... argh
> > 
> > Hey Christian,
> > 
> > so it could be also a erlang limit? I found out why my riak instances
> > are all having different processlimits. My mcollectived daemons have
> > the different limits and when I triggered a puppetrun through
> > mcollective they got this processlimit as well.
> > 
> > Also in the crash log I see:
> > 
> > exception exit: {{system_limit,[{erlang,spawn
> > 
> > for the too many processes. So it doesn't look like a Erlang limit, do
> > it? But I will keep this +P in my mind!! Thanks a lot.
> > 
> > The zdbbl is now at 100MB.
> > 
> > Cheers
> > Simon
> > 
> > On Fri, 19 Jul 2013 08:49:50 +0100
> > Christian Dahlqvist <christ...@basho.com> wrote:
> > 
> > > Hi Simon,
> > > 
> > > If you have objects that can be a s big as 15MB, it is probably wise to 
> > > increase the size of +zdbbl in order to avoid filling up buffers when 
> > > these large objects need to be transferred between nodes. What an 
> > > appropriate level is depends a lot on the size distribution of your data 
> > > and your access patterns, so I would recommend benchmarking to find a 
> > > suitable value.
> > > 
> > > Erlang also has a default process limit of 32768 (at least in R15B01), 
> > > which may be what you are hitting. You can override this to 256k by 
> > > adding the following line to the vm.args file:
> > > 
> > >     +P 262144
> > > 
> > > Best regards,
> > > 
> > > Christian
> > > 
> > > 
> > > 
> > > On 19 Jul 2013, at 08:24, Simon Effenberg <seffenb...@team.mobile.de> 
> > > wrote:
> > > 
> > > > The +zdbbl parameter helped a lot but the hinted handoffs didn't
> > > > disappear completely. I have no more busy dist port errors in the
> > > > _console.log_ (why aren't they in the error.log? it looks for me like a
> > > > serious problem you have.. at least our cluster was behaving not that
> > > > nice).
> > > > 
> > > > I'll try to increase the buffer size to a higher value because my
> > > > suggestion is that also the objects send from one to another are also
> > > > stored therein and we have sometimes objects which are up to 15MB.
> > > > 
> > > > But I saw now also some crashes in the last 6 hours on only two machines
> > > > complaining about too many processes
> > > > 
> > > > ----------------
> > > > console.log
> > > > 2013-07-19 02:04:21.962 UTC [error] <0.12813.29> CRASH REPORT Process 
> > > > <0.12813.29> with 15 neighbours exited with reason: {system_limit
> > > > 
> > > > crash.log
> > > > 2013-07-19 02:04:21 UTC =ERROR REPORT====
> > > > Too many processes
> > > > ----------------
> > > > 
> > > > the process has a process limit of 95142. So I will increase it now but 
> > > > I never saw any information about such problems on the linux tuning 
> > > > page. Am I missing something?
> > > > 
> > > > Cheers
> > > > Simon
> > > > 
> > > > 
> > > > On Thu, 18 Jul 2013 19:34:18 +0100
> > > > Guido Medina <guido.med...@temetra.com> wrote:
> > > > 
> > > >> If what you are describing is happening for 1.4, type riak-admin diag 
> > > >> and see the new recommended kernel parameters, also, on vm.args 
> > > >> uncomment the +zdbbl 32768 parameter, since what you are describing is 
> > > >> similar to what happened to us when we upgraded to 1.4.
> > > >> 
> > > >> HTH,
> > > >> 
> > > >> Guido.
> > > >> 
> > > >> On 18/07/13 19:21, Simon Effenberg wrote:
> > > >>> Hi @list,
> > > >>> 
> > > >>> I see sometimes logs talking about "hinted_handoff transfer of .. 
> > > >>> failed because of TCP recv timeout".
> > > >>> Also riak-admin transfers shows me many handoffs (is it possible to 
> > > >>> give some insights about "how many" handoffs happened through 
> > > >>> "riak-admin status"?).
> > > >>> 
> > > >>> - Is it a normal behavior to have up to 30 handoffs from/to different 
> > > >>> nodes?
> > > >>> - How can I get down to the problem with the TCP recv timeout? I'm 
> > > >>> not sure if this is a network problem or if the other node is too 
> > > >>> slow. The load is ok on the machines (some IOwait but not 100%). 
> > > >>> Maybe interfering with AAE?
> > > >>> 
> > > >>> Here the log information about the TCP recv timeout. But that is not 
> > > >>> that often but handoffs happens really often:
> > > >>> 
> > > >>> 2013-07-18 16:22:05.654 UTC [error] 
> > > >>> <0.28933.14>@riak_core_handoff_sender:start_fold:216 hinted_handoff 
> > > >>> transfer of riak_kv_vnode from 'riak@10.46.109.207' 
> > > >>> 1118962191081472546749696200048404186924073353216 to 
> > > >>> 'riak@10.46.109.205' 
> > > >>> 1118962191081472546749696200048404186924073353216 failed because of 
> > > >>> TCP recv timeout
> > > >>> 2013-07-18 16:22:05.673 UTC [error] 
> > > >>> <0.202.0>@riak_core_handoff_manager:handle_info:282 An outbound 
> > > >>> handoff of partition riak_kv_vnode 
> > > >>> 1118962191081472546749696200048404186924073353216 was terminated for 
> > > >>> reason: {shutdown,timeout}
> > > >>> 
> > > >>> 
> > > >>> Thanks in advance
> > > >>> Simon
> > > >>> 
> > > >>> _______________________________________________
> > > >>> riak-users mailing list
> > > >>> riak-users@lists.basho.com
> > > >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > > >> 
> > > >> 
> > > >> _______________________________________________
> > > >> riak-users mailing list
> > > >> riak-users@lists.basho.com
> > > >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > > > 
> > > > 
> > > > -- 
> > > > Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> > > > Fon:     + 49-(0)30-8109 - 7173
> > > > Fax:     + 49-(0)30-8109 - 7131
> > > > 
> > > > Mail:     seffenb...@team.mobile.de
> > > > Web:    www.mobile.de
> > > > 
> > > > Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> > > > 
> > > > 
> > > > Geschäftsführer: Malte Krüger
> > > > HRB Nr.: 18517 P, Amtsgericht Potsdam
> > > > Sitz der Gesellschaft: Kleinmachnow 
> > > > 
> > > > _______________________________________________
> > > > riak-users mailing list
> > > > riak-users@lists.basho.com
> > > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > > 
> > 
> > 
> > -- 
> > Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> > Fon:     + 49-(0)30-8109 - 7173
> > Fax:     + 49-(0)30-8109 - 7131
> > 
> > Mail:     seffenb...@team.mobile.de
> > Web:    www.mobile.de
> > 
> > Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> > 
> > 
> > Geschäftsführer: Malte Krüger
> > HRB Nr.: 18517 P, Amtsgericht Potsdam
> > Sitz der Gesellschaft: Kleinmachnow 
> > 
> > _______________________________________________
> > riak-users mailing list
> > riak-users@lists.basho.com
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 
> -- 
> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> Fon:     + 49-(0)30-8109 - 7173
> Fax:     + 49-(0)30-8109 - 7131
> 
> Mail:     seffenb...@team.mobile.de
> Web:    www.mobile.de
> 
> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> 
> 
> Geschäftsführer: Malte Krüger
> HRB Nr.: 18517 P, Amtsgericht Potsdam
> Sitz der Gesellschaft: Kleinmachnow 
> 
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


-- 
Simon Effenberg | Site Ops Engineer | mobile.international GmbH
Fon:     + 49-(0)30-8109 - 7173
Fax:     + 49-(0)30-8109 - 7131

Mail:     seffenb...@team.mobile.de
Web:    www.mobile.de

Marktplatz 1 | 14532 Europarc Dreilinden | Germany


Geschäftsführer: Malte Krüger
HRB Nr.: 18517 P, Amtsgericht Potsdam
Sitz der Gesellschaft: Kleinmachnow 

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: TCP recv timeout and handoffs almost all the time

Reply via email to