Re: TCP recv timeout and handoffs almost all the time

Simon Effenberg Fri, 19 Jul 2013 06:58:17 -0700

It looked good for some hours but now again we got 

2013-07-19 13:27:07.800 UTC [error] 
<0.18747.29>@riak_core_handoff_sender:start_fold:216 hinted_handoff transfer of 
riak_kv_vnode from 'riak@10.46.109.207' 
1136089163393944065322395631681798128560666312704 to 'riak@10.47.109.202' 
1136089163393944065322395631681798128560666312704 failed because of TCP recv 
timeout


and on the destination host I see:


2013-07-19 13:25:04.455 UTC [error] 
<0.28632.25>@riak_core_handoff_receiver:handle_info:80 Handoff receiver for 
partition 1136089163393944065322395631681798128560666312704 exited abnormally 
after processing 2 objects: 
{timeout,{gen_fsm,sync_send_all_state_event,[<0.1107.0>,{handoff_data,<<141,146,205,110,211,64,20,133,237,4,211,132,2,170,80,69,37,150,22,203,186,216,249,105,210,172,42,149,95,137,162,2,5,177,129,232,120,102,156,153,137,61,78,237,113,72,10,172,186,101,195,51,176,224,1,120,12,158,130,55,97,198,173,68,83,177,192,35,223,197,55,231,156,185,158,235,27,155,36,87,115,86,148,208,34,87,227,146,145,130,233,242,206,173,46,153,204,59,60,18,125,61,91,208,123,223,188,51,190,70,157,86,49,206,99,201,136,206,28,199,249,167,209,110,172,122,83,67,92,222,164,78,187,24,27,135,102,74,243,54,117,174,81,65,52,60,108,152,213,194,17,66,190,33,175,60,220,189,204,108,78,195,150,117,123,198,205,139,168,64,47,103,12,26,12,11,83,31,96,134,20,128,128,170,245,91,86,186,254,46,120,37,48,13,222,30,99,130,1,158,152,213,67,132,199,168,240,26,7,72,12,123,134,23,198,25,154,247,33,30,225,16,18,39,56,56,63,210,173,139,205,241,132,162,108,33,175,226,205,139,248,231,40,117,112,152,83,145,8,70,121,51,54,134,15,177,211,252,252,59,118,218,223,127,94,114,93,183,174,53,194,81,148,76,227,13,142,77,43,1,134,82,90,254,227,147,111,238,212,31,69,219,126,44,168,63,242,211,124,206,210,101,86,149,130,116,250,251,147,12,34,221,33,121,230,111,251,101,189,207,243,100,63,143,89,161,4,83,59,148,25,30,151,6,79,39,162,43,62,46,79,213,105,181,103,181,150,173,140,197,64,208,58,33,234,134,123,195,97,212,11,13,210,70,23,117,7,189,78,103,216,31,12,118,67,211,6,169,69,187,211,98,113,50,226,18,75,213,77,184,255,229,252,115,120,195,246,220,58,186,251,244,236,101,182,117,159,55,224,42,207,193,215,247,191,110,203,191,67,118,255,127,200,114,229,122,169,227,145,148,65,153,32,93,84,76,74,243,19,85,102,8,137,80,140,254,1>>},60000]}}

so both shows a timeout. How could I takle this down?

- could this happen when many read repairs occur (through AAE)?

Also our "fsm PUT time is going higher but not really the GET time".. is this 
the normal behavior in LOAD/read repair situations?

Also is this a bigger problem with eLevelDB or would it be the same case for 
Bitcask?

Cheers
Simon


On Fri, 19 Jul 2013 10:25:05 +0200
Simon Effenberg <seffenb...@team.mobile.de> wrote:

> once again with the list included... argh
> 
> Hey Christian,
> 
> so it could be also a erlang limit? I found out why my riak instances
> are all having different processlimits. My mcollectived daemons have
> the different limits and when I triggered a puppetrun through
> mcollective they got this processlimit as well.
> 
> Also in the crash log I see:
> 
> exception exit: {{system_limit,[{erlang,spawn
> 
> for the too many processes. So it doesn't look like a Erlang limit, do
> it? But I will keep this +P in my mind!! Thanks a lot.
> 
> The zdbbl is now at 100MB.
> 
> Cheers
> Simon
> 
> On Fri, 19 Jul 2013 08:49:50 +0100
> Christian Dahlqvist <christ...@basho.com> wrote:
> 
> > Hi Simon,
> > 
> > If you have objects that can be a s big as 15MB, it is probably wise to 
> > increase the size of +zdbbl in order to avoid filling up buffers when these 
> > large objects need to be transferred between nodes. What an appropriate 
> > level is depends a lot on the size distribution of your data and your 
> > access patterns, so I would recommend benchmarking to find a suitable value.
> > 
> > Erlang also has a default process limit of 32768 (at least in R15B01), 
> > which may be what you are hitting. You can override this to 256k by adding 
> > the following line to the vm.args file:
> > 
> >     +P 262144
> > 
> > Best regards,
> > 
> > Christian
> > 
> > 
> > 
> > On 19 Jul 2013, at 08:24, Simon Effenberg <seffenb...@team.mobile.de> wrote:
> > 
> > > The +zdbbl parameter helped a lot but the hinted handoffs didn't
> > > disappear completely. I have no more busy dist port errors in the
> > > _console.log_ (why aren't they in the error.log? it looks for me like a
> > > serious problem you have.. at least our cluster was behaving not that
> > > nice).
> > > 
> > > I'll try to increase the buffer size to a higher value because my
> > > suggestion is that also the objects send from one to another are also
> > > stored therein and we have sometimes objects which are up to 15MB.
> > > 
> > > But I saw now also some crashes in the last 6 hours on only two machines
> > > complaining about too many processes
> > > 
> > > ----------------
> > > console.log
> > > 2013-07-19 02:04:21.962 UTC [error] <0.12813.29> CRASH REPORT Process 
> > > <0.12813.29> with 15 neighbours exited with reason: {system_limit
> > > 
> > > crash.log
> > > 2013-07-19 02:04:21 UTC =ERROR REPORT====
> > > Too many processes
> > > ----------------
> > > 
> > > the process has a process limit of 95142. So I will increase it now but I 
> > > never saw any information about such problems on the linux tuning page. 
> > > Am I missing something?
> > > 
> > > Cheers
> > > Simon
> > > 
> > > 
> > > On Thu, 18 Jul 2013 19:34:18 +0100
> > > Guido Medina <guido.med...@temetra.com> wrote:
> > > 
> > >> If what you are describing is happening for 1.4, type riak-admin diag 
> > >> and see the new recommended kernel parameters, also, on vm.args 
> > >> uncomment the +zdbbl 32768 parameter, since what you are describing is 
> > >> similar to what happened to us when we upgraded to 1.4.
> > >> 
> > >> HTH,
> > >> 
> > >> Guido.
> > >> 
> > >> On 18/07/13 19:21, Simon Effenberg wrote:
> > >>> Hi @list,
> > >>> 
> > >>> I see sometimes logs talking about "hinted_handoff transfer of .. 
> > >>> failed because of TCP recv timeout".
> > >>> Also riak-admin transfers shows me many handoffs (is it possible to 
> > >>> give some insights about "how many" handoffs happened through 
> > >>> "riak-admin status"?).
> > >>> 
> > >>> - Is it a normal behavior to have up to 30 handoffs from/to different 
> > >>> nodes?
> > >>> - How can I get down to the problem with the TCP recv timeout? I'm not 
> > >>> sure if this is a network problem or if the other node is too slow. The 
> > >>> load is ok on the machines (some IOwait but not 100%). Maybe 
> > >>> interfering with AAE?
> > >>> 
> > >>> Here the log information about the TCP recv timeout. But that is not 
> > >>> that often but handoffs happens really often:
> > >>> 
> > >>> 2013-07-18 16:22:05.654 UTC [error] 
> > >>> <0.28933.14>@riak_core_handoff_sender:start_fold:216 hinted_handoff 
> > >>> transfer of riak_kv_vnode from 'riak@10.46.109.207' 
> > >>> 1118962191081472546749696200048404186924073353216 to 
> > >>> 'riak@10.46.109.205' 1118962191081472546749696200048404186924073353216 
> > >>> failed because of TCP recv timeout
> > >>> 2013-07-18 16:22:05.673 UTC [error] 
> > >>> <0.202.0>@riak_core_handoff_manager:handle_info:282 An outbound handoff 
> > >>> of partition riak_kv_vnode 
> > >>> 1118962191081472546749696200048404186924073353216 was terminated for 
> > >>> reason: {shutdown,timeout}
> > >>> 
> > >>> 
> > >>> Thanks in advance
> > >>> Simon
> > >>> 
> > >>> _______________________________________________
> > >>> riak-users mailing list
> > >>> riak-users@lists.basho.com
> > >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > >> 
> > >> 
> > >> _______________________________________________
> > >> riak-users mailing list
> > >> riak-users@lists.basho.com
> > >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > > 
> > > 
> > > -- 
> > > Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> > > Fon:     + 49-(0)30-8109 - 7173
> > > Fax:     + 49-(0)30-8109 - 7131
> > > 
> > > Mail:     seffenb...@team.mobile.de
> > > Web:    www.mobile.de
> > > 
> > > Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> > > 
> > > 
> > > Geschäftsführer: Malte Krüger
> > > HRB Nr.: 18517 P, Amtsgericht Potsdam
> > > Sitz der Gesellschaft: Kleinmachnow 
> > > 
> > > _______________________________________________
> > > riak-users mailing list
> > > riak-users@lists.basho.com
> > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > 
> 
> 
> -- 
> Simon Effenberg | Site Ops Engineer | mobile.international GmbH
> Fon:     + 49-(0)30-8109 - 7173
> Fax:     + 49-(0)30-8109 - 7131
> 
> Mail:     seffenb...@team.mobile.de
> Web:    www.mobile.de
> 
> Marktplatz 1 | 14532 Europarc Dreilinden | Germany
> 
> 
> Geschäftsführer: Malte Krüger
> HRB Nr.: 18517 P, Amtsgericht Potsdam
> Sitz der Gesellschaft: Kleinmachnow 
> 
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


-- 
Simon Effenberg | Site Ops Engineer | mobile.international GmbH
Fon:     + 49-(0)30-8109 - 7173
Fax:     + 49-(0)30-8109 - 7131

Mail:     seffenb...@team.mobile.de
Web:    www.mobile.de

Marktplatz 1 | 14532 Europarc Dreilinden | Germany


Geschäftsführer: Malte Krüger
HRB Nr.: 18517 P, Amtsgericht Potsdam
Sitz der Gesellschaft: Kleinmachnow 

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: TCP recv timeout and handoffs almost all the time

Reply via email to