only after restarting the Riak instance on this node the awaiting handoffs where processed.. this is weird :(
On Fri, 19 Jul 2013 15:55:43 +0200 Simon Effenberg <seffenb...@team.mobile.de> wrote: > It looked good for some hours but now again we got > > 2013-07-19 13:27:07.800 UTC [error] > <0.18747.29>@riak_core_handoff_sender:start_fold:216 hinted_handoff transfer > of riak_kv_vnode from 'riak@10.46.109.207' > 1136089163393944065322395631681798128560666312704 to 'riak@10.47.109.202' > 1136089163393944065322395631681798128560666312704 failed because of TCP recv > timeout > > and on the destination host I see: > > > 2013-07-19 13:25:04.455 UTC [error] > <0.28632.25>@riak_core_handoff_receiver:handle_info:80 Handoff receiver for > partition 1136089163393944065322395631681798128560666312704 exited abnormally > after processing 2 objects: > {timeout,{gen_fsm,sync_send_all_state_event,[<0.1107.0>,{handoff_data,<<141,146,205,110,211,64,20,133,237,4,211,132,2,170,80,69,37,150,22,203,186,216,249,105,210,172,42,149,95,137,162,2,5,177,129,232,120,102,156,153,137,61,78,237,113,72,10,172,186,101,195,51,176,224,1,120,12,158,130,55,97,198,173,68,83,177,192,35,223,197,55,231,156,185,158,235,27,155,36,87,115,86,148,208,34,87,227,146,145,130,233,242,206,173,46,153,204,59,60,18,125,61,91,208,123,223,188,51,190,70,157,86,49,206,99,201,136,206,28,199,249,167,209,110,172,122,83,67,92,222,164,78,187,24,27,135,102,74,243,54,117,174,81,65,52,60,108,152,213,194,17,66,190,33,175,60,220,189,204,108,78,195,150,117,123,198,205,139,168,64,47,103,12,26,12,11,83,31,96,134,20,128,128,170,245,91,86,186,254,46,120,37,48,13,222,30,99,130,1,158,152,213,67,132,199,168,240,26,7,72,12,123,134,23,198,25,154,247,33,30,225,16,18,39,56,56,63,210,173,139,205,241,132,162,108,33,175,226,205,139,248,231,40,117,112,152,83,145,8,70,121,51,54,134,15,177,211,252,252,59,118,218,223,127,94,114,93,183,174,53,194,81,148,76,227,13,142,77,43,1,134,82,90,254,227,147,111,238,212,31,69,219,126,44,168,63,242,211,124,206,210,101,86,149,130,116,250,251,147,12,34,221,33,121,230,111,251,101,189,207,243,100,63,143,89,161,4,83,59,148,25,30,151,6,79,39,162,43,62,46,79,213,105,181,103,181,150,173,140,197,64,208,58,33,234,134,123,195,97,212,11,13,210,70,23,117,7,189,78,103,216,31,12,118,67,211,6,169,69,187,211,98,113,50,226,18,75,213,77,184,255,229,252,115,120,195,246,220,58,186,251,244,236,101,182,117,159,55,224,42,207,193,215,247,191,110,203,191,67,118,255,127,200,114,229,122,169,227,145,148,65,153,32,93,84,76,74,243,19,85,102,8,137,80,140,254,1>>},60000]}} > > so both shows a timeout. How could I takle this down? > > - could this happen when many read repairs occur (through AAE)? > > Also our "fsm PUT time is going higher but not really the GET time".. is this > the normal behavior in LOAD/read repair situations? > > Also is this a bigger problem with eLevelDB or would it be the same case for > Bitcask? > > Cheers > Simon > > > On Fri, 19 Jul 2013 10:25:05 +0200 > Simon Effenberg <seffenb...@team.mobile.de> wrote: > > > once again with the list included... argh > > > > Hey Christian, > > > > so it could be also a erlang limit? I found out why my riak instances > > are all having different processlimits. My mcollectived daemons have > > the different limits and when I triggered a puppetrun through > > mcollective they got this processlimit as well. > > > > Also in the crash log I see: > > > > exception exit: {{system_limit,[{erlang,spawn > > > > for the too many processes. So it doesn't look like a Erlang limit, do > > it? But I will keep this +P in my mind!! Thanks a lot. > > > > The zdbbl is now at 100MB. > > > > Cheers > > Simon > > > > On Fri, 19 Jul 2013 08:49:50 +0100 > > Christian Dahlqvist <christ...@basho.com> wrote: > > > > > Hi Simon, > > > > > > If you have objects that can be a s big as 15MB, it is probably wise to > > > increase the size of +zdbbl in order to avoid filling up buffers when > > > these large objects need to be transferred between nodes. What an > > > appropriate level is depends a lot on the size distribution of your data > > > and your access patterns, so I would recommend benchmarking to find a > > > suitable value. > > > > > > Erlang also has a default process limit of 32768 (at least in R15B01), > > > which may be what you are hitting. You can override this to 256k by > > > adding the following line to the vm.args file: > > > > > > +P 262144 > > > > > > Best regards, > > > > > > Christian > > > > > > > > > > > > On 19 Jul 2013, at 08:24, Simon Effenberg <seffenb...@team.mobile.de> > > > wrote: > > > > > > > The +zdbbl parameter helped a lot but the hinted handoffs didn't > > > > disappear completely. I have no more busy dist port errors in the > > > > _console.log_ (why aren't they in the error.log? it looks for me like a > > > > serious problem you have.. at least our cluster was behaving not that > > > > nice). > > > > > > > > I'll try to increase the buffer size to a higher value because my > > > > suggestion is that also the objects send from one to another are also > > > > stored therein and we have sometimes objects which are up to 15MB. > > > > > > > > But I saw now also some crashes in the last 6 hours on only two machines > > > > complaining about too many processes > > > > > > > > ---------------- > > > > console.log > > > > 2013-07-19 02:04:21.962 UTC [error] <0.12813.29> CRASH REPORT Process > > > > <0.12813.29> with 15 neighbours exited with reason: {system_limit > > > > > > > > crash.log > > > > 2013-07-19 02:04:21 UTC =ERROR REPORT==== > > > > Too many processes > > > > ---------------- > > > > > > > > the process has a process limit of 95142. So I will increase it now but > > > > I never saw any information about such problems on the linux tuning > > > > page. Am I missing something? > > > > > > > > Cheers > > > > Simon > > > > > > > > > > > > On Thu, 18 Jul 2013 19:34:18 +0100 > > > > Guido Medina <guido.med...@temetra.com> wrote: > > > > > > > >> If what you are describing is happening for 1.4, type riak-admin diag > > > >> and see the new recommended kernel parameters, also, on vm.args > > > >> uncomment the +zdbbl 32768 parameter, since what you are describing is > > > >> similar to what happened to us when we upgraded to 1.4. > > > >> > > > >> HTH, > > > >> > > > >> Guido. > > > >> > > > >> On 18/07/13 19:21, Simon Effenberg wrote: > > > >>> Hi @list, > > > >>> > > > >>> I see sometimes logs talking about "hinted_handoff transfer of .. > > > >>> failed because of TCP recv timeout". > > > >>> Also riak-admin transfers shows me many handoffs (is it possible to > > > >>> give some insights about "how many" handoffs happened through > > > >>> "riak-admin status"?). > > > >>> > > > >>> - Is it a normal behavior to have up to 30 handoffs from/to different > > > >>> nodes? > > > >>> - How can I get down to the problem with the TCP recv timeout? I'm > > > >>> not sure if this is a network problem or if the other node is too > > > >>> slow. The load is ok on the machines (some IOwait but not 100%). > > > >>> Maybe interfering with AAE? > > > >>> > > > >>> Here the log information about the TCP recv timeout. But that is not > > > >>> that often but handoffs happens really often: > > > >>> > > > >>> 2013-07-18 16:22:05.654 UTC [error] > > > >>> <0.28933.14>@riak_core_handoff_sender:start_fold:216 hinted_handoff > > > >>> transfer of riak_kv_vnode from 'riak@10.46.109.207' > > > >>> 1118962191081472546749696200048404186924073353216 to > > > >>> 'riak@10.46.109.205' > > > >>> 1118962191081472546749696200048404186924073353216 failed because of > > > >>> TCP recv timeout > > > >>> 2013-07-18 16:22:05.673 UTC [error] > > > >>> <0.202.0>@riak_core_handoff_manager:handle_info:282 An outbound > > > >>> handoff of partition riak_kv_vnode > > > >>> 1118962191081472546749696200048404186924073353216 was terminated for > > > >>> reason: {shutdown,timeout} > > > >>> > > > >>> > > > >>> Thanks in advance > > > >>> Simon > > > >>> > > > >>> _______________________________________________ > > > >>> riak-users mailing list > > > >>> riak-users@lists.basho.com > > > >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > >> > > > >> > > > >> _______________________________________________ > > > >> riak-users mailing list > > > >> riak-users@lists.basho.com > > > >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > > > > > > > > > > -- > > > > Simon Effenberg | Site Ops Engineer | mobile.international GmbH > > > > Fon: + 49-(0)30-8109 - 7173 > > > > Fax: + 49-(0)30-8109 - 7131 > > > > > > > > Mail: seffenb...@team.mobile.de > > > > Web: www.mobile.de > > > > > > > > Marktplatz 1 | 14532 Europarc Dreilinden | Germany > > > > > > > > > > > > Geschäftsführer: Malte Krüger > > > > HRB Nr.: 18517 P, Amtsgericht Potsdam > > > > Sitz der Gesellschaft: Kleinmachnow > > > > > > > > _______________________________________________ > > > > riak-users mailing list > > > > riak-users@lists.basho.com > > > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > > > > > > > -- > > Simon Effenberg | Site Ops Engineer | mobile.international GmbH > > Fon: + 49-(0)30-8109 - 7173 > > Fax: + 49-(0)30-8109 - 7131 > > > > Mail: seffenb...@team.mobile.de > > Web: www.mobile.de > > > > Marktplatz 1 | 14532 Europarc Dreilinden | Germany > > > > > > Geschäftsführer: Malte Krüger > > HRB Nr.: 18517 P, Amtsgericht Potsdam > > Sitz der Gesellschaft: Kleinmachnow > > > > _______________________________________________ > > riak-users mailing list > > riak-users@lists.basho.com > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > -- > Simon Effenberg | Site Ops Engineer | mobile.international GmbH > Fon: + 49-(0)30-8109 - 7173 > Fax: + 49-(0)30-8109 - 7131 > > Mail: seffenb...@team.mobile.de > Web: www.mobile.de > > Marktplatz 1 | 14532 Europarc Dreilinden | Germany > > > Geschäftsführer: Malte Krüger > HRB Nr.: 18517 P, Amtsgericht Potsdam > Sitz der Gesellschaft: Kleinmachnow > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com -- Simon Effenberg | Site Ops Engineer | mobile.international GmbH Fon: + 49-(0)30-8109 - 7173 Fax: + 49-(0)30-8109 - 7131 Mail: seffenb...@team.mobile.de Web: www.mobile.de Marktplatz 1 | 14532 Europarc Dreilinden | Germany Geschäftsführer: Malte Krüger HRB Nr.: 18517 P, Amtsgericht Potsdam Sitz der Gesellschaft: Kleinmachnow _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com