wow.. now I have something to search for.. riak46-1 Max processes unlimited unlimited processes riak46-2 Max processes unlimited unlimited processes riak46-3 Max processes unlimited unlimited processes riak46-4 Max processes unlimited unlimited processes riak46-5 Max processes unlimited unlimited processes riak46-6 Max processes unlimited unlimited processes riak46-7 Max processes 95142 95142 processes riak46-8 Max processes unlimited unlimited processes riak46-9 Max processes 95142 95142 processes riak47-1 Max processes 191896 191896 processes riak47-2 Max processes 192920 192920 processes riak47-3 Max processes unlimited unlimited processes riak47-4 Max processes unlimited unlimited processes riak47-5 Max processes unlimited unlimited processes riak47-6 Max processes unlimited unlimited processes riak47-7 Max processes 95142 95142 processes riak47-8 Max processes 95142 95142 processes riak47-9 Max processes 95142 95142 processes
riak46-{7..9}, riak47-1 and riak47-{7..9} are quiet newly reinstalled but all with puppet and in theory nothing special about them compared to the other once.. I need to have a look and probably try to enforce an "unlimited" process limit. Cheers Simon On Fri, 19 Jul 2013 09:24:07 +0200 Simon Effenberg <seffenb...@team.mobile.de> wrote: > The +zdbbl parameter helped a lot but the hinted handoffs didn't > disappear completely. I have no more busy dist port errors in the > _console.log_ (why aren't they in the error.log? it looks for me like a > serious problem you have.. at least our cluster was behaving not that > nice). > > I'll try to increase the buffer size to a higher value because my > suggestion is that also the objects send from one to another are also > stored therein and we have sometimes objects which are up to 15MB. > > But I saw now also some crashes in the last 6 hours on only two machines > complaining about too many processes > > ---------------- > console.log > 2013-07-19 02:04:21.962 UTC [error] <0.12813.29> CRASH REPORT Process > <0.12813.29> with 15 neighbours exited with reason: {system_limit > > crash.log > 2013-07-19 02:04:21 UTC =ERROR REPORT==== > Too many processes > ---------------- > > the process has a process limit of 95142. So I will increase it now but I > never saw any information about such problems on the linux tuning page. Am I > missing something? > > Cheers > Simon > > > On Thu, 18 Jul 2013 19:34:18 +0100 > Guido Medina <guido.med...@temetra.com> wrote: > > > If what you are describing is happening for 1.4, type riak-admin diag > > and see the new recommended kernel parameters, also, on vm.args > > uncomment the +zdbbl 32768 parameter, since what you are describing is > > similar to what happened to us when we upgraded to 1.4. > > > > HTH, > > > > Guido. > > > > On 18/07/13 19:21, Simon Effenberg wrote: > > > Hi @list, > > > > > > I see sometimes logs talking about "hinted_handoff transfer of .. failed > > > because of TCP recv timeout". > > > Also riak-admin transfers shows me many handoffs (is it possible to give > > > some insights about "how many" handoffs happened through "riak-admin > > > status"?). > > > > > > - Is it a normal behavior to have up to 30 handoffs from/to different > > > nodes? > > > - How can I get down to the problem with the TCP recv timeout? I'm not > > > sure if this is a network problem or if the other node is too slow. The > > > load is ok on the machines (some IOwait but not 100%). Maybe interfering > > > with AAE? > > > > > > Here the log information about the TCP recv timeout. But that is not that > > > often but handoffs happens really often: > > > > > > 2013-07-18 16:22:05.654 UTC [error] > > > <0.28933.14>@riak_core_handoff_sender:start_fold:216 hinted_handoff > > > transfer of riak_kv_vnode from 'riak@10.46.109.207' > > > 1118962191081472546749696200048404186924073353216 to 'riak@10.46.109.205' > > > 1118962191081472546749696200048404186924073353216 failed because of TCP > > > recv timeout > > > 2013-07-18 16:22:05.673 UTC [error] > > > <0.202.0>@riak_core_handoff_manager:handle_info:282 An outbound handoff > > > of partition riak_kv_vnode > > > 1118962191081472546749696200048404186924073353216 was terminated for > > > reason: {shutdown,timeout} > > > > > > > > > Thanks in advance > > > Simon > > > > > > _______________________________________________ > > > riak-users mailing list > > > riak-users@lists.basho.com > > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > > > > _______________________________________________ > > riak-users mailing list > > riak-users@lists.basho.com > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > -- > Simon Effenberg | Site Ops Engineer | mobile.international GmbH > Fon: + 49-(0)30-8109 - 7173 > Fax: + 49-(0)30-8109 - 7131 > > Mail: seffenb...@team.mobile.de > Web: www.mobile.de > > Marktplatz 1 | 14532 Europarc Dreilinden | Germany > > > Geschäftsführer: Malte Krüger > HRB Nr.: 18517 P, Amtsgericht Potsdam > Sitz der Gesellschaft: Kleinmachnow > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com -- Simon Effenberg | Site Ops Engineer | mobile.international GmbH Fon: + 49-(0)30-8109 - 7173 Fax: + 49-(0)30-8109 - 7131 Mail: seffenb...@team.mobile.de Web: www.mobile.de Marktplatz 1 | 14532 Europarc Dreilinden | Germany Geschäftsführer: Malte Krüger HRB Nr.: 18517 P, Amtsgericht Potsdam Sitz der Gesellschaft: Kleinmachnow _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com