Hello, riak version is 1.1.4-1. We set transfer limit in config made it equal to 4.
I don't think we have riak-admin transfer-limit or riak-admin cluster plan. The problem is that damn nodes can't pass partition between each other, probably because they're too big. Each 5k files(leveldb backend) and weights 10GB each. There're no problems with smaller partitions. We can't find anything usefull on handoff fail in riak or system logs. Seems like ulimit and erlang ports are way higher, we increased it 4 times today. It begins like: 2014-07-14 12:22:45.518 UTC [info] <0.10544.0>@riak_core_handoff_sender:start_fold:83 Starting handoff of partition riak_kv_vnode 68507889249886074290797726533575766546371837952 from 'riak@192.168.153.182' to 'riak@192.168.164.133' And ends like: 2014-07-14 08:43:28.829 UTC [error] <0.2264.0>@riak_core_handoff_sender:start_fold:152 Handoff of partition riak_kv_vnode 68507889249886074290797726533575766546371837952 from ' riak@192.168.153.182' to 'riak@192.168.164.133' FAILED after sending 1318000 objects in 1455.15 seconds: closed 2014-07-14 10:40:18.294 UTC [error] <0.11555.0>@riak_core_handoff_sender:start_fold:152 Handoff of partition riak_kv_vnode 68507889249886074290797726533575766546371837952 from ' riak@192.168.153.182' to 'riak@192.168.164.133' FAILED after sending 911000 objects in 2734.48 seconds: closed 2014-07-14 09:43:43.197 UTC [error] <0.26922.2>@riak_core_handoff_sender:start_fold:152 Handoff of partition riak_kv_vnode 68507889249886074290797726533575766546371837952 from ' riak@192.168.153.182' to 'riak@192.168.164.133' FAILED after sending 32000 objects in 963.06 seconds: timeout Maybe we need to check something else on target node? Actually it always runs in GC problems: 2014-07-14 12:30:03.579 UTC [info] <0.99.0>@riak_core_sysmon_handler:handle_event:85 monitor long_gc <0.468.0> [{initial_call,{riak_kv_js_vm,init,1}},{almost_current_function,{xmerl_ucs,expand_utf8_1,3}},{message_queue_len,0}] [{timeout,118},{old_heap_block_size,0},{heap_block_size,196418},{mbuf_size,0},{stack_size,45},{old_heap_size,0},{heap_size,136165}] 2014-07-14 12:30:44.386 UTC [info] <0.99.0>@riak_core_sysmon_handler:handle_event:85 monitor long_gc <0.713.0> [{initial_call,{riak_core_vnode,init,1}},{almost_current_function,{gen_fsm,loop,7}},{message_queue_len,0}] [{timeout,126},{old_heap_block_size,0},{heap_block_size,1597},{mbuf_size,0},{stack_size,38},{old_heap_size,0},{heap_size,658}] Probably we have some CPU issues here, but node is not under load currently. Thank you, Leonid 2014-07-14 16:11 GMT+04:00 Ciprian Manea <cipr...@basho.com>: > Hi Leonid, > > Which Riak version are you running? > > Have you committed* the cluster plan after issuing the cluster > force-remove <node> commands? > > What is the output of $ riak-admin transfer-limit, ran from one of your > riak nodes? > > > *Do not run this command yet if you have not done it already. > Please run a riak-admin cluster plan and attach its output here. > > > Thanks, > Ciprian > > > On Mon, Jul 14, 2014 at 2:41 PM, Леонид Рябоштан < > leonid.riabosh...@twiket.com> wrote: > >> Hello, guys, >> >> It seems like we ran into emergency. I wonder if there can be any help on >> that. >> >> Everything that happened below was because we were trying to rebalace >> space used by nodes that we running out of space. >> >> Cluster is 7 machines now, member_status looks like: >> Attempting to restart script through sudo -u riak >> ================================= Membership >> ================================== >> Status Ring Pending Node >> >> ------------------------------------------------------------------------------- >> valid 15.6% 20.3% 'riak@192.168.135.180' >> valid 0.0% 0.0% 'riak@192.168.152.90' >> valid 0.0% 0.0% 'riak@192.168.153.182' >> valid 26.6% 23.4% 'riak@192.168.164.133' >> valid 27.3% 21.1% 'riak@192.168.177.36' >> valid 8.6% 15.6% 'riak@192.168.194.138' >> valid 21.9% 19.5% 'riak@192.168.194.149' >> >> ------------------------------------------------------------------------------- >> Valid:7 / Leaving:0 / Exiting:0 / Joining:0 / Down:0 >> >> 2 nodes with 0 Ring was made to force leave the cluster, they have plenty >> of data on them which is now seems to be not accessible. Handoffs are stuck >> it seems. Node 'riak@192.168.152.90'(is in same situation as ' >> riak@192.168.153.182') tries to handoff partitions to ' >> riak@192.168.164.133' but fails for unknown reason after huge >> timeouts(from 5 to 40 minutes). Partition it's trying to move is about 10Gb >> in size. It grows slowly on target node, but probably it's just usual >> writes from normal operation. It doesn't get any smaller on source node. >> >> I wonder is there any way to let cluster know that we want those nodes to >> be actually members of source node and there's no actual need to transfer >> them? How to redo cluster ownership balance? Revert this force-leave stuff. >> >> Thank you, >> Leonid >> >> _______________________________________________ >> riak-users mailing list >> riak-users@lists.basho.com >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> >> >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com