Hi Leonid, Lets try to increase the handoff_timeout and see if it can solve your problem.
Could you please paste the below code in a $ riak attach riak_core_util:rpc_every_member_ann(application,set_env,[riak_core, handoff_timeout, 5400000],infinity). riak_core_util:rpc_every_member_ann(application,set_env,[riak_core, handoff_receive_timeout, 5400000],infinity). You should be able to exit back at the shell prompt by pressing ^D Could you please archive/compress and send me directly by email: + the ring directory (including its content) from one of your riak nodes + recent log files (console.log, error.log, crash.log if any), same node Thanks, Ciprian On Mon, Jul 14, 2014 at 3:33 PM, Леонид Рябоштан < leonid.riabosh...@twiket.com> wrote: > Hello, > > riak version is 1.1.4-1. We set transfer limit in config made it equal to > 4. > > I don't think we have riak-admin transfer-limit or riak-admin cluster plan. > > The problem is that damn nodes can't pass partition between each other, > probably because they're too big. Each 5k files(leveldb backend) and > weights 10GB each. There're no problems with smaller partitions. We can't > find anything usefull on handoff fail in riak or system logs. Seems like > ulimit and erlang ports are way higher, we increased it 4 times today. > > It begins like: > 2014-07-14 12:22:45.518 UTC [info] > <0.10544.0>@riak_core_handoff_sender:start_fold:83 Starting handoff of > partition riak_kv_vnode 68507889249886074290797726533575766546371837952 > from 'riak@192.168.153.182' to 'riak@192.168.164.133' > > And ends like: > 2014-07-14 08:43:28.829 UTC [error] > <0.2264.0>@riak_core_handoff_sender:start_fold:152 Handoff of partition > riak_kv_vnode 68507889249886074290797726533575766546371837952 from ' > riak@192.168.153.182' to 'riak@192.168.164.133' FAILED after sending > 1318000 objects in 1455.15 seconds: closed > 2014-07-14 10:40:18.294 UTC [error] > <0.11555.0>@riak_core_handoff_sender:start_fold:152 Handoff of partition > riak_kv_vnode 68507889249886074290797726533575766546371837952 from ' > riak@192.168.153.182' to 'riak@192.168.164.133' FAILED after sending > 911000 objects in 2734.48 seconds: closed > 2014-07-14 09:43:43.197 UTC [error] > <0.26922.2>@riak_core_handoff_sender:start_fold:152 Handoff of partition > riak_kv_vnode 68507889249886074290797726533575766546371837952 from ' > riak@192.168.153.182' to 'riak@192.168.164.133' FAILED after sending > 32000 objects in 963.06 seconds: timeout > > Maybe we need to check something else on target node? Actually it always > runs in GC problems: > 2014-07-14 12:30:03.579 UTC [info] > <0.99.0>@riak_core_sysmon_handler:handle_event:85 monitor long_gc <0.468.0> > [{initial_call,{riak_kv_js_vm,init,1}},{almost_current_function,{xmerl_ucs,expand_utf8_1,3}},{message_queue_len,0}] > [{timeout,118},{old_heap_block_size,0},{heap_block_size,196418},{mbuf_size,0},{stack_size,45},{old_heap_size,0},{heap_size,136165}] > 2014-07-14 12:30:44.386 UTC [info] > <0.99.0>@riak_core_sysmon_handler:handle_event:85 monitor long_gc <0.713.0> > [{initial_call,{riak_core_vnode,init,1}},{almost_current_function,{gen_fsm,loop,7}},{message_queue_len,0}] > [{timeout,126},{old_heap_block_size,0},{heap_block_size,1597},{mbuf_size,0},{stack_size,38},{old_heap_size,0},{heap_size,658}] > > Probably we have some CPU issues here, but node is not under load > currently. > > Thank you, > Leonid > > > 2014-07-14 16:11 GMT+04:00 Ciprian Manea <cipr...@basho.com>: > > Hi Leonid, >> >> Which Riak version are you running? >> >> Have you committed* the cluster plan after issuing the cluster >> force-remove <node> commands? >> >> What is the output of $ riak-admin transfer-limit, ran from one of your >> riak nodes? >> >> >> *Do not run this command yet if you have not done it already. >> Please run a riak-admin cluster plan and attach its output here. >> >> >> Thanks, >> Ciprian >> >> >> On Mon, Jul 14, 2014 at 2:41 PM, Леонид Рябоштан < >> leonid.riabosh...@twiket.com> wrote: >> >>> Hello, guys, >>> >>> It seems like we ran into emergency. I wonder if there can be any help >>> on that. >>> >>> Everything that happened below was because we were trying to rebalace >>> space used by nodes that we running out of space. >>> >>> Cluster is 7 machines now, member_status looks like: >>> Attempting to restart script through sudo -u riak >>> ================================= Membership >>> ================================== >>> Status Ring Pending Node >>> >>> ------------------------------------------------------------------------------- >>> valid 15.6% 20.3% 'riak@192.168.135.180' >>> valid 0.0% 0.0% 'riak@192.168.152.90' >>> valid 0.0% 0.0% 'riak@192.168.153.182' >>> valid 26.6% 23.4% 'riak@192.168.164.133' >>> valid 27.3% 21.1% 'riak@192.168.177.36' >>> valid 8.6% 15.6% 'riak@192.168.194.138' >>> valid 21.9% 19.5% 'riak@192.168.194.149' >>> >>> ------------------------------------------------------------------------------- >>> Valid:7 / Leaving:0 / Exiting:0 / Joining:0 / Down:0 >>> >>> 2 nodes with 0 Ring was made to force leave the cluster, they have >>> plenty of data on them which is now seems to be not accessible. Handoffs >>> are stuck it seems. Node 'riak@192.168.152.90'(is in same situation as ' >>> riak@192.168.153.182') tries to handoff partitions to ' >>> riak@192.168.164.133' but fails for unknown reason after huge >>> timeouts(from 5 to 40 minutes). Partition it's trying to move is about 10Gb >>> in size. It grows slowly on target node, but probably it's just usual >>> writes from normal operation. It doesn't get any smaller on source node. >>> >>> I wonder is there any way to let cluster know that we want those nodes >>> to be actually members of source node and there's no actual need to >>> transfer them? How to redo cluster ownership balance? Revert this >>> force-leave stuff. >>> >>> Thank you, >>> Leonid >>> >>> _______________________________________________ >>> riak-users mailing list >>> riak-users@lists.basho.com >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>> >>> >> >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com