Re: riak handoffs stalled

Леонид Рябоштан Mon, 14 Jul 2014 05:35:29 -0700

Hello,

riak version is 1.1.4-1. We set transfer limit in config made it equal to 4.


I don't think we have riak-admin transfer-limit or riak-admin cluster plan.

The problem is that damn nodes can't pass partition between each other,
probably because they're too big. Each 5k files(leveldb backend) and
weights 10GB each. There're no problems with smaller partitions. We can't
find anything usefull on handoff fail in riak or system logs. Seems like
ulimit and erlang ports are way higher, we increased it 4 times today.

It begins like:
2014-07-14 12:22:45.518 UTC [info]
<0.10544.0>@riak_core_handoff_sender:start_fold:83 Starting handoff of
partition riak_kv_vnode 68507889249886074290797726533575766546371837952
from 'riak@192.168.153.182' to 'riak@192.168.164.133'

And ends like:
2014-07-14 08:43:28.829 UTC [error]
<0.2264.0>@riak_core_handoff_sender:start_fold:152 Handoff of partition
riak_kv_vnode 68507889249886074290797726533575766546371837952 from '
riak@192.168.153.182' to 'riak@192.168.164.133' FAILED after sending
1318000 objects in 1455.15 seconds: closed
2014-07-14 10:40:18.294 UTC [error]
<0.11555.0>@riak_core_handoff_sender:start_fold:152 Handoff of partition
riak_kv_vnode 68507889249886074290797726533575766546371837952 from '
riak@192.168.153.182' to 'riak@192.168.164.133' FAILED after sending 911000
objects in 2734.48 seconds: closed
2014-07-14 09:43:43.197 UTC [error]
<0.26922.2>@riak_core_handoff_sender:start_fold:152 Handoff of partition
riak_kv_vnode 68507889249886074290797726533575766546371837952 from '
riak@192.168.153.182' to 'riak@192.168.164.133' FAILED after sending 32000
objects in 963.06 seconds: timeout

Maybe we need to check something else on target node? Actually it always
runs in GC problems:
2014-07-14 12:30:03.579 UTC [info]
<0.99.0>@riak_core_sysmon_handler:handle_event:85 monitor long_gc <0.468.0>
[{initial_call,{riak_kv_js_vm,init,1}},{almost_current_function,{xmerl_ucs,expand_utf8_1,3}},{message_queue_len,0}]
[{timeout,118},{old_heap_block_size,0},{heap_block_size,196418},{mbuf_size,0},{stack_size,45},{old_heap_size,0},{heap_size,136165}]
2014-07-14 12:30:44.386 UTC [info]
<0.99.0>@riak_core_sysmon_handler:handle_event:85 monitor long_gc <0.713.0>
[{initial_call,{riak_core_vnode,init,1}},{almost_current_function,{gen_fsm,loop,7}},{message_queue_len,0}]
[{timeout,126},{old_heap_block_size,0},{heap_block_size,1597},{mbuf_size,0},{stack_size,38},{old_heap_size,0},{heap_size,658}]

Probably we have some CPU issues here, but node is not under load currently.

Thank you,
Leonid


2014-07-14 16:11 GMT+04:00 Ciprian Manea <cipr...@basho.com>:

> Hi Leonid,
>
> Which Riak version are you running?
>
> Have you committed* the cluster plan after issuing the cluster
> force-remove <node> commands?
>
> What is the output of $ riak-admin transfer-limit, ran from one of your
> riak nodes?
>
>
> *Do not run this command yet if you have not done it already.
> Please run a riak-admin cluster plan and attach its output here.
>
>
> Thanks,
> Ciprian
>
>
> On Mon, Jul 14, 2014 at 2:41 PM, Леонид Рябоштан <
> leonid.riabosh...@twiket.com> wrote:
>
>> Hello, guys,
>>
>> It seems like we ran into emergency. I wonder if there can be any help on
>> that.
>>
>> Everything that happened below was because we were trying to rebalace
>> space used by nodes that we running out of space.
>>
>> Cluster is 7 machines now, member_status looks like:
>> Attempting to restart script through sudo -u riak
>> ================================= Membership
>> ==================================
>> Status     Ring    Pending    Node
>>
>> -------------------------------------------------------------------------------
>> valid      15.6%     20.3%    'riak@192.168.135.180'
>> valid       0.0%      0.0%    'riak@192.168.152.90'
>> valid       0.0%      0.0%    'riak@192.168.153.182'
>> valid      26.6%     23.4%    'riak@192.168.164.133'
>> valid      27.3%     21.1%    'riak@192.168.177.36'
>> valid       8.6%     15.6%    'riak@192.168.194.138'
>> valid      21.9%     19.5%    'riak@192.168.194.149'
>>
>> -------------------------------------------------------------------------------
>> Valid:7 / Leaving:0 / Exiting:0 / Joining:0 / Down:0
>>
>> 2 nodes with 0 Ring was made to force leave the cluster, they have plenty
>> of data on them which is now seems to be not accessible. Handoffs are stuck
>> it seems. Node 'riak@192.168.152.90'(is in same situation as '
>> riak@192.168.153.182') tries to handoff partitions to '
>> riak@192.168.164.133' but fails for unknown reason after huge
>> timeouts(from 5 to 40 minutes). Partition it's trying to move is about 10Gb
>> in size. It grows slowly on target node, but probably it's just usual
>> writes from normal operation. It doesn't get any smaller on source node.
>>
>> I wonder is there any way to let cluster know that we want those nodes to
>> be actually members of source node and there's no actual need to transfer
>> them? How to redo cluster ownership balance? Revert this force-leave stuff.
>>
>> Thank you,
>> Leonid
>>
>> _______________________________________________
>> riak-users mailing list
>> riak-users@lists.basho.com
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>>
>

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: riak handoffs stalled

Reply via email to