Hello,

I have a problem with persistent timeouts during ownership handoffs. I've
tried to surf over Internet and current mail list but no success.

I have Riak 1.4.12 cluster with 17 nodes. Almost all nodes use multibackend
with bitcask and eleveldb as storage backends (we need multiple backend for
Riak CS 1.5.0 integration).

Now I'm working to migrate Riak cluster to eleveldb as primary and only
backend. For now I have 2 nodes with eleveldb backend in the same cluster.

During ownership handoff process I permanently see errors of timed out
handoff receivers and sender.

Here is partial output of riak-admin transfers:
...
transfer type: ownership_transfer
vnode type: riak_kv_vnode
partition: 331121464707782692405522344912282871640797216768
started: 2015-10-21 08:32:55 [46.66 min ago]
last update: no updates seen
total size: unknown
objects transferred: unknown

                           unknown
riak@taipan.pleiad.uaprom  =======>  r...@eggeater.pleiad.uapr
                                     om
        |                                           |   0%
                           unknown

transfer type: ownership_transfer
vnode type: riak_kv_vnode
partition: 336830455478606531929755488790080852186328203264
started: 2015-10-21 08:32:54 [46.68 min ago]
last update: no updates seen
total size: unknown
objects transferred: unknown
...

Some of partition handoffs state never updates, some of them terminates
after partial handoff objects and never starts again.

I see nothing in logs but following:

On receiver side:

2015-10-21 11:33:55.131 [error]
<0.25390.1266>@riak_core_handoff_receiver:handle_info:105 Handoff receiver
for partition 331121464707782692405522344912282871640797216768 timed out
after processing 0 objects.

On sender side:

2015-10-21 11:01:58.879 [error] <0.13177.1401> CRASH REPORT Process
<0.13177.1401> with 0 neighbours crashed with reason: no function clause
matching webmachine_request:peer_from_peername({error,enotconn},
{webmachine_request,{wm_reqstate,#Port<0.50978116>,[],undefined,undefined,undefined,{wm_reqdata,...},...}})
line 150
2015-10-21 11:32:50.055 [error] <0.207.0> Supervisor
riak_core_handoff_sender_sup had child riak_core_handoff_sender started
with {riak_core_handoff_sender,start_link,undefined} at <0.22312.1090> exit
with reason max_concurrency in context child_terminated

{error, enotconn} - seems to be network issue. But I have no any problems
with network. All hosts resolve their neighbors correctly and /etc/hosts on
each node are correct.

I've tried to increase handoff_timeout and handoff_receive_timeout. But no
success.

Forcing handoff helped me but for short period of time:

rpc:multicall([node() | nodes()], riak_core_vnode_manager, force_handoffs, []).


I see progress of handoffs (riak-admin transfers) but then I see
handoff timed out again.


A week ago I've joined 4 nodes with bitcask. And there was no such problems.


I'm confused a little bit and need to understand my next steps in
troubleshooting this issue.
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to