Possible handoff stalls

Armon Dadgar Mon, 19 Mar 2012 11:24:05 -0700

I wanted to ping the mailing list and see if anybody else has encountered

stalls in the partition handoffs on Riak 1.1. We added a new node to our cluster

last Friday, but noticed that the partition handoffs appear to have stopped

after about 7-8 hours.

Most of the handoffs completed, and the only handoffs that remained were from node 3 to node 2.

The ring claimant (node 1), indicated that node 3 was unreachable (via ring_status).

However, Riak control did not indicate that node 3 was unreachable, and in fact it was

actually live and continuing to serve request.

To resolve this, I tried to just restart node 3. I ran "riak stop" multiple times, but this did

not actually seem to do anything (The node was continuing to run and serve requests).

Next, I attached to the node and ran "init:stop()." This started to shut down various

sub-systems, but the node was still running. Sending a SIGTERM signal to the beam vm

finally killed it. Restarting the node with "riak start" worked as expected,

and the node promptly resumed the handoffs, and finished in a few hours.

I'm not sure exactly what the issue was, but something seemed to cause a

stalling of the handoffs.

I've attached the contents of our console.log, erlang.log, error.log and crash.log

from the relevant times if that is useful.

Best Regards,

Armon Dadgar

logs.tar.gz
Description: GNU Zip compressed data

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Possible handoff stalls

Reply via email to