I wanted to ping the mailing list and see if anybody else has encountered stalls in the partition handoffs on Riak 1.1. We added a new node to our cluster last Friday, but noticed that the partition handoffs appear to have stopped after about 7-8 hours. Most of the handoffs completed, and the only handoffs that remained were from node 3 to node 2. The ring claimant (node 1), indicated that node 3 was unreachable (via ring_status). However, Riak control did not indicate that node 3 was unreachable, and in fact it was actually live and continuing to serve request. To resolve this, I tried to just restart node 3. I ran "riak stop" multiple times, but this did not actually seem to do anything (The node was continuing to run and serve requests). Next, I attached to the node and ran "init:stop()." This started to shut down various sub-systems, but the node was still running. Sending a SIGTERM signal to the beam vm finally killed it. Restarting the node with "riak start" worked as expected, and the node promptly resumed the handoffs, and finished in a few hours. I'm not sure exactly what the issue was, but something seemed to cause a stalling of the handoffs. I've attached the contents of our console.log, erlang.log, error.log and crash.log from the relevant times if that is useful. Best Regards,
Armon Dadgar |
logs.tar.gz
Description: GNU Zip compressed data
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com