Re: Possible handoff stalls

Armon Dadgar Mon, 19 Mar 2012 11:40:44 -0700

Okay, good to know this is a known issue. I attached the
logs for the last time this occurred in my original email.


I'll try to capture this information if the problem occurs again.
Thanks.

Best Regards,

Armon Dadgar

On Mar 19, 2012, at 11:36 AM, Jon Meredith wrote:

> Hi Armon,
> 
> We've recently patched an issue that affects handoffs here 
> https://github.com/basho/riak_core/pull/153
> 
> If the issue repeats for you, as well as the logs it would be very useful if 
> you could follow the instructions from the pull request above ro the 
> 'riak_core_handoff_manager:status().' command against all nodes.
> 
> The pull request works around an issue where it looks like the kernel has 
> closed a socket (no evidence of it any longer with netstat/ss) but the erlang 
> process is still stuck in an receive call from it (gen_tcp:recv/2 to be more 
> precise).
> 
> Please let us know if you hit it again.
> 
> Best, Jon.
> 
> On Mon, Mar 19, 2012 at 12:10 PM, Armon Dadgar <armon.dad...@gmail.com> wrote:
> I wanted to ping the mailing list and see if anybody else has encountered
> stalls in the partition handoffs on Riak 1.1. We added a new node to our 
> cluster
> last Friday, but noticed that the partition handoffs appear to have stopped 
> after about 7-8 hours. 
> 
> Most of the handoffs completed, and the only handoffs that remained were from 
> node 3 to node 2.
> The ring claimant (node 1), indicated that node 3 was unreachable (via 
> ring_status).
> However, Riak control did not indicate that node 3 was unreachable, and in 
> fact it was
> actually live and continuing to serve request.
> 
> To resolve this, I tried to just restart node 3. I ran "riak stop" multiple 
> times, but this did
> not actually seem to do anything (The node was continuing to run and serve 
> requests).
> Next, I attached to the node and ran "init:stop()." This started to shut down 
> various
> sub-systems, but the node was still running. Sending a SIGTERM signal to the 
> beam vm
> finally killed it. Restarting the node with "riak start" worked as expected,
> and the node promptly resumed the handoffs, and finished in a few hours.
> 
> I'm not sure exactly what the issue was, but something seemed to cause a
> stalling of the handoffs.
> 
> I've attached the contents of our console.log, erlang.log, error.log and 
> crash.log
> from the relevant times if that is useful.
> 
> Best Regards,
> 
> Armon Dadgar
> 
> 
> 
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 
> 
> 
> -- 
> Jon Meredith
> Platform Engineering Manager
> Basho Technologies, Inc.
> jmered...@basho.com
>

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Possible handoff stalls

Reply via email to