Re: [Pacemaker] Node stuck in pending state

Andrew Beekhof Wed, 09 Apr 2014 17:13:30 -0700

On 10 Apr 2014, at 4:49 am, Brian J. Murrell <br...@interlinx.bc.ca> wrote:


> On Tue, 2014-04-08 at 17:29 -0400, Digimer wrote: 
>> Looks like your fencing (stonith) failed.
> 
> Where?  If I'm reading the logs correctly, it looks like stonith worked.
> Here's the stonith:
> 
> Apr  8 09:53:21 lotus-4vm6 stonith-ng[2492]:   notice: log_operation: 
> Operation 'reboot' [3306] (call 2 from crmd.2496) for host 'lotus-4vm5' with 
> device 'st-fencing' returned: 0 (OK)
> 
> and then corosync reporting that the node left the cluster:
> 
> Apr  8 09:53:26 lotus-4vm6 corosync[2442]:   [pcmk  ] info: pcmk_peer_update: 
> lost: lotus-4vm5 3176140298
> 
> Yes?  Or am I misunderstanding that message?
> 
> Doesn't this below also further indicate that the vm5 node did actually
> get stonithed?
> 
> Apr  8 09:53:26 lotus-4vm6 corosync[2442]:   [pcmk  ] info: 
> ais_mark_unseen_peer_dead: Node lotus-4vm5 was not seen in the previous 
> transition
> Apr  8 09:53:26 lotus-4vm6 corosync[2442]:   [pcmk  ] info: update_member: 
> Node 3176140298/lotus-4vm5 is now: lost
> 
> crmd and cib also seem to be noticing the node has gone away too, don't
> they here:
> 
> Apr  8 09:53:26 lotus-4vm6 cib[2491]:   notice: plugin_handle_membership: 
> Membership 20: quorum lost
> Apr  8 09:53:26 lotus-4vm6 cib[2491]:   notice: crm_update_peer_state: 
> plugin_handle_membership: Node lotus-4vm5[3176140298] - state is now lost 
> (was member)
> Apr  8 09:53:26 lotus-4vm6 crmd[2496]:   notice: plugin_handle_membership: 
> Membership 20: quorum lost
> Apr  8 09:53:26 lotus-4vm6 crmd[2496]:   notice: crm_update_peer_state: 
> plugin_handle_membership: Node lotus-4vm5[3176140298] - state is now lost 
> (was member)
> 
> And then the node comes back:
> 
> Apr  8 09:54:04 lotus-4vm6 corosync[2442]:   [pcmk  ] notice: 
> pcmk_peer_update: Transitional membership event on ring 24: memb=1, new=0, 
> lost=0
> Apr  8 09:54:04 lotus-4vm6 corosync[2442]:   [pcmk  ] info: pcmk_peer_update: 
> memb: lotus-4vm6 3192917514
> Apr  8 09:54:04 lotus-4vm6 corosync[2442]:   [pcmk  ] notice: 
> pcmk_peer_update: Stable membership event on ring 24: memb=2, new=1, lost=0
> Apr  8 09:54:04 lotus-4vm6 corosync[2442]:   [pcmk  ] info: update_member: 
> Node 3176140298/lotus-4vm5 is now: member
> Apr  8 09:54:04 lotus-4vm6 corosync[2442]:   [pcmk  ] info: pcmk_peer_update: 
> NEW:  lotus-4vm5 3176140298
> Apr  8 09:54:04 lotus-4vm6 corosync[2442]:   [pcmk  ] info: pcmk_peer_update: 
> MEMB: lotus-4vm5 3176140298
> 
> And now crmd realizes the node is back:
> 
> Apr  8 09:54:04 lotus-4vm6 crmd[2496]:   notice: crm_update_peer_state: 
> plugin_handle_membership: Node lotus-4vm5[3176140298] - state is now member 
> (was lost)
> 
> As well as cib:
> 
> Apr  8 09:54:04 lotus-4vm6 cib[2491]:   notice: crm_update_peer_state: 
> plugin_handle_membership: Node lotus-4vm5[3176140298] - state is now member 
> (was lost)
> 
> And stonith-ng and crmd reports successful reboot:
> 
> Apr  8 09:54:04 lotus-4vm6 stonith-ng[2492]:   notice: remote_op_done: 
> Operation reboot of lotus-4vm5 by lotus-4vm6 for 
> crmd.2496-ZBdUr1hrI04s+xCAc1R/N1ez/nohh...@public.gmane.org<mailto:crmd.2496-ZBdUr1hrI04s+xCAc1R/N1ez/nohh...@public.gmane.org>:
>  OK
> Apr  8 09:54:04 lotus-4vm6 crmd[2496]:   notice: tengine_stonith_callback: 
> Stonith operation 2/13:0:0:f325afae-64b0-4812-a897-70556ab1e806: OK (0)
> Apr  8 09:54:04 lotus-4vm6 crmd[2496]:   notice: tengine_stonith_notify: Peer 
> lotus-4vm5 was terminated (reboot) by lotus-4vm6 for lotus-4vm6: OK 
> (ref=ae82b411-b07a-4235-be55-5a30a00b323b) by client crmd.2496
> 
> But all of a sudden, crmd reports the node is "lost" again?
> 
> Apr  8 09:54:04 lotus-4vm6 crmd[2496]:   notice: crm_update_peer_state: 
> send_stonith_update: Node lotus-4vm5[3176140298] - state is now lost (was 
> member)
> 
> But why?

Brian: the detective work above is highly appreciated

Little known fact, I too got sick of trying to figure out why nodes were being 
marked up/down and so the second '${tag}:' is actually the source of the change.
So in this case, its the function send_stonith_update() and I recognise the 
problem from a few weeks ago.

Essentially the node is returning "too fast" (specifically, before the fencing 
notification arrives) causing pacemaker to forget the node is up and healthy.

The fix for this is https://github.com/beekhof/pacemaker/commit/e777b17 and is 
present in 1.1.11



> 
> Not surprising that we get these messages (below) if crmd thinks it was
> suddenly "lost" (when it was in fact not according to the log for vm5:
> )
> 
> Apr  8 09:54:11 lotus-4vm6 crmd[2496]:  warning: crmd_cs_dispatch: Recieving 
> messages from a node we think is dead: lotus-4vm5[-1118826998]
> Apr  8 09:54:31 lotus-4vm6 crmd[2496]:   notice: do_election_count_vote: 
> Election 2 (current: 2, owner: lotus-4vm5): Processed vote from lotus-4vm5 
> (Peer is not part of our cluster)
> 
> So I think the question is, why did crmd suddenly believe the node to be
> "lost" when there was no evidence that it was lost?
> 
> b.
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Node stuck in pending state

Reply via email to