On 10 Apr 2014, at 4:49 am, Brian J. Murrell <br...@interlinx.bc.ca> wrote:
> On Tue, 2014-04-08 at 17:29 -0400, Digimer wrote: >> Looks like your fencing (stonith) failed. > > Where? If I'm reading the logs correctly, it looks like stonith worked. > Here's the stonith: > > Apr 8 09:53:21 lotus-4vm6 stonith-ng[2492]: notice: log_operation: > Operation 'reboot' [3306] (call 2 from crmd.2496) for host 'lotus-4vm5' with > device 'st-fencing' returned: 0 (OK) > > and then corosync reporting that the node left the cluster: > > Apr 8 09:53:26 lotus-4vm6 corosync[2442]: [pcmk ] info: pcmk_peer_update: > lost: lotus-4vm5 3176140298 > > Yes? Or am I misunderstanding that message? > > Doesn't this below also further indicate that the vm5 node did actually > get stonithed? > > Apr 8 09:53:26 lotus-4vm6 corosync[2442]: [pcmk ] info: > ais_mark_unseen_peer_dead: Node lotus-4vm5 was not seen in the previous > transition > Apr 8 09:53:26 lotus-4vm6 corosync[2442]: [pcmk ] info: update_member: > Node 3176140298/lotus-4vm5 is now: lost > > crmd and cib also seem to be noticing the node has gone away too, don't > they here: > > Apr 8 09:53:26 lotus-4vm6 cib[2491]: notice: plugin_handle_membership: > Membership 20: quorum lost > Apr 8 09:53:26 lotus-4vm6 cib[2491]: notice: crm_update_peer_state: > plugin_handle_membership: Node lotus-4vm5[3176140298] - state is now lost > (was member) > Apr 8 09:53:26 lotus-4vm6 crmd[2496]: notice: plugin_handle_membership: > Membership 20: quorum lost > Apr 8 09:53:26 lotus-4vm6 crmd[2496]: notice: crm_update_peer_state: > plugin_handle_membership: Node lotus-4vm5[3176140298] - state is now lost > (was member) > > And then the node comes back: > > Apr 8 09:54:04 lotus-4vm6 corosync[2442]: [pcmk ] notice: > pcmk_peer_update: Transitional membership event on ring 24: memb=1, new=0, > lost=0 > Apr 8 09:54:04 lotus-4vm6 corosync[2442]: [pcmk ] info: pcmk_peer_update: > memb: lotus-4vm6 3192917514 > Apr 8 09:54:04 lotus-4vm6 corosync[2442]: [pcmk ] notice: > pcmk_peer_update: Stable membership event on ring 24: memb=2, new=1, lost=0 > Apr 8 09:54:04 lotus-4vm6 corosync[2442]: [pcmk ] info: update_member: > Node 3176140298/lotus-4vm5 is now: member > Apr 8 09:54:04 lotus-4vm6 corosync[2442]: [pcmk ] info: pcmk_peer_update: > NEW: lotus-4vm5 3176140298 > Apr 8 09:54:04 lotus-4vm6 corosync[2442]: [pcmk ] info: pcmk_peer_update: > MEMB: lotus-4vm5 3176140298 > > And now crmd realizes the node is back: > > Apr 8 09:54:04 lotus-4vm6 crmd[2496]: notice: crm_update_peer_state: > plugin_handle_membership: Node lotus-4vm5[3176140298] - state is now member > (was lost) > > As well as cib: > > Apr 8 09:54:04 lotus-4vm6 cib[2491]: notice: crm_update_peer_state: > plugin_handle_membership: Node lotus-4vm5[3176140298] - state is now member > (was lost) > > And stonith-ng and crmd reports successful reboot: > > Apr 8 09:54:04 lotus-4vm6 stonith-ng[2492]: notice: remote_op_done: > Operation reboot of lotus-4vm5 by lotus-4vm6 for > crmd.2496-ZBdUr1hrI04s+xCAc1R/N1ez/nohh...@public.gmane.org<mailto:crmd.2496-ZBdUr1hrI04s+xCAc1R/N1ez/nohh...@public.gmane.org>: > OK > Apr 8 09:54:04 lotus-4vm6 crmd[2496]: notice: tengine_stonith_callback: > Stonith operation 2/13:0:0:f325afae-64b0-4812-a897-70556ab1e806: OK (0) > Apr 8 09:54:04 lotus-4vm6 crmd[2496]: notice: tengine_stonith_notify: Peer > lotus-4vm5 was terminated (reboot) by lotus-4vm6 for lotus-4vm6: OK > (ref=ae82b411-b07a-4235-be55-5a30a00b323b) by client crmd.2496 > > But all of a sudden, crmd reports the node is "lost" again? > > Apr 8 09:54:04 lotus-4vm6 crmd[2496]: notice: crm_update_peer_state: > send_stonith_update: Node lotus-4vm5[3176140298] - state is now lost (was > member) > > But why? Brian: the detective work above is highly appreciated Little known fact, I too got sick of trying to figure out why nodes were being marked up/down and so the second '${tag}:' is actually the source of the change. So in this case, its the function send_stonith_update() and I recognise the problem from a few weeks ago. Essentially the node is returning "too fast" (specifically, before the fencing notification arrives) causing pacemaker to forget the node is up and healthy. The fix for this is https://github.com/beekhof/pacemaker/commit/e777b17 and is present in 1.1.11 > > Not surprising that we get these messages (below) if crmd thinks it was > suddenly "lost" (when it was in fact not according to the log for vm5: > ) > > Apr 8 09:54:11 lotus-4vm6 crmd[2496]: warning: crmd_cs_dispatch: Recieving > messages from a node we think is dead: lotus-4vm5[-1118826998] > Apr 8 09:54:31 lotus-4vm6 crmd[2496]: notice: do_election_count_vote: > Election 2 (current: 2, owner: lotus-4vm5): Processed vote from lotus-4vm5 > (Peer is not part of our cluster) > > So I think the question is, why did crmd suddenly believe the node to be > "lost" when there was no evidence that it was lost? > > b. > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org