On 17/05/2013, at 12:23 AM, Brian J. Murrell <br...@interlinx.bc.ca> wrote:
> Using Pacemaker 1.1.8 on EL6.4 with the pacemaker plugin, I'm finding > strange behavior with "stonith-admin -B node2". It seems to shut the > node down but not start it back up and ends up reporting a timer > expired: > > # stonith_admin -B node2 > Command failed: Timer expired > > The pacemaker log for the operation is: > > May 16 13:50:41 node1 stonith_admin[23174]: notice: crm_log_args: Invoked: > stonith_admin -B node2 > May 16 13:50:41 node1 stonith-ng[1673]: notice: handle_request: Client > stonith_admin.23174.4a093de2 wants to fence (reboot) 'node2' with device > '(any)' > May 16 13:50:41 node1 stonith-ng[1673]: notice: initiate_remote_stonith_op: > Initiating remote operation reboot for node2: > aa230634-6a38-42b7-8ed4-0a0eb64af39a (0) > May 16 13:50:41 node1 cibadmin[23176]: notice: crm_log_args: Invoked: > cibadmin --query > May 16 13:50:49 node1 corosync[1376]: [TOTEM ] A processor failed, forming > new configuration. > May 16 13:50:55 node1 corosync[1376]: [pcmk ] notice: pcmk_peer_update: > Transitional membership event on ring 76: memb=1, new=0, lost=1 > May 16 13:50:55 node1 corosync[1376]: [pcmk ] info: pcmk_peer_update: > memb: node1 4252674240 > May 16 13:50:55 node1 corosync[1376]: [pcmk ] info: pcmk_peer_update: > lost: node2 2608507072 > May 16 13:50:55 node1 corosync[1376]: [pcmk ] notice: pcmk_peer_update: > Stable membership event on ring 76: memb=1, new=0, lost=0 > May 16 13:50:55 node1 corosync[1376]: [pcmk ] info: pcmk_peer_update: > MEMB: node1 4252674240 > May 16 13:50:55 node1 corosync[1376]: [pcmk ] info: > ais_mark_unseen_peer_dead: Node node2 was not seen in the previous transition > May 16 13:50:55 node1 corosync[1376]: [pcmk ] info: update_member: Node > 2608507072/node2 is now: lost > May 16 13:50:55 node1 corosync[1376]: [pcmk ] info: > send_member_notification: Sending membership update 76 to 2 children > May 16 13:50:55 node1 corosync[1376]: [TOTEM ] A processor joined or left > the membership and a new membership was formed. > May 16 13:50:55 node1 corosync[1376]: [CPG ] chosen downlist: sender r(0) > ip(192.168.122.253) r(1) ip(10.0.0.253) ; members(old:2 left:1) > May 16 13:50:55 node1 corosync[1376]: [MAIN ] Completed service > synchronization, ready to provide service. > May 16 13:50:55 node1 cib[1672]: notice: ais_dispatch_message: Membership > 76: quorum lost > May 16 13:50:55 node1 cib[1672]: notice: crm_update_peer_state: > crm_update_ais_node: Node node2[2608507072] - state is now lost > May 16 13:50:55 node1 crmd[1677]: notice: ais_dispatch_message: Membership > 76: quorum lost > May 16 13:50:55 node1 crmd[1677]: notice: crm_update_peer_state: > crm_update_ais_node: Node node2[2608507072] - state is now lost > May 16 13:50:55 node1 crmd[1677]: warning: match_down_event: No match for > shutdown action on node2 > May 16 13:50:55 node1 crmd[1677]: notice: peer_update_callback: > Stonith/shutdown of node2 not matched > May 16 13:50:55 node1 crmd[1677]: notice: do_state_transition: State > transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL > origin=check_join_state ] > May 16 13:50:57 node1 attrd[1675]: notice: attrd_local_callback: Sending > full refresh (origin=crmd) > May 16 13:50:57 node1 attrd[1675]: notice: attrd_trigger_update: Sending > flush op to all hosts for: last-failure-resource1 (1368710825) > May 16 13:50:57 node1 attrd[1675]: notice: attrd_trigger_update: Sending > flush op to all hosts for: probe_complete (true) > May 16 13:50:58 node1 pengine[1676]: notice: unpack_config: On loss of CCM > Quorum: Ignore > May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to > 'now' > May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to > 'now' > May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to > 'now' > May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to > 'now' > May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to > 'now' > May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to > 'now' > May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to > 'now' > May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to > 'now' > May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to > 'now' > May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to > 'now' > May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to > 'now' > May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to > 'now' > May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to > 'now' > May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to > 'now' > May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to > 'now' > May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to > 'now' > May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to > 'now' > May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to > 'now' > May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to > 'now' > May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to > 'now' > May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to > 'now' > May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to > 'now' > May 16 13:50:58 node1 pengine[1676]: crit: get_timet_now: Defaulting to > 'now' > May 16 13:50:58 node1 crmd[1677]: notice: run_graph: Transition 247 > (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-input-981.bz2): Complete > May 16 13:50:58 node1 crmd[1677]: notice: do_state_transition: State > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS > cause=C_FSA_INTERNAL origin=notify_crmd ] > May 16 13:50:58 node1 pengine[1676]: notice: process_pe_message: Calculated > Transition 247: /var/lib/pacemaker/pengine/pe-input-981.bz2 > May 16 13:53:05 node1 stonith-ng[1673]: error: remote_op_done: Operation > reboot of node2 by node1 for stonith_admin.23174@node1.aa230634: Timer expired > May 16 13:53:05 node1 crmd[1677]: notice: tengine_stonith_notify: Peer > node2 was not terminated (st_notify_fence) by node1 for node1: Timer expired > (ref=aa230634-6a38-42b7-8ed4-0a0eb64af39a) by client stonith_admin.23174 > May 16 13:53:05 node1 stonith-ng[1673]: warning: stonith_notify_client: > st_notify_fence notification of client stonith_admin.23174.4a093d failed: > Broken pipe (-32) > > What is this timer that expiring? The fencing operation itself. > > FWIW, stonith_admin -[F|U] node2 operates as expected as does utilizing > the fence agent's "-o reboot" functionality directly. Thats interesting. Can we see your pacemaker config? Also, have you configured a logfile? The contents of that would be useful: http://blog.clusterlabs.org/blog/2013/pacemaker-logging/ > Other required > fence agent functions also work as expected, I believe: > > # echo "action=list" | my_fence_agent > node1, > node2, > # echo -e "node=node2\nport=node2\naction=reboot" | my_fence_agent > # echo $? > 0 > [ and of course, the node is rebooted successfully ] > > Also, but secondarily, what is that "crit: get_timet_now: Defaulting to > 'now'" all about? It seems concerning given that it's labeled a "crit" > which I take to mean "critical". Normally yes. In this case its debug code that I forgot to remove before committing :-( > > Cheers, > b. > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org