Re: [Pacemaker] stonith-ng: error: remote_op_done: Operation reboot of node2 by node1 for stonith_admin: Timer expired

Andrew Beekhof Wed, 22 May 2013 20:44:10 -0700

On 17/05/2013, at 12:23 AM, Brian J. Murrell <br...@interlinx.bc.ca> wrote:


> Using Pacemaker 1.1.8 on EL6.4 with the pacemaker plugin, I'm finding
> strange behavior with "stonith-admin -B node2".  It seems to shut the
> node down but not start it back up and ends up reporting a timer
> expired:
> 
> # stonith_admin -B node2
> Command failed: Timer expired
> 
> The pacemaker log for the operation is:
> 
> May 16 13:50:41 node1 stonith_admin[23174]:   notice: crm_log_args: Invoked: 
> stonith_admin -B node2 
> May 16 13:50:41 node1 stonith-ng[1673]:   notice: handle_request: Client 
> stonith_admin.23174.4a093de2 wants to fence (reboot) 'node2' with device 
> '(any)'
> May 16 13:50:41 node1 stonith-ng[1673]:   notice: initiate_remote_stonith_op: 
> Initiating remote operation reboot for node2: 
> aa230634-6a38-42b7-8ed4-0a0eb64af39a (0)
> May 16 13:50:41 node1 cibadmin[23176]:   notice: crm_log_args: Invoked: 
> cibadmin --query 
> May 16 13:50:49 node1 corosync[1376]:   [TOTEM ] A processor failed, forming 
> new configuration.
> May 16 13:50:55 node1 corosync[1376]:   [pcmk  ] notice: pcmk_peer_update: 
> Transitional membership event on ring 76: memb=1, new=0, lost=1
> May 16 13:50:55 node1 corosync[1376]:   [pcmk  ] info: pcmk_peer_update: 
> memb: node1 4252674240
> May 16 13:50:55 node1 corosync[1376]:   [pcmk  ] info: pcmk_peer_update: 
> lost: node2 2608507072
> May 16 13:50:55 node1 corosync[1376]:   [pcmk  ] notice: pcmk_peer_update: 
> Stable membership event on ring 76: memb=1, new=0, lost=0
> May 16 13:50:55 node1 corosync[1376]:   [pcmk  ] info: pcmk_peer_update: 
> MEMB: node1 4252674240
> May 16 13:50:55 node1 corosync[1376]:   [pcmk  ] info: 
> ais_mark_unseen_peer_dead: Node node2 was not seen in the previous transition
> May 16 13:50:55 node1 corosync[1376]:   [pcmk  ] info: update_member: Node 
> 2608507072/node2 is now: lost
> May 16 13:50:55 node1 corosync[1376]:   [pcmk  ] info: 
> send_member_notification: Sending membership update 76 to 2 children
> May 16 13:50:55 node1 corosync[1376]:   [TOTEM ] A processor joined or left 
> the membership and a new membership was formed.
> May 16 13:50:55 node1 corosync[1376]:   [CPG   ] chosen downlist: sender r(0) 
> ip(192.168.122.253) r(1) ip(10.0.0.253) ; members(old:2 left:1)
> May 16 13:50:55 node1 corosync[1376]:   [MAIN  ] Completed service 
> synchronization, ready to provide service.
> May 16 13:50:55 node1 cib[1672]:   notice: ais_dispatch_message: Membership 
> 76: quorum lost
> May 16 13:50:55 node1 cib[1672]:   notice: crm_update_peer_state: 
> crm_update_ais_node: Node node2[2608507072] - state is now lost
> May 16 13:50:55 node1 crmd[1677]:   notice: ais_dispatch_message: Membership 
> 76: quorum lost
> May 16 13:50:55 node1 crmd[1677]:   notice: crm_update_peer_state: 
> crm_update_ais_node: Node node2[2608507072] - state is now lost
> May 16 13:50:55 node1 crmd[1677]:  warning: match_down_event: No match for 
> shutdown action on node2
> May 16 13:50:55 node1 crmd[1677]:   notice: peer_update_callback: 
> Stonith/shutdown of node2 not matched
> May 16 13:50:55 node1 crmd[1677]:   notice: do_state_transition: State 
> transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL 
> origin=check_join_state ]
> May 16 13:50:57 node1 attrd[1675]:   notice: attrd_local_callback: Sending 
> full refresh (origin=crmd)
> May 16 13:50:57 node1 attrd[1675]:   notice: attrd_trigger_update: Sending 
> flush op to all hosts for: last-failure-resource1 (1368710825)
> May 16 13:50:57 node1 attrd[1675]:   notice: attrd_trigger_update: Sending 
> flush op to all hosts for: probe_complete (true)
> May 16 13:50:58 node1 pengine[1676]:   notice: unpack_config: On loss of CCM 
> Quorum: Ignore
> May 16 13:50:58 node1 pengine[1676]:     crit: get_timet_now: Defaulting to 
> 'now'
> May 16 13:50:58 node1 pengine[1676]:     crit: get_timet_now: Defaulting to 
> 'now'
> May 16 13:50:58 node1 pengine[1676]:     crit: get_timet_now: Defaulting to 
> 'now'
> May 16 13:50:58 node1 pengine[1676]:     crit: get_timet_now: Defaulting to 
> 'now'
> May 16 13:50:58 node1 pengine[1676]:     crit: get_timet_now: Defaulting to 
> 'now'
> May 16 13:50:58 node1 pengine[1676]:     crit: get_timet_now: Defaulting to 
> 'now'
> May 16 13:50:58 node1 pengine[1676]:     crit: get_timet_now: Defaulting to 
> 'now'
> May 16 13:50:58 node1 pengine[1676]:     crit: get_timet_now: Defaulting to 
> 'now'
> May 16 13:50:58 node1 pengine[1676]:     crit: get_timet_now: Defaulting to 
> 'now'
> May 16 13:50:58 node1 pengine[1676]:     crit: get_timet_now: Defaulting to 
> 'now'
> May 16 13:50:58 node1 pengine[1676]:     crit: get_timet_now: Defaulting to 
> 'now'
> May 16 13:50:58 node1 pengine[1676]:     crit: get_timet_now: Defaulting to 
> 'now'
> May 16 13:50:58 node1 pengine[1676]:     crit: get_timet_now: Defaulting to 
> 'now'
> May 16 13:50:58 node1 pengine[1676]:     crit: get_timet_now: Defaulting to 
> 'now'
> May 16 13:50:58 node1 pengine[1676]:     crit: get_timet_now: Defaulting to 
> 'now'
> May 16 13:50:58 node1 pengine[1676]:     crit: get_timet_now: Defaulting to 
> 'now'
> May 16 13:50:58 node1 pengine[1676]:     crit: get_timet_now: Defaulting to 
> 'now'
> May 16 13:50:58 node1 pengine[1676]:     crit: get_timet_now: Defaulting to 
> 'now'
> May 16 13:50:58 node1 pengine[1676]:     crit: get_timet_now: Defaulting to 
> 'now'
> May 16 13:50:58 node1 pengine[1676]:     crit: get_timet_now: Defaulting to 
> 'now'
> May 16 13:50:58 node1 pengine[1676]:     crit: get_timet_now: Defaulting to 
> 'now'
> May 16 13:50:58 node1 pengine[1676]:     crit: get_timet_now: Defaulting to 
> 'now'
> May 16 13:50:58 node1 pengine[1676]:     crit: get_timet_now: Defaulting to 
> 'now'
> May 16 13:50:58 node1 crmd[1677]:   notice: run_graph: Transition 247 
> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
> Source=/var/lib/pacemaker/pengine/pe-input-981.bz2): Complete
> May 16 13:50:58 node1 crmd[1677]:   notice: do_state_transition: State 
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> May 16 13:50:58 node1 pengine[1676]:   notice: process_pe_message: Calculated 
> Transition 247: /var/lib/pacemaker/pengine/pe-input-981.bz2
> May 16 13:53:05 node1 stonith-ng[1673]:    error: remote_op_done: Operation 
> reboot of node2 by node1 for stonith_admin.23174@node1.aa230634: Timer expired
> May 16 13:53:05 node1 crmd[1677]:   notice: tengine_stonith_notify: Peer 
> node2 was not terminated (st_notify_fence) by node1 for node1: Timer expired 
> (ref=aa230634-6a38-42b7-8ed4-0a0eb64af39a) by client stonith_admin.23174
> May 16 13:53:05 node1 stonith-ng[1673]:  warning: stonith_notify_client: 
> st_notify_fence notification of client stonith_admin.23174.4a093d failed: 
> Broken pipe (-32)
> 
> What is this timer that expiring?

The fencing operation itself.

> 
> FWIW, stonith_admin -[F|U] node2 operates as expected as does utilizing
> the fence agent's "-o reboot" functionality directly.

Thats interesting.
Can we see your pacemaker config?
Also, have you configured a logfile?  The contents of that would be useful:

    http://blog.clusterlabs.org/blog/2013/pacemaker-logging/

>  Other required
> fence agent functions also work as expected, I believe:
> 
> # echo "action=list" | my_fence_agent
> node1,
> node2,
> # echo -e "node=node2\nport=node2\naction=reboot" | my_fence_agent
> # echo $?
> 0
> [ and of course, the node is rebooted successfully ]
> 
> Also, but secondarily, what is that "crit: get_timet_now: Defaulting to
> 'now'" all about?  It seems concerning given that it's labeled a "crit"
> which I take to mean "critical".

Normally yes. In this case its debug code that I forgot to remove before 
committing :-(

> 
> Cheers,
> b.
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] stonith-ng: error: remote_op_done: Operation reboot of node2 by node1 for stonith_admin: Timer expired

Reply via email to