I have a pacemaker 1.0.10 installation on rhel5 but I can't seem to manage to get a working stonith configuration. I have tested my stonith device manually using the stonith command and it works fine. What doesn't seem to be happening is pacemaker/stonithd actually asking for a stonith. In my log I get:
Oct 18 08:54:23 mds1 stonithd: [4645]: ERROR: Failed to STONITH the node oss1: optype=RESET, op_result=TIMEOUT Oct 18 08:54:23 mds1 crmd: [4650]: info: tengine_stonith_callback: call=-975, optype=1, node_name=oss1, result=2, node_list=, action=17:1023:0:4e12e206-e0be-4915-bfb8-b4e052057f01 Oct 18 08:54:23 mds1 crmd: [4650]: ERROR: tengine_stonith_callback: Stonith of oss1 failed (2)... aborting transition. Oct 18 08:54:23 mds1 crmd: [4650]: info: abort_transition_graph: tengine_stonith_callback:402 - Triggered transition abort (complete=0) : Stonith failed Oct 18 08:54:23 mds1 crmd: [4650]: info: update_abort_priority: Abort priority upgraded from 0 to 1000000 Oct 18 08:54:23 mds1 crmd: [4650]: info: update_abort_priority: Abort action done superceeded by restart Oct 18 08:54:23 mds1 crmd: [4650]: info: run_graph: ==================================================== Oct 18 08:54:23 mds1 crmd: [4650]: notice: run_graph: Transition 1023 (Complete=2, Pending=0, Fired=0, Skipped=7, Incomplete=0, Source=/var/lib/pengine/pe-warn-5799.bz2): Stopped Oct 18 08:54:23 mds1 crmd: [4650]: info: te_graph_trigger: Transition 1023 is now complete Oct 18 08:54:23 mds1 crmd: [4650]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ] Oct 18 08:54:23 mds1 crmd: [4650]: info: do_state_transition: All 1 cluster nodes are eligible to run resources. Oct 18 08:54:23 mds1 crmd: [4650]: info: do_pe_invoke: Query 1307: Requesting the current CIB: S_POLICY_ENGINE Oct 18 08:54:23 mds1 crmd: [4650]: info: do_pe_invoke_callback: Invoking the PE: query=1307, ref=pe_calc-dc-1318942463-1164, seq=16860, quorate=0 Oct 18 08:54:23 mds1 pengine: [4649]: notice: unpack_config: On loss of CCM Quorum: Ignore Oct 18 08:54:23 mds1 pengine: [4649]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0 Oct 18 08:54:23 mds1 pengine: [4649]: WARN: pe_fence_node: Node oss1 will be fenced because it is un-expectedly down Oct 18 08:54:23 mds1 pengine: [4649]: info: determine_online_status_fencing: #011ha_state=active, ccm_state=false, crm_state=online, join_state=pending, expected=member Oct 18 08:54:23 mds1 pengine: [4649]: WARN: determine_online_status: Node oss1 is unclean Oct 18 08:54:23 mds1 pengine: [4649]: WARN: pe_fence_node: Node mds2 will be fenced because it is un-expectedly down Oct 18 08:54:23 mds1 pengine: [4649]: info: determine_online_status_fencing: #011ha_state=active, ccm_state=false, crm_state=online, join_state=pending, expected=member Oct 18 08:54:23 mds1 pengine: [4649]: WARN: determine_online_status: Node mds2 is unclean Oct 18 08:54:23 mds1 pengine: [4649]: info: determine_online_status_fencing: Node oss2 is down Oct 18 08:54:23 mds1 pengine: [4649]: info: determine_online_status: Node mds1 is online Oct 18 08:54:23 mds1 pengine: [4649]: notice: native_print: MGS_2#011(ocf::hydra:Target):#011Started mds1 Oct 18 08:54:23 mds1 pengine: [4649]: notice: native_print: testfs-MDT0000_3#011(ocf::hydra:Target):#011Started mds2 Oct 18 08:54:23 mds1 pengine: [4649]: notice: native_print: testfs-OST0000_4#011(ocf::hydra:Target):#011Started oss1 Oct 18 08:54:23 mds1 pengine: [4649]: notice: clone_print: Clone Set: fencing Oct 18 08:54:23 mds1 pengine: [4649]: notice: short_print: Stopped: [ st-pm:0 st-pm:1 st-pm:2 st-pm:3 ] Oct 18 08:54:23 mds1 pengine: [4649]: info: get_failcount: testfs-MDT0000_3 has failed 10 times on mds1 Oct 18 08:54:23 mds1 pengine: [4649]: notice: common_apply_stickiness: testfs-MDT0000_3 can fail 999990 more times on mds1 before being forced off Oct 18 08:54:23 mds1 pengine: [4649]: info: native_color: Resource testfs-OST0000_4 cannot run anywhere Oct 18 08:54:23 mds1 pengine: [4649]: info: native_color: Resource st-pm:0 cannot run anywhere Oct 18 08:54:23 mds1 pengine: [4649]: info: native_color: Resource st-pm:1 cannot run anywhere Oct 18 08:54:23 mds1 pengine: [4649]: info: native_color: Resource st-pm:2 cannot run anywhere Oct 18 08:54:23 mds1 pengine: [4649]: info: native_color: Resource st-pm:3 cannot run anywhere Oct 18 08:54:23 mds1 pengine: [4649]: WARN: custom_action: Action testfs-MDT0000_3_stop_0 on mds2 is unrunnable (offline) Oct 18 08:54:23 mds1 pengine: [4649]: WARN: custom_action: Marking node mds2 unclean Oct 18 08:54:23 mds1 pengine: [4649]: notice: RecurringOp: Start recurring monitor (120s) for testfs-MDT0000_3 on mds1 Oct 18 08:54:23 mds1 pengine: [4649]: WARN: custom_action: Action testfs-OST0000_4_stop_0 on oss1 is unrunnable (offline) Oct 18 08:54:23 mds1 pengine: [4649]: WARN: custom_action: Marking node oss1 unclean Oct 18 08:54:23 mds1 pengine: [4649]: WARN: stage6: Scheduling Node oss1 for STONITH Oct 18 08:54:23 mds1 pengine: [4649]: info: native_stop_constraints: testfs-OST0000_4_stop_0 is implicit after oss1 is fenced Oct 18 08:54:23 mds1 pengine: [4649]: WARN: stage6: Scheduling Node mds2 for STONITH Oct 18 08:54:23 mds1 pengine: [4649]: info: native_stop_constraints: testfs-MDT0000_3_stop_0 is implicit after mds2 is fenced Oct 18 08:54:23 mds1 pengine: [4649]: notice: LogActions: Leave resource MGS_2#011(Started mds1) Oct 18 08:54:23 mds1 pengine: [4649]: notice: LogActions: Move resource testfs-MDT0000_3#011(Started mds2 -> mds1) Oct 18 08:54:23 mds1 pengine: [4649]: notice: LogActions: Stop resource testfs-OST0000_4#011(oss1) Oct 18 08:54:23 mds1 pengine: [4649]: notice: LogActions: Leave resource st-pm:0#011(Stopped) Oct 18 08:54:23 mds1 pengine: [4649]: notice: LogActions: Leave resource st-pm:1#011(Stopped) Oct 18 08:54:23 mds1 pengine: [4649]: notice: LogActions: Leave resource st-pm:2#011(Stopped) Oct 18 08:54:23 mds1 pengine: [4649]: notice: LogActions: Leave resource st-pm:3#011(Stopped) Oct 18 08:54:23 mds1 crmd: [4650]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Oct 18 08:54:23 mds1 crmd: [4650]: info: unpack_graph: Unpacked transition 1024: 9 actions in 9 synapses Oct 18 08:54:23 mds1 crmd: [4650]: info: do_te_invoke: Processing graph 1024 (ref=pe_calc-dc-1318942463-1164) derived from /var/lib/pengine/pe-warn-5800.bz2 Oct 18 08:54:23 mds1 crmd: [4650]: info: te_pseudo_action: Pseudo action 15 fired and confirmed Oct 18 08:54:23 mds1 crmd: [4650]: info: te_fence_node: Executing reboot fencing operation (17) on oss1 (timeout=60000) Oct 18 08:54:23 mds1 stonithd: [4645]: info: client tengine [pid: 4650] requests a STONITH operation RESET on node oss1 Oct 18 08:54:23 mds1 stonithd: [4645]: info: we can't manage oss1, broadcast request to other nodes Oct 18 08:54:23 mds1 stonithd: [4645]: info: Broadcasting the message succeeded: require others to stonith node oss1. Oct 18 08:54:23 mds1 pengine: [4649]: WARN: process_pe_message: Transition 1024: WARNINGs found during PE processing. PEngine Input stored in: /var/lib/pengine/pe-warn-5800.bz2 Oct 18 08:54:23 mds1 pengine: [4649]: info: process_pe_message: Configuration WARNINGs found during PE processing. Please run "crm_verify -L" to identify issues. My configuration is: # crm configure show node mds1 node mds2 node oss1 node oss2 primitive MGS_2 ocf:hydra:Target \ meta target-role="Started" \ operations $id="MGS_2-operations" \ op monitor interval="120" timeout="60" \ op start interval="0" timeout="300" \ op stop interval="0" timeout="300" \ params target="MGS" primitive st-pm stonith:external/powerman \ params serverhost="192.168.122.1:10101" poweroff="0" primitive testfs-MDT0000_3 ocf:hydra:Target \ meta target-role="Started" \ operations $id="testfs-MDT0000_3-operations" \ op monitor interval="120" timeout="60" \ op start interval="0" timeout="300" \ op stop interval="0" timeout="300" \ params target="testfs-MDT0000" primitive testfs-OST0000_4 ocf:hydra:Target \ meta target-role="Started" \ operations $id="testfs-OST0000_4-operations" \ op monitor interval="120" timeout="60" \ op start interval="0" timeout="300" \ op stop interval="0" timeout="300" \ params target="testfs-OST0000" clone fencing st-pm location MGS_2-primary MGS_2 20: mds1 location MGS_2-secondary MGS_2 10: mds2 location testfs-MDT0000_3-primary testfs-MDT0000_3 20: mds2 location testfs-MDT0000_3-secondary testfs-MDT0000_3 10: mds1 location testfs-OST0000_4-primary testfs-OST0000_4 20: oss1 location testfs-OST0000_4-secondary testfs-OST0000_4 10: oss2 property $id="cib-bootstrap-options" \ no-quorum-policy="ignore" \ expected-quorum-votes="4" \ symmetric-cluster="false" \ cluster-infrastructure="openais" \ dc-version="1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3" \ stonith-enabled="true" Any ideas why stonith is failing? Cheers, b.
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker