Hi -
Am new to pacemaker and now have a shiny new configuration that will not stonith. This is a test system using KVM and external/libvirt - all VMs are running CentOS 5. Am (really) hoping someone might be willing to help troubleshoot this configuration. Thank you for your time and effort! The items that are suspect to me are: 1. st-nodes has no 'location' entry 2. logs report node_list= 3. resource st-nodes is Stopped Have attached a clip of the configuration below. The full configuration and log file may be found at - http://pastebin.com/bS87FXUr Per 'stonith -t external/libvirt -h' I have configured stonith using: primitive st-nodes stonith:external/libvirt \ params hostlist="st15-mds1,st15-mds2,st15-oss1,st15-oss2" hypervisor_uri="qemu+ssh://wc0008/system" stonith-timeout="30" \ op start interval="0" timeout="60" \ op stop interval="0" timeout="60" \ op monitor interval="60" And a section of the log file is: Jun 29 11:02:07 st15-mds2 stonithd: [4485]: ERROR: Failed to STONITH the node st15-mds1: optype=RESET, op_result=TIMEOUT Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: tengine_stonith_callback: call=-65, optype=1, node_name=st15-mds1, result=2, node_list=, action=23:90:0:aac961e7-b06b-4dfd-ae60-c882407b16b5 Jun 29 11:02:07 st15-mds2 crmd: [4490]: ERROR: tengine_stonith_callback: Stonith of st15-mds1 failed (2)... aborting transition. Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: abort_transition_graph: tengine_stonith_callback:409 - Triggered transition abort (complete=0) : Stonith failed Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: update_abort_priority: Abort priority upgraded from 0 to 1000000 Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: update_abort_priority: Abort action done superceeded by restart Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: run_graph: ==================================================== Jun 29 11:02:07 st15-mds2 crmd: [4490]: notice: run_graph: Transition 90 (Complete=2, Pending=0, Fired=0, Skipped=5, Incomplete=0, Source=/var/lib/pengine/pe-warn-173.bz2): Stopped Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: te_graph_trigger: Transition 90 is now complete Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ] Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: do_state_transition: All 3 cluster nodes are eligible to run resources. Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: do_pe_invoke: Query 299: Requesting the current CIB: S_POLICY_ENGINE Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: do_pe_invoke_callback: Invoking the PE: query=299, ref=pe_calc-dc-1340982127-223, seq=396, quorate=1 Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0 Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: determine_online_status: Node st15-mds2 is online Jun 29 11:02:07 st15-mds2 pengine: [4489]: WARN: pe_fence_node: Node st15-mds1 will be fenced because it is un-expectedly down Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: determine_online_status_fencing: ha_state=active, ccm_state=false, crm_state=online, join_state=member, expected=member Jun 29 11:02:07 st15-mds2 pengine: [4489]: WARN: determine_online_status: Node st15-mds1 is unclean Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: determine_online_status: Node st15-oss1 is online Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: determine_online_status: Node st15-oss2 is online Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: native_print: lustre-OST0000 (ocf::heartbeat:Filesystem): Started st15-oss1 Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: native_print: lustre-OST0001 (ocf::heartbeat:Filesystem): Started st15-oss1 Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: native_print: lustre-OST0002 (ocf::heartbeat:Filesystem): Started st15-oss2 Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: native_print: lustre-OST0003 (ocf::heartbeat:Filesystem): Started st15-oss2 Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: native_print: lustre-MDT0000 (ocf::heartbeat:Filesystem): Started st15-mds1 Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: native_print: st-nodes (stonith:external/libvirt): Stopped Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: native_color: Resource st-nodes cannot run anywhere Jun 29 11:02:07 st15-mds2 pengine: [4489]: WARN: custom_action: Action lustre-MDT0000_stop_0 on st15-mds1 is unrunnable (offline) Jun 29 11:02:07 st15-mds2 pengine: [4489]: WARN: custom_action: Marking node st15-mds1 unclean Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: RecurringOp: Start recurring monitor (120s) for lustre-MDT0000 on st15-mds2 Jun 29 11:02:07 st15-mds2 pengine: [4489]: WARN: stage6: Scheduling Node st15-mds1 for STONITH Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: native_stop_constraints: lustre-MDT0000_stop_0 is implicit after st15-mds1 is fenced Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: LogActions: Leave resource lustre-OST0000 (Started st15-oss1) Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: LogActions: Leave resource lustre-OST0001 (Started st15-oss1) Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: LogActions: Leave resource lustre-OST0002 (Started st15-oss2) Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: LogActions: Leave resource lustre-OST0003 (Started st15-oss2) Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: LogActions: Move resource lustre-MDT0000 (Started st15-mds1 -> st15-mds2) Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: LogActions: Leave resource st-nodes (Stopped) Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Jun 29 11:02:07 st15-mds2 pengine: [4489]: WARN: process_pe_message: Transition 91: WARNINGs found during PE processing. PEngine Input stored in: /var/lib/pengine/pe-warn-174.bz2 Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: unpack_graph: Unpacked transition 91: 7 actions in 7 synapses Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: process_pe_message: Configuration WARNINGs found during PE processing. Please run "crm_verify -L" to identify issues. Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: do_te_invoke: Processing graph 91 (ref=pe_calc-dc-1340982127-223) derived from /var/lib/pengine/pe-warn-174.bz2 Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: te_pseudo_action: Pseudo action 21 fired and confirmed Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: te_fence_node: Executing reboot fencing operation (23) on st15-mds1 (timeout=60000) Jun 29 11:02:07 st15-mds2 stonithd: [4485]: info: client tengine [pid: 4490] requests a STONITH operation RESET on node st15-mds1 Jun 29 11:02:07 st15-mds2 stonithd: [4485]: info: we can't manage st15-mds1, broadcast request to other nodes Jun 29 11:02:07 st15-mds2 stonithd: [4485]: info: Broadcasting the message succeeded: require others to stonith node st15-mds1. Thank you! Brett Lee Everything Penguin - http://etpenguin.com
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org