[Pacemaker] newb - stonith not working - require others to stonith node

Brett Lee Fri, 29 Jun 2012 08:48:03 -0700

Hi -


Am new to pacemaker and now have a shiny new configuration that will not 
stonith.  This is a test system using KVM and external/libvirt - all VMs are 
running CentOS 5.

Am (really) hoping someone might be willing to help troubleshoot this 
configuration.  Thank you for your time and effort!



The items that are suspect to me are:
1.  st-nodes has no 'location' entry
2.  logs report node_list=
3.  resource st-nodes is Stopped

Have attached a clip of the configuration below.  The full configuration and 
log file may be found at - http://pastebin.com/bS87FXUr

Per 'stonith -t external/libvirt -h' I have configured stonith using:

primitive st-nodes stonith:external/libvirt \
        params hostlist="st15-mds1,st15-mds2,st15-oss1,st15-oss2" 
hypervisor_uri="qemu+ssh://wc0008/system" stonith-timeout="30" \
        op start interval="0" timeout="60"
 \
        op stop interval="0" timeout="60" \
        op monitor interval="60"

And a section of the log file is:

Jun 29 11:02:07 st15-mds2 stonithd: [4485]: ERROR: Failed to STONITH the node 
st15-mds1: optype=RESET, op_result=TIMEOUT
Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: tengine_stonith_callback: 
call=-65, optype=1, node_name=st15-mds1, result=2, node_list=, 
action=23:90:0:aac961e7-b06b-4dfd-ae60-c882407b16b5
Jun 29 11:02:07 st15-mds2 crmd: [4490]: ERROR: tengine_stonith_callback: 
Stonith of st15-mds1 failed (2)... aborting transition.
Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: abort_transition_graph: 
tengine_stonith_callback:409 - Triggered transition abort (complete=0) : 
Stonith failed
Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: update_abort_priority: Abort 
priority upgraded from 0 to 1000000
Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: update_abort_priority: Abort
 action done superceeded by restart
Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: run_graph: 
====================================================
Jun 29 11:02:07 st15-mds2 crmd: [4490]: notice: run_graph: Transition 90 
(Complete=2, Pending=0, Fired=0, Skipped=5, Incomplete=0, 
Source=/var/lib/pengine/pe-warn-173.bz2): Stopped
Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: te_graph_trigger: Transition 90 
is now complete
Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: do_state_transition: State 
transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC 
cause=C_FSA_INTERNAL origin=notify_crmd ]
Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: do_state_transition: All 3 
cluster nodes are eligible to run resources.
Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: do_pe_invoke: Query 299: 
Requesting the current CIB: S_POLICY_ENGINE
Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: do_pe_invoke_callback: Invoking 
the PE: query=299,
 ref=pe_calc-dc-1340982127-223, seq=396, quorate=1
Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: unpack_config: Node scores: 
'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: determine_online_status: Node 
st15-mds2 is online
Jun 29 11:02:07 st15-mds2 pengine: [4489]: WARN: pe_fence_node: Node st15-mds1 
will be fenced because it is un-expectedly down
Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: 
determine_online_status_fencing:     ha_state=active, ccm_state=false, 
crm_state=online, join_state=member, expected=member
Jun 29 11:02:07 st15-mds2 pengine: [4489]: WARN: determine_online_status: Node 
st15-mds1 is unclean
Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: determine_online_status: Node 
st15-oss1 is online
Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: determine_online_status: Node 
st15-oss2 is online
Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice:
 native_print: lustre-OST0000    (ocf::heartbeat:Filesystem):    Started 
st15-oss1
Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: native_print: 
lustre-OST0001    (ocf::heartbeat:Filesystem):    Started st15-oss1
Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: native_print: 
lustre-OST0002    (ocf::heartbeat:Filesystem):    Started st15-oss2
Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: native_print: 
lustre-OST0003    (ocf::heartbeat:Filesystem):    Started st15-oss2
Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: native_print: 
lustre-MDT0000    (ocf::heartbeat:Filesystem):    Started st15-mds1
Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: native_print: st-nodes    
(stonith:external/libvirt):    Stopped 
Jun 29 11:02:07 st15-mds2 pengine:
 [4489]: info: native_color: Resource st-nodes cannot run anywhere
Jun 29 11:02:07 st15-mds2 pengine: [4489]: WARN: custom_action: Action 
lustre-MDT0000_stop_0 on st15-mds1 is unrunnable (offline)
Jun 29 11:02:07 st15-mds2 pengine: [4489]: WARN: custom_action: Marking node 
st15-mds1 unclean
Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: RecurringOp:  Start 
recurring monitor (120s) for lustre-MDT0000 on st15-mds2
Jun 29 11:02:07 st15-mds2 pengine: [4489]: WARN: stage6: Scheduling Node 
st15-mds1 for STONITH
Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: native_stop_constraints: 
lustre-MDT0000_stop_0 is implicit after st15-mds1 is fenced
Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: LogActions: Leave   resource 
lustre-OST0000    (Started st15-oss1)
Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: LogActions: Leave   resource 
lustre-OST0001    (Started st15-oss1)
Jun
 29 11:02:07 st15-mds2 pengine: [4489]: notice: LogActions: Leave   resource 
lustre-OST0002    (Started st15-oss2)
Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: LogActions: Leave   resource 
lustre-OST0003    (Started st15-oss2)
Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: LogActions: Move    resource 
lustre-MDT0000    (Started st15-mds1 -> st15-mds2)
Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: LogActions: Leave   resource 
st-nodes    (Stopped)
Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: do_state_transition: State 
transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
cause=C_IPC_MESSAGE origin=handle_response ]
Jun 29 11:02:07 st15-mds2 pengine: [4489]: WARN: process_pe_message: Transition 
91: WARNINGs found during PE processing. PEngine Input stored in: 
/var/lib/pengine/pe-warn-174.bz2
Jun
 29 11:02:07 st15-mds2 crmd: [4490]: info: unpack_graph: Unpacked transition 
91: 7 actions in 7 synapses
Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: process_pe_message: 
Configuration WARNINGs found during PE processing.  Please run "crm_verify -L" 
to identify issues.
Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: do_te_invoke: Processing graph 91 
(ref=pe_calc-dc-1340982127-223) derived from /var/lib/pengine/pe-warn-174.bz2
Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: te_pseudo_action: Pseudo action 
21 fired and confirmed
Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: te_fence_node: Executing reboot 
fencing operation (23) on st15-mds1 (timeout=60000)
Jun 29 11:02:07 st15-mds2 stonithd: [4485]: info: client tengine [pid: 4490] 
requests a STONITH operation RESET on node st15-mds1
Jun 29 11:02:07 st15-mds2 stonithd: [4485]: info: we can't manage st15-mds1, 
broadcast request to other nodes
Jun 29 11:02:07 st15-mds2 stonithd:
 [4485]: info: Broadcasting the message succeeded: require others to stonith 
node st15-mds1.

Thank you!

 
Brett Lee
Everything Penguin - http://etpenguin.com

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[Pacemaker] newb - stonith not working - require others to stonith node

Reply via email to