Hi Folks,
The problem with my configuration issue was resolved. In short,
my configuration did not allow for a location where the stonith resource could
run. Adding an appropriate 'location' entry allowed the stonith resource to
run, and then stonith immediately worked as expected.
No, I did not resolve this issue; instead I enlisted support from Andreas Kurz
& friends at Hastexo.com. I received the type of support that was needed; it
was very fast; it was very professional; most importantly, I would *certainly*
enlist their support again. Thank you Andreas !
Brett Lee
Everything Penguin - http://etpenguin.com
>________________________________
> From: Dejan Muhamedagic <deja...@fastmail.fm>
>To: Brett Lee <brett...@yahoo.com>; The Pacemaker cluster resource manager
><pacemaker@oss.clusterlabs.org>
>Sent: Monday, July 2, 2012 7:27 AM
>Subject: Re: [Pacemaker] newb - stonith not working - require others to
>stonith node
>
>Hi,
>
>On Fri, Jun 29, 2012 at 08:43:24AM -0700, Brett Lee wrote:
>> Hi -
>>
>>
>> Am new to pacemaker and now have a shiny new configuration that will not
>> stonith. This is a test system using KVM and external/libvirt - all VMs are
>> running CentOS 5.
>>
>> Am (really) hoping someone might be willing to help troubleshoot this
>> configuration. Thank you for your time and effort!
>>
>>
>>
>> The items that are suspect to me are:
>> 1. st-nodes has no 'location' entry
>> 2. logs report node_list=
>> 3. resource st-nodes is Stopped
>>
>> Have attached a clip of the configuration below. The full configuration and
>> log file may be found at - http://pastebin.com/bS87FXUr
>>
>> Per 'stonith -t external/libvirt -h' I have configured stonith using:
>
>Did you try fencing manually with this stonith program? You can
>do it like this:
>
>stonith -t external/libvirt hostlist="st15-mds1,st15-mds2,st15-oss1,st15-oss2"
>hypervisor_uri="qemu+ssh://wc0008/system" -T reset st15-mds1
>
>> primitive st-nodes stonith:external/libvirt \
>> params hostlist="st15-mds1,st15-mds2,st15-oss1,st15-oss2"
>> hypervisor_uri="qemu+ssh://wc0008/system" stonith-timeout="30" \
>
>I'm not sure if ',' works here as a separator, better use a
>space.
>
>stonith-timeout is effectively ignored here. Use the cluster
>property for that.
>
>> op start interval="0" timeout="60"
>> \
>> op stop interval="0" timeout="60" \
>> op monitor interval="60"
>>
>> And a section of the log file is:
>>
>> Jun 29 11:02:07 st15-mds2 stonithd: [4485]: ERROR: Failed to STONITH the
>> node st15-mds1: optype=RESET, op_result=TIMEOUT
>
>This indicates that fencing was attempted. But it timed out.
>Perhaps take a look at the libvirt logs?
>
>Thanks,
>
>Dejan
>
>> Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: tengine_stonith_callback:
>> call=-65, optype=1, node_name=st15-mds1, result=2, node_list=,
>> action=23:90:0:aac961e7-b06b-4dfd-ae60-c882407b16b5
>> Jun 29 11:02:07 st15-mds2 crmd: [4490]: ERROR: tengine_stonith_callback:
>> Stonith of st15-mds1 failed (2)... aborting transition.
>> Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: abort_transition_graph:
>> tengine_stonith_callback:409 - Triggered transition abort (complete=0) :
>> Stonith failed
>> Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: update_abort_priority: Abort
>> priority upgraded from 0 to 1000000
>> Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: update_abort_priority: Abort
>> action done superceeded by restart
>> Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: run_graph:
>> ====================================================
>> Jun 29 11:02:07 st15-mds2 crmd: [4490]: notice: run_graph: Transition 90
>> (Complete=2, Pending=0, Fired=0, Skipped=5, Incomplete=0,
>> Source=/var/lib/pengine/pe-warn-173.bz2): Stopped
>> Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: te_graph_trigger: Transition
>> 90 is now complete
>> Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: do_state_transition: State
>> transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC
>> cause=C_FSA_INTERNAL origin=notify_crmd ]
>> Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: do_state_transition: All 3
>> cluster nodes are eligible to run resources.
>> Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: do_pe_invoke: Query 299:
>> Requesting the current CIB: S_POLICY_ENGINE
>> Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: do_pe_invoke_callback:
>> Invoking the PE: query=299,
>> ref=pe_calc-dc-1340982127-223, seq=396, quorate=1
>> Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: unpack_config: Node scores:
>> 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
>> Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: determine_online_status:
>> Node st15-mds2 is online
>> Jun 29 11:02:07 st15-mds2 pengine: [4489]: WARN: pe_fence_node: Node
>> st15-mds1 will be fenced because it is un-expectedly down
>> Jun 29 11:02:07 st15-mds2 pengine: [4489]: info:
>> determine_online_status_fencing: ha_state=active, ccm_state=false,
>> crm_state=online, join_state=member, expected=member
>> Jun 29 11:02:07 st15-mds2 pengine: [4489]: WARN: determine_online_status:
>> Node st15-mds1 is unclean
>> Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: determine_online_status:
>> Node st15-oss1 is online
>> Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: determine_online_status:
>> Node st15-oss2 is online
>> Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice:
>> native_print: lustre-OST0000 (ocf::heartbeat:Filesystem): Started
>>st15-oss1
>> Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: native_print:
>> lustre-OST0001 (ocf::heartbeat:Filesystem): Started st15-oss1
>> Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: native_print:
>> lustre-OST0002 (ocf::heartbeat:Filesystem): Started st15-oss2
>> Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: native_print:
>> lustre-OST0003 (ocf::heartbeat:Filesystem): Started st15-oss2
>> Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: native_print:
>> lustre-MDT0000 (ocf::heartbeat:Filesystem): Started st15-mds1
>> Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: native_print: st-nodes
>> (stonith:external/libvirt): Stopped
>> Jun 29 11:02:07 st15-mds2 pengine:
>> [4489]: info: native_color: Resource st-nodes cannot run anywhere
>> Jun 29 11:02:07 st15-mds2 pengine: [4489]: WARN: custom_action: Action
>> lustre-MDT0000_stop_0 on st15-mds1 is unrunnable (offline)
>> Jun 29 11:02:07 st15-mds2 pengine: [4489]: WARN: custom_action: Marking node
>> st15-mds1 unclean
>> Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: RecurringOp: Start
>> recurring monitor (120s) for lustre-MDT0000 on st15-mds2
>> Jun 29 11:02:07 st15-mds2 pengine: [4489]: WARN: stage6: Scheduling Node
>> st15-mds1 for STONITH
>> Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: native_stop_constraints:
>> lustre-MDT0000_stop_0 is implicit after st15-mds1 is fenced
>> Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: LogActions: Leave
>> resource lustre-OST0000 (Started st15-oss1)
>> Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: LogActions: Leave
>> resource lustre-OST0001 (Started st15-oss1)
>> Jun
>> 29 11:02:07 st15-mds2 pengine: [4489]: notice: LogActions: Leave resource
>>lustre-OST0002 (Started st15-oss2)
>> Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: LogActions: Leave
>> resource lustre-OST0003 (Started st15-oss2)
>> Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: LogActions: Move
>> resource lustre-MDT0000 (Started st15-mds1 -> st15-mds2)
>> Jun 29 11:02:07 st15-mds2 pengine: [4489]: notice: LogActions: Leave
>> resource st-nodes (Stopped)
>> Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: do_state_transition: State
>> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
>> cause=C_IPC_MESSAGE origin=handle_response ]
>> Jun 29 11:02:07 st15-mds2 pengine: [4489]: WARN: process_pe_message:
>> Transition 91: WARNINGs found during PE processing. PEngine Input stored in:
>> /var/lib/pengine/pe-warn-174.bz2
>> Jun
>> 29 11:02:07 st15-mds2 crmd: [4490]: info: unpack_graph: Unpacked transition
>>91: 7 actions in 7 synapses
>> Jun 29 11:02:07 st15-mds2 pengine: [4489]: info: process_pe_message:
>> Configuration WARNINGs found during PE processing. Please run "crm_verify
>> -L" to identify issues.
>> Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: do_te_invoke: Processing graph
>> 91 (ref=pe_calc-dc-1340982127-223) derived from
>> /var/lib/pengine/pe-warn-174.bz2
>> Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: te_pseudo_action: Pseudo
>> action 21 fired and confirmed
>> Jun 29 11:02:07 st15-mds2 crmd: [4490]: info: te_fence_node: Executing
>> reboot fencing operation (23) on st15-mds1 (timeout=60000)
>> Jun 29 11:02:07 st15-mds2 stonithd: [4485]: info: client tengine [pid: 4490]
>> requests a STONITH operation RESET on node st15-mds1
>> Jun 29 11:02:07 st15-mds2 stonithd: [4485]: info: we can't manage st15-mds1,
>> broadcast request to other nodes
>> Jun 29 11:02:07 st15-mds2 stonithd:
>> [4485]: info: Broadcasting the message succeeded: require others to stonith
>>node st15-mds1.
>>
>> Thank you!
>>
>>
>> Brett Lee
>> Everything Penguin - http://etpenguin.com
>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
>
>
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org