Re: [Pacemaker] stonith in a virtual cluster

Jean-Francois Malouin Thu, 01 Mar 2012 12:22:24 -0800

* Florian Haas <[email protected]> [20120229 08:12]:
> Jean-François,
> 
> I realize I'm late to this discussion, however allow me to chime in here 
> anyhow:
> 
> On Mon, Feb 27, 2012 at 11:45 PM, Jean-Francois Malouin
> <[email protected]> wrote:
> >> Have you looked at fence_virt? 
> >> http://www.clusterlabs.org/wiki/Guest_Fencing
> >
> > Yes I did.
> >
> > I had a quick go last week at compiling it on Debian/Squeeze with
> > backports but with no luck.
> 
> Seeing as you're on Debian, there really is no need to use fence_virt.
> Instead, you should just be able to use the "external/libvirt" STONITH
> plugin that ships with cluster-glue (in squeeze-backports). That
> plugin works like a charm and I've used it in testing many times. No
> need to compile anything.
> 
> http://www.hastexo.com/resources/hints-and-kinks/fencing-virtual-cluster-nodes
> may be a helpful resource.


Thanks Florian! Exactly what I needed!

I set it up as you explained above. I can virsh from the guests to the
physical host but I'm experiencing a few oddities...

If I manually stonith node1 from node2 (or killall -9 corosync on
node1) I get repeated console messages:

node2 stonith: [31734]: CRIT: external_reset_req: 'libvirt reset' for host 
node1 failed with rc 1

and syslog shows:

Mar  1 14:00:51 node2 pengine: [991]: WARN: pe_fence_node: Node node1 will be 
fenced because it is un-expectedly down
Mar  1 14:00:51 node2 pengine: [991]: WARN: determine_online_status: Node node1 
is unclean
Mar  1 14:00:51 node2 pengine: [991]: notice: unpack_rsc_op: Operation 
fence_node1_last_failure_0 found resource fence_node1 active on node2
Mar  1 14:00:51 node2 pengine: [991]: notice: unpack_rsc_op: Operation 
fence_node2_last_failure_0 found resource fence_node2 active on node1
Mar  1 14:00:51 node2 pengine: [991]: WARN: custom_action: Action 
resPing:0_stop_0 on node1 is unrunnable (offline)
Mar  1 14:00:51 node2 pengine: [991]: WARN: custom_action: Marking node node1 
unclean
Mar  1 14:00:51 node2 pengine: [991]: WARN: custom_action: Action 
fence_node2_stop_0 on node1 is unrunnable (offline)
Mar  1 14:00:51 node2 pengine: [991]: WARN: custom_action: Marking node node1 
unclean
Mar  1 14:00:51 node2 pengine: [991]: WARN: stage6: Scheduling Node node1 for 
STONITH
...
Mar  1 14:00:52 node2 stonith-ng: [987]: info: initiate_remote_stonith_op: 
Initiating remote operation reboot for node1: 
339d69d4-7d46-46a0-8256-e2c9a6637f09
Mar  1 14:00:52 node2 stonith-ng: [987]: info: can_fence_host_with_device: 
Refreshing port list for fence_node1
Mar  1 14:00:52 node2 stonith-ng: [987]: WARN: parse_host_line: Could not parse 
(0 0):
Mar  1 14:00:52 node2 stonith-ng: [987]: info: can_fence_host_with_device: 
fence_node1 can fence node1: dynamic-list
Mar  1 14:00:52 node2 stonith-ng: [987]: info: call_remote_stonith: Requesting 
that node2 perform op reboot node1
Mar  1 14:00:52 node2 stonith-ng: [987]: info: stonith_fence: Exec 
<stonith_command t="stonith-ng" 
st_async_id="339d69d4-7d46-46a0-8256-e2c9a6637f09" st_op="st_fence" 
st_callid="0" st_callopt="0" 
st_remote_op="339d69d4-7d46-46a0-8256-e2c9a6637f09" st_target="node1" 
st_device_action="reboot" st_timeout="54000" src="node2" seq="3" />
Mar  1 14:00:52 node2 stonith-ng: [987]: info: can_fence_host_with_device: 
fence_node1 can fence node1: dynamic-list
Mar  1 14:00:52 node2 stonith-ng: [987]: info: stonith_fence: Found 1 matching 
devices for 'node1'
...
Mar  1 14:00:52 node2 stonith-ng: [987]: info: stonith_command: Processed 
st_fence from node2: rc=-1
Mar  1 14:00:52 node2 stonith-ng: [987]: info: make_args: reboot-ing node 
'node1' as 'port=node1'
Mar  1 14:00:52 node2 pengine: [991]: WARN: process_pe_message: Transition 1: 
WARNINGs found during PE processing. PEngine Input stored in: 
/var/lib/pengine/pe-warn-8.bz2
Mar  1 14:00:52 node2 pengine: [991]: notice: process_pe_message: Configuration 
WARNINGs found during PE processing.  Please run "crm_verify -L" to identify 
issues.
Mar  1 14:00:57 node2 external/libvirt[31741]: [31769]: notice: Domain node1 
was stopped
Mar  1 14:01:02 node2 external/libvirt[31741]: [31783]: ERROR: Failed to start 
domain node1
Mar  1 14:01:02 node2 external/libvirt[31741]: [31789]: ERROR: error: failed to 
get domain 'node1'
Mar  1 14:01:02 node2 external/libvirt[31741]: [31789]: error: Domain not 
found: xenUnifiedDomainLookupByName


At this point I can't restart the stonith'ed node1, the cib list it as
UNCLEAN: first I manually have to wipe it clean with 

'crm node clearstate node1' 

as otherwize the surviving node2 just keep shooting it and some dummy
resources (and and an IP resource located with a ping to the
hypervisor) dont restart properly by themselves.

Must something simple that I overlooked...

Any ideas?

jf

> 
> Cheers,
> Florian
> 
> -- 
> Need help with High Availability?
> http://www.hastexo.com/now
> 
> _______________________________________________
> Pacemaker mailing list: [email protected]
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: [email protected]
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] stonith in a virtual cluster

Reply via email to