* Florian Haas <[email protected]> [20120229 08:12]: > Jean-François, > > I realize I'm late to this discussion, however allow me to chime in here > anyhow: > > On Mon, Feb 27, 2012 at 11:45 PM, Jean-Francois Malouin > <[email protected]> wrote: > >> Have you looked at fence_virt? > >> http://www.clusterlabs.org/wiki/Guest_Fencing > > > > Yes I did. > > > > I had a quick go last week at compiling it on Debian/Squeeze with > > backports but with no luck. > > Seeing as you're on Debian, there really is no need to use fence_virt. > Instead, you should just be able to use the "external/libvirt" STONITH > plugin that ships with cluster-glue (in squeeze-backports). That > plugin works like a charm and I've used it in testing many times. No > need to compile anything. > > http://www.hastexo.com/resources/hints-and-kinks/fencing-virtual-cluster-nodes > may be a helpful resource.
Thanks Florian! Exactly what I needed! I set it up as you explained above. I can virsh from the guests to the physical host but I'm experiencing a few oddities... If I manually stonith node1 from node2 (or killall -9 corosync on node1) I get repeated console messages: node2 stonith: [31734]: CRIT: external_reset_req: 'libvirt reset' for host node1 failed with rc 1 and syslog shows: Mar 1 14:00:51 node2 pengine: [991]: WARN: pe_fence_node: Node node1 will be fenced because it is un-expectedly down Mar 1 14:00:51 node2 pengine: [991]: WARN: determine_online_status: Node node1 is unclean Mar 1 14:00:51 node2 pengine: [991]: notice: unpack_rsc_op: Operation fence_node1_last_failure_0 found resource fence_node1 active on node2 Mar 1 14:00:51 node2 pengine: [991]: notice: unpack_rsc_op: Operation fence_node2_last_failure_0 found resource fence_node2 active on node1 Mar 1 14:00:51 node2 pengine: [991]: WARN: custom_action: Action resPing:0_stop_0 on node1 is unrunnable (offline) Mar 1 14:00:51 node2 pengine: [991]: WARN: custom_action: Marking node node1 unclean Mar 1 14:00:51 node2 pengine: [991]: WARN: custom_action: Action fence_node2_stop_0 on node1 is unrunnable (offline) Mar 1 14:00:51 node2 pengine: [991]: WARN: custom_action: Marking node node1 unclean Mar 1 14:00:51 node2 pengine: [991]: WARN: stage6: Scheduling Node node1 for STONITH ... Mar 1 14:00:52 node2 stonith-ng: [987]: info: initiate_remote_stonith_op: Initiating remote operation reboot for node1: 339d69d4-7d46-46a0-8256-e2c9a6637f09 Mar 1 14:00:52 node2 stonith-ng: [987]: info: can_fence_host_with_device: Refreshing port list for fence_node1 Mar 1 14:00:52 node2 stonith-ng: [987]: WARN: parse_host_line: Could not parse (0 0): Mar 1 14:00:52 node2 stonith-ng: [987]: info: can_fence_host_with_device: fence_node1 can fence node1: dynamic-list Mar 1 14:00:52 node2 stonith-ng: [987]: info: call_remote_stonith: Requesting that node2 perform op reboot node1 Mar 1 14:00:52 node2 stonith-ng: [987]: info: stonith_fence: Exec <stonith_command t="stonith-ng" st_async_id="339d69d4-7d46-46a0-8256-e2c9a6637f09" st_op="st_fence" st_callid="0" st_callopt="0" st_remote_op="339d69d4-7d46-46a0-8256-e2c9a6637f09" st_target="node1" st_device_action="reboot" st_timeout="54000" src="node2" seq="3" /> Mar 1 14:00:52 node2 stonith-ng: [987]: info: can_fence_host_with_device: fence_node1 can fence node1: dynamic-list Mar 1 14:00:52 node2 stonith-ng: [987]: info: stonith_fence: Found 1 matching devices for 'node1' ... Mar 1 14:00:52 node2 stonith-ng: [987]: info: stonith_command: Processed st_fence from node2: rc=-1 Mar 1 14:00:52 node2 stonith-ng: [987]: info: make_args: reboot-ing node 'node1' as 'port=node1' Mar 1 14:00:52 node2 pengine: [991]: WARN: process_pe_message: Transition 1: WARNINGs found during PE processing. PEngine Input stored in: /var/lib/pengine/pe-warn-8.bz2 Mar 1 14:00:52 node2 pengine: [991]: notice: process_pe_message: Configuration WARNINGs found during PE processing. Please run "crm_verify -L" to identify issues. Mar 1 14:00:57 node2 external/libvirt[31741]: [31769]: notice: Domain node1 was stopped Mar 1 14:01:02 node2 external/libvirt[31741]: [31783]: ERROR: Failed to start domain node1 Mar 1 14:01:02 node2 external/libvirt[31741]: [31789]: ERROR: error: failed to get domain 'node1' Mar 1 14:01:02 node2 external/libvirt[31741]: [31789]: error: Domain not found: xenUnifiedDomainLookupByName At this point I can't restart the stonith'ed node1, the cib list it as UNCLEAN: first I manually have to wipe it clean with 'crm node clearstate node1' as otherwize the surviving node2 just keep shooting it and some dummy resources (and and an IP resource located with a ping to the hypervisor) dont restart properly by themselves. Must something simple that I overlooked... Any ideas? jf > > Cheers, > Florian > > -- > Need help with High Availability? > http://www.hastexo.com/now > > _______________________________________________ > Pacemaker mailing list: [email protected] > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: [email protected] http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
