Hi, On the Pacemkaer-1.1 + Corosync stack, although stonithd reboots after an abnormal end, STONITH is not performed after that.
I am using the newest devel. - pacemaker : db5e16736cc2682fbf37f81cd47be7d17d5a2364 - corosync : 88dd3e1eeacd64701d665f10acbc40f3795dd32f - glue : 2686:66d5f0c135c9 * 0. cluster's state. [root@vm1 ~]# crm_mon -r1 ============ Last updated: Wed May 2 16:07:29 2012 Last change: Wed May 2 16:06:35 2012 via cibadmin on vm1 Stack: corosync Current DC: vm1 (1) - partition WITHOUT quorum Version: 1.1.7-db5e167 2 Nodes configured, unknown expected votes 3 Resources configured. ============ Online: [ vm1 vm2 ] Full list of resources: prmDummy (ocf::pacemaker:Dummy): Started vm2 prmStonith1 (stonith:external/libvirt): Started vm2 prmStonith2 (stonith:external/libvirt): Started vm1 [root@vm1 ~]# crm configure show node $id="1" vm1 node $id="2" vm2 primitive prmDummy ocf:pacemaker:Dummy \ op start interval="0s" timeout="60s" on-fail="restart" \ op monitor interval="10s" timeout="60s" on-fail="fence" \ op stop interval="0s" timeout="60s" on-fail="stop" primitive prmStonith1 stonith:external/libvirt \ params hostlist="vm1" hypervisor_uri="qemu+ssh://f/system" \ op start interval="0s" timeout="60s" \ op monitor interval="3600s" timeout="60s" \ op stop interval="0s" timeout="60s" primitive prmStonith2 stonith:external/libvirt \ params hostlist="vm2" hypervisor_uri="qemu+ssh://g/system" \ op start interval="0s" timeout="60s" \ op monitor interval="3600s" timeout="60s" \ op stop interval="0s" timeout="60s" location rsc_location-prmDummy prmDummy \ rule $id="rsc_location-prmDummy-rule" 200: #uname eq vm2 location rsc_location-prmStonith1 prmStonith1 \ rule $id="rsc_location-prmStonith1-rule" 200: #uname eq vm2 \ rule $id="rsc_location-prmStonith1-rule-0" -inf: #uname eq vm1 location rsc_location-prmStonith2 prmStonith2 \ rule $id="rsc_location-prmStonith2-rule" 200: #uname eq vm1 \ rule $id="rsc_location-prmStonith2-rule-0" -inf: #uname eq vm2 property $id="cib-bootstrap-options" \ dc-version="1.1.7-db5e167" \ cluster-infrastructure="corosync" \ no-quorum-policy="ignore" \ stonith-enabled="true" \ startup-fencing="false" \ stonith-timeout="120s" rsc_defaults $id="rsc-options" \ resource-stickiness="INFINITY" \ migration-threshold="1" * 1. terminate stonithd forcibly. [root@vm1 ~]# pkill -9 stonithd * 2. I cause STONITH, but stonithd says that a device is not found and does not STONITH. [root@vm1 ~]# ssh vm2 'rm /var/run/Dummy-prmDummy.state' [root@vm1 ~]# grep Found /var/log/ha-debug May 2 16:13:07 vm1 stonith-ng[15115]: debug: stonith_query: Found 0 matching devices for 'vm2' May 2 16:13:19 vm1 stonith-ng[15115]: debug: stonith_query: Found 0 matching devices for 'vm2' May 2 16:13:31 vm1 stonith-ng[15115]: debug: stonith_query: Found 0 matching devices for 'vm2' May 2 16:13:43 vm1 stonith-ng[15115]: debug: stonith_query: Found 0 matching devices for 'vm2' (snip) [root@vm1 ~]# After stonithd reboots, it seems that STONITH-resource or lrmd needs to be rebooted.. is this the designed behavior? # crm resource restart <STONITH resource (prmStonith2)> or # /usr/lib64/heartbeat/lrmd -r (on the node which stonithd rebooted) ---- Best regards, Kazunori INOUE _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org