Re: [Pacemaker] STONITH is not performed after stonithd reboots

Kazunori INOUE Wed, 08 Aug 2012 01:04:34 -0700

Thanks Andrew,
I opened Bugzilla about this.
* http://bugs.clusterlabs.org/show_bug.cgi?id=5094


Best Regards,
Kazunori INOUE

(12.08.07 20:30), Andrew Beekhof wrote:

On Wed, Aug 1, 2012 at 7:26 PM, Kazunori INOUE
<inouek...@intellilink.co.jp> wrote:

Hi,

This problem has not been fixed yet. (2012 Jul 29, 33119da31c)
When stonithd was terminated abnormally, doesn't crmd have to reboot
like time when lrmd was terminated?

The following patch will reboot crmd, if connection with stonithd
breaks. I checked this problem was fixed, however cannot grasp the
extent of the impact...


It's quite severe :-)
I'd like to see if we can come up with something a little less brutal.

Could you file a bugzilla for me please?


[root@dev1 pacemaker]# git diff
diff --git a/crmd/te_utils.c b/crmd/te_utils.c
index f6a7550..deb4513 100644
--- a/crmd/te_utils.c
+++ b/crmd/te_utils.c
@@ -83,6 +83,7 @@ tengine_stonith_connection_destroy(stonith_t * st,
stonith_event_t *e)
  {
      if (is_set(fsa_input_register, R_ST_REQUIRED)) {
          crm_crit("Fencing daemon connection failed");
+        register_fsa_input(C_FSA_INTERNAL, I_ERROR, NULL);
          mainloop_set_trigger(stonith_reconnect);

      } else {
[root@dev1 pacemaker]#

Best regards,
Kazunori INOUE


(12.05.09 16:11), Andrew Beekhof wrote:


On Mon, May 7, 2012 at 7:52 PM, Kazunori INOUE
<inouek...@intellilink.co.jp> wrote:


Hi,

On the Pacemkaer-1.1 + Corosync stack, although stonithd reboots
after an abnormal end, STONITH is not performed after that.

I am using the newest devel.
- pacemaker : db5e16736cc2682fbf37f81cd47be7d17d5a2364
- corosync  : 88dd3e1eeacd64701d665f10acbc40f3795dd32f
- glue      : 2686:66d5f0c135c9


* 0. cluster's state.

   [root@vm1 ~]# crm_mon -r1
   ============
   Last updated: Wed May  2 16:07:29 2012
   Last change: Wed May  2 16:06:35 2012 via cibadmin on vm1
   Stack: corosync
   Current DC: vm1 (1) - partition WITHOUT quorum
   Version: 1.1.7-db5e167
   2 Nodes configured, unknown expected votes
   3 Resources configured.
   ============

   Online: [ vm1 vm2 ]

   Full list of resources:

   prmDummy       (ocf::pacemaker:Dummy): Started vm2
   prmStonith1    (stonith:external/libvirt):     Started vm2
   prmStonith2    (stonith:external/libvirt):     Started vm1

   [root@vm1 ~]# crm configure show
   node $id="1" vm1
   node $id="2" vm2
   primitive prmDummy ocf:pacemaker:Dummy \
          op start interval="0s" timeout="60s" on-fail="restart" \
          op monitor interval="10s" timeout="60s" on-fail="fence" \
          op stop interval="0s" timeout="60s" on-fail="stop"
   primitive prmStonith1 stonith:external/libvirt \
          params hostlist="vm1" hypervisor_uri="qemu+ssh://f/system" \
          op start interval="0s" timeout="60s" \
          op monitor interval="3600s" timeout="60s" \
          op stop interval="0s" timeout="60s"
   primitive prmStonith2 stonith:external/libvirt \
          params hostlist="vm2" hypervisor_uri="qemu+ssh://g/system" \
          op start interval="0s" timeout="60s" \
          op monitor interval="3600s" timeout="60s" \
          op stop interval="0s" timeout="60s"
   location rsc_location-prmDummy prmDummy \
          rule $id="rsc_location-prmDummy-rule" 200: #uname eq vm2
   location rsc_location-prmStonith1 prmStonith1 \
          rule $id="rsc_location-prmStonith1-rule" 200: #uname eq vm2 \
          rule $id="rsc_location-prmStonith1-rule-0" -inf: #uname eq vm1
   location rsc_location-prmStonith2 prmStonith2 \
          rule $id="rsc_location-prmStonith2-rule" 200: #uname eq vm1 \
          rule $id="rsc_location-prmStonith2-rule-0" -inf: #uname eq vm2
   property $id="cib-bootstrap-options" \
          dc-version="1.1.7-db5e167" \
          cluster-infrastructure="corosync" \
          no-quorum-policy="ignore" \
          stonith-enabled="true" \
          startup-fencing="false" \
          stonith-timeout="120s"
   rsc_defaults $id="rsc-options" \
          resource-stickiness="INFINITY" \
          migration-threshold="1"


* 1. terminate stonithd forcibly.

   [root@vm1 ~]# pkill -9 stonithd


* 2. I cause STONITH, but stonithd says that a device is not found and
    does not STONITH.

   [root@vm1 ~]# ssh vm2 'rm /var/run/Dummy-prmDummy.state'
   [root@vm1 ~]# grep Found /var/log/ha-debug
   May  2 16:13:07 vm1 stonith-ng[15115]:    debug: stonith_query: Found 0
matching devices for 'vm2'
   May  2 16:13:19 vm1 stonith-ng[15115]:    debug: stonith_query: Found 0
matching devices for 'vm2'
   May  2 16:13:31 vm1 stonith-ng[15115]:    debug: stonith_query: Found 0
matching devices for 'vm2'
   May  2 16:13:43 vm1 stonith-ng[15115]:    debug: stonith_query: Found 0
matching devices for 'vm2'
   (snip)
   [root@vm1 ~]#


After stonithd reboots, it seems that STONITH-resource or lrmd needs
to be rebooted.. is this the designed behavior?



No, that sounds like a bug.


   # crm resource restart <STONITH resource (prmStonith2)>
   or
   # /usr/lib64/heartbeat/lrmd -r  (on the node which stonithd rebooted)

----
Best regards,
Kazunori INOUE

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org



_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] STONITH is not performed after stonithd reboots

Reply via email to