Re: [Pacemaker] STONITH is not performed after stonithd reboots

Kazunori INOUE Wed, 01 Aug 2012 02:29:28 -0700

Hi,

This problem has not been fixed yet. (2012 Jul 29, 33119da31c)
When stonithd was terminated abnormally, doesn't crmd have to reboot
like time when lrmd was terminated?


The following patch will reboot crmd, if connection with stonithd
breaks. I checked this problem was fixed, however cannot grasp the
extent of the impact...

[root@dev1 pacemaker]# git diff
diff --git a/crmd/te_utils.c b/crmd/te_utils.c
index f6a7550..deb4513 100644
--- a/crmd/te_utils.c
+++ b/crmd/te_utils.c
@@ -83,6 +83,7 @@ tengine_stonith_connection_destroy(stonith_t * st, 
stonith_event_t *e)
 {
     if (is_set(fsa_input_register, R_ST_REQUIRED)) {
         crm_crit("Fencing daemon connection failed");
+        register_fsa_input(C_FSA_INTERNAL, I_ERROR, NULL);
         mainloop_set_trigger(stonith_reconnect);

     } else {
[root@dev1 pacemaker]#

Best regards,
Kazunori INOUE

(12.05.09 16:11), Andrew Beekhof wrote:

On Mon, May 7, 2012 at 7:52 PM, Kazunori INOUE
<inouek...@intellilink.co.jp> wrote:

Hi,

On the Pacemkaer-1.1 + Corosync stack, although stonithd reboots
after an abnormal end, STONITH is not performed after that.

I am using the newest devel.
- pacemaker : db5e16736cc2682fbf37f81cd47be7d17d5a2364
- corosync  : 88dd3e1eeacd64701d665f10acbc40f3795dd32f
- glue      : 2686:66d5f0c135c9


* 0. cluster's state.

  [root@vm1 ~]# crm_mon -r1
  ============
  Last updated: Wed May  2 16:07:29 2012
  Last change: Wed May  2 16:06:35 2012 via cibadmin on vm1
  Stack: corosync
  Current DC: vm1 (1) - partition WITHOUT quorum
  Version: 1.1.7-db5e167
  2 Nodes configured, unknown expected votes
  3 Resources configured.
  ============

  Online: [ vm1 vm2 ]

  Full list of resources:

  prmDummy       (ocf::pacemaker:Dummy): Started vm2
  prmStonith1    (stonith:external/libvirt):     Started vm2
  prmStonith2    (stonith:external/libvirt):     Started vm1

  [root@vm1 ~]# crm configure show
  node $id="1" vm1
  node $id="2" vm2
  primitive prmDummy ocf:pacemaker:Dummy \
         op start interval="0s" timeout="60s" on-fail="restart" \
         op monitor interval="10s" timeout="60s" on-fail="fence" \
         op stop interval="0s" timeout="60s" on-fail="stop"
  primitive prmStonith1 stonith:external/libvirt \
         params hostlist="vm1" hypervisor_uri="qemu+ssh://f/system" \
         op start interval="0s" timeout="60s" \
         op monitor interval="3600s" timeout="60s" \
         op stop interval="0s" timeout="60s"
  primitive prmStonith2 stonith:external/libvirt \
         params hostlist="vm2" hypervisor_uri="qemu+ssh://g/system" \
         op start interval="0s" timeout="60s" \
         op monitor interval="3600s" timeout="60s" \
         op stop interval="0s" timeout="60s"
  location rsc_location-prmDummy prmDummy \
         rule $id="rsc_location-prmDummy-rule" 200: #uname eq vm2
  location rsc_location-prmStonith1 prmStonith1 \
         rule $id="rsc_location-prmStonith1-rule" 200: #uname eq vm2 \
         rule $id="rsc_location-prmStonith1-rule-0" -inf: #uname eq vm1
  location rsc_location-prmStonith2 prmStonith2 \
         rule $id="rsc_location-prmStonith2-rule" 200: #uname eq vm1 \
         rule $id="rsc_location-prmStonith2-rule-0" -inf: #uname eq vm2
  property $id="cib-bootstrap-options" \
         dc-version="1.1.7-db5e167" \
         cluster-infrastructure="corosync" \
         no-quorum-policy="ignore" \
         stonith-enabled="true" \
         startup-fencing="false" \
         stonith-timeout="120s"
  rsc_defaults $id="rsc-options" \
         resource-stickiness="INFINITY" \
         migration-threshold="1"


* 1. terminate stonithd forcibly.

  [root@vm1 ~]# pkill -9 stonithd


* 2. I cause STONITH, but stonithd says that a device is not found and
   does not STONITH.

  [root@vm1 ~]# ssh vm2 'rm /var/run/Dummy-prmDummy.state'
  [root@vm1 ~]# grep Found /var/log/ha-debug
  May  2 16:13:07 vm1 stonith-ng[15115]:    debug: stonith_query: Found 0 
matching devices for 'vm2'
  May  2 16:13:19 vm1 stonith-ng[15115]:    debug: stonith_query: Found 0 
matching devices for 'vm2'
  May  2 16:13:31 vm1 stonith-ng[15115]:    debug: stonith_query: Found 0 
matching devices for 'vm2'
  May  2 16:13:43 vm1 stonith-ng[15115]:    debug: stonith_query: Found 0 
matching devices for 'vm2'
  (snip)
  [root@vm1 ~]#


After stonithd reboots, it seems that STONITH-resource or lrmd needs
to be rebooted.. is this the designed behavior?


No, that sounds like a bug.


  # crm resource restart <STONITH resource (prmStonith2)>
  or
  # /usr/lib64/heartbeat/lrmd -r  (on the node which stonithd rebooted)

----
Best regards,
Kazunori INOUE

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] STONITH is not performed after stonithd reboots

Reply via email to