On Wed, Aug 1, 2012 at 7:26 PM, Kazunori INOUE <inouek...@intellilink.co.jp> wrote: > Hi, > > This problem has not been fixed yet. (2012 Jul 29, 33119da31c) > When stonithd was terminated abnormally, doesn't crmd have to reboot > like time when lrmd was terminated? > > The following patch will reboot crmd, if connection with stonithd > breaks. I checked this problem was fixed, however cannot grasp the > extent of the impact...
It's quite severe :-) I'd like to see if we can come up with something a little less brutal. Could you file a bugzilla for me please? > > [root@dev1 pacemaker]# git diff > diff --git a/crmd/te_utils.c b/crmd/te_utils.c > index f6a7550..deb4513 100644 > --- a/crmd/te_utils.c > +++ b/crmd/te_utils.c > @@ -83,6 +83,7 @@ tengine_stonith_connection_destroy(stonith_t * st, > stonith_event_t *e) > { > if (is_set(fsa_input_register, R_ST_REQUIRED)) { > crm_crit("Fencing daemon connection failed"); > + register_fsa_input(C_FSA_INTERNAL, I_ERROR, NULL); > mainloop_set_trigger(stonith_reconnect); > > } else { > [root@dev1 pacemaker]# > > Best regards, > Kazunori INOUE > > > (12.05.09 16:11), Andrew Beekhof wrote: >> >> On Mon, May 7, 2012 at 7:52 PM, Kazunori INOUE >> <inouek...@intellilink.co.jp> wrote: >>> >>> Hi, >>> >>> On the Pacemkaer-1.1 + Corosync stack, although stonithd reboots >>> after an abnormal end, STONITH is not performed after that. >>> >>> I am using the newest devel. >>> - pacemaker : db5e16736cc2682fbf37f81cd47be7d17d5a2364 >>> - corosync : 88dd3e1eeacd64701d665f10acbc40f3795dd32f >>> - glue : 2686:66d5f0c135c9 >>> >>> >>> * 0. cluster's state. >>> >>> [root@vm1 ~]# crm_mon -r1 >>> ============ >>> Last updated: Wed May 2 16:07:29 2012 >>> Last change: Wed May 2 16:06:35 2012 via cibadmin on vm1 >>> Stack: corosync >>> Current DC: vm1 (1) - partition WITHOUT quorum >>> Version: 1.1.7-db5e167 >>> 2 Nodes configured, unknown expected votes >>> 3 Resources configured. >>> ============ >>> >>> Online: [ vm1 vm2 ] >>> >>> Full list of resources: >>> >>> prmDummy (ocf::pacemaker:Dummy): Started vm2 >>> prmStonith1 (stonith:external/libvirt): Started vm2 >>> prmStonith2 (stonith:external/libvirt): Started vm1 >>> >>> [root@vm1 ~]# crm configure show >>> node $id="1" vm1 >>> node $id="2" vm2 >>> primitive prmDummy ocf:pacemaker:Dummy \ >>> op start interval="0s" timeout="60s" on-fail="restart" \ >>> op monitor interval="10s" timeout="60s" on-fail="fence" \ >>> op stop interval="0s" timeout="60s" on-fail="stop" >>> primitive prmStonith1 stonith:external/libvirt \ >>> params hostlist="vm1" hypervisor_uri="qemu+ssh://f/system" \ >>> op start interval="0s" timeout="60s" \ >>> op monitor interval="3600s" timeout="60s" \ >>> op stop interval="0s" timeout="60s" >>> primitive prmStonith2 stonith:external/libvirt \ >>> params hostlist="vm2" hypervisor_uri="qemu+ssh://g/system" \ >>> op start interval="0s" timeout="60s" \ >>> op monitor interval="3600s" timeout="60s" \ >>> op stop interval="0s" timeout="60s" >>> location rsc_location-prmDummy prmDummy \ >>> rule $id="rsc_location-prmDummy-rule" 200: #uname eq vm2 >>> location rsc_location-prmStonith1 prmStonith1 \ >>> rule $id="rsc_location-prmStonith1-rule" 200: #uname eq vm2 \ >>> rule $id="rsc_location-prmStonith1-rule-0" -inf: #uname eq vm1 >>> location rsc_location-prmStonith2 prmStonith2 \ >>> rule $id="rsc_location-prmStonith2-rule" 200: #uname eq vm1 \ >>> rule $id="rsc_location-prmStonith2-rule-0" -inf: #uname eq vm2 >>> property $id="cib-bootstrap-options" \ >>> dc-version="1.1.7-db5e167" \ >>> cluster-infrastructure="corosync" \ >>> no-quorum-policy="ignore" \ >>> stonith-enabled="true" \ >>> startup-fencing="false" \ >>> stonith-timeout="120s" >>> rsc_defaults $id="rsc-options" \ >>> resource-stickiness="INFINITY" \ >>> migration-threshold="1" >>> >>> >>> * 1. terminate stonithd forcibly. >>> >>> [root@vm1 ~]# pkill -9 stonithd >>> >>> >>> * 2. I cause STONITH, but stonithd says that a device is not found and >>> does not STONITH. >>> >>> [root@vm1 ~]# ssh vm2 'rm /var/run/Dummy-prmDummy.state' >>> [root@vm1 ~]# grep Found /var/log/ha-debug >>> May 2 16:13:07 vm1 stonith-ng[15115]: debug: stonith_query: Found 0 >>> matching devices for 'vm2' >>> May 2 16:13:19 vm1 stonith-ng[15115]: debug: stonith_query: Found 0 >>> matching devices for 'vm2' >>> May 2 16:13:31 vm1 stonith-ng[15115]: debug: stonith_query: Found 0 >>> matching devices for 'vm2' >>> May 2 16:13:43 vm1 stonith-ng[15115]: debug: stonith_query: Found 0 >>> matching devices for 'vm2' >>> (snip) >>> [root@vm1 ~]# >>> >>> >>> After stonithd reboots, it seems that STONITH-resource or lrmd needs >>> to be rebooted.. is this the designed behavior? >> >> >> No, that sounds like a bug. >> >>> >>> # crm resource restart <STONITH resource (prmStonith2)> >>> or >>> # /usr/lib64/heartbeat/lrmd -r (on the node which stonithd rebooted) >>> >>> ---- >>> Best regards, >>> Kazunori INOUE >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org