On Fri, Mar 2, 2012 at 5:07 PM, Junko IKEDA <tsukishima...@gmail.com> wrote: > Hi, > > OK, we have to setup STONITH to handle this. > By the way, I tried to run the group resource and do the same test. > > crm configuration; > > property \ > no-quorum-policy="ignore" \ > stonith-enabled="false" \ > crmd-transition-delay="2s" \ > cluster-recheck-interval="60s" > > rsc_defaults \ > resource-stickiness="INFINITY" \ > migration-threshold="1" > > primitive dummy01 ocf:heartbeat:Dummy \ > op start timeout="60s" interval="0s" on-fail="restart" \ > op monitor timeout="60s" interval="7s" on-fail="restart" \ > op stop timeout="60s" interval="0s" on-fail="block" > > primitive dummy02 ocf:heartbeat:Dummy-stop-NG \ > op start timeout="60s" interval="0s" on-fail="restart" \ > op monitor timeout="60s" interval="7s" on-fail="restart" \ > op stop timeout="60s" interval="0s" on-fail="block" > > group dummy-g dummy01 dummy02 > > > in this case, dummy02 calls stop NG. > dummy02 goes to unmanaged status, > and after that, Pacemaker shutdown is freezing,
On the one hand the admin is saying "always stop A before B", but then also asking for "stop B" while preventing "stop A". So the admin is making incompatible demands, which one do you want us to ignore? > it seems that Pacemaker is waiting some clear operations for unmanaged > resources. > if dummy01 calls stop NG, Pacemaker shutdown works well. > see attached hb_report. > > Thanks, > Junko > > 2012/3/1 Andrew Beekhof <and...@beekhof.net>: >> On Wed, Feb 29, 2012 at 6:32 PM, Junko IKEDA <tsukishima...@gmail.com> wrote: >>> Hi, >>> >>> I'm running the following simple configuration with Pacemaker 1.1.6, >>> and try the test case, "resource stop NG and shutdown Pacemaker". >>> >>> property \ >>> no-quorum-policy="ignore" \ >>> stonith-enabled="false" \ >>> crmd-transition-delay="2s" >>> >>> rsc_defaults \ >>> resource-stickiness="INFINITY" \ >>> migration-threshold="1" >>> >>> primitive dummy01 ocf:heartbeat:Dummy-stop-NG \ >>> op start timeout="60s" interval="0s" on-fail="restart" \ >>> op monitor timeout="60s" interval="7s" on-fail="restart" \ >>> op stop timeout="60s" interval="0s" on-fail="block" >>> >>> >>> "Dummy-stop-NG" RA just sends "stop NG" to Pacemaker. >>> >>> # diff -urNp Dummy Dummy-stop-NG >>> --- Dummy 2011-06-30 17:43:37.000000000 +0900 >>> +++ Dummy-stop-NG 2012-02-28 19:11:12.850207767 +0900 >>> @@ -108,6 +108,8 @@ dummy_start() { >>> } >>> >>> dummy_stop() { >>> + exit $OCF_ERR_GENERIC >>> + >>> dummy_monitor >>> if [ $? = $OCF_SUCCESS ]; then >>> rm ${OCF_RESKEY_state} >>> >>> >>> >>> Before the test, the resource is running on "bl460g6a". >>> >>> # crm_simulate -S -x pe-input-1.bz2 >>> >>> Current cluster status: >>> Online: [ bl460g6a bl460g6b ] >>> >>> dummy01 (ocf::heartbeat:Dummy-stop-NG): Stopped >>> >>> Transition Summary: >>> crm_simulate[14195]: 2012/02/29_15:46:57 notice: LogActions: Start >>> dummy01 (bl460g6a) >>> >>> Executing cluster transition: >>> * Executing action 6: dummy01_monitor_0 on bl460g6b >>> * Executing action 4: dummy01_monitor_0 on bl460g6a >>> * Executing action 7: dummy01_start_0 on bl460g6a >>> * Executing action 8: dummy01_monitor_7000 on bl460g6a >>> >>> Revised cluster status: >>> Online: [ bl460g6a bl460g6b ] >>> >>> dummy01 (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a >>> >>> >>> >>> Stop Pacemaker on "bl460g6a". >>> # service heartbeat stop >>> >>> Pacemaker tries to stop resouce and move it to "bl460g6b" at first, >>> # crm_simulate -S -x pe-input-2.bz2 >>> >>> Current cluster status: >>> Online: [ bl460g6a bl460g6b ] >>> >>> dummy01 (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a >>> >>> Transition Summary: >>> crm_simulate[12195]: 2012/02/29_15:35:02 notice: LogActions: Move >>> dummy01 (Started bl460g6a -> bl460g6b) >>> >>> Executing cluster transition: >>> * Executing action 6: dummy01_stop_0 on bl460g6a >>> * Executing action 7: dummy01_start_0 on bl460g6b >>> * Executing action 8: dummy01_monitor_7000 on bl460g6b >>> >>> Revised cluster status: >>> Online: [ bl460g6a bl460g6b ] >>> >>> dummy01 (ocf::heartbeat:Dummy-stop-NG): Started bl460g6b >>> >>> >>> >>> but this action will fail, it means the resource goes into unmanaged state. >>> # crm_simulate -S -x pe-input-3.bz2 >>> >>> Current cluster status: >>> Online: [ bl460g6a bl460g6b ] >>> >>> dummy01 (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a >>> (unmanaged) FAILED >>> >>> Transition Summary: >>> >>> Executing cluster transition: >>> >>> Revised cluster status: >>> Online: [ bl460g6a bl460g6b ] >>> >>> dummy01 (ocf::heartbeat:Dummy-stop-NG): Started bl460g6a >>> (unmanaged) FAILED >>> >>> >>> >>> Pacemaker shutdown on "bl460g6a" becomes successful, >>> it seems that the following patch works well. >>> https://github.com/ClusterLabs/pacemaker/commit/07976fe5eb04c432f1d1c9aebb1b1587ba7f0bcf#pengine/graph.c >>> >>> At this time, the resource on "bl460g6a" (pacemaker already shutdowns) >>> might be running because it fails to stop. >> >> This is because we ignore the status section of any offline nodes when >> stonith-enabled=false. >> >>> In fact, the resource didn't start on "bl460g6b" after its stop NG and >>> "bl460g6a"'s shutdown, and this is an expectable behavior, >>> but I could start it on "bl460g6b" with crm command. >>> This holds the potential for the unexpected active/active status. >>> Is it possible to prevent it's start in this situation? >> >> Only by disabling the logic in >> >> https://github.com/ClusterLabs/pacemaker/commit/07976fe5eb04c432f1d1c9aebb1b1587ba7f0bcf#pengine/graph.c >> when stonith is disabled. >> >>> for example, >>> (1) Dummy runs on node-a >>> (2) Shutdown Pacemaker on node-a, and Dummy stop NG >>> (3) Dummy can not run on other nodes >>> (4) * cleanup the unmanaged status of Dummy after checking it's manual >>> operation on node-a >>> (5) * start Dummy on other nodes >>> This can be the safe way. >>> >>> See attached hb_report. >>> >>> Thanks, >>> Junko IKEDA >>> >>> NTT DATA INTELLILINK CORPORATION >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >>> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org