On Fri, Jul 8, 2011 at 2:43 PM, Andrew Beekhof <and...@beekhof.net> wrote: > On Wed, Jul 6, 2011 at 10:26 PM, Andreas Kurz <andreas.k...@linbit.com> wrote: >> On 2011-07-05 00:24, Andrew Beekhof wrote: >>> On Fri, Jul 1, 2011 at 9:23 PM, Andreas Kurz <andreas.k...@linbit.com> >>> wrote: >>>> Hello, >>>> >>>> In a cluster without stonith enabled (yes I know ....) the monitor >>>> failure of one resource followed by the stop failure of a dependent >>>> resource lead to a cascade of errors especially because the cluster did >>>> not stop the shutdown sequence on stop (timeout) failures: >>>> >>>> WARN: should_dump_input: Ignoring requirement that >>>> resource_fs_home_stop_0 comeplete before ms_drbd_home_demote_0: >>>> unmanaged failed resources cannot prevent clone shutdown >>>> >>>> ... and that is really ugly in a DRBD Environment, because demote/stop >>>> will not work when the DRBD device is in use -- so in this case this >>>> order requirement on stop must not be ignored. >>> >>> Did you ask the cluster to shut down before or after the first resource >>> failed? >> >> Neither the first nor the latter ... >> >> * IP resource had a monitor failure >> * restart triggered a restart of all dependent resources >> * one resource had an stop failure >> * cluster decided that this failed resource must move away >> * cluster decided to move IP to second node and therefore all dependant >> resource have to follow >> * cascading stop begins >> * two file systems where unable to umount --> stop failure/unmanaged, >> ignored >> >> ... and now the really ugly things happened as demote and stop on DRBD >> ms resources were triggered although the files system were still online. >> >> Furthermore the cluster additionally tried to promote DRBD on the second >> node which is also impossible if the other side is not demoted/stopped. >> >> Of course there are clones/ms resources that can stop independent of >> their dependent resources but DRBD is one that can't. >> >> So I think there should be a way to tell a clone/ms resource do _not_ >> ignore the order requirements on stop failures of dependent resources. >> >>> >>>> The result were a lot of unmanaged resources and the cluster even tried >>>> to promote the MS resource on the other node although the second >>>> instance was neither demoted nor stopped. >>> >>> We seem to loose either way. >>> If we have the cluster block people complain shutdown takes too long. >> >> This seems to be sensible for a cluster shutdown, but I don't think this >> a good behavior on resource migration. The cluster was really in a heavy >> mess after this stop cascade. > > Which was caused by a failed stop ;-) > None of this would have happened if stonith was enabled or if the > original stop was reliable (which it is required to be). > >> >> I would expect the cluster to block on the first stop error and wait for >> manual intervention if no fencing is configured. > > I think you'd need to set on-fail=block for that. > To be honest, no vendor supports clusters without a fencing device so > we don't seriously test it. > >> >>> >>> Basically at the point a resource fails and stonith is not configured >>> - shutdown is best-effort. >> >> I agree, shutdown of the failed resource can be best-effort .. >> especially on cluster shutdown ... even then I'd like to have a choice. >> >> I don't agree that ignoring the stop order requirements is also >> best-effort on "simple" resource stop/migration ... I'd like to tweak my >> resource to insist on stop order for dependent resources. > > The problem (normally) is that the cluster is being given conflicting > instructions. > It can't both shut down and observe the constraint. > > Probably the cluster should not have tried to shut down the node as a > substitute for lack of stonith. >
Correction, I just checked the code and it does not try to do this. So some part of your sequence of events is off - someone initiated a shutdown too. _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker