On Sat, Nov 10, 2012 at 6:35 AM, David Vossel <dvos...@redhat.com> wrote: > ----- Original Message ----- >> From: "Lars Marowsky-Bree" <l...@suse.com> >> To: "The Pacemaker cluster resource manager" <pacemaker@oss.clusterlabs.org> >> Sent: Friday, November 9, 2012 11:54:16 AM >> Subject: Re: [Pacemaker] Enable remote monitoring >> >> On 2012-11-09T11:46:59, David Vossel <dvos...@redhat.com> wrote: >> >> > What if we made something similar to the concept of an "un-managed" >> > resource, in that it is only ever monitored, but treated it like a >> > normal resource. Meaning start/stop could still execute, but >> > start is really just the first "monitor" operation and stop just >> > means the recurring "monitor" cancels. >> > >> > Having "start" redirect to "monitor" in pacemaker would take care >> > of that timeout problem you all were talking about with the first >> > failure. Set the start operation to some larger timeout. >> > Basically start would just verify that monitor passed once, then >> > you could move on to the normal monitor timeouts/intervals. Stop >> > would always return success and cancel whatever recurring monitors >> > are running. >> >> That's exactly the kind of abstraction a resource agent class can >> provide though for the nagios agents - no need to have that special >> knowledge in the PE. The LRM can hide this, which is partly its >> purpose. > > I know nothing about the nagios agents, but if we are taking that route, why > not just have the nagios agents map the "start" action to "monitor" instead > of making a new class. Then PE and LRMD don't need any special knowledge of > this.
It needs to be a new class because the scripts (I'm pretty sure) follow a completely different API to anything else we support. > >> > Now that I think about it, I'm not even sure we need the new >> > container Andrew and I talked about at all if we introduce >> > "monitor-only" resources. >> >> Yes. We'd still need it. >> >> > At this point we could just have a group where the first member >> > launches the vm, and all the members after that are the >> > monitor-only resources that start/stop similar to normal resources >> > for the PE. If any of the group members fail, I guess we'd need >> > the whole group to be recovered in the right order. >> >> That's the point - "right order" for a container is not quite the >> right >> order as for a regular group. Basically, the group semantics would >> recover from the failed resource onward, never the VM resource >> (container). > > Seems like it would be possible to create a group option to recover all group > members from the first resource onward on a failure. As long as the vm > remains first, would the right order not be preserved? Please. Not a group. Groups are groups and these are different. Please don't make groups any worse than they already are ;-) > >> If you look at my proposal, I actually made the "container=" a group >> attribute - because we need to map monitor failures to the container, >> as >> well as ignore any stop failures (service is down clean as long as >> the >> container is eventually stopped). > > I see what you are saying. This is basically the same concept I saw earlier > where the monitor resources were defined in the operation tags of a resource. > This abstraction moves the resource to the container and makes the monitor > operations resource primitives that are only monitored. > > I don't think we should have to worry about stop failures at all though. > Stop failures shouldn't occur except possibly at the vm resource. With the > "monitor-only" resources I outlined, or with the new resource class you > proposed, stop should just always succeed for these monitor resources. No > ignoring of stop failures should have to occur. > >> >> I think the shell might render this differently, even if we express >> it >> as a group + meta-attribute(s) in the XML (which seems to be the way >> to >> go). "container ..." is easier on the eyes ;-) > > It doesn't matter to me how this is represented for the user through whatever > cli/shell tool someone uses. > > Assuming we figure out a way to make these nagios resources map "start" to > "monitor", and stop all ways succeeds (however we agree on doing that, new > class, resource option, whatever) would the abstraction below work if the > group could be made to recover at the first resource onward for any failure > in the chain? > > <group id="vm_and_resources"> > <primitive id="vm" ... /> > <primitive id="rsc-monitor-only-whatever" ..../> > <primitive id="rsc-monitor-only-somethingelse" .../> > <group/> > > If the above works we get clones of these groups for free, and implementation > should be fairly straight forward. Trust me, don't go there. Groups are already a sufficiently tortured construct. > -- Vossel > >> >> >> Regards, >> Lars >> >> -- >> Architect Storage/HA >> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix >> Imendörffer, HRB 21284 (AG Nürnberg) >> "Experience is the name everyone gives to their mistakes." -- Oscar >> Wilde >> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org