On Sat, Nov 10, 2012 at 6:35 AM, David Vossel <dvos...@redhat.com> wrote:
> ----- Original Message -----
>> From: "Lars Marowsky-Bree" <l...@suse.com>
>> To: "The Pacemaker cluster resource manager" <pacemaker@oss.clusterlabs.org>
>> Sent: Friday, November 9, 2012 11:54:16 AM
>> Subject: Re: [Pacemaker] Enable remote monitoring
>>
>> On 2012-11-09T11:46:59, David Vossel <dvos...@redhat.com> wrote:
>>
>> > What if we made something similar to the concept of an "un-managed"
>> > resource, in that it is only ever monitored, but treated it like a
>> > normal resource.  Meaning start/stop could still execute, but
>> > start is really just the first "monitor" operation and stop just
>> > means the recurring "monitor" cancels.
>> >
>> > Having "start" redirect to "monitor" in pacemaker would take care
>> > of that timeout problem you all were talking about with the first
>> > failure.  Set the start operation to some larger timeout.
>> >  Basically start would just verify that monitor passed once, then
>> > you could move on to the normal monitor timeouts/intervals.  Stop
>> > would always return success and cancel whatever recurring monitors
>> > are running.
>>
>> That's exactly the kind of abstraction a resource agent class can
>> provide though for the nagios agents - no need to have that special
>> knowledge in the PE. The LRM can hide this, which is partly its
>> purpose.
>
> I know nothing about the nagios agents, but if we are taking that route, why 
> not just have the nagios agents map the "start" action to "monitor" instead 
> of making a new class.  Then PE and LRMD don't need any special knowledge of 
> this.

It needs to be a new class because the scripts (I'm pretty sure)
follow a completely different API to anything else we support.

>
>> > Now that I think about it, I'm not even sure we need the new
>> > container Andrew and I talked about at all if we introduce
>> > "monitor-only" resources.
>>
>> Yes. We'd still need it.
>>
>> > At this point we could just have a group where the first member
>> > launches the vm, and all the members after that are the
>> > monitor-only resources that start/stop similar to normal resources
>> > for the PE.  If any of the group members fail, I guess we'd need
>> > the whole group to be recovered in the right order.
>>
>> That's the point - "right order" for a container is not quite the
>> right
>> order as for a regular group. Basically, the group semantics would
>> recover from the failed resource onward, never the VM resource
>> (container).
>
> Seems like it would be possible to create a group option to recover all group 
> members from the first resource onward on a failure.  As long as the vm 
> remains first, would the right order not be preserved?

Please. Not a group. Groups are groups and these are different. Please
don't make groups any worse than they already are ;-)

>
>> If you look at my proposal, I actually made the "container=" a group
>> attribute - because we need to map monitor failures to the container,
>> as
>> well as ignore any stop failures (service is down clean as long as
>> the
>> container is eventually stopped).
>
> I see what you are saying. This is basically the same concept I saw earlier 
> where the monitor resources were defined in the operation tags of a resource. 
> This abstraction moves the resource to the container and makes the monitor 
> operations resource primitives that are only monitored.
>
> I don't think we should have to worry about stop failures at all though.  
> Stop failures shouldn't occur except possibly at the vm resource.  With the 
> "monitor-only" resources I outlined, or with the new resource class you 
> proposed, stop should just always succeed for these monitor resources.  No 
> ignoring of stop failures should have to occur.
>
>>
>> I think the shell might render this differently, even if we express
>> it
>> as a group + meta-attribute(s) in the XML (which seems to be the way
>> to
>> go). "container ..." is easier on the eyes ;-)
>
> It doesn't matter to me how this is represented for the user through whatever 
> cli/shell tool someone uses.
>
> Assuming we figure out a way to make these nagios resources map "start" to 
> "monitor", and stop all ways succeeds (however we agree on doing that, new 
> class, resource option, whatever) would the abstraction below work if the 
> group could be made to recover at the first resource onward for any failure 
> in the chain?
>
> <group id="vm_and_resources">
> <primitive id="vm" ... />
> <primitive id="rsc-monitor-only-whatever" ..../>
> <primitive id="rsc-monitor-only-somethingelse" .../>
> <group/>
>
> If the above works we get clones of these groups for free, and implementation 
> should be fairly straight forward.

Trust me, don't go there.
Groups are already a sufficiently tortured construct.

> -- Vossel
>
>>
>>
>> Regards,
>>     Lars
>>
>> --
>> Architect Storage/HA
>> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix
>> Imendörffer, HRB 21284 (AG Nürnberg)
>> "Experience is the name everyone gives to their mistakes." -- Oscar
>> Wilde
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to