Re: [Pacemaker] fencing to recover from failed resources

2011-01-13 Thread Lars Marowsky-Bree
On 2011-01-13T09:30:48, Michael Smith wrote: > >the resource agent basically does a "xm list --long" while > >monitoring, which takes less than half a second in a console. > I think sometimes xend hangs for a while. 30 seconds should be good. There's a pending fix for this, which introduces a fa

Re: [Pacemaker] fencing to recover from failed resources

2011-01-13 Thread Michael Smith
Bart Coninckx wrote: By the way: things seem better when I change the monitor time out to 30 seconds in stead of 10 seconds. Very strange though, because the resource agent basically does a "xm list --long" while monitoring, which takes less than half a second in a console. I think sometimes

Re: [Pacemaker] fencing to recover from failed resources

2011-01-13 Thread Florian Haas
On 2011-01-13 13:16, Bart Coninckx wrote: > On Thursday 13 January 2011 11:58:03 Lars Marowsky-Bree wrote: >> On 2011-01-13T11:48:41, Bart Coninckx wrote: >>> I notice that you work Novell, this is a SLES11SP1 installation so if the >>> resource agent for Xen is faulty I guess you know about it? >

Re: [Pacemaker] fencing to recover from failed resources

2011-01-13 Thread Bart Coninckx
On Thursday 13 January 2011 11:58:03 Lars Marowsky-Bree wrote: > On 2011-01-13T11:48:41, Bart Coninckx wrote: > > I notice that you work Novell, this is a SLES11SP1 installation so if the > > resource agent for Xen is faulty I guess you know about it? > > Yes, I think I'd know about it. The Xen R

Re: [Pacemaker] fencing to recover from failed resources

2011-01-13 Thread Lars Marowsky-Bree
On 2011-01-13T11:48:41, Bart Coninckx wrote: > I notice that you work Novell, this is a SLES11SP1 installation so if the > resource agent for Xen is faulty I guess you know about it? Yes, I think I'd know about it. The Xen RA doesn't have any known bugs at the moment, but make sure that all mai

Re: [Pacemaker] fencing to recover from failed resources

2011-01-13 Thread Bart Coninckx
On Thursday 13 January 2011 11:13:42 Lars Marowsky-Bree wrote: > On 2011-01-13T11:08:49, Bart Coninckx wrote: > > thx for your answer. > > So do I get this straight: > > - resource undergoes monitor operation > > - monitor reports failure > > - a restart of the resource is issued (stop and start)

Re: [Pacemaker] fencing to recover from failed resources

2011-01-13 Thread Bart Coninckx
On Thursday 13 January 2011 11:13:42 Lars Marowsky-Bree wrote: > On 2011-01-13T11:08:49, Bart Coninckx wrote: > > thx for your answer. > > So do I get this straight: > > - resource undergoes monitor operation > > - monitor reports failure > > - a restart of the resource is issued (stop and start)

Re: [Pacemaker] fencing to recover from failed resources

2011-01-13 Thread Lars Marowsky-Bree
On 2011-01-13T11:08:49, Bart Coninckx wrote: > thx for your answer. > So do I get this straight: > - resource undergoes monitor operation > - monitor reports failure > - a restart of the resource is issued (stop and start) > - stop fails > - PE decides to fence the node because of this regardles

Re: [Pacemaker] fencing to recover from failed resources

2011-01-13 Thread Bart Coninckx
On Thursday 13 January 2011 09:51:16 Lars Marowsky-Bree wrote: > On 2011-01-12T22:52:14, Bart Coninckx wrote: > > Jan 12 22:20:34 xen2 pengine: [6633]: WARN: unpack_rsc_op: Processing > > failed op intranet1_stop_0 on xen1: unknown exec error (-2) > > > > My monitors are set to restart a resorce.

Re: [Pacemaker] fencing to recover from failed resources

2011-01-13 Thread Lars Marowsky-Bree
On 2011-01-12T22:52:14, Bart Coninckx wrote: > Jan 12 22:20:34 xen2 pengine: [6633]: WARN: unpack_rsc_op: Processing failed > op intranet1_stop_0 on xen1: unknown exec error (-2) > My monitors are set to restart a resorce. What makes the PE decide to fence > the node in stead of first trying t

[Pacemaker] fencing to recover from failed resources

2011-01-12 Thread Bart Coninckx
Hi, I get a lot of fencing on my two node cluster with these messages: Jan 12 22:20:34 xen2 pengine: [6633]: info: get_failcount: intranet1 has failed INFINITY times on xen1 Jan 12 22:20:34 xen2 pengine: [6633]: info: get_failcount: intranet1 has failed INFINITY times on xen1 Jan 12 22:20:34 xe