[Pacemaker] How to delete warning information

2011-01-13 Thread jiaju liu
when I use command crm configure property start-failure-is-fatal=FALSE it shows WARNING: status: operation not recognized WARNING: status: operation not recognized WARNING: status: operation not recognized WARNING: status: operation not recognized WARNING: status: operation not recognized WARNIN

[Pacemaker] Help with configuring pacemaker automatically with chef

2011-01-13 Thread Todd Nine
Hi guys, I'm having a hard time finding the info I need to configure pacemaker from an input file. I've been using Zookeeper a lot in our application tier, so I'm familiar with clusters, however I'm struggling to adapt that knowledge to the pacemaker configuration. Here is an overview of our c

Re: [Pacemaker] Howto write a STONITH agent

2011-01-13 Thread Bob Haxo
Hi Christoph, Have you taken a look in /usr/lib64/stonith/plugins/external? The "ipmi" plugin might serve as a coding example/template. Or maybe the "drac5" plugin. At first glance, "drac5" appears to be using ssh. Bob Haxo On Thu, 2011-01-13 at 21:09 +0100, Christoph Herrmann wrote: > Hi, >

Re: [Pacemaker] Node doesn't rejoin automatically after reboot - POSSIBLE CAUSE

2011-01-13 Thread Bob Haxo
Hi Tom (and Andrew), I figured out an easy fix for the problem that I encountered. However, there would seem to be a problem lurking in the code. Here is what I found. On one of the servers that was online and hosting resources: r2lead1:~ # netstat -a | grep crm Proto RefCnt Flags Type

[Pacemaker] Howto write a STONITH agent

2011-01-13 Thread Christoph Herrmann
Hi, I have some brand new HP Blades with ILO Boards (iLO 2 Standard Blade Edition 1.81 ...) But I'm not able to connect with them via the external/riloe agent. When i try: stonith -t external/riloe -p "hostlist=node1 ilo_hostname=ilo1 ilo_user=ilouser ilo_password=ilopass ilo_can_reset=1 ilo_p

Re: [Pacemaker] Node doesn't rejoin automatically after reboot

2011-01-13 Thread Bob Haxo
So, Tom ...how do you get the failed node online? I've re-installed with the same image that is running on three other nodes, but still fails. This node was quite happy for the past 3 months. As I'm testing installs, this and other nodes have been installed a significant number of times withou

Re: [Pacemaker] Node doesn't rejoin automatically after reboot

2011-01-13 Thread Tom Tux
I don't know. I still have this issue (and it seems, that I'm not the only one...). I'll have a look, if there are pacemaker-updates through the zypper-update-channel available (sles11-sp1). Regards, Tom 2011/1/13 Bob Haxo : > Tom, others, > > Please, what was the solution to this issue? > > Tha

Re: [Pacemaker] Node doesn't rejoin automatically after reboot

2011-01-13 Thread Bob Haxo
Tom, others, Please, what was the solution to this issue? Thanks, Bob Haxo On Mon, 2010-09-06 at 09:50 +0200, Tom Tux wrote: > Yes, corosync is running after the reboot. It comes up with the > regular init-procedure (runlevel 3 in my case). > > 2010/9/6 Andrew Beekhof : > > On Mon, Sep 6, 20

[Pacemaker] [Ubuntu-ha] startup problem DLM on ubuntu lucid

2011-01-13 Thread Jake Smith
I read the thread related to this startup problem (dlm segfaults when server comes up with corosync auto starting up). I just have one follow-up question: The 3.07 package in Ubuntu-HA has not been patched for Lucid yet and there is not a backport of 3.0.12 for Lucid to fix this problem. So i

Re: [Pacemaker] fencing to recover from failed resources

2011-01-13 Thread Lars Marowsky-Bree
On 2011-01-13T09:30:48, Michael Smith wrote: > >the resource agent basically does a "xm list --long" while > >monitoring, which takes less than half a second in a console. > I think sometimes xend hangs for a while. 30 seconds should be good. There's a pending fix for this, which introduces a fa

Re: [Pacemaker] fencing to recover from failed resources

2011-01-13 Thread Michael Smith
Bart Coninckx wrote: By the way: things seem better when I change the monitor time out to 30 seconds in stead of 10 seconds. Very strange though, because the resource agent basically does a "xm list --long" while monitoring, which takes less than half a second in a console. I think sometimes

Re: [Pacemaker] fencing to recover from failed resources

2011-01-13 Thread Florian Haas
On 2011-01-13 13:16, Bart Coninckx wrote: > On Thursday 13 January 2011 11:58:03 Lars Marowsky-Bree wrote: >> On 2011-01-13T11:48:41, Bart Coninckx wrote: >>> I notice that you work Novell, this is a SLES11SP1 installation so if the >>> resource agent for Xen is faulty I guess you know about it? >

Re: [Pacemaker] fencing to recover from failed resources

2011-01-13 Thread Bart Coninckx
On Thursday 13 January 2011 11:58:03 Lars Marowsky-Bree wrote: > On 2011-01-13T11:48:41, Bart Coninckx wrote: > > I notice that you work Novell, this is a SLES11SP1 installation so if the > > resource agent for Xen is faulty I guess you know about it? > > Yes, I think I'd know about it. The Xen R

Re: [Pacemaker] fencing to recover from failed resources

2011-01-13 Thread Lars Marowsky-Bree
On 2011-01-13T11:48:41, Bart Coninckx wrote: > I notice that you work Novell, this is a SLES11SP1 installation so if the > resource agent for Xen is faulty I guess you know about it? Yes, I think I'd know about it. The Xen RA doesn't have any known bugs at the moment, but make sure that all mai

Re: [Pacemaker] fencing to recover from failed resources

2011-01-13 Thread Bart Coninckx
On Thursday 13 January 2011 11:13:42 Lars Marowsky-Bree wrote: > On 2011-01-13T11:08:49, Bart Coninckx wrote: > > thx for your answer. > > So do I get this straight: > > - resource undergoes monitor operation > > - monitor reports failure > > - a restart of the resource is issued (stop and start)

Re: [Pacemaker] fencing to recover from failed resources

2011-01-13 Thread Bart Coninckx
On Thursday 13 January 2011 11:13:42 Lars Marowsky-Bree wrote: > On 2011-01-13T11:08:49, Bart Coninckx wrote: > > thx for your answer. > > So do I get this straight: > > - resource undergoes monitor operation > > - monitor reports failure > > - a restart of the resource is issued (stop and start)

Re: [Pacemaker] fencing to recover from failed resources

2011-01-13 Thread Lars Marowsky-Bree
On 2011-01-13T11:08:49, Bart Coninckx wrote: > thx for your answer. > So do I get this straight: > - resource undergoes monitor operation > - monitor reports failure > - a restart of the resource is issued (stop and start) > - stop fails > - PE decides to fence the node because of this regardles

Re: [Pacemaker] fencing to recover from failed resources

2011-01-13 Thread Bart Coninckx
On Thursday 13 January 2011 09:51:16 Lars Marowsky-Bree wrote: > On 2011-01-12T22:52:14, Bart Coninckx wrote: > > Jan 12 22:20:34 xen2 pengine: [6633]: WARN: unpack_rsc_op: Processing > > failed op intranet1_stop_0 on xen1: unknown exec error (-2) > > > > My monitors are set to restart a resorce.

[Pacemaker] "Stretched" cluster support

2011-01-13 Thread Valentin Vidic
On Thu, Jan 13, 2011 at 10:14:09AM +0100, Lars Marowsky-Bree wrote: > Introduction: At LPC 2010, we discussed (once more) that a key feature > for pacemaker in 2011 would be improved support for multi-site clusters; > by multi-site, we mean two (or more) sites with a local cluster each, > and some

[Pacemaker] Multi-site support in pacemaker (tokens, deadman, CTR)

2011-01-13 Thread Lars Marowsky-Bree
Hi all, sorry for the delay in posting this. IntroductioN: At LPC 2010, we discussed (once more) that a key feature for pacemaker in 2011 would be improved support for multi-site clusters; by multi-site, we mean two (or more) sites with a local cluster each, and some higher level entity coordinat

Re: [Pacemaker] fencing to recover from failed resources

2011-01-13 Thread Lars Marowsky-Bree
On 2011-01-12T22:52:14, Bart Coninckx wrote: > Jan 12 22:20:34 xen2 pengine: [6633]: WARN: unpack_rsc_op: Processing failed > op intranet1_stop_0 on xen1: unknown exec error (-2) > My monitors are set to restart a resorce. What makes the PE decide to fence > the node in stead of first trying t