[Linux-HA] Problem with stonith RA using external/ipmi over lan or lanplus interface

Pham, Tom Thu, 12 Apr 2012 10:06:15 -0700


-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of 
[email protected]
Sent: Thursday, April 12, 2012 4:04 AM
To: [email protected]
Subject: Linux-HA Digest, Vol 101, Issue 11


Send Linux-HA mailing list submissions to
        [email protected]

To subscribe or unsubscribe via the World Wide Web, visit
        http://lists.linux-ha.org/mailman/listinfo/linux-ha
or, via email, send a message with subject or body 'help' to
        [email protected]

You can reach the person managing the list at
        [email protected]

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Linux-HA digest..."


Today's Topics:

   1. Problem with stonith RA using external/ipmi over lan or
      lanplus interface (Pham, Tom)
   2. Re: pacemaker+drbd promotion delay (Lars Ellenberg)
   3. Re: Problem with stonith RA using external/ipmi over lan or
      lanplus interface (emmanuel segura)
   4. Re: ocf:heartbeat:apache resource agent and timeouts
      (Lars Ellenberg)
   5. Re: ocf:heartbeat:apache resource agent and timeouts
      (Lars Ellenberg)
   6. Re: ocf:heartbeat:apache resource agent and timeouts
      (Lars Ellenberg)
   7. Re: Problem with stonith RA using external/ipmi over lan  or
      lanplus interface (Nikita Michalko)


----------------------------------------------------------------------

Message: 1
Date: Wed, 11 Apr 2012 21:00:41 +0000
From: "Pham, Tom" <[email protected]>
Subject: [Linux-HA] Problem with stonith RA using external/ipmi over
        lan or lanplus interface
To: "'[email protected]'" <[email protected]>
Message-ID:
        
<b7fdce9d6f80aa40aabdaa98d00c4a3e066...@wdc1exchmbxp03.hq.corp.viasat.com>
        
Content-Type: text/plain; charset="us-ascii"

Hi everyone,

I try to test a 2 nodes cluster with stonith resource using external/ipmi ( I 
tried external/riloe first but it does not seem to work)
My system has HP Proliant BL460c G7 with iLO 3 card Firmware 1.15
SUSE 11
Corosync version 1.2.7 ; Pacmaker 1.0.9

When I use the interface lan or lanplus, It failed to start the stonith 
resource. I get the error below
external/ipmi[12173]: [12184]: ERROR: error executing ipmitool: Error: Unable 
to establish IPMI v2 / RMCP+ session  Unable to get Chassis Power Status

However, when I used the interface = open instead lan/lanplus ,it started the 
stonith resource fine. When I tried to kill -9 corosync in node1, I expect it 
will reboot node1 and started all resource on node2. But it reboot node1. 
Someone mentioned that open interface is a local interface and only allows to 
fence itself.

Anyone knows why the lan/lanplus does not work?

Thanks

Tom Pham



------------------------------

Message: 2
Date: Thu, 12 Apr 2012 09:26:36 +0200
From: Lars Ellenberg <[email protected]>
Subject: Re: [Linux-HA] pacemaker+drbd promotion delay
To: General Linux-HA mailing list <[email protected]>
Message-ID: <[email protected]>
Content-Type: text/plain; charset=utf-8

On Wed, Apr 11, 2012 at 08:22:59AM +1000, Andrew Beekhof wrote:
> It looks like the drbd RA is calling crm_master during the monitor action.
> That wouldn't seem like a good idea as the value isn't counted until
> the resource is started and if the transition is interrupted (as it is
> here) then the PE won't try to promote it (because the value didn't
> change).

I did not get the last part.
Why would it not be promoted,
even though it has positive master score?

> Has the drbd RA always done this?

Yes.

When else should we call crm_master?

Preference changes: we may lose a local disk,
we may have been outdated or inconsistent,
then sync up, etc.

> On Sat, Mar 31, 2012 at 2:56 AM, William Seligman
> <[email protected]> wrote:
> > On 3/30/12 1:13 AM, Andrew Beekhof wrote:
> >> On Fri, Mar 30, 2012 at 2:57 AM, William Seligman
> >> <[email protected]> wrote:
> >>> On 3/29/12 3:19 AM, Andrew Beekhof wrote:
> >>>> On Wed, Mar 28, 2012 at 9:12 AM, William Seligman
> >>>> <[email protected]> wrote:
> >>>>> The basics: Dual-primary cman+pacemaker+drbd cluster running on 
> >>>>> RHEL6.2; spec
> >>>>> files and versions below.
> >>>>>
> >>>>> Problem: If I restart both nodes at the same time, or even just start 
> >>>>> pacemaker
> >>>>> on both nodes at the same time, the drbd ms resource starts, but both 
> >>>>> nodes stay
> >>>>> in slave mode. They'll both stay in slave mode until one of the 
> >>>>> following occurs:
> >>>>>
> >>>>> - I manually type "crm resource cleanup <ms-resource-name>"
> >>>>>
> >>>>> - 15 minutes elapse. Then the "PEngine Recheck Timer" is fired, and the 
> >>>>> ms
> >>>>> resources are promoted.
> >>>>>
> >>>>> The key resource definitions:
> >>>>>
> >>>>> primitive AdminDrbd ocf:linbit:drbd \
> >>>>> ? ? ? ?params drbd_resource="admin" \
> >>>>> ? ? ? ?op monitor interval="59s" role="Master" timeout="30s" \
> >>>>> ? ? ? ?op monitor interval="60s" role="Slave" timeout="30s" \
> >>>>> ? ? ? ?op stop interval="0" timeout="100" \
> >>>>> ? ? ? ?op start interval="0" timeout="240" \
> >>>>> ? ? ? ?meta target-role="Master"
> >>>>> ms AdminClone AdminDrbd \
> >>>>> ? ? ? ?meta master-max="2" master-node-max="1" clone-max="2" \
> >>>>> ? ? ? ?clone-node-max="1" notify="true" interleave="true"
> >>>>> # The lengthy definition of "FilesystemGroup" is in the crm pastebin 
> >>>>> below
> >>>>> clone FilesystemClone FilesystemGroup \
> >>>>> ? ? ? ?meta interleave="true" target-role="Started"
> >>>>> colocation Filesystem_With_Admin inf: FilesystemClone AdminClone:Master
> >>>>> order Admin_Before_Filesystem inf: AdminClone:promote 
> >>>>> FilesystemClone:start
> >>>>>
> >>>>> Note that I stuck in "target-role" options to try to solve the problem; 
> >>>>> no effect.
> >>>>>
> >>>>> When I look in /var/log/messages, I see no error messages or 
> >>>>> indications why the
> >>>>> promotion should be delayed. The 'admin' drbd resource is reported as 
> >>>>> UpToDate
> >>>>> on both nodes. There are no error messages when I force the issue with:
> >>>>>
> >>>>> crm resource cleanup AdminClone
> >>>>>
> >>>>> It's as if pacemaker, at start, needs some kind of "kick" after the drbd
> >>>>> resource is ready to be promoted.
> >>>>>
> >>>>> This is not just an abstract case for me. At my site, it's not uncommon 
> >>>>> for
> >>>>> there to be lengthy power outages that will bring down the cluster. 
> >>>>> Both systems
> >>>>> will come up when power is restored, and I need for cluster services to 
> >>>>> be
> >>>>> available shortly afterward, not 15 minutes later.
> >>>>>
> >>>>> Any ideas?
> >>>>
> >>>> Not without any logs
> >>>
> >>> Sure! Here's an extract from the log: <http://pastebin.com/L1ZnsQ0R>
> >>>
> >>> Before you click on the link (it's a big wall of text),
> >>
> >> I'm used to trawling the logs. ?Grep is a wonderful thing :-)
> >>
> >> At this stage it is apparent that I need to see
> >> /var/lib/pengine/pe-input-4.bz2 from hypatia-corosync.
> >> Do you have this file still?
> >
> > No, so I re-ran the test. Here's the log extract from the test I did today
> > <http://pastebin.com/6QYH2jkf>.
> >
> > Based on what you asked for from the previous extract, I think what you want
> > from this test is pe-input-5. Just to play it safe, I copied and bunzip2'ed 
> > all
> > three pe-input files mentioned in the log messages:
> >
> > pe-input-4: <http://pastebin.com/Txx50BJp>
> > pe-input-5: <http://pastebin.com/zzppL6DF>
> > pe-input-6: <http://pastebin.com/1dRgURK5>
> >
> > I pray to the gods of Grep that you find a clue in all of that!
> >
> >>> here are what I think
> >>> are the landmarks:
> >>>
> >>> - The extract starts just after the node boots, at the start of syslog at 
> >>> time
> >>> 10:49:21.
> >>> - I've highlighted when pacemakerd starts, at 10:49:46.
> >>> - I've highlighted when drbd reports that the 'admin' resource is 
> >>> UpToDate, at
> >>> 10:50:10.
> >>> - One last highlight: When pacemaker finally promotes the drbd resource to
> >>> Primary on both nodes, at 11:05:11.
> >>>
> >>>> Details:
> >>>>>
> >>>>> # rpm -q kernel cman pacemaker drbd
> >>>>> kernel-2.6.32-220.4.1.el6.x86_64
> >>>>> cman-3.0.12.1-23.el6.x86_64
> >>>>> pacemaker-1.1.6-3.el6.x86_64
> >>>>> drbd-8.4.1-1.el6.x86_64
> >>>>>
> >>>>> Output of crm_mon after two-node reboot or pacemaker restart:
> >>>>> <http://pastebin.com/jzrpCk3i>
> >>>>> cluster.conf: <http://pastebin.com/sJw4KBws>
> >>>>> "crm configure show": <http://pastebin.com/MgYCQ2JH>
> >>>>> "drbdadm dump all": <http://pastebin.com/NrY6bskk>
> >
> > --
> > Bill Seligman ? ? ? ? ? ? | Phone: (914) 591-2823
> > Nevis Labs, Columbia Univ | mailto://[email protected]
> > PO Box 137 ? ? ? ? ? ? ? ?|
> > Irvington NY 10533 USA ? ?| http://www.nevis.columbia.edu/~seligman/
> >
> >
> > _______________________________________________
> > Linux-HA mailing list
> > [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD? and LINBIT? are registered trademarks of LINBIT, Austria.


------------------------------
Thank Emmanuel, I will take a look at ipmilan. BTW, what Linux distributions 
did you use with ipmilan.

Tom

Message: 3
Date: Thu, 12 Apr 2012 09:51:05 +0200
From: emmanuel segura <[email protected]>
Subject: Re: [Linux-HA] Problem with stonith RA using external/ipmi
        over lan or lanplus interface
To: General Linux-HA mailing list <[email protected]>
Message-ID:
        <cae7pj3b_9ca3juenh+hdpse4t1xcsggs4qs3oxgbdzngkbh...@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

If i rember well, for use Ilo 3 card you sould use the cluster agent

ipmilan

Il giorno 11 aprile 2012 23:00, Pham, Tom <[email protected]> ha scritto:

> Hi everyone,
>
> I try to test a 2 nodes cluster with stonith resource using external/ipmi
> ( I tried external/riloe first but it does not seem to work)
> My system has HP Proliant BL460c G7 with iLO 3 card Firmware 1.15
> SUSE 11
> Corosync version 1.2.7 ; Pacmaker 1.0.9
>
> When I use the interface lan or lanplus, It failed to start the stonith
> resource. I get the error below
> external/ipmi[12173]: [12184]: ERROR: error executing ipmitool: Error:
> Unable to establish IPMI v2 / RMCP+ session  Unable to get Chassis Power
> Status
>
> However, when I used the interface = open instead lan/lanplus ,it started
> the stonith resource fine. When I tried to kill -9 corosync in node1, I
> expect it will reboot node1 and started all resource on node2. But it
> reboot node1. Someone mentioned that open interface is a local interface
> and only allows to fence itself.
>
> Anyone knows why the lan/lanplus does not work?
>
> Thanks
>
> Tom Pham
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>



-- 
esta es mi vida e me la vivo hasta que dios quiera


------------------------------

Message: 4
Date: Thu, 12 Apr 2012 12:00:12 +0200
From: Lars Ellenberg <[email protected]>
Subject: Re: [Linux-HA] ocf:heartbeat:apache resource agent and
        timeouts
To: [email protected]
Message-ID: <[email protected]>
Content-Type: text/plain; charset=us-ascii

On Sun, Apr 08, 2012 at 03:03:58PM +0200, David Gubler wrote:
> On 05.04.2012 17:14, Dejan Muhamedagic wrote:
> > Hmm, the process running the monitor operation should be removed
> > (killed) by lrmd on timeout. If that doesn't happen, then you
> > just hit a jackpot bug!
> 
> Ok, that's crucial information I've been missing, and thus I 
> misinterpreted my test results. Back to square one...
> 
> TEST 1: *Unpatched* Apache resource agent with this configuration:
> 
> root@node2:/etc/ha.d# crm configure show
> node $id="aa9dea56-ae1e-42a9-a37b-f7c9f5dc5860" node1
> node $id="aec6cf09-e141-415d-8957-a7b94e09df7f" node2
> primitive apache ocf:heartbeat:apache \
>      params statusurl="http://localhost/server-status"; \
>      op monitor interval="15s" timeout="5s" \
>      meta is-managed="false"
> clone apacheClone apache
> property $id="cib-bootstrap-options" \
>      dc-version="1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
>      cluster-infrastructure="Heartbeat" \
>      stonith-enabled="false" \
>      no-quorum-policy="ignore" \
>      last-lrm-refresh="1333886776"
> 
> 
> crm_mon shows
>   Clone Set: apacheClone [apache]
>       apache:0   (ocf::heartbeat:apache):        Started node2 (unmanaged)
>       apache:1   (ocf::heartbeat:apache):        Started node1 (unmanaged)
> Thus all is well.

Nothing is well.
They are "unmanaged" already ...
Which means the cluster will still attempt to monitor for changes,
but will not take action.


-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com


------------------------------

Message: 5
Date: Thu, 12 Apr 2012 12:06:54 +0200
From: Lars Ellenberg <[email protected]>
Subject: Re: [Linux-HA] ocf:heartbeat:apache resource agent and
        timeouts
To: [email protected]
Message-ID: <[email protected]>
Content-Type: text/plain; charset=us-ascii

On Sun, Apr 08, 2012 at 03:16:17PM +0200, David Gubler wrote:
> Hi Lars,
> 
> On 05.04.2012 18:53, Lars Ellenberg wrote:
> > Uhm, "invalid test case".
> >
> > rather try:
> > iptables -I INPUT -p tcp --dport 80 -i lo -j REJECT
> > or even
> > iptables -I INPUT -p tcp --dport 80 -i lo -j REJECT --reject-with tcp-reset
> Yes, then it works, but that's not surprising, because in this case the 
> operations return immediately and never time out. But why should a 
> non-responsive apache be an invalid test case? We've reached apache's 
> connection limit more than once, and from the client's point of view 
> this produces a very similar effect to '-j DROP'.
> 
> 
> > Pacemaker behaviour is just the same,
> > whether a monitor action "timed out", or "failed".
> 
> I've come to the conclusion that this just isn't true, please see my 
> other mail, I've listed all the steps I did in detail.
> 
> 
> >
> > After the monitor action timed out or failed,
> > the recovery action by pacemaker would be to stop the service,
> > and restart it (there or elsewhere).
> >
> > Did that not happen?
> >
> > The start operation of the apache RA internally does monitor as well,
> > so it likely times out as well.
> >
> > I'd expect the cluster to move the unresponsive apache to some other
> > node, after monitor and restart timed out.  Which I think is the right
> > thing to do.
> 
> I'm using unmanaged resources, because for our application there's no 
> point in having Pacemaker shut down apache (apache can be used on all 
> hosts in parallel and without restrictions). So no stop/start for us.

Right. So the resources are not managed.
Did you mention that before?

I won't argue with that, if you think that is how it should be, so be it.

Pacemaker does not monitor resources that are supposed to
be stopped for "reviving on their own".
Not by default, at least.

I suggest you add a "monitor" action for "role=Stopped"
(with a different interval!)

So the better subject would have been
How to configure Pacemaker to monitor (unmanaged) stopped resources
in case they resurrect on their own?

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com


------------------------------

Message: 6
Date: Thu, 12 Apr 2012 12:22:42 +0200
From: Lars Ellenberg <[email protected]>
Subject: Re: [Linux-HA] ocf:heartbeat:apache resource agent and
        timeouts
To: [email protected]
Message-ID: <[email protected]>
Content-Type: text/plain; charset=us-ascii

On Thu, Apr 12, 2012 at 12:06:54PM +0200, Lars Ellenberg wrote:
> On Sun, Apr 08, 2012 at 03:16:17PM +0200, David Gubler wrote:
> > Hi Lars,
> > 
> > On 05.04.2012 18:53, Lars Ellenberg wrote:
> > > Uhm, "invalid test case".
> > >
> > > rather try:
> > > iptables -I INPUT -p tcp --dport 80 -i lo -j REJECT
> > > or even
> > > iptables -I INPUT -p tcp --dport 80 -i lo -j REJECT --reject-with 
> > > tcp-reset
> > Yes, then it works, but that's not surprising, because in this case the 
> > operations return immediately and never time out. But why should a 
> > non-responsive apache be an invalid test case? We've reached apache's 
> > connection limit more than once, and from the client's point of view 
> > this produces a very similar effect to '-j DROP'.
> > 
> > 
> > > Pacemaker behaviour is just the same,
> > > whether a monitor action "timed out", or "failed".
> > 
> > I've come to the conclusion that this just isn't true, please see my 
> > other mail, I've listed all the steps I did in detail.
> > 
> > 
> > >
> > > After the monitor action timed out or failed,
> > > the recovery action by pacemaker would be to stop the service,
> > > and restart it (there or elsewhere).
> > >
> > > Did that not happen?
> > >
> > > The start operation of the apache RA internally does monitor as well,
> > > so it likely times out as well.
> > >
> > > I'd expect the cluster to move the unresponsive apache to some other
> > > node, after monitor and restart timed out.  Which I think is the right
> > > thing to do.
> > 
> > I'm using unmanaged resources, because for our application there's no 
> > point in having Pacemaker shut down apache (apache can be used on all 
> > hosts in parallel and without restrictions). So no stop/start for us.
> 
> Right. So the resources are not managed.
> Did you mention that before?

Hm. So you did. Guess my auto-correction while reading dropped that line...

primitive apache ocf:heartbeat:apache \
         params testconffile="/etc/ha.d/doodletest.pm"
         testname="doodle"\
         op monitor interval="30" timeout="20" \
         op monitor interval="31" timeout="20" role=Stopped \
         meta is-managed="false"


I think that "monitor role=Stopped" thing works for primitives.
It may work for clones, I'd have to double check that.
iirc, it does not work for ms resources.
At least not last time I checked.

> I won't argue with that, if you think that is how it should be, so be it.
> 
> Pacemaker does not monitor resources that are supposed to
> be stopped for "reviving on their own".
> Not by default, at least.
> 
> I suggest you add a "monitor" action for "role=Stopped"
> (with a different interval!)
> 
> So the better subject would have been
> How to configure Pacemaker to monitor (unmanaged) stopped resources

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com


------------------------------
Hi Nikita,

Thanks for your response, yes I did started ipmi on both nodes. I did checked 
it with lsmod and chkconfig commands and see 
ipmi_devintf            8183  0
ipmi_si                43402  0
ipmi_msghandler
When I tried ipmitool -I lan -U root -H ip -a chassis power cycle. It did not 
work but it worked with -I open interface.

What should I do to enable lan/lanplus on SUSE 11?

Tom

Message: 7
Date: Thu, 12 Apr 2012 13:03:15 +0200
From: Nikita Michalko <[email protected]>
Subject: Re: [Linux-HA] Problem with stonith RA using external/ipmi
        over lan        or lanplus interface
To: "General Linux-HA mailing list" <[email protected]>
Message-ID: <[email protected]>
Content-Type: Text/Plain;  charset="iso-8859-1"

Hi,

did you properly configure BOTH ipmi with ipmitool? And ipmi started?
/etc/init.d/ipmi start
What says the command:
ipmitool -I lan -H IP_OF_OTHER_NODE  -U SOMEUSER -A MD5 -P SOMEPASSWORD power 
status


HTH

Nikita Michalko 


Am Mittwoch, 11. April 2012 23:00:41 schrieb Pham, Tom:
> Hi everyone,
> 
> I try to test a 2 nodes cluster with stonith resource using external/ipmi (
>  I tried external/riloe first but it does not seem to work) My system has
>  HP Proliant BL460c G7 with iLO 3 card Firmware 1.15
> SUSE 11
> Corosync version 1.2.7 ; Pacmaker 1.0.9
> 
> When I use the interface lan or lanplus, It failed to start the stonith
>  resource. I get the error below external/ipmi[12173]: [12184]: ERROR:
>  error executing ipmitool: Error: Unable to establish IPMI v2 / RMCP+
>  session  Unable to get Chassis Power Status
> 
> However, when I used the interface = open instead lan/lanplus ,it started
>  the stonith resource fine. When I tried to kill -9 corosync in node1, I
>  expect it will reboot node1 and started all resource on node2. But it
>  reboot node1. Someone mentioned that open interface is a local interface
>  and only allows to fence itself.
> 
> Anyone knows why the lan/lanplus does not work?
> 
> Thanks
> 
> Tom Pham
> 
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
> 


------------------------------

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

End of Linux-HA Digest, Vol 101, Issue 11
*****************************************
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Problem with stonith RA using external/ipmi over lan or lanplus interface

Reply via email to