On Wed, Sep 1, 2010 at 2:51 PM, Ron Kerry <rke...@sgi.com> wrote: > I have taken over working this issue from Vince. The ping clone resource and > constraints were setup as described in the prior attached link. Things were > still not working correctly and the resources were not failing over as > expected when we ifconfig'd one of the monitored interfaces down. I > discovered a bug in the pacemaker/ping script (from the SLE11 HAE > distribution) where a "*" in an expr statement had not been quoted and was > thus being interpreted by the shell.
Also fixed upstream. > I fixed this problem and I was able to > get a single failover to occur, but after that failover the ping monitor was > canceled on the node that had the downed interface. Even after configuring > the interface back up, the monitor task never run again to notice that fact. > This essentially leaves that node with a lower score and improper interface > monitoring. I can clear the problem by stopping and then starting the ping > clone resource. Note that I have tried pulling up the full ping resource > agent script from the SLE11 HAE SP1 distribution and that does not improve > this particular problem (though it fixes a few others). > > I have attached the full hb_report output, but here is a log snip of what is > occurring. > > Sep 1 06:43:50 hpcnas2 root: ifconfig eth3 down > Sep 1 06:43:59 hpcnas2 ntpd[10303]: Deleting interface #13 eth3, > 10.10.20.32#123, interface stats: received=0, sent=0, dropped=0, > active_time=42600 secs > Sep 1 06:43:59 hpcnas2 ntpd[10303]: Deleting interface #15 eth3, > 10.10.20.33#123, interface stats: received=0, sent=0, dropped=0, > active_time=41100 secs > Sep 1 06:44:01 hpcnas2 ping[28882]: [28887]: INFO: ping monitor invoked > Sep 1 06:44:05 hpcnas2 ping[28882]: [28895]: ERROR: Unexpected result for > 'ping -n -q -W 5 -c 5 10.10.20.30' 2: connect: Network is unreachable > Sep 1 06:44:14 hpcnas2 attrd: [13676]: info: attrd_trigger_update: Sending > flush op to all hosts for: pingd (2000) > Sep 1 06:44:14 hpcnas2 attrd: [13676]: info: attrd_perform_update: Sent > update 56: pingd=2000 > Sep 1 06:44:14 hpcnas2 crmd: [13678]: info: do_lrm_rsc_op: Performing > key=34:686:0:bbe666a5-2b9f-4419-9728-803197b6e643 op=NFS_stop_0 ) > Sep 1 06:44:14 hpcnas2 lrmd: [13675]: info: rsc:NFS:83: stop > ... > resources failover > ... > Sep 1 06:45:09 hpcnas2 ping[29241]: [29246]: INFO: ping monitor invoked > Sep 1 06:45:13 hpcnas2 ping[29241]: [29254]: ERROR: Unexpected result for > 'ping -n -q -W 5 -c 5 10.10.20.30' 2: connect: Network is unreachable > Sep 1 06:45:17 hpcnas2 crmd: [13678]: info: process_lrm_event: LRM > operation ping:1_monitor_60000 (call=82, status=1, cib-update=0, > confirmed=true) Cancelled > Sep 1 06:45:32 hpcnas2 kernel: bnx2: eth3: using MSIX > Sep 1 06:45:35 hpcnas2 kernel: bnx2: eth3 NIC Copper Link is Up, 1000 Mbps > full duplex > Sep 1 06:45:38 hpcnas2 root: ifconfig eth3 up > Sep 1 06:48:08 hpcnas2 root: ping monitor appears to be no longer running > > > The concern is the "process_lrm_event: LRM operation ping:1_monitor_60000 () > Cancelled" event. Was the resource stopped? Thats the only time I could imagine a recurring operation being cancelled. > NOTE: The "ping monitor invoked" messages are a debug statement I added to > the RA script so I know when the ping_monitor() routine is called. > > Thanks for any assistance you can provide -- Ron > > > > Nate Pearlstein wrote: >> >> Subject: >> Re: [Pacemaker] IPaddr2 not failing-over >> From: >> "Andrew Beekhof" <and...@beekhof.net> >> Date: >> Thu, 26 Aug 2010 02:47:46 -0500 >> To: >> "The Pacemaker cluster resource manager" <pacemaker@oss.clusterlabs.org> >> >> To: >> "The Pacemaker cluster resource manager" <pacemaker@oss.clusterlabs.org> >> >> >> On Wed, Aug 11, 2010 at 10:55 PM, Vince Gabriel <vin...@sgi.com> wrote: >> > Hi everyone, >> > >> > I have new cluster that is works exceptionally well with the exception >> of >> > the IPaddr2 virtual interfaces initiated failovers. If the interface is >> > downed or cable disconnected, a failover never happens. I’ve attempted >> to >> > incorporate pingd however that has not helped either? It’s my >> understanding >> > a pingd clone should not be needed any long? >> >> If you want to move services based on connectivity, then you need a >> ping(d) clone and some rules that make use of the properties it sets. >> >> >> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ch09s03s03.html >> >> > >> > nas1:~ # rpm -qa | grep hear >> > >> > heartbeat-resources-3.0.0-0.2.8 >> > >> > heartbeat-common-3.0.0-0.6.5 >> > >> > libheartbeat2-3.0.0-0.6.5 >> > >> > cnas1:~ # rpm -qa | grep -i pace >> > >> > pacemaker-pygui-1.99.2-0.2.6 >> > >> > libpacemaker3-1.0.5-0.5.6 >> > >> > pacemaker-1.0.5-0.5.6 >> > >> > primitive HA3-ip ocf:heartbeat:IPaddr2 \ >> > >> > operations $id="HA3-ip-operations" \ >> > >> > op monitor interval="60s" start-delay="0" timeout="30s" >> > on-fail="restart" \ >> > >> > op start interval="0" timeout="90" on-fail="restart" >> > requires="fencing" \ >> > >> > op stop interval="0" timeout="100" on-fail="fence" \ >> > >> > params ip="10.10.20.33" nic="eth3" cidr_netmask="24" \ >> > >> > meta resource-stickiness="1" migration-threshold="1" >> > >> > It’s my understanding…please correct me if I’m wrong….if the interface >> fails >> > it will attempt to restart the interface once, >> >> No, only if the resource fails. >> Your logic only holds if the RA reports failure when the interface fails. >> >> > if it happens again the group >> > it’s associated with should failover to the standby node based on >> > “migration-threshold="1"”. >> > >> > Thanks in Advance, >> > >> > -Vince >> > >> > -- >> > >> > Vince Gabriel >> > >> > Field Technical Analyst >> > >> > SGI >> > >> > office: 361.729.9151 >> > >> > cell: 409.392.8083 >> > >> > >> > >> > >> > >> > _______________________________________________ >> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> > >> > Project Home: http://www.clusterlabs.org >> > Getting started: >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> > Bugs: >> > >> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker >> > >> > >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: >> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker >> > > > -- > > Ron Kerry rke...@sgi.com > Field Technical Support - SGI Federal > Home Office: 248 375-5671 Cell: 248 761-7204 > > -------------- > NB: Information in this message is SGI confidential. It is intended solely > for > the person(s) to whom it is addressed and may not be copied, used, disclosed > or > distributed to others without SGI consent. If you are not the intended > recipient please notify me by email or telephone, delete the message from > your > system immediately and destroy any printed copies. > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker