Andrew Beekhof wrote:
On Wed, Sep 1, 2010 at 2:51 PM, Ron Kerry <rke...@sgi.com> wrote:
> I have taken over working this issue from Vince. The ping clone
resource and
> constraints were setup as described in the prior attached link.
Things were
> still not working correctly and the resources were not failing over as
> expected when we ifconfig'd one of the monitored interfaces down. I
> discovered a bug in the pacemaker/ping script (from the SLE11 HAE
> distribution) where a "*" in an expr statement had not been quoted
and was
> thus being interpreted by the shell.
Also fixed upstream.
> I fixed this problem and I was able to
> get a single failover to occur, but after that failover the ping
monitor was
> canceled on the node that had the downed interface. Even after
configuring
> the interface back up, the monitor task never run again to notice
that fact.
> This essentially leaves that node with a lower score and improper
interface
> monitoring. I can clear the problem by stopping and then starting the
ping
> clone resource. Note that I have tried pulling up the full ping resource
> agent script from the SLE11 HAE SP1 distribution and that does not
improve
> this particular problem (though it fixes a few others).
>
> I have attached the full hb_report output, but here is a log snip of
what is
> occurring.
>
> Sep 1 06:43:50 hpcnas2 root: ifconfig eth3 down
> Sep 1 06:43:59 hpcnas2 ntpd[10303]: Deleting interface #13 eth3,
> 10.10.20.32#123, interface stats: received=0, sent=0, dropped=0,
> active_time=42600 secs
> Sep 1 06:43:59 hpcnas2 ntpd[10303]: Deleting interface #15 eth3,
> 10.10.20.33#123, interface stats: received=0, sent=0, dropped=0,
> active_time=41100 secs
> Sep 1 06:44:01 hpcnas2 ping[28882]: [28887]: INFO: ping monitor invoked
> Sep 1 06:44:05 hpcnas2 ping[28882]: [28895]: ERROR: Unexpected
result for
> 'ping -n -q -W 5 -c 5 10.10.20.30' 2: connect: Network is unreachable
> Sep 1 06:44:14 hpcnas2 attrd: [13676]: info: attrd_trigger_update:
Sending
> flush op to all hosts for: pingd (2000)
> Sep 1 06:44:14 hpcnas2 attrd: [13676]: info: attrd_perform_update: Sent
> update 56: pingd=2000
> Sep 1 06:44:14 hpcnas2 crmd: [13678]: info: do_lrm_rsc_op: Performing
> key=34:686:0:bbe666a5-2b9f-4419-9728-803197b6e643 op=NFS_stop_0 )
> Sep 1 06:44:14 hpcnas2 lrmd: [13675]: info: rsc:NFS:83: stop
> ...
> resources failover
> ...
> Sep 1 06:45:09 hpcnas2 ping[29241]: [29246]: INFO: ping monitor invoked
> Sep 1 06:45:13 hpcnas2 ping[29241]: [29254]: ERROR: Unexpected
result for
> 'ping -n -q -W 5 -c 5 10.10.20.30' 2: connect: Network is unreachable
> Sep 1 06:45:17 hpcnas2 crmd: [13678]: info: process_lrm_event: LRM
> operation ping:1_monitor_60000 (call=82, status=1, cib-update=0,
> confirmed=true) Cancelled
> Sep 1 06:45:32 hpcnas2 kernel: bnx2: eth3: using MSIX
> Sep 1 06:45:35 hpcnas2 kernel: bnx2: eth3 NIC Copper Link is Up,
1000 Mbps
> full duplex
> Sep 1 06:45:38 hpcnas2 root: ifconfig eth3 up
> Sep 1 06:48:08 hpcnas2 root: ping monitor appears to be no longer
running
>
>
> The concern is the "process_lrm_event: LRM operation
ping:1_monitor_60000 ()
> Cancelled" event.
Was the resource stopped? Thats the only time I could imagine a
recurring operation being cancelled.
No it was not stopped. In fact, from the "crm_mon" output that is included with the hb_report output
you can see that the resource still shows as running on both HA cluster nodes. How can I dig further
to figure out what and why the monitor operation is being canceled.
> NOTE: The "ping monitor invoked" messages are a debug statement I
added to
> the RA script so I know when the ping_monitor() routine is called.
>
> Thanks for any assistance you can provide -- Ron
>
--
Ron Kerry rke...@sgi.com
Field Technical Support - SGI Federal
Home Office: 248 375-5671 Cell: 248 761-7204
--------------
NB: Information in this message is SGI confidential. It is intended solely for
the person(s) to whom it is addressed and may not be copied, used, disclosed or
distributed to others without SGI consent. If you are not the intended
recipient please notify me by email or telephone, delete the message from your
system immediately and destroy any printed copies.
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker