hi there,

i think we found the reason for the syntax error in ping RA:
the crash of heartbeat had produced a coredump
in  /var/lib/heartbeat/cores/root , which is the working directory of
the ping RA. ping RA makes use of a unquoted * symbol:

score=`expr $active * $OCF_RESKEY_multiplier`

since we have a coredump in the working directory, shell is
misunderstanding the * symbol.
A patch for that would be:

--- ping        2011-01-07 18:31:35.000000000 +0100
+++ ping.new    2011-01-07 18:32:50.000000000 +0100
@@ -241,7 +241,7 @@
            *) ocf_log err "Unexpected result for '$p_exe $p_args
$OCF_RESKEY_options $host' $rc: $p_out";;
        esac
     done
-    score=`expr $active * $OCF_RESKEY_multiplier`
+    score=`expr $active \* $OCF_RESKEY_multiplier`
     attrd_updater -n $OCF_RESKEY_name -v $score -d $OCF_RESKEY_dampen
 }


beside form that we still have the SIGXCPU problem.

bests

daniel


Am Freitag, den 07.01.2011, 12:12 +0100 schrieb Daniel Krambrock:
> hi there,
> 
> we have got an 12 node cluster for managing KVM based virtual machines.
> we are using fedora 12 for the node systems with pacemaker
> (pacemaker-1.0.7-1.fc12.x86_64) and heartbeat
> (heartbeat-3.0.0-0.7.0daab7da36a8.hg.fc12.x86_64).
> 
> we had a crash of heartbeat with SIGXCPU
> 
> Jan  2 01:21:11 node09 heartbeat: [31328]: WARN: Managed HBREAD process
> 25702 killed by signal 24 [SIGXCPU - CPU limit exceeded].
> Jan  2 01:21:11 node09 heartbeat: [31328]: ERROR: Managed HBREAD process
> 25702 dumped core
> Jan  2 01:21:11 node09 heartbeat: [31328]: ERROR: HBREAD process died.
> Beginning communications restart process for comm channel 0.
> Jan  2 01:21:11 node09 heartbeat: [31328]: WARN: Managed HBWRITE process
> 25701 killed by signal 9 [SIGKILL - Kill, unblockable].
> Jan  2 01:21:11 node09 heartbeat: [31328]: ERROR: Both comm processes
> for channel 0 have died.  Restarting.
> Jan  2 01:21:11 node09 heartbeat: [31328]: info: glib: UDP multicast
> heartbeat started for group 239.0.0.4 port 694 interface br_vlan1040
> (ttl=1 loop=0)
> Jan  2 01:21:11 node09 heartbeat: [31328]: info: Communications restart
> succeeded.
> Jan  2 01:21:12 node09 heartbeat: [22135]: info: Stack hogger failed
> 0xffffffff
> Jan  2 01:21:12 node09 heartbeat: [22136]: info: Stack hogger failed
> 0xffffffff
> 
> we figured out that if debug mode is turned on, heartbeat is setting a
> max cpu time limit to 4143 (you can see that in the
> cat /proc/<heartbeat-pid>/limits file). if debug mode is turned off you
> dont have that limit.
> 
> directly after the heartbeat crash the pacemaker ping RA is not working
> any more, it is producing only syntax errors:
> 
> Jan  2 01:21:24 node09 lrmd: [31341]: info: RA output:
> (pingd_stornet:8:monitor:stderr) expr: syntax error
> Jan  2 01:21:24 node09 attrd_updater: [22148]: info: Invoked:
> attrd_updater -n pingd_stornet -v -d 5s 
> Jan  2 01:21:24 node09 attrd_updater: [22148]: info: attrd_lazy_update:
> Connecting to cluster... 5 retries remaining
> Jan  2 01:21:38 node09 lrmd: [31341]: info: RA output:
> (pingd_stornet:8:monitor:stderr) expr: syntax error
> Jan  2 01:21:38 node09 attrd_updater: [22172]: info: Invoked:
> attrd_updater -n pingd_stornet -v -d 5s 
> Jan  2 01:21:38 node09 attrd_updater: [22172]: info: attrd_lazy_update:
> Connecting to cluster... 5 retries remaining
> Jan  2 01:21:52 node09 lrmd: [31341]: info: RA output:
> (pingd_stornet:8:monitor:stderr) expr: syntax error
> Jan  2 01:21:52 node09 attrd_updater: [22191]: info: Invoked:
> attrd_updater -n pingd_stornet -v -d 5s
> 
> on every machine that had that SIGXCPU crash ping RA is not working any
> more.
> 
> my questions are:
> - do we have to turn debug mode off to get rid of the max cpu time
> limit? is that the right thing to do, or are we using to much cpu time
> for the heartbeat process?
> - how to fix the ping RA? is my cluster somehow screwed up, that ping RA
> not working any more?
> 
> bests
> 
> daniel
> 


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to