Very interested on SIGXCPU problem. I cannot deploy my solution with it.
i On Fri, Jan 7, 2011 at 11:41 AM, Daniel Krambrock <[email protected]>wrote: > hi there, > > i think we found the reason for the syntax error in ping RA: > the crash of heartbeat had produced a coredump > in /var/lib/heartbeat/cores/root , which is the working directory of > the ping RA. ping RA makes use of a unquoted * symbol: > > score=`expr $active * $OCF_RESKEY_multiplier` > > since we have a coredump in the working directory, shell is > misunderstanding the * symbol. > A patch for that would be: > > --- ping 2011-01-07 18:31:35.000000000 +0100 > +++ ping.new 2011-01-07 18:32:50.000000000 +0100 > @@ -241,7 +241,7 @@ > *) ocf_log err "Unexpected result for '$p_exe $p_args > $OCF_RESKEY_options $host' $rc: $p_out";; > esac > done > - score=`expr $active * $OCF_RESKEY_multiplier` > + score=`expr $active \* $OCF_RESKEY_multiplier` > attrd_updater -n $OCF_RESKEY_name -v $score -d $OCF_RESKEY_dampen > } > > > beside form that we still have the SIGXCPU problem. > > bests > > daniel > > > Am Freitag, den 07.01.2011, 12:12 +0100 schrieb Daniel Krambrock: > > hi there, > > > > we have got an 12 node cluster for managing KVM based virtual machines. > > we are using fedora 12 for the node systems with pacemaker > > (pacemaker-1.0.7-1.fc12.x86_64) and heartbeat > > (heartbeat-3.0.0-0.7.0daab7da36a8.hg.fc12.x86_64). > > > > we had a crash of heartbeat with SIGXCPU > > > > Jan 2 01:21:11 node09 heartbeat: [31328]: WARN: Managed HBREAD process > > 25702 killed by signal 24 [SIGXCPU - CPU limit exceeded]. > > Jan 2 01:21:11 node09 heartbeat: [31328]: ERROR: Managed HBREAD process > > 25702 dumped core > > Jan 2 01:21:11 node09 heartbeat: [31328]: ERROR: HBREAD process died. > > Beginning communications restart process for comm channel 0. > > Jan 2 01:21:11 node09 heartbeat: [31328]: WARN: Managed HBWRITE process > > 25701 killed by signal 9 [SIGKILL - Kill, unblockable]. > > Jan 2 01:21:11 node09 heartbeat: [31328]: ERROR: Both comm processes > > for channel 0 have died. Restarting. > > Jan 2 01:21:11 node09 heartbeat: [31328]: info: glib: UDP multicast > > heartbeat started for group 239.0.0.4 port 694 interface br_vlan1040 > > (ttl=1 loop=0) > > Jan 2 01:21:11 node09 heartbeat: [31328]: info: Communications restart > > succeeded. > > Jan 2 01:21:12 node09 heartbeat: [22135]: info: Stack hogger failed > > 0xffffffff > > Jan 2 01:21:12 node09 heartbeat: [22136]: info: Stack hogger failed > > 0xffffffff > > > > we figured out that if debug mode is turned on, heartbeat is setting a > > max cpu time limit to 4143 (you can see that in the > > cat /proc/<heartbeat-pid>/limits file). if debug mode is turned off you > > dont have that limit. > > > > directly after the heartbeat crash the pacemaker ping RA is not working > > any more, it is producing only syntax errors: > > > > Jan 2 01:21:24 node09 lrmd: [31341]: info: RA output: > > (pingd_stornet:8:monitor:stderr) expr: syntax error > > Jan 2 01:21:24 node09 attrd_updater: [22148]: info: Invoked: > > attrd_updater -n pingd_stornet -v -d 5s > > Jan 2 01:21:24 node09 attrd_updater: [22148]: info: attrd_lazy_update: > > Connecting to cluster... 5 retries remaining > > Jan 2 01:21:38 node09 lrmd: [31341]: info: RA output: > > (pingd_stornet:8:monitor:stderr) expr: syntax error > > Jan 2 01:21:38 node09 attrd_updater: [22172]: info: Invoked: > > attrd_updater -n pingd_stornet -v -d 5s > > Jan 2 01:21:38 node09 attrd_updater: [22172]: info: attrd_lazy_update: > > Connecting to cluster... 5 retries remaining > > Jan 2 01:21:52 node09 lrmd: [31341]: info: RA output: > > (pingd_stornet:8:monitor:stderr) expr: syntax error > > Jan 2 01:21:52 node09 attrd_updater: [22191]: info: Invoked: > > attrd_updater -n pingd_stornet -v -d 5s > > > > on every machine that had that SIGXCPU crash ping RA is not working any > > more. > > > > my questions are: > > - do we have to turn debug mode off to get rid of the max cpu time > > limit? is that the right thing to do, or are we using to much cpu time > > for the heartbeat process? > > - how to fix the ping RA? is my cluster somehow screwed up, that ping RA > > not working any more? > > > > bests > > > > daniel > > > > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
