Very interested on SIGXCPU problem.

I cannot deploy my solution with it.

i

On Fri, Jan 7, 2011 at 11:41 AM, Daniel Krambrock <[email protected]>wrote:

> hi there,
>
> i think we found the reason for the syntax error in ping RA:
> the crash of heartbeat had produced a coredump
> in  /var/lib/heartbeat/cores/root , which is the working directory of
> the ping RA. ping RA makes use of a unquoted * symbol:
>
> score=`expr $active * $OCF_RESKEY_multiplier`
>
> since we have a coredump in the working directory, shell is
> misunderstanding the * symbol.
> A patch for that would be:
>
> --- ping        2011-01-07 18:31:35.000000000 +0100
> +++ ping.new    2011-01-07 18:32:50.000000000 +0100
> @@ -241,7 +241,7 @@
>            *) ocf_log err "Unexpected result for '$p_exe $p_args
> $OCF_RESKEY_options $host' $rc: $p_out";;
>        esac
>     done
> -    score=`expr $active * $OCF_RESKEY_multiplier`
> +    score=`expr $active \* $OCF_RESKEY_multiplier`
>     attrd_updater -n $OCF_RESKEY_name -v $score -d $OCF_RESKEY_dampen
>  }
>
>
> beside form that we still have the SIGXCPU problem.
>
> bests
>
> daniel
>
>
> Am Freitag, den 07.01.2011, 12:12 +0100 schrieb Daniel Krambrock:
> > hi there,
> >
> > we have got an 12 node cluster for managing KVM based virtual machines.
> > we are using fedora 12 for the node systems with pacemaker
> > (pacemaker-1.0.7-1.fc12.x86_64) and heartbeat
> > (heartbeat-3.0.0-0.7.0daab7da36a8.hg.fc12.x86_64).
> >
> > we had a crash of heartbeat with SIGXCPU
> >
> > Jan  2 01:21:11 node09 heartbeat: [31328]: WARN: Managed HBREAD process
> > 25702 killed by signal 24 [SIGXCPU - CPU limit exceeded].
> > Jan  2 01:21:11 node09 heartbeat: [31328]: ERROR: Managed HBREAD process
> > 25702 dumped core
> > Jan  2 01:21:11 node09 heartbeat: [31328]: ERROR: HBREAD process died.
> > Beginning communications restart process for comm channel 0.
> > Jan  2 01:21:11 node09 heartbeat: [31328]: WARN: Managed HBWRITE process
> > 25701 killed by signal 9 [SIGKILL - Kill, unblockable].
> > Jan  2 01:21:11 node09 heartbeat: [31328]: ERROR: Both comm processes
> > for channel 0 have died.  Restarting.
> > Jan  2 01:21:11 node09 heartbeat: [31328]: info: glib: UDP multicast
> > heartbeat started for group 239.0.0.4 port 694 interface br_vlan1040
> > (ttl=1 loop=0)
> > Jan  2 01:21:11 node09 heartbeat: [31328]: info: Communications restart
> > succeeded.
> > Jan  2 01:21:12 node09 heartbeat: [22135]: info: Stack hogger failed
> > 0xffffffff
> > Jan  2 01:21:12 node09 heartbeat: [22136]: info: Stack hogger failed
> > 0xffffffff
> >
> > we figured out that if debug mode is turned on, heartbeat is setting a
> > max cpu time limit to 4143 (you can see that in the
> > cat /proc/<heartbeat-pid>/limits file). if debug mode is turned off you
> > dont have that limit.
> >
> > directly after the heartbeat crash the pacemaker ping RA is not working
> > any more, it is producing only syntax errors:
> >
> > Jan  2 01:21:24 node09 lrmd: [31341]: info: RA output:
> > (pingd_stornet:8:monitor:stderr) expr: syntax error
> > Jan  2 01:21:24 node09 attrd_updater: [22148]: info: Invoked:
> > attrd_updater -n pingd_stornet -v -d 5s
> > Jan  2 01:21:24 node09 attrd_updater: [22148]: info: attrd_lazy_update:
> > Connecting to cluster... 5 retries remaining
> > Jan  2 01:21:38 node09 lrmd: [31341]: info: RA output:
> > (pingd_stornet:8:monitor:stderr) expr: syntax error
> > Jan  2 01:21:38 node09 attrd_updater: [22172]: info: Invoked:
> > attrd_updater -n pingd_stornet -v -d 5s
> > Jan  2 01:21:38 node09 attrd_updater: [22172]: info: attrd_lazy_update:
> > Connecting to cluster... 5 retries remaining
> > Jan  2 01:21:52 node09 lrmd: [31341]: info: RA output:
> > (pingd_stornet:8:monitor:stderr) expr: syntax error
> > Jan  2 01:21:52 node09 attrd_updater: [22191]: info: Invoked:
> > attrd_updater -n pingd_stornet -v -d 5s
> >
> > on every machine that had that SIGXCPU crash ping RA is not working any
> > more.
> >
> > my questions are:
> > - do we have to turn debug mode off to get rid of the max cpu time
> > limit? is that the right thing to do, or are we using to much cpu time
> > for the heartbeat process?
> > - how to fix the ping RA? is my cluster somehow screwed up, that ping RA
> > not working any more?
> >
> > bests
> >
> > daniel
> >
>
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to