Re: [Linux-HA] Heartbeat dies with SIGXCPU, pacemaker ping RA syntax error

Igor Chudov Fri, 07 Jan 2011 04:45:50 -0800

I have the same problem (on Ubuntu).

Very interested in an answer.


i

On Fri, Jan 7, 2011 at 5:12 AM, Daniel Krambrock <[email protected]>wrote:

> hi there,
>
> we have got an 12 node cluster for managing KVM based virtual machines.
> we are using fedora 12 for the node systems with pacemaker
> (pacemaker-1.0.7-1.fc12.x86_64) and heartbeat
> (heartbeat-3.0.0-0.7.0daab7da36a8.hg.fc12.x86_64).
>
> we had a crash of heartbeat with SIGXCPU
>
> Jan  2 01:21:11 node09 heartbeat: [31328]: WARN: Managed HBREAD process
> 25702 killed by signal 24 [SIGXCPU - CPU limit exceeded].
> Jan  2 01:21:11 node09 heartbeat: [31328]: ERROR: Managed HBREAD process
> 25702 dumped core
> Jan  2 01:21:11 node09 heartbeat: [31328]: ERROR: HBREAD process died.
> Beginning communications restart process for comm channel 0.
> Jan  2 01:21:11 node09 heartbeat: [31328]: WARN: Managed HBWRITE process
> 25701 killed by signal 9 [SIGKILL - Kill, unblockable].
> Jan  2 01:21:11 node09 heartbeat: [31328]: ERROR: Both comm processes
> for channel 0 have died.  Restarting.
> Jan  2 01:21:11 node09 heartbeat: [31328]: info: glib: UDP multicast
> heartbeat started for group 239.0.0.4 port 694 interface br_vlan1040
> (ttl=1 loop=0)
> Jan  2 01:21:11 node09 heartbeat: [31328]: info: Communications restart
> succeeded.
> Jan  2 01:21:12 node09 heartbeat: [22135]: info: Stack hogger failed
> 0xffffffff
> Jan  2 01:21:12 node09 heartbeat: [22136]: info: Stack hogger failed
> 0xffffffff
>
> we figured out that if debug mode is turned on, heartbeat is setting a
> max cpu time limit to 4143 (you can see that in the
> cat /proc/<heartbeat-pid>/limits file). if debug mode is turned off you
> dont have that limit.
>
> directly after the heartbeat crash the pacemaker ping RA is not working
> any more, it is producing only syntax errors:
>
> Jan  2 01:21:24 node09 lrmd: [31341]: info: RA output:
> (pingd_stornet:8:monitor:stderr) expr: syntax error
> Jan  2 01:21:24 node09 attrd_updater: [22148]: info: Invoked:
> attrd_updater -n pingd_stornet -v -d 5s
> Jan  2 01:21:24 node09 attrd_updater: [22148]: info: attrd_lazy_update:
> Connecting to cluster... 5 retries remaining
> Jan  2 01:21:38 node09 lrmd: [31341]: info: RA output:
> (pingd_stornet:8:monitor:stderr) expr: syntax error
> Jan  2 01:21:38 node09 attrd_updater: [22172]: info: Invoked:
> attrd_updater -n pingd_stornet -v -d 5s
> Jan  2 01:21:38 node09 attrd_updater: [22172]: info: attrd_lazy_update:
> Connecting to cluster... 5 retries remaining
> Jan  2 01:21:52 node09 lrmd: [31341]: info: RA output:
> (pingd_stornet:8:monitor:stderr) expr: syntax error
> Jan  2 01:21:52 node09 attrd_updater: [22191]: info: Invoked:
> attrd_updater -n pingd_stornet -v -d 5s
>
> on every machine that had that SIGXCPU crash ping RA is not working any
> more.
>
> my questions are:
> - do we have to turn debug mode off to get rid of the max cpu time
> limit? is that the right thing to do, or are we using to much cpu time
> for the heartbeat process?
> - how to fix the ping RA? is my cluster somehow screwed up, that ping RA
> not working any more?
>
> bests
>
> daniel
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Heartbeat dies with SIGXCPU, pacemaker ping RA syntax error

Reply via email to