I have the same problem (on Ubuntu). Very interested in an answer.
i On Fri, Jan 7, 2011 at 5:12 AM, Daniel Krambrock <[email protected]>wrote: > hi there, > > we have got an 12 node cluster for managing KVM based virtual machines. > we are using fedora 12 for the node systems with pacemaker > (pacemaker-1.0.7-1.fc12.x86_64) and heartbeat > (heartbeat-3.0.0-0.7.0daab7da36a8.hg.fc12.x86_64). > > we had a crash of heartbeat with SIGXCPU > > Jan 2 01:21:11 node09 heartbeat: [31328]: WARN: Managed HBREAD process > 25702 killed by signal 24 [SIGXCPU - CPU limit exceeded]. > Jan 2 01:21:11 node09 heartbeat: [31328]: ERROR: Managed HBREAD process > 25702 dumped core > Jan 2 01:21:11 node09 heartbeat: [31328]: ERROR: HBREAD process died. > Beginning communications restart process for comm channel 0. > Jan 2 01:21:11 node09 heartbeat: [31328]: WARN: Managed HBWRITE process > 25701 killed by signal 9 [SIGKILL - Kill, unblockable]. > Jan 2 01:21:11 node09 heartbeat: [31328]: ERROR: Both comm processes > for channel 0 have died. Restarting. > Jan 2 01:21:11 node09 heartbeat: [31328]: info: glib: UDP multicast > heartbeat started for group 239.0.0.4 port 694 interface br_vlan1040 > (ttl=1 loop=0) > Jan 2 01:21:11 node09 heartbeat: [31328]: info: Communications restart > succeeded. > Jan 2 01:21:12 node09 heartbeat: [22135]: info: Stack hogger failed > 0xffffffff > Jan 2 01:21:12 node09 heartbeat: [22136]: info: Stack hogger failed > 0xffffffff > > we figured out that if debug mode is turned on, heartbeat is setting a > max cpu time limit to 4143 (you can see that in the > cat /proc/<heartbeat-pid>/limits file). if debug mode is turned off you > dont have that limit. > > directly after the heartbeat crash the pacemaker ping RA is not working > any more, it is producing only syntax errors: > > Jan 2 01:21:24 node09 lrmd: [31341]: info: RA output: > (pingd_stornet:8:monitor:stderr) expr: syntax error > Jan 2 01:21:24 node09 attrd_updater: [22148]: info: Invoked: > attrd_updater -n pingd_stornet -v -d 5s > Jan 2 01:21:24 node09 attrd_updater: [22148]: info: attrd_lazy_update: > Connecting to cluster... 5 retries remaining > Jan 2 01:21:38 node09 lrmd: [31341]: info: RA output: > (pingd_stornet:8:monitor:stderr) expr: syntax error > Jan 2 01:21:38 node09 attrd_updater: [22172]: info: Invoked: > attrd_updater -n pingd_stornet -v -d 5s > Jan 2 01:21:38 node09 attrd_updater: [22172]: info: attrd_lazy_update: > Connecting to cluster... 5 retries remaining > Jan 2 01:21:52 node09 lrmd: [31341]: info: RA output: > (pingd_stornet:8:monitor:stderr) expr: syntax error > Jan 2 01:21:52 node09 attrd_updater: [22191]: info: Invoked: > attrd_updater -n pingd_stornet -v -d 5s > > on every machine that had that SIGXCPU crash ping RA is not working any > more. > > my questions are: > - do we have to turn debug mode off to get rid of the max cpu time > limit? is that the right thing to do, or are we using to much cpu time > for the heartbeat process? > - how to fix the ping RA? is my cluster somehow screwed up, that ping RA > not working any more? > > bests > > daniel > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
