On Sun, Dec 26, 2010 at 08:56:13AM -0600, Igor Chudov wrote: > As you guys recall, I have set up a heartbeat/drbd based system to replace > an aging drbd solution. > > While it sits there, it has not been activated. > > I have noticed (due to some self checking scripts) that heartbeat died on > one machine. > > Looking in logs, I found this in ha-log.2: > > Dec 13 17:13:14 pfs-srv3 heartbeat: [1243]: WARN: Managed HBREAD process > 3279 killed by signal 24 [SIGXCPU - CPU limit exceeded].
The heartbeat read process was using too much CPU. > Dec 13 17:13:14 pfs-srv3 heartbeat: [1243]: ERROR: Managed HBREAD process > 3279 dumped core > Dec 13 17:13:14 pfs-srv3 heartbeat: [1243]: ERROR: HBREAD process died. > Beginning communications restart process for comm channel 0. > Dec 13 17:13:14 pfs-srv3 heartbeat: [1243]: info: glib: UDP Broadcast > heartbeat closed on port 12694 interface eth1 - Status: 1 > Dec 13 17:13:14 pfs-srv3 heartbeat: [1243]: WARN: Managed HBWRITE process > 3278 killed by signal 9 [SIGKILL - Kill, unblockable]. > Dec 13 17:13:14 pfs-srv3 heartbeat: [1243]: ERROR: Both comm processes for > channel 0 have died. Restarting. > Dec 13 17:13:14 pfs-srv3 heartbeat: [1243]: info: glib: UDP Broadcast > heartbeat started on port 12694 (12694) interface eth1 > Dec 13 17:13:14 pfs-srv3 heartbeat: [1243]: info: glib: UDP Broadcast > heartbeat closed on port 12694 interface eth1 - Status: 1 > Dec 13 17:13:14 pfs-srv3 heartbeat: [1243]: info: Communications restart > succeeded. > Dec 16 10:29:38 pfs-srv3 heartbeat: [1269]: CRIT: Emergency Shutdown: Master > Control process died. heartbeat found out that MCP left. Nothing else in the logs? Core files? Thanks, Dejan > Dec 16 10:29:38 pfs-srv3 heartbeat: [1269]: CRIT: Killing pid 1243 with > SIGTERM > Dec 16 10:29:38 pfs-srv3 heartbeat: [1269]: CRIT: Killing pid 7247 with > SIGTERM > Dec 16 10:29:38 pfs-srv3 heartbeat: [1269]: CRIT: Killing pid 7248 with > SIGTERM > Dec 16 10:29:38 pfs-srv3 heartbeat: [1269]: CRIT: Emergency Shutdown(MCP > dead): Killing ourselves. > > It looks like heartbeat had a couple of issues, one is dying from SIGXCPU, > and another is dying from master control process. Any ideas as to why this > could have happened? > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
