On Sun, Dec 26, 2010 at 08:56:13AM -0600, Igor Chudov wrote:
> As you guys recall, I have set up a heartbeat/drbd based system to replace
> an aging drbd solution.
> 
> While it sits there, it has not been activated.
> 
> I have noticed (due to some self checking scripts) that heartbeat died on
> one machine.
> 
> Looking in logs, I found this in ha-log.2:
> 
> Dec 13 17:13:14 pfs-srv3 heartbeat: [1243]: WARN: Managed HBREAD process
> 3279 killed by signal 24 [SIGXCPU - CPU limit exceeded].

The heartbeat read process was using too much CPU.

> Dec 13 17:13:14 pfs-srv3 heartbeat: [1243]: ERROR: Managed HBREAD process
> 3279 dumped core
> Dec 13 17:13:14 pfs-srv3 heartbeat: [1243]: ERROR: HBREAD process died.
>  Beginning communications restart process for comm channel 0.
> Dec 13 17:13:14 pfs-srv3 heartbeat: [1243]: info: glib: UDP Broadcast
> heartbeat closed on port 12694 interface eth1 - Status: 1
> Dec 13 17:13:14 pfs-srv3 heartbeat: [1243]: WARN: Managed HBWRITE process
> 3278 killed by signal 9 [SIGKILL - Kill, unblockable].
> Dec 13 17:13:14 pfs-srv3 heartbeat: [1243]: ERROR: Both comm processes for
> channel 0 have died.  Restarting.
> Dec 13 17:13:14 pfs-srv3 heartbeat: [1243]: info: glib: UDP Broadcast
> heartbeat started on port 12694 (12694) interface eth1
> Dec 13 17:13:14 pfs-srv3 heartbeat: [1243]: info: glib: UDP Broadcast
> heartbeat closed on port 12694 interface eth1 - Status: 1
> Dec 13 17:13:14 pfs-srv3 heartbeat: [1243]: info: Communications restart
> succeeded.
> Dec 16 10:29:38 pfs-srv3 heartbeat: [1269]: CRIT: Emergency Shutdown: Master
> Control process died.

heartbeat found out that MCP left. Nothing else in the logs?
Core files?

Thanks,

Dejan

> Dec 16 10:29:38 pfs-srv3 heartbeat: [1269]: CRIT: Killing pid 1243 with
> SIGTERM
> Dec 16 10:29:38 pfs-srv3 heartbeat: [1269]: CRIT: Killing pid 7247 with
> SIGTERM
> Dec 16 10:29:38 pfs-srv3 heartbeat: [1269]: CRIT: Killing pid 7248 with
> SIGTERM
> Dec 16 10:29:38 pfs-srv3 heartbeat: [1269]: CRIT: Emergency Shutdown(MCP
> dead): Killing ourselves.
> 
> It looks like heartbeat had a couple of issues, one is dying from SIGXCPU,
> and another is dying from master control process. Any ideas as to why this
> could have happened?
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to