Re: [Linux-HA] Heartbeat dies AGAIN with SIGXCPU, cluster screwed up again

Igor Chudov Tue, 04 Jan 2011 07:48:05 -0800

Steve, here's some data.

The OS is Ubuntu 10.04.

~# apt-cache policy heartbeat
heartbeat:
  Installed: 1:3.0.3-1ubuntu1
  Candidate: 1:3.0.3-1ubuntu1
  Version table:
 *** 1:3.0.3-1ubuntu1 0
        500 http://us.archive.ubuntu.com/ubuntu/ lucid/universe Packages
        100 /var/lib/dpkg/status

I agree that it should not use too much CPU, and I think that it does not.
But after a while it gets a SIGXCPU anyway.

It also seems to die from something else.

ec 29 02:29:16 pfs-srv3 heartbeat: [1196]: WARN: Managed HBREAD process 1228
killed by signal 24 [SIGXCPU - CPU limit exceeded].
Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: ERROR: Managed HBREAD process
1228 dumped core
Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: ERROR: HBREAD process died.
 Beginning communications restart process for comm channel 0.
Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast
heartbeat closed on port 12694 interface eth1 - Status: 1
Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: WARN: Managed HBWRITE process
1227 killed by signal 9 [SIGKILL - Kill, unblockable].
Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: ERROR: Both comm processes for
channel 0 have died.  Restarting.
Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast
heartbeat started on port 12694 (12694) interface eth1
Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast
heartbeat closed on port 12694 interface eth1 - Status: 1
Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: info: Communications restart
succeeded.
Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: WARN: Managed HBREAD process
6729 killed by signal 24 [SIGXCPU - CPU limit exceeded].
Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: ERROR: Managed HBREAD process
6729 dumped core
Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: ERROR: HBREAD process died.
 Beginning communications restart process for comm channel 0.
Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast
heartbeat closed on port 12694 interface eth1 - Status: 1
Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: WARN: Managed HBWRITE process
6728 killed by signal 9 [SIGKILL - Kill, unblockable].
Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: ERROR: Both comm processes for
channel 0 have died.  Restarting.
Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast
heartbeat started on port 12694 (12694) interface eth1
Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast
heartbeat closed on port 12694 interface eth1 - Status: 1
Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: info: Communications restart
succeeded.
Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Emergency Shutdown: Master
Control process died.
Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Killing pid 1196 with
SIGTERM
Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Killing pid 9866 with
SIGTERM
Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Killing pid 9867 with
SIGTERM
Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Emergency Shutdown(MCP
dead): Killing ourselves.

i

On Tue, Jan 4, 2011 at 9:33 AM, Steve Davies <[email protected]> wrote:

> On 4 January 2011 13:47, Igor Chudov <[email protected]> wrote:
> > Further reading indicates that heartbeat itself sets a limit for itself
> > every so often.
> >
> > Then it exceeds the limit (probably due to a bug). I am sure that tha's
> why
> > whoever wrote heartbeat, set cpu limit, instead of foxing their bugs.
> >
> > Then it dies with SIGXCPU, leaving everything in an extremely messy
> state,
> > leading to split brain, destruction of shared resources (DRBD data).
> >
> > I was trying to be a little patient. A little forgiving. I must say that
> my
> > patience is rapidly running out.
> >
> > I absolutely cannot use this "solution" as a basis of a high reliability
> > cluster, because it is the opposite of reliability.
> >
> > We had an old cluster that works very well with heartbeat V1. But it is
> > getting old, the disks are wearing out, the fans are not getting newer,
> etc.
> > I set up a new cluster in summer, but never fully trusted it, and it
> looks
> > like I will not be able to trust it. We never completed a switchover.
> >
> > At this point I feel rather desperate. Perhaps I should give "pacemaker"
> > another go. I really have no idea and I am running out of options.
> >
> > i
> >
> > On Tue, Jan 4, 2011 at 7:32 AM, Igor Chudov <[email protected]> wrote:
> >
> >> A few weeks I reported that heartbeat died on one of the cluster
> machines,
> >> due to SIGXCPU.
> >>
> >> Well, it happened again. Heartbeat died, now both machines had the
> shared
> >> IP address up, what a god awful mess!!!
> >>
> >> Nopw they have split brain and the whole nine yards!
> >>
> >> I  looked at /proc/<heartbeat_pid>/limits and found:
> >>
> >> Limit                     Soft Limit           Hard Limit
> Units
> >>
> >> Max cpu time              43                   unlimited
>  seconds
> >>
> >>
> >> So, this process somehow has a limit set for it.
> >>
> >> Does anyone have ANY clue who would set a limit for this process??? WTF?
> >> Does it do it for itself or what?
> >>
>
> I cannot answer your question, but I suspect it might be useful if you
> mentioned which version of heartbeat and what resource manager you are
> using. Perhaps provide a copy of your heartbeat configuration.
>
> Is heartbeat using too much CPU? It should be pretty much idle
> relative to the rest of the system - If not, it is worth finding out
> why not.
>
> Regards,
> Steve
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Heartbeat dies AGAIN with SIGXCPU, cluster screwed up again

Reply via email to