Re: [Linux-HA] Heartbeat dies AGAIN with SIGXCPU, cluster screwed up again

Igor Chudov Mon, 10 Jan 2011 10:18:13 -0800

On Tue, Jan 4, 2011 at 10:22 AM, Serge Dubrouski <[email protected]> wrote:


> On Tue, Jan 4, 2011 at 9:14 AM, Igor Chudov <[email protected]> wrote:
> > Serge, I am not sure of anything, but the self-communication is supposed
> to
> > be taking place on a single crossover cable between second network cards
> of
> > the servers. (eth1).
>
> Agree, yet something strange and pretty unique is going on with your
> setup. Could you publish your ha.conf and outputs for ifconfig eth1
> and netstat -in ?
>
>

It happened again. This time all I know from logs is that MCP died.

My first question that I want answered regardless of anything, is how to
enable dumping cores and debugging the crash.

My second question is, can heartbeat be configured to restart itself in case
of such a failure.

My version is 3.0.3.

Anyway, here is the conf file.


 use_logd on
udpport 12694
keepalive 1
warntime 15
deadtime 20
debug 1
initdead 30
bcast eth1
node pfs-srv3
node pfs-srv4
auto_failback on
crm off

>
> > Igor
> >
> > On Tue, Jan 4, 2011 at 10:06 AM, Serge Dubrouski <[email protected]>
> wrote:
> >
> >> Are you sure that everything is all right with your network? It looks
> >> like processes that are responsible for UDP communications are taking
> >> too much of CPU time.
> >>
> >> On Tue, Jan 4, 2011 at 8:47 AM, Igor Chudov <[email protected]> wrote:
> >> > Steve, here's some data.
> >> >
> >> > The OS is Ubuntu 10.04.
> >> >
> >> > ~# apt-cache policy heartbeat
> >> > heartbeat:
> >> >  Installed: 1:3.0.3-1ubuntu1
> >> >  Candidate: 1:3.0.3-1ubuntu1
> >> >  Version table:
> >> >  *** 1:3.0.3-1ubuntu1 0
> >> >        500 http://us.archive.ubuntu.com/ubuntu/ lucid/universe
> Packages
> >> >        100 /var/lib/dpkg/status
> >> >
> >> > I agree that it should not use too much CPU, and I think that it does
> >> not.
> >> > But after a while it gets a SIGXCPU anyway.
> >> >
> >> > It also seems to die from something else.
> >> >
> >> > ec 29 02:29:16 pfs-srv3 heartbeat: [1196]: WARN: Managed HBREAD
> process
> >> 1228
> >> > killed by signal 24 [SIGXCPU - CPU limit exceeded].
> >> > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: ERROR: Managed HBREAD
> process
> >> > 1228 dumped core
> >> > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: ERROR: HBREAD process
> died.
> >> >  Beginning communications restart process for comm channel 0.
> >> > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast
> >> > heartbeat closed on port 12694 interface eth1 - Status: 1
> >> > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: WARN: Managed HBWRITE
> process
> >> > 1227 killed by signal 9 [SIGKILL - Kill, unblockable].
> >> > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: ERROR: Both comm processes
> >> for
> >> > channel 0 have died.  Restarting.
> >> > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast
> >> > heartbeat started on port 12694 (12694) interface eth1
> >> > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast
> >> > heartbeat closed on port 12694 interface eth1 - Status: 1
> >> > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: info: Communications
> restart
> >> > succeeded.
> >> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: WARN: Managed HBREAD
> process
> >> > 6729 killed by signal 24 [SIGXCPU - CPU limit exceeded].
> >> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: ERROR: Managed HBREAD
> process
> >> > 6729 dumped core
> >> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: ERROR: HBREAD process
> died.
> >> >  Beginning communications restart process for comm channel 0.
> >> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast
> >> > heartbeat closed on port 12694 interface eth1 - Status: 1
> >> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: WARN: Managed HBWRITE
> process
> >> > 6728 killed by signal 9 [SIGKILL - Kill, unblockable].
> >> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: ERROR: Both comm processes
> >> for
> >> > channel 0 have died.  Restarting.
> >> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast
> >> > heartbeat started on port 12694 (12694) interface eth1
> >> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast
> >> > heartbeat closed on port 12694 interface eth1 - Status: 1
> >> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: info: Communications
> restart
> >> > succeeded.
> >> > Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Emergency Shutdown:
> >> Master
> >> > Control process died.
> >> > Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Killing pid 1196
> with
> >> > SIGTERM
> >> > Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Killing pid 9866
> with
> >> > SIGTERM
> >> > Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Killing pid 9867
> with
> >> > SIGTERM
> >> > Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Emergency
> Shutdown(MCP
> >> > dead): Killing ourselves.
> >> >
> >> > i
> >> >
> >> > On Tue, Jan 4, 2011 at 9:33 AM, Steve Davies <[email protected]>
> >> wrote:
> >> >
> >> >> On 4 January 2011 13:47, Igor Chudov <[email protected]> wrote:
> >> >> > Further reading indicates that heartbeat itself sets a limit for
> >> itself
> >> >> > every so often.
> >> >> >
> >> >> > Then it exceeds the limit (probably due to a bug). I am sure that
> >> tha's
> >> >> why
> >> >> > whoever wrote heartbeat, set cpu limit, instead of foxing their
> bugs.
> >> >> >
> >> >> > Then it dies with SIGXCPU, leaving everything in an extremely messy
> >> >> state,
> >> >> > leading to split brain, destruction of shared resources (DRBD
> data).
> >> >> >
> >> >> > I was trying to be a little patient. A little forgiving. I must say
> >> that
> >> >> my
> >> >> > patience is rapidly running out.
> >> >> >
> >> >> > I absolutely cannot use this "solution" as a basis of a high
> >> reliability
> >> >> > cluster, because it is the opposite of reliability.
> >> >> >
> >> >> > We had an old cluster that works very well with heartbeat V1. But
> it
> >> is
> >> >> > getting old, the disks are wearing out, the fans are not getting
> >> newer,
> >> >> etc.
> >> >> > I set up a new cluster in summer, but never fully trusted it, and
> it
> >> >> looks
> >> >> > like I will not be able to trust it. We never completed a
> switchover.
> >> >> >
> >> >> > At this point I feel rather desperate. Perhaps I should give
> >> "pacemaker"
> >> >> > another go. I really have no idea and I am running out of options.
> >> >> >
> >> >> > i
> >> >> >
> >> >> > On Tue, Jan 4, 2011 at 7:32 AM, Igor Chudov <[email protected]>
> >> wrote:
> >> >> >
> >> >> >> A few weeks I reported that heartbeat died on one of the cluster
> >> >> machines,
> >> >> >> due to SIGXCPU.
> >> >> >>
> >> >> >> Well, it happened again. Heartbeat died, now both machines had the
> >> >> shared
> >> >> >> IP address up, what a god awful mess!!!
> >> >> >>
> >> >> >> Nopw they have split brain and the whole nine yards!
> >> >> >>
> >> >> >> I  looked at /proc/<heartbeat_pid>/limits and found:
> >> >> >>
> >> >> >> Limit                     Soft Limit           Hard Limit
> >> >> Units
> >> >> >>
> >> >> >> Max cpu time              43                   unlimited
> >> >>  seconds
> >> >> >>
> >> >> >>
> >> >> >> So, this process somehow has a limit set for it.
> >> >> >>
> >> >> >> Does anyone have ANY clue who would set a limit for this
> process???
> >> WTF?
> >> >> >> Does it do it for itself or what?
> >> >> >>
> >> >>
> >> >> I cannot answer your question, but I suspect it might be useful if
> you
> >> >> mentioned which version of heartbeat and what resource manager you
> are
> >> >> using. Perhaps provide a copy of your heartbeat configuration.
> >> >>
> >> >> Is heartbeat using too much CPU? It should be pretty much idle
> >> >> relative to the rest of the system - If not, it is worth finding out
> >> >> why not.
> >> >>
> >> >> Regards,
> >> >> Steve
> >> >> _______________________________________________
> >> >> Linux-HA mailing list
> >> >> [email protected]
> >> >> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >> >> See also: http://linux-ha.org/ReportingProblems
> >> >>
> >> > _______________________________________________
> >> > Linux-HA mailing list
> >> > [email protected]
> >> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >> > See also: http://linux-ha.org/ReportingProblems
> >> >
> >>
> >>
> >>
> >> --
> >> Serge Dubrouski.
> >> _______________________________________________
> >> Linux-HA mailing list
> >> [email protected]
> >> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >> See also: http://linux-ha.org/ReportingProblems
> >>
> > _______________________________________________
> > Linux-HA mailing list
> > [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
>
>
>
> --
> Serge Dubrouski.
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Heartbeat dies AGAIN with SIGXCPU, cluster screwed up again

Reply via email to