On Tue, Jan 4, 2011 at 9:14 AM, Igor Chudov <[email protected]> wrote: > Serge, I am not sure of anything, but the self-communication is supposed to > be taking place on a single crossover cable between second network cards of > the servers. (eth1).
Agree, yet something strange and pretty unique is going on with your setup. Could you publish your ha.conf and outputs for ifconfig eth1 and netstat -in ? > > Igor > > On Tue, Jan 4, 2011 at 10:06 AM, Serge Dubrouski <[email protected]> wrote: > >> Are you sure that everything is all right with your network? It looks >> like processes that are responsible for UDP communications are taking >> too much of CPU time. >> >> On Tue, Jan 4, 2011 at 8:47 AM, Igor Chudov <[email protected]> wrote: >> > Steve, here's some data. >> > >> > The OS is Ubuntu 10.04. >> > >> > ~# apt-cache policy heartbeat >> > heartbeat: >> > Installed: 1:3.0.3-1ubuntu1 >> > Candidate: 1:3.0.3-1ubuntu1 >> > Version table: >> > *** 1:3.0.3-1ubuntu1 0 >> > 500 http://us.archive.ubuntu.com/ubuntu/ lucid/universe Packages >> > 100 /var/lib/dpkg/status >> > >> > I agree that it should not use too much CPU, and I think that it does >> not. >> > But after a while it gets a SIGXCPU anyway. >> > >> > It also seems to die from something else. >> > >> > ec 29 02:29:16 pfs-srv3 heartbeat: [1196]: WARN: Managed HBREAD process >> 1228 >> > killed by signal 24 [SIGXCPU - CPU limit exceeded]. >> > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: ERROR: Managed HBREAD process >> > 1228 dumped core >> > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: ERROR: HBREAD process died. >> > Beginning communications restart process for comm channel 0. >> > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast >> > heartbeat closed on port 12694 interface eth1 - Status: 1 >> > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: WARN: Managed HBWRITE process >> > 1227 killed by signal 9 [SIGKILL - Kill, unblockable]. >> > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: ERROR: Both comm processes >> for >> > channel 0 have died. Restarting. >> > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast >> > heartbeat started on port 12694 (12694) interface eth1 >> > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast >> > heartbeat closed on port 12694 interface eth1 - Status: 1 >> > Dec 29 02:29:16 pfs-srv3 heartbeat: [1196]: info: Communications restart >> > succeeded. >> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: WARN: Managed HBREAD process >> > 6729 killed by signal 24 [SIGXCPU - CPU limit exceeded]. >> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: ERROR: Managed HBREAD process >> > 6729 dumped core >> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: ERROR: HBREAD process died. >> > Beginning communications restart process for comm channel 0. >> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast >> > heartbeat closed on port 12694 interface eth1 - Status: 1 >> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: WARN: Managed HBWRITE process >> > 6728 killed by signal 9 [SIGKILL - Kill, unblockable]. >> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: ERROR: Both comm processes >> for >> > channel 0 have died. Restarting. >> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast >> > heartbeat started on port 12694 (12694) interface eth1 >> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: info: glib: UDP Broadcast >> > heartbeat closed on port 12694 interface eth1 - Status: 1 >> > Dec 30 21:03:49 pfs-srv3 heartbeat: [1196]: info: Communications restart >> > succeeded. >> > Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Emergency Shutdown: >> Master >> > Control process died. >> > Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Killing pid 1196 with >> > SIGTERM >> > Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Killing pid 9866 with >> > SIGTERM >> > Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Killing pid 9867 with >> > SIGTERM >> > Dec 31 13:58:22 pfs-srv3 heartbeat: [1226]: CRIT: Emergency Shutdown(MCP >> > dead): Killing ourselves. >> > >> > i >> > >> > On Tue, Jan 4, 2011 at 9:33 AM, Steve Davies <[email protected]> >> wrote: >> > >> >> On 4 January 2011 13:47, Igor Chudov <[email protected]> wrote: >> >> > Further reading indicates that heartbeat itself sets a limit for >> itself >> >> > every so often. >> >> > >> >> > Then it exceeds the limit (probably due to a bug). I am sure that >> tha's >> >> why >> >> > whoever wrote heartbeat, set cpu limit, instead of foxing their bugs. >> >> > >> >> > Then it dies with SIGXCPU, leaving everything in an extremely messy >> >> state, >> >> > leading to split brain, destruction of shared resources (DRBD data). >> >> > >> >> > I was trying to be a little patient. A little forgiving. I must say >> that >> >> my >> >> > patience is rapidly running out. >> >> > >> >> > I absolutely cannot use this "solution" as a basis of a high >> reliability >> >> > cluster, because it is the opposite of reliability. >> >> > >> >> > We had an old cluster that works very well with heartbeat V1. But it >> is >> >> > getting old, the disks are wearing out, the fans are not getting >> newer, >> >> etc. >> >> > I set up a new cluster in summer, but never fully trusted it, and it >> >> looks >> >> > like I will not be able to trust it. We never completed a switchover. >> >> > >> >> > At this point I feel rather desperate. Perhaps I should give >> "pacemaker" >> >> > another go. I really have no idea and I am running out of options. >> >> > >> >> > i >> >> > >> >> > On Tue, Jan 4, 2011 at 7:32 AM, Igor Chudov <[email protected]> >> wrote: >> >> > >> >> >> A few weeks I reported that heartbeat died on one of the cluster >> >> machines, >> >> >> due to SIGXCPU. >> >> >> >> >> >> Well, it happened again. Heartbeat died, now both machines had the >> >> shared >> >> >> IP address up, what a god awful mess!!! >> >> >> >> >> >> Nopw they have split brain and the whole nine yards! >> >> >> >> >> >> I looked at /proc/<heartbeat_pid>/limits and found: >> >> >> >> >> >> Limit Soft Limit Hard Limit >> >> Units >> >> >> >> >> >> Max cpu time 43 unlimited >> >> seconds >> >> >> >> >> >> >> >> >> So, this process somehow has a limit set for it. >> >> >> >> >> >> Does anyone have ANY clue who would set a limit for this process??? >> WTF? >> >> >> Does it do it for itself or what? >> >> >> >> >> >> >> I cannot answer your question, but I suspect it might be useful if you >> >> mentioned which version of heartbeat and what resource manager you are >> >> using. Perhaps provide a copy of your heartbeat configuration. >> >> >> >> Is heartbeat using too much CPU? It should be pretty much idle >> >> relative to the rest of the system - If not, it is worth finding out >> >> why not. >> >> >> >> Regards, >> >> Steve >> >> _______________________________________________ >> >> Linux-HA mailing list >> >> [email protected] >> >> http://lists.linux-ha.org/mailman/listinfo/linux-ha >> >> See also: http://linux-ha.org/ReportingProblems >> >> >> > _______________________________________________ >> > Linux-HA mailing list >> > [email protected] >> > http://lists.linux-ha.org/mailman/listinfo/linux-ha >> > See also: http://linux-ha.org/ReportingProblems >> > >> >> >> >> -- >> Serge Dubrouski. >> _______________________________________________ >> Linux-HA mailing list >> [email protected] >> http://lists.linux-ha.org/mailman/listinfo/linux-ha >> See also: http://linux-ha.org/ReportingProblems >> > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > -- Serge Dubrouski. _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
