Thanks for the response,

I did do a Google search on both logs before posting to this mailing list.  
This is what has been tried so far:

   1. Several times the service was stopped and started using
      */etc/init.d/heartbeat*, but each time the the *stop* would hang
      and the *start* would not start up all processes. Initially, the
      stop was allowed to attempt completion for 30 minutes and did not
      complete, another attempt let it run for over an hour.  So,
      waiting a long time for shutdown has also been attempted.
   2. The server was rebooted.
   3. The iptable rules were "turned off" such that there was no firewall:
      *# service iptables status
      Firewall is stopped.*
   4. With heartbeat shutdown, the files from */var/lib/heartbeat/crm*
      were moved to another location to leave that directory empty.  The
      processes still did not come up completely, so the configuration
      was not re-obtained from the working node in the cluster.  NOTE:
      The cibadmin command could not be used either without the other
      processes up.

Is there some other way to "kick the system" to try and get heartbeat going 
again?

I've looked at the ha.cf file and it looks fine, removing the crm files was 
another attempt to see if there was file corruption.  As far as I can tell, the 
files in /usr/lib/heartbeat appear to be ok -- as in they were not recently 
changed.  Is there some other place to check for corruption that could possibly 
lead to this kind of behavior.

Regards,
Bart

Hi,

Am Donnerstag, 9. Dezember 2010 15:16 schrieb Bart Pousson:

> >  Hi,
> >
> >  I have a system with two nodes that had been running heartbeat for a
> >  while -- Linux HA 2.1.4.  One of the heartbeat processes went to 100%
> >  CPU usage and stayed there, with the following logs seen:
> >
> >  heartbeat[17464]: 2010/11/21_03:04:07 info: Gmain_timeout_dispatch:
> >  started at 3846010832 should have started at 3845570140
> >  heartbeat[17464]: 2010/11/21_03:04:08 WARN: Gmain_timeout_dispatch:
> >  Dispatch function for retransmit request took too long to execute: 400
> >  ms (>  10 ms) (GSource: 0x18254030)
> >
> >  I tried to shutdown using /etc/init.d/heartbeat stop  -- the shutdown
> >  hung and ever since then the only way to stop the heartbeat processes is
> >  by doing a kill (or killall).
> >
> >  When the heartbeat processes are started, only the first few processes
> >  come up -- heartbeat never fully initializes. The following processes
> >  never come up:
> >
> >       /usr/lib/heartbeat/ccm
> >       /usr/lib/heartbeat/cib
> >       /usr/lib/heartbeat/lrmd -r
> >       /usr/lib/heartbeat/stonithd
> >       /usr/lib/heartbeat/attrd
> >       /usr/lib/heartbeat/crmd
> >       /usr/lib/heartbeat/mgmtd -v
> >       /usr/lib/heartbeat/cibmon -d
> >
> >  These logs are now seen every time a start is attempted:
> >
> >  heartbeat[12339]: 2010/12/08_16:20:23 ERROR: Message hist queue is
> >  filling up (500 messages in queue)
> >  heartbeat[12339]: 2010/12/08_16:20:23 ERROR: Message hist queue is
> >  filling up (500 messages in queue)
> >  heartbeat[12339]: 2010/12/08_16:20:23 ERROR: Message hist queue is
> >  filling up (500 messages in queue)
> >
> >  So, I've gotten heartbeat into a state where it will not start up all
> >  the processes, and when trying to stop it hangs.  I'm not sure what else
> >  to look at.  Has anyone seen this kind of behavior before?
- Yes, sure; did you already tried to "google" on:
"Message hist queue is
filling up"

- look for example this:
http://www.gossamer-threads.com/lists/linuxha/users/43024

HTH

Nikita Michalko


> >
> >  Thanks,
> >  Bart
> >  _______________________________________________
> >  Linux-HA mailing list
> >  [email protected]
> >  http://lists.linux-ha.org/mailman/listinfo/linux-ha
> >  See also:http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to