Re: [Linux-HA] Can no longer start/stop heartbeat properly

Andrew Beekhof Fri, 10 Dec 2010 00:36:05 -0800

On Thu, Dec 9, 2010 at 6:48 PM, Bart Pousson
<[email protected]> wrote:
> Thanks for the response,
>
> I did do a Google search on both logs before posting to this mailing list.  
> This is what has been tried so far:
>
>   1. Several times the service was stopped and started using
>      */etc/init.d/heartbeat*, but each time the the *stop* would hang
>      and the *start* would not start up all processes. Initially, the
>      stop was allowed to attempt completion for 30 minutes and did not
>      complete, another attempt let it run for over an hour.  So,
>      waiting a long time for shutdown has also been attempted.
>   2. The server was rebooted.
>   3. The iptable rules were "turned off" such that there was no firewall:
>      *# service iptables status
>      Firewall is stopped.*


Call me paranoid, but I prefer to disable iptables before rebooting the machine.
Ie.   chkconfig del iptables

That way you can be sure there's no residual from having iptables active.
Make sure to do this on both nodes (the one filling up the hist queue
isn't always the one with the problem).

>   4. With heartbeat shutdown, the files from */var/lib/heartbeat/crm*
>      were moved to another location to leave that directory empty.  The
>      processes still did not come up completely, so the configuration
>      was not re-obtained from the working node in the cluster.  NOTE:
>      The cibadmin command could not be used either without the other
>      processes up.
>
> Is there some other way to "kick the system" to try and get heartbeat going 
> again?
>
> I've looked at the ha.cf file and it looks fine, removing the crm files was 
> another attempt to see if there was file corruption.  As far as I can tell, 
> the files in /usr/lib/heartbeat appear to be ok -- as in they were not 
> recently changed.  Is there some other place to check for corruption that 
> could possibly lead to this kind of behavior.
>
> Regards,
> Bart
>
> Hi,
>
> Am Donnerstag, 9. Dezember 2010 15:16 schrieb Bart Pousson:
>
>> >  Hi,
>> >
>> >  I have a system with two nodes that had been running heartbeat for a
>> >  while -- Linux HA 2.1.4.  One of the heartbeat processes went to 100%
>> >  CPU usage and stayed there, with the following logs seen:
>> >
>> >  heartbeat[17464]: 2010/11/21_03:04:07 info: Gmain_timeout_dispatch:
>> >  started at 3846010832 should have started at 3845570140
>> >  heartbeat[17464]: 2010/11/21_03:04:08 WARN: Gmain_timeout_dispatch:
>> >  Dispatch function for retransmit request took too long to execute: 400
>> >  ms (>  10 ms) (GSource: 0x18254030)
>> >
>> >  I tried to shutdown using /etc/init.d/heartbeat stop  -- the shutdown
>> >  hung and ever since then the only way to stop the heartbeat processes is
>> >  by doing a kill (or killall).
>> >
>> >  When the heartbeat processes are started, only the first few processes
>> >  come up -- heartbeat never fully initializes. The following processes
>> >  never come up:
>> >
>> >       /usr/lib/heartbeat/ccm
>> >       /usr/lib/heartbeat/cib
>> >       /usr/lib/heartbeat/lrmd -r
>> >       /usr/lib/heartbeat/stonithd
>> >       /usr/lib/heartbeat/attrd
>> >       /usr/lib/heartbeat/crmd
>> >       /usr/lib/heartbeat/mgmtd -v
>> >       /usr/lib/heartbeat/cibmon -d
>> >
>> >  These logs are now seen every time a start is attempted:
>> >
>> >  heartbeat[12339]: 2010/12/08_16:20:23 ERROR: Message hist queue is
>> >  filling up (500 messages in queue)
>> >  heartbeat[12339]: 2010/12/08_16:20:23 ERROR: Message hist queue is
>> >  filling up (500 messages in queue)
>> >  heartbeat[12339]: 2010/12/08_16:20:23 ERROR: Message hist queue is
>> >  filling up (500 messages in queue)
>> >
>> >  So, I've gotten heartbeat into a state where it will not start up all
>> >  the processes, and when trying to stop it hangs.  I'm not sure what else
>> >  to look at.  Has anyone seen this kind of behavior before?
> - Yes, sure; did you already tried to "google" on:
> "Message hist queue is
> filling up"
>
> - look for example this:
> http://www.gossamer-threads.com/lists/linuxha/users/43024
>
> HTH
>
> Nikita Michalko
>
>
>> >
>> >  Thanks,
>> >  Bart
>> >  _______________________________________________
>> >  Linux-HA mailing list
>> >  [email protected]
>> >  http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> >  See also:http://linux-ha.org/ReportingProblems
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Can no longer start/stop heartbeat properly

Reply via email to