Re: [Linux-HA] Antw: General question about heartbeat tokens and node overloaded.

Ulrich Windl Wed, 02 Oct 2013 04:40:47 -0700

>>> Lars Marowsky-Bree <[email protected]> schrieb am 02.10.2013 um 09:48 in 
>>> Nachricht
<[email protected]>:
> On 2013-10-02T09:36:14, Ulrich Windl <[email protected]> 
> wrote:
> 
>> In general I'm afraid you cannot handle this situation in a perfect way:
>> 
>> You have two types of problems:
>> 1) A node, resource, or monitor is hanging, but a long timeout prevents to
>> recognize this in time
>> 2) A node, resource, or monitor is performing slower than usual, but a short
>> timeout causes the cluster to think there is a problem with the
>> node/resource/monitor
> 
> Yes, or to summarize, timeouts suck for failure detection, but for many
> cases, we don't have anything better. Digging out my age old post:
> http://advogato.org/person/lmb/diary/108.html 
> 
> A massively overloaded system is indistinguishable from a failing or
> hung one. On the plus side, if a system is *that* overloaded that
> corosync isn't being scheduled and it's rather limited network traffic
> presents a problem, it is likely so FUBAR'ed that fencing it doesn't
> make things worse. So the misdiagnosis isn't necessarily a problem.


Hi!

There is one notable exception: If you have shared storage (SAN, NAS, NFS), the 
cause of the slowness may be external to the systems being monitored, thus 
fencing those will not improve the situation, most likely.

> 
>> BTW: We had eperienced hanging I/O when one of our SAN devices had a
>> problem, but the others did not. Still the SLES11 SP2 kernel saw
>> stalled I/Os for obviously unaffected devices. The problem is being
>> investigated...
> 
> FC can be weird like that if it is routed through the same HBA or
> switch. It's not always a kernel problem, the fabric isn't trivial
> either. Good luck with finding the root cause :-/

You are argumenting that a shared media (like the Internet) may be causing one 
server to be slow if the other server is slow. That would only be plausible if 
the client is waiting for one request to the slow server to complete before 
starting a request to the faster server. If that's the case for disks instead 
of servers and a FC-SAN as shared medium, the OS really has a problem (and not 
the shared medium).

Regards,
Ulrich


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Antw: General question about heartbeat tokens and node overloaded.

Reply via email to