>>> Lars Marowsky-Bree <[email protected]> schrieb am 02.10.2013 um 09:48 in >>> Nachricht <[email protected]>: > On 2013-10-02T09:36:14, Ulrich Windl <[email protected]> > wrote: > >> In general I'm afraid you cannot handle this situation in a perfect way: >> >> You have two types of problems: >> 1) A node, resource, or monitor is hanging, but a long timeout prevents to >> recognize this in time >> 2) A node, resource, or monitor is performing slower than usual, but a short >> timeout causes the cluster to think there is a problem with the >> node/resource/monitor > > Yes, or to summarize, timeouts suck for failure detection, but for many > cases, we don't have anything better. Digging out my age old post: > http://advogato.org/person/lmb/diary/108.html > > A massively overloaded system is indistinguishable from a failing or > hung one. On the plus side, if a system is *that* overloaded that > corosync isn't being scheduled and it's rather limited network traffic > presents a problem, it is likely so FUBAR'ed that fencing it doesn't > make things worse. So the misdiagnosis isn't necessarily a problem.
Hi! There is one notable exception: If you have shared storage (SAN, NAS, NFS), the cause of the slowness may be external to the systems being monitored, thus fencing those will not improve the situation, most likely. > >> BTW: We had eperienced hanging I/O when one of our SAN devices had a >> problem, but the others did not. Still the SLES11 SP2 kernel saw >> stalled I/Os for obviously unaffected devices. The problem is being >> investigated... > > FC can be weird like that if it is routed through the same HBA or > switch. It's not always a kernel problem, the fabric isn't trivial > either. Good luck with finding the root cause :-/ You are argumenting that a shared media (like the Internet) may be causing one server to be slow if the other server is slow. That would only be plausible if the client is waiting for one request to the slow server to complete before starting a request to the faster server. If that's the case for disks instead of servers and a FC-SAN as shared medium, the OS really has a problem (and not the shared medium). Regards, Ulrich _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
