Hi, On Fri, Feb 05, 2010 at 08:59:50AM +0100, Dominik Klein wrote: > > But generally I believe this test case is invalid. > > I might agree here that this test case does not necessarily reproduce > what happened on my production system (unfortunately I do not know for > sure what happened there, the dev who caused this just tells me he used > some stupid sql statement and even executed it several times in > parallel), but I do not think the testcase is invalid. If there is an > OOM situation on a node and therefore the local pacemaker can't do it's > job anymore (I base this statement on the various lrmd "cannot allocate > memory" logs), this is a case the cluster should be able to recover from.
Yes, I'd say the cluster should be able to deal with a node which is in just about any state. This time, at least it seems so, the problem was that corosync ran as a realtime process and crmd not. Perhaps corosync should watch the local processes, i.e. to have some kind of IPC heartbeat ... > What I saw while doing this test was that the bad node discovered > failures on the running ip and mysql resources, scheduled the recovery, > but never managed to recover. > > I think it was lmb who suggested "periodic health-checks" on the > pacemaker layer. If pacemaker on $good had periodically tried to talk to > pacemaker on $bad, then it might have seen that $bad does not respond > and might have done something about it. Just my theory though. ... or the higher level heartbeats as you suggested here. There is still, however, a problem with false positives. At any rate, the user should have a way to specify when a node is not usable anymore. Thanks, Dejan > Opinions? > > Regards > Dominik > > _______________________________________________ > Pacemaker mailing list > Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker _______________________________________________ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker