I wrote: >>> I've written up a brief document entitled "STONITH Deathmatch Explained >>> (and Some Hints for Resource Agent Authors and Systems Engineers)": >>> >>> http://ourobengr.com/ha >>> >>> ...
Then Dejan Muhamedagic wrote: >> ... >> >> - in "Causes ..." you missed to mention split-brain (no >> communication channels working) and, at the same time, to >> stress how important it is to have redundant communications :) >> >> - even though you mention that in the title, I'd still move the >> resource agent intricacies into another document; they are all >> very valuable advice, but of no concern to cluster >> administrators; it's also good to keep the focus on our little >> problem; then you'll have to find other "Things You Didn't >> Think Of" (or just keep the title and leave the section empty: >> it is important; or insert another illustration) >> >> - devote more space/thought to the part on how to avoid a >> "deathmatch"; there's only a mention on chkconfig within >> "Debugging ..." (or one can also use the "poweroff" fencing >> operation); also, note that this occurs only in cases reboot >> doesn't fix a problem (e.g. split-brain) And Joe Armstrong wrote: > ...You might want to also add a possibility > to avoid the situation. Don't allow heartbeat to be started by > the RC scripts. Once a machine has been STONITH'd you can consider > that it is untrustworthy until the admin inspects the reason for > the failure and manually allows the node back into the cluster. > This same thinking is why I hate auto-failback... For the record, I've made a couple of minor updates based on the above: - Split-brain is added as a cause of STONITH. - There's now a small section "Avoiding STONITH Deathmatch", which mentions ensuring redundant comms, not starting the cluster at boot time, and trying stonith-action=poweroff. - There's a mention of the document still being applicable if you're using OpenAIS instead of Heartbeat. I haven't moved RA specifics into another document yet. I have a nasty feeling this might result in something larger that rattles on about the importance of ensuring correct semantics for all operations (e.g.: the "start" op shouldn't return success if the resource isn't really, truly, actually, completely started yet, or you can wind up in one of those wacky start[ok]->monitor[fail]->stop->start[ok]->monitor[fail]->stop cycles). Tim -- t...@wirejunkie.com http://www.wirejunkie.com/ _______________________________________________ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker