On 15/10/13 22:36, 邢立明 wrote: > Hello dear Heartbeat team: > > Thank you very much for your reply,I still have the following two > questions: > > 1、How to get the heart line disconnected, Heartbeat triggered by events? > 2、Heartbeat is disconnected, how to set only one machine provides service?
Corosync uses the totem protocol for "heartbeat" like monitoring of the other node's health. A token is passed around to each node, the node does some work (like acknowledge old messages, send new ones), and then it passes the token on to the next node. This goes around and around all the time. Should a node note pass it's token on after a short timeout period, the token is declared lost, an error count goes up and a new token is sent. If too many tokens are lost in a row, the node is declared lost/dead. Once the node is declared lost, the remaining nodes reform a new cluster. If enough nodes are left to form quorum (simple majority), then the new cluster will continue to provide services. In two-node clusters, quorum is disabled so each node can work on it's own. Corosync itself only cares about cluster membership, message passing and quorum (as of corosync v2+). What happens after the cluster reforms is up to the cluster resource manager. In this case, that would be pacemaker. When pacemaker is told that membership has changed because a node died, it looks to see what services might have been lost. Once it knows what was lost, it looks at the rules it's been given and decides what to do. Generally, the first thing it does is "stonith" the lost node. This is a process where the lost node is powered off, called power fencing, or cut off from the network/storage, called fabric fencing. In either case, the idea is to make sure that the lost node is in a known state. If this is skipped, the node could recover later and try to provide cluster services, not having realized that it was removed from the cluster. This could cause problems from confusing switches to corrupting data. In two-node clusters, there is also a chance of a "split-brain". Because quorum has to be disabled, it is possible for both nodes to think the other node is dead and both try to provide the same cluster services. By using stonith, after the nodes break from one another (which could happen with a network failure, for example), neither node will offer services until one of them has stonith'ed the other. The faster node will win and the slower node will shut down (or be isolated). The survivor can then run services safely without risking a split-brain. Once the dead node has been stonithed, pacemaker then decides what to do with the lost services. Generally, this means "restart the service here that had been running on the dead node". The details of this, though, are decided by you when you configure the resources in pacemaker. Hope this helps! It's pretty high-level and simplifies a few things, but hopefully it helps you understand the mechanics. :) digimer PS - Please reply to the mailing list. Discussions like this can help others by being public and stored in archives. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
