On Wed, Jan 16, 2013 at 2:01 PM, Digimer <[email protected]> wrote: > Welcome to the addiction^h^h^hcommunity.
Hello Digimer! Thanks for the speedy reply. And the warning about the addiction. :) > In case it matters; Red Hat supports corosync + cman + rgmanager in RHEL > 6.x. Pacemaker is scheduled to replace cman/rgmanager in RHEL 7, but > until then, it's in tech-preview only and doesn't get updates between > y-stream updates. That's useful to know, thanks. It's because of the Red Hat plan to move to Pacemaker that I'm comfortable in investing the time in learning it now, rather than set things up with cman/rgmanager. We won't be asking Red Hat to support this cluster. > Also; Why 6.2 when 6.3 has been out for a long time? It's just the DVD that I have to hand. :) But okay, I'll download 6.3, thanks for the nudge. :) > The docs are right; Fencing is really really important. I'd go so far as > to say that your cluster is fatally flawed without proper fencing. I have seen the various subtle RED HAT WILL NOT SUPPORT CLUSTERS WITHOUT FENCING notes I've encountered along the way in the my reading, yes. :) > I personally avoid this by using Active/Passive bonds, with each link in > a different switch, plus two fence devices. I put IPMI on the first > switch and PDU fence devices on the other switch. This way, at least one > fence device is available, no matter what. Ah, so clearly Pacemaker is clever enough to cycle through all of its STONITH devices that apply to a node until one works. That makes sense, but I wasn't sure; most/all examples I've seen only mention a single STONITH agent. That's a useful example, thanks. > I understand this is outside your current resources, but I would still > implement fencing. Yes. I may be constrained to just a single network/switch, though, which is why I'm still dubious. > You would have to lose the switch. The lose of one link alone, say the > network cable or interface used by corosync on one node dies, the other > node will still successfully reach the failed node's fence device and > kill it. This is why fence devices must exist outside of the target node. RIght. But with VMs the node VMs share the same network as the hypervisor that does the killing, so they really share the same network, don't they? My concern about STONITH for my little cluster remains - with a single network, if it fails then I'm going to have a case of both nodes trying to kill the other ... and from what you told me earlier - "a failed fence action will leave the cluster hung" - that means the application will be left down on both. What exactly do you mean by "cluster hung"? Will the nodes suicide, cycle with never-ending attempts to kill the other (before proceeding to run resources) or just go into some 'suspended' state? > "Largely" read-only is not entirely read-only. If it was truly > read-only, then why use a cluster at all? True. That's why I described it as "largely read-only". Our situation is ... in normal operation the cluster application will be read/write. In times of an outage it will be used mainly to provide information for recovery purposes; i.e. mainly in a 'read' capacity. If any changes are made during the outage (by accident or otherwise) then we're content to throw away those changes if necessary. It's much more important for the application to be running during the outage itself. > If your nodes are KVM VMs, you can use either 'fence_xvm' or 'fence_virsh'. > > ... > > Take a look at fence_xvm as well, it's multicast-based. I gather, from reading the various man pages, that fence_xvm is a 'front end' to the fence_virtd daemon, using multicast to send fencing requests to that daemon. But I would have to configure fence_virtd to use its 'libvirt-qpid' back-end to actually send/receive QMF/AMQP requests via/to qpid to reboot the bad KVM node. The configuration necessary for that is currently a mystery to me. Thank you for your help, I appreciate it! _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
