On Wed, 2013-02-06 at 11:24 +0100, Michael Schwartzkopff wrote: > Am Mittwoch, 6. Februar 2013, 11:06:23 schrieb Roman Haefeli: > > Hi all > > > > We are running a pacemaker/corosync cluster with three nodes that > > manages ~30 OpenVZ containers. > > > > We recently had the situation where one node fenced the to other two > > nodes (sbd is configured as a stonith device). In the system logs I was > > able to spot the line where the node gives the death pill to the others. > > However, I have difficulties finding the original reason for the > > decision to fence the other nodes. > > > > Before I spam the list with logs, I'd like to ask if there is something > > particular I should look for. Are there any advices about how to proceed > > in such a situation? > > > > Many thanks in advance. > > > > Roman > > The reasong should be in the logs above the fencing event. Something like > > corosync: lost connection. > > If you want help from the list paste your logs (the relevant parts only!) to > pastebin and mail the link.
I wasn't sure about which parts are the relevant. However, in the meanwhile we were able to explain the situation. As often, it was a whole chain of circumstances that eventually lead to fencing nodes. Here the whole story (for those interested): Each node has two NICs which form a network bond. On this bond there are two VLANs configured, one for DMZ and one for internal use (corosync ring and NFS traffic). Some containers have their virtual eth devices bridged to the intern vlan for NFS access. Whenever a container starts or stops, its veth device joins or leaves the bridge on the internal vlan. This wouldn't generally be a problem, but it is when the kernel is the Debian OpenVZ kernel. With this kernel bridges always use the MAC address of the member with the smallest number as its MAC. When the container's MAC address is smaller then the one of the physical NIC, the MAC address of the bridge changes whenever that container is started or stopped. This MAC switching caused network lags on the bridge where also the corosync ring is connected. This finally made the corosync ring break, which in turn lead to the fencing of two nodes. Either of those would have prevented that: * A non-Debian OpenVZ kernel (different scheme for assigning MACs to a bridge) * giving a higher MAC to the veth of the container * running the corosync ring on its own vlan. Roman _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
