First off, I'm going to preface this with the realization that what I am explaining makes no sense, doesn't follow normal logic and I'm not a complete idiot. I've beaten my head against a wall with this issue for two days, and have made no progress, yet we've had a couple of production system outages because of it.

Environment is a pair of IBM x-series systems in a DMZ connected to an ASA5500. Each IBM box has two interfaces in a mode=4 bond connected to two switches, which connected to the pri/sec firewall and are interconnected - Poor man's redundancy I support. Both boxes run RHEL6.3 and Pacemaker 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14. ASA has a ARP table timeout of 4hours.

There are about a dozen IPAddr resources in a group which are configured with meta ordered="false" collocated="false" - Each is independent from a service perspective, but the group makes it easy to manage them. Each box runs LVS with mangle rules, then assigns fwm values for routing within LVS - For whatever reason, this still requires the IP to be on the box receiving the packet through LVS, even if the mangle rule is triggered.

We've had a couple of instances for two IPs in this configuration where Pacemaker (and syslog) indicate the IP is assigned to box 01, yet the firewall receives an ARP reply from box 02. Didn't believe it at first until we grabbed packets from a SPAN on the switches. Correct IP address in reply, MAC of one of the bonded interfaces on box 02, yet the IP isn't on it.

We've experienced both 01 arping for an IP on 02, and 02 arping for an IP on 01. Last night when we had the issue, an IP was on 02, 01 arped for it and I tcpdumped on 01 and saw SYN packets coming in for the IP on 01 - Makes sense, but doesn't explain why the box answered the arp in the first place.

I realize this likely isn't a Pacemaker issue, but I was hoping someone else might have experienced a similar issue, or can at least point me in the right direction. We have a far more complex Pacemaker/LVS environment on our inside network (which isn't link-local to the ASA - goes through an inside router) which works flawlessly, so I'm open to the fact that something is totally screwed up in our DMZ.

Sorry that was long. :)

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to