I'm trying to understand an odd behavior during carp failover where one uplink goes numb until the demarc equipment is power cycled.
Consider the following: ISP1-demarc ISP2-demarc | | SW1 (Net1) SW2 (Net2) ----- C |\ /| | X | |/ \| FW-A - FW-B |\ /| | X | |/ \| SW3 (Net3) SW4 (Net4) (no NAT) (NAT) | H4 ISP1-demarc and ISP2-demarc are the respective ISP's equipment (outside of my control, other than power cycling them). SWn are all unmanaged switches. FW-A, FW-B, and C are all OpenBSD boxes. FW-A and FW-B, in particular, are running 5.7-STABLE in a master/slave carp configuration. Things are set up so that traffic to/from Net3 is sent via ISP1 (no NAT) and traffic to/from Net4 is sent via ISP2 (using NAT on on FW-A and FW-B). H4 is a host sitting on Net4 in private address space. Static IPs are used throughout, including on both the SW1 and SW2 subnets. FW-n are routers, not bridges. Pfsync is running via a crossover cable between FW-A and FW-B. Behavior: In normal operations everything works as expected. During a carp failover, everything for Net3 via ISP1 also works as expected. However, during a failover I lose connectivity on Net4, in a qualified manner (see below) until ISP2-demarc is power cycled. The obvious first answer is that ISP2-demarc (which is a Motorola cable modem) probably has a limited number of MAC slots available to it. However, that doesn't seem quite right. More details ... Before failover, I set up a 'ping -n' running on H4 and going to a host elsewhere on the Internet (call it EXT). I also set up a 'ping -n' on C going to the carp IP of FW-A and FW-B on Net2 (lets call that Carp2). Now comes the wierd part. If I shut down the master, FW-A, I see the following: 1. the running pings from C to Carp2 continue to work until ^C 2. the running pings from H4 to EXT continue to work until ^C 3. a concurrent newly created ping from C to Carp2 fails 4. a concurrent newly created ping from H4 to EXT fails 5. all other outbound traffic from Net4 fails (this is just a generalization of (4). If I power cycle ISP2-demarc, sanity returns. That is, until FW-A comes back up and FW-B is demoted again. Then I get the same type of failures until ISP2-demarc is power cycled again. Power cycling switch SW2 instead of ISP2-demarc does not affect the outcome. Ok, so how about the MACs? On Net2 we have the following MACs: - ISP2-demarc-mac (on ISP2-demarc) - C-mac (on C) - FW-A-mac (physical MAC on FW-A) - FW-B-mac (physical MAC on FW-B) - Carp2-mac (the virtual MAC used by Carp2, which I've verified to be the same for both FW-A and FW-B when they are respectively running as master. One wart here, and a difference between Net1 and Net2 is that on Net1 both firewalls have their own IPs in addition to the Carp1 IP. However, on Net2 both firewall's hostname.if file contains only the 'up' keyword; no IP is used on that network until the machine becomes the carp master. So that means that when H4 is pinging EXT, the pings are being NAT'd to use the Carp1 IP. Therefore I wouldn't expect a failover to cause the modem's MAC slots to overflow. But the *really* weird part is what is happening with C; why would C not be able to ping Carp1 until ISP2-demarc is power-cycled, especially with SW2 isolating the latter from Carp1 and C? And the story with C gets better. If I set up a tcpdump on FW-B's Net2 interface, I see the following sequence of events: - before killing FW-A, I see arp requests and CARPv2 advertisements from FW-A (based on the skew), and that's about it (as expected) - upon shutting down FW-A, I see a CARPv2 packet from FW-B, and then start seeing the ping request/reply pairs coming in from C (as expected) - upon killing and restarting C's ping to Carp2, I no longer see the response on C, but I'm seeing both the request and response in FW-B's tcpdump. On C, I see only the echo response. (NOT expected) Does this last bit point the finger at SW2 being the culprit (perhaps not routing packets to the appropriate NIC port), even though power cycling SW2 isn't sufficient to fix the problem? Any other thoughts? Devin