Should have mentioned it but the situation described below was with the 'defer' option of pfsync enabled.
I think you are right about the problems being with TCP sequence number checks. I tried the PF rule with 'keep state (sloppy)' and that "fixes" the problem (or I guess it would be better to say: "Makes the symptoms disappear"). It seems like a highly discouraged option and I don't fully understand the security implications. Would appreciate any insights anyone could offer on that. Could it be that since my upstream has a strong preference for FW1 everything goes fine for a while (like about 30 secs in mosts tests) and then upstream sends a couple (maybe even a single one) packets directly to FW2 which botches the sequence number check on FW1? Even with this theory I still don't understand why PF has no problem accepting traffic on the outside interface (vlan1604) and only starts to have a problem when trying to send it out on the inside interface (vlan1003). The cable is actually just a normal cable; Old habbits I guess... ;-) > Op 20 oktober 2016 om 20:21 schreef Stuart Henderson <s...@spacehopper.org>: > > For this config where you can't predict which firewall receives the > packet from upstream, and especially if you end up with packets from > your "inside" machine going through a different firewall as the one > receiving external packets, you can run into problems with the TCP > sequence number checking that PF (and some other stateful firewalls) > does on TCP packets. > > Try "ifconfig pfsync0 defer" first - from pfsync(4): > > Where more than one firewall might actively handle packets, e.g. with > certain ospfd(8), bgpd(8) or carp(4) configurations, it is beneficial to > defer transmission of the initial packet of a connection. The pfsync state > insert message is sent immediately; the packet is queued until either this > message is acknowledged by another system, or a timeout has expired. This > behaviour is enabled with the defer parameter to ifconfig(8). > > On 2016-10-20, Jasper Siepkes <siep...@serviceplanet.nl> wrote: > > Hi list! > > > > I've ran into a situation with PF which I don't quite understand. > > > > The situation is as follows; I have 2 OpenBSD firewalls connected to an > > upstream provider which forwards traffic to us via equal cost multi > > path routing (ECMP). The firewalls are connected via a crossover cable > > Incidentally, there's no need for crossover cables with gigabit nics. > > > over wich pfsync is configured. On the inside the firewalls are each > > connected with 2 cables (with LACP) to 2 different switches which > > are in an MLAG configuration (so these 2 switches function as 1 switch). > > The OpenBSD firewalls are running OpenBSD 6.0 with all patches applied. > > > > It looks like this (public IP's changed): > > > > OUTSIDE / UPSTREAM > > > > GW: 192.168.116.21 GW: 192.168.216.21 > > + ^ > > | | > > vlan1604 | | vlan2604 > > 192.168.116.22 | | 192.168.216.22 > > | | > > +---v---+ +----+--+ > > | FW 1 +------+ FW 2 | > > +---+---+ +----+--+ > > vlan1003 | ^ vlan1003 > > 17.214.19.49 | | 17.214.19.50 > > +---------------+ > > > > INSIDE > > > > Now on both firewalls I have this really simple ruleset: > > > > ------------------------- > > # cat /etc/pf.conf > > > > set skip on lo0 > > # Interface connected with crossover cable to other firewall for > > # pfsync. > > set skip on em1 > > > > block log > > > > pass log quick proto tcp to port 22 > > ------------------------- > > > > Which results in the following PF rules: > > ------------------------- > > # pfctl -sr > > > > block drop log all > > pass log quick proto tcp from any to any port = 22 flags S/SA > > ------------------------- > > > > Now when I SSH from the outside world to 17.214.19.50 the traffic flows > > as indicated in the diagram (altough its ECMP upstream seems to prefer > > FW 1 so traffic always ends up there): > > > > [Internet] Me (62.187.45.178) > > | > > V > > [FW1]vlan1604 > > | > > V > > [FW1]vlan1003 > > | > > V > > [FW2]vlan1003 > > | > > V > > [FW2]vlan2604 > > | > > V > > [Internet] Me > > > > And this works. However after about 30 seconds I lose connection to the > > 17.214.19.50 host because PF can't match the traffic on FW1 vlan1003 > > to the established state. I'm typing random stuff in to the SSH session > > to keep it active and then it just hangs. This looks like this > > (public IP's changed): > > > > ------------------------- > > # tcpdump -nettti pflog0 port 22 and host 17.214.19.50 > > tcpdump: WARNING: snaplen raised from 116 to 160 > > tcpdump: listening on pflog0, link-type PFLOG > > Oct 20 10:30:11.299997 rule 1/(match) pass in on vlan1604: > > 62.187.45.178.64072 > > > 17.214.19.50.22: S 4112726507:4112726507(0) win 29200 6451222 0,nop,wscale > > 7> (DF) > > Oct 20 10:30:11.300026 rule 1/(match) pass out on vlan1003: > > 62.187.45.178.64072 > >> 17.214.19.50.22: S 4112726507:4112726507(0) win 29200 > >> 1460,sackOK,timestamp 6451222 0,nop,wscale 7> (DF) > > > > > > > > Oct 20 10:30:44.330002 rule 0/(match) block out on vlan1003: > > 62.187.45.178.64072 > >> 17.214.19.50.22: P 4112740387:4112740427(40) ack 2507834833 win 594 > ><nop,nop,timestamp 6484253 2782905123> (DF) [tos 0x10] > > Oct 20 10:30:44.425886 rule 0/(match) block out on vlan1003: > > 62.187.45.178.64072 > >> 17.214.19.50.22: P 40:80(40) ack 1 win 594 <nop,nop,timestamp 6484349 > > 2782905123> (DF) [tos 0x10] > > Oct 20 10:30:44.436021 rule 0/(match) block out on vlan1003: > > 62.187.45.178.64072 > >> 17.214.19.50.22: P 40:80(40) ack 1 win 594 <nop,nop,timestamp 6484359 > > 2782905123> (DF) [tos 0x10] > > Oct 20 10:30:44.514107 rule 0/(match) block out on vlan1003: > > 62.187.45.178.64072 > >> 17.214.19.50.22: P 80:120(40) ack 1 win 594 <nop,nop,timestamp 6484437 > > 2782905123> (DF) [tos 0x10] > > Oct 20 10:30:44.618079 rule 0/(match) block out on vlan1003: > > 62.187.45.178.64072 > >> 17.214.19.50.22: P 120:160(40) ack 1 win 594 <nop,nop,timestamp 6484541 > > 2782905123> (DF) [tos 0x10] > > ------------------------- > > > > It seems that PF all of a sudden doesn't see the SSH traffic as part > > of the established connection anymore. The state table of PF show that > > the state was correctly added to the state table and synced between > > the firewalls and it also still there: > > > > ----------------------------------- > > # pfctl -ss > > > > all carp 17.214.19.49 -> 17.214.19.50 SINGLE:NO_TRAFFIC > > all carp 10.100.0.2 -> 10.100.0.3 SINGLE:NO_TRAFFIC > > all carp 10.100.2.2 -> 10.100.2.3 SINGLE:NO_TRAFFIC > > all tcp 17.214.19.49:22 <- 62.187.45.178:65149 ESTABLISHED:CLOSING > > all tcp 17.214.19.49:22 <- 62.187.45.178:58883 ESTABLISHED:CLOSING > > all tcp 17.214.19.49:22 <- 62.187.45.178:59505 ESTABLISHED:ESTABLISHED > > all tcp 17.214.19.49:22 <- 62.187.45.178:63889 ESTABLISHED:FIN_WAIT_2 > > all tcp 17.214.19.49:22 <- 62.187.45.178:63963 ESTABLISHED:ESTABLISHED > > all tcp 17.214.19.49:22 <- 62.187.45.178:63235 ESTABLISHED:ESTABLISHED > > all tcp 17.214.19.50:22 <- 62.187.45.178:54705 FIN_WAIT_2:FIN_WAIT_2 > > all tcp 17.214.19.50:22 <- 62.187.45.178:64072 ESTABLISHED:ESTABLISHED > > all tcp 17.214.19.50:22 <- 119.249.54.68:38527 TIME_WAIT:TIME_WAIT > > all tcp 17.214.19.49:22 <- 221.194.47.224:60327 TIME_WAIT:TIME_WAIT > > all tcp 17.214.19.50:22 <- 221.194.47.224:53897 TIME_WAIT:TIME_WAIT > > ----------------------------------- > > > > The relevant PF state here is (as indentified in the pflog tcpdump > > as the SSH session that disconnected): > > > > all tcp 17.214.19.50:22 <- 62.187.45.178:64072 ESTABLISHED:ESTABLISHED > > > > which seems okay. > > > > What I also find odd is that PF allows the packet to > > traverse the vlan1604 (external) interface and then decides that it > > can't traverse the vlan1003 (internal) interface. Why isn't it a > > problem for the vlan1604 interface? It should be noted that the > > vlan1003 interfaces sits on a trunk interface (trunk0, configured as > > LACP). I don't see how but this might be related. > > > > I'm at a loss here as I really can't explain the behavior I'm seeing > > of PF here. Am I missing something? Could this be a bug? > > > > Regards, > > > > Jasper