For this config where you can't predict which firewall receives the
packet from upstream, and especially if you end up with packets from
your "inside" machine going through a different firewall as the one
receiving external packets, you can run into problems with the TCP
sequence number checking that PF (and some other stateful firewalls)
does on TCP packets.

Try "ifconfig pfsync0 defer" first - from pfsync(4):

   Where more than one firewall might actively handle packets, e.g. with
   certain ospfd(8), bgpd(8) or carp(4) configurations, it is beneficial to
   defer transmission of the initial packet of a connection.  The pfsync state
   insert message is sent immediately; the packet is queued until either this
   message is acknowledged by another system, or a timeout has expired.  This
   behaviour is enabled with the defer parameter to ifconfig(8).




On 2016-10-20, Jasper Siepkes <siep...@serviceplanet.nl> wrote:
> Hi list!
>
> I've ran into a situation with PF which I don't quite understand. 
>
> The situation is as follows; I have 2 OpenBSD firewalls connected to an
> upstream provider which forwards traffic to us via equal cost multi
> path routing (ECMP). The firewalls are connected via a crossover cable

Incidentally, there's no need for crossover cables with gigabit nics.

> over wich pfsync is configured. On the inside the firewalls are each
> connected with 2 cables (with LACP) to 2 different switches which 
> are in an MLAG configuration (so these 2 switches function as 1 switch).
> The OpenBSD firewalls are running OpenBSD 6.0 with all patches applied.
>
> It looks like this (public IP's changed):
>
>                  OUTSIDE / UPSTREAM                
>
>   GW: 192.168.116.21      GW: 192.168.216.21
>                +               ^
>                |               |
>       vlan1604 |               | vlan2604
> 192.168.116.22 |               | 192.168.216.22
>                |               |
>            +---v---+      +----+--+
>            | FW 1  +------+ FW 2  |
>            +---+---+      +----+--+
>      vlan1003  |               ^   vlan1003
>  17.214.19.49  |               |   17.214.19.50
>                +---------------+
>
>                     INSIDE
>
> Now on both firewalls I have this really simple ruleset:
>
> -------------------------
> # cat /etc/pf.conf
>                                                                               
>                                        
> set skip on lo0
> # Interface connected with crossover cable to other firewall for
> # pfsync.
> set skip on em1
>
> block log
>
> pass log quick proto tcp to port 22
> -------------------------
>
> Which results in the following PF rules:
> -------------------------
> # pfctl -sr
>                                                                               
>                                               
> block drop log all
> pass log quick proto tcp from any to any port = 22 flags S/SA
> -------------------------
>
> Now when I SSH from the outside world to 17.214.19.50 the traffic flows
> as indicated in the diagram (altough its ECMP upstream seems to prefer
> FW 1 so traffic always ends up there): 
>
> [Internet] Me (62.187.45.178)
>      |
>      V
> [FW1]vlan1604 
>      |
>      V
> [FW1]vlan1003
>      |
>      V
> [FW2]vlan1003 
>      |
>      V
> [FW2]vlan2604 
>      |
>      V
> [Internet] Me 
>
> And this works. However after about 30 seconds I lose connection to the
> 17.214.19.50 host because PF can't match the traffic on FW1 vlan1003 
> to the established state. I'm typing random stuff in to the SSH session
> to keep it active and then it just hangs. This looks like this 
> (public IP's changed):
>
> -------------------------
> # tcpdump -nettti pflog0 port 22 and host 17.214.19.50 
> tcpdump: WARNING: snaplen raised from 116 to 160
> tcpdump: listening on pflog0, link-type PFLOG
> Oct 20 10:30:11.299997 rule 1/(match) pass in on vlan1604: 
> 62.187.45.178.64072 >
> 17.214.19.50.22: S 4112726507:4112726507(0) win 29200 <mss 
> 1460,sackOK,timestamp
> 6451222 0,nop,wscale 7> (DF)
> Oct 20 10:30:11.300026 rule 1/(match) pass out on vlan1003: 
> 62.187.45.178.64072
>> 17.214.19.50.22: S 4112726507:4112726507(0) win 29200 <mss
> 1460,sackOK,timestamp 6451222 0,nop,wscale 7> (DF)
>
>
>
> Oct 20 10:30:44.330002 rule 0/(match) block out on vlan1003: 
> 62.187.45.178.64072
>> 17.214.19.50.22: P 4112740387:4112740427(40) ack 2507834833 win 594
><nop,nop,timestamp 6484253 2782905123> (DF) [tos 0x10]
> Oct 20 10:30:44.425886 rule 0/(match) block out on vlan1003: 
> 62.187.45.178.64072
>> 17.214.19.50.22: P 40:80(40) ack 1 win 594 <nop,nop,timestamp 6484349
> 2782905123> (DF) [tos 0x10]
> Oct 20 10:30:44.436021 rule 0/(match) block out on vlan1003: 
> 62.187.45.178.64072
>> 17.214.19.50.22: P 40:80(40) ack 1 win 594 <nop,nop,timestamp 6484359
> 2782905123> (DF) [tos 0x10]
> Oct 20 10:30:44.514107 rule 0/(match) block out on vlan1003: 
> 62.187.45.178.64072
>> 17.214.19.50.22: P 80:120(40) ack 1 win 594 <nop,nop,timestamp 6484437
> 2782905123> (DF) [tos 0x10]
> Oct 20 10:30:44.618079 rule 0/(match) block out on vlan1003: 
> 62.187.45.178.64072
>> 17.214.19.50.22: P 120:160(40) ack 1 win 594 <nop,nop,timestamp 6484541
> 2782905123> (DF) [tos 0x10]
> -------------------------
>
> It seems that PF all of a sudden doesn't see the SSH traffic as part
> of the established connection anymore. The state table of PF show that 
> the state was correctly added to the state table and synced between 
> the firewalls and it also still there:
>
> -----------------------------------
> # pfctl -ss
>                                                                               
>                                               
> all carp 17.214.19.49 -> 17.214.19.50           SINGLE:NO_TRAFFIC
> all carp 10.100.0.2 -> 10.100.0.3               SINGLE:NO_TRAFFIC
> all carp 10.100.2.2 -> 10.100.2.3               SINGLE:NO_TRAFFIC
> all tcp 17.214.19.49:22 <- 62.187.45.178:65149  ESTABLISHED:CLOSING
> all tcp 17.214.19.49:22 <- 62.187.45.178:58883  ESTABLISHED:CLOSING
> all tcp 17.214.19.49:22 <- 62.187.45.178:59505  ESTABLISHED:ESTABLISHED
> all tcp 17.214.19.49:22 <- 62.187.45.178:63889  ESTABLISHED:FIN_WAIT_2
> all tcp 17.214.19.49:22 <- 62.187.45.178:63963  ESTABLISHED:ESTABLISHED
> all tcp 17.214.19.49:22 <- 62.187.45.178:63235  ESTABLISHED:ESTABLISHED
> all tcp 17.214.19.50:22 <- 62.187.45.178:54705  FIN_WAIT_2:FIN_WAIT_2
> all tcp 17.214.19.50:22 <- 62.187.45.178:64072  ESTABLISHED:ESTABLISHED
> all tcp 17.214.19.50:22 <- 119.249.54.68:38527  TIME_WAIT:TIME_WAIT
> all tcp 17.214.19.49:22 <- 221.194.47.224:60327 TIME_WAIT:TIME_WAIT
> all tcp 17.214.19.50:22 <- 221.194.47.224:53897 TIME_WAIT:TIME_WAIT
> -----------------------------------
>
> The relevant PF state here is (as indentified in the pflog tcpdump
> as the SSH session that disconnected):
>
> all tcp 17.214.19.50:22 <- 62.187.45.178:64072  ESTABLISHED:ESTABLISHED
>
> which seems okay. 
>
> What I also find odd is that PF allows the packet to
> traverse the vlan1604 (external) interface and then decides that it 
> can't traverse the vlan1003 (internal) interface. Why isn't it a
> problem for the vlan1604 interface? It should be noted that the 
> vlan1003 interfaces sits on a trunk interface (trunk0, configured as 
> LACP). I don't see how but this might be related.
>
> I'm at a loss here as I really can't explain the behavior I'm seeing
> of PF here. Am I missing something? Could this be a bug?
>
> Regards,
>
> Jasper

Reply via email to