Firstly: this problem never occurred even once in ~6 months of operation with pf(4) disabled; it never occurred in ~2 months of operation with pf(4) enabled, an accept-all ruleset and no pfsync, and now with pfsync configured it's happening about once a week.

My setup is complex enough that I expect I'm hitting some odd corner case... apologies for the dense description.

I've got two OpenBSD 5.6-STABLE (courtesy of M:Tier packages, thanks guys!) BGP routers running carp & pfsync between them for some of the "internal" interfaces. Yes, I probably should have done this using two routers, two firewalls & ECMP, but I didn't have enough hardware, so I collapsed the firewall function onto the routers and used CARP instead of ECMP for outbound traffic.

The problem is that one or the other router will start dropping traffic "randomly". Never both at the same time (so far). The first symptom I notice is usually that DNS lookups suddenly start to fail. Rebooting the problem router always fixes the issue... but sometimes I pick the wrong router to reboot and have to reboot both. This is, of course, a crappy solution in the first place - the issue isn't that I'm not sure which one to reboot, it's that I have to reboot it at all.

I *believe* the dropped packets are inbound replies; I run two BGP sessions with my upstream, so traffic is stochastically (I think) split between the two routers.

There's enough traffic running through them that leaving tcpdump(8) running on both is not feasible. The pf(4) ruleset is trivial, and should never be able to block DNS traffic to or from my workstation - the rule that hits (or should, anyway) is "pass all flags any keep state (sloppy, pflow) allow-opts"!

If it matters, pfsync0 and all the routing interfaces are vlan(4) interfaces on top of trunk(4) LACP interfaces. The pfsync0/vlan8 is a dedicated VLAN that only exists on these two trunk ports, and I'm using private IPv4 address space with syncpeer to set up pfsync0.

This problem never occurred even once in many months of operation with pf(4) disabled; it never occurred in about two months of operation with pf(4) enabled, an accept-all ruleset and no pfsync, and now with pfsync configured it's happening about once a week.

None of my customers have complained yet, but since it affects my own workstation, I must assume it's only a matter of time...

I don't see anything unusual in /var/log or dmesg, I don't see anything unusual in netstat -s output either - but I'm not sure I know what to look for.

With apologies for suppressing part of the data, the *entire* pf ruleset ("pfctl -s rules") on each router is:
pass all flags any keep state (sloppy, pflow) allow-opts
block drop inet from any to 198.xxx.xxx.xxx/28
pass inet from 198.yyy.yyy.yyy/25 to 198.xxx.xxx.xxx/28 flags S/SA keep state (sloppy, pflow) pass log (matches) inet proto tcp from any to 198.xxx.xxx.xxx port = aaaa flags S/SA keep state (sloppy, pflow) pass log (matches) inet proto tcp from any to 198.xxx.xxx.xxx port = aaaa flags S/SA keep state (sloppy, pflow) pass log (matches) inet proto tcp from any to 198.xxx.xxx.xxx port = aaaa flags S/SA keep state (sloppy, pflow)
My workstation - where I see the effect of this problem most immediately - and my local DNS resolvers - all live in that 198.yyy.yyy.yyy/25 subnet; I don't know if this is relevant or not.


So... at this point, what problem indicators (counters? log messages?) should I be looking at or monitoring?

--
-Adam Thompson
 athom...@athompso.net
 +1 (204) 291-7950 - cell
 +1 (204) 489-6515 - fax

Reply via email to