where to start troubleshooting pfsync?

Adam Thompson Fri, 13 Feb 2015 09:59:21 -0800

Firstly: this problem never occurred even once in ~6 months of operationwith pf(4) disabled; it never occurred in ~2 months of operation withpf(4) enabled, an accept-all ruleset and no pfsync, and now with pfsyncconfigured it's happening about once a week.

My setup is complex enough that I expect I'm hitting some odd cornercase... apologies for the dense description.

I've got two OpenBSD 5.6-STABLE (courtesy of M:Tier packages, thanksguys!) BGP routers running carp & pfsync between them for some of the"internal" interfaces. Yes, I probably should have done this using tworouters, two firewalls & ECMP, but I didn't have enough hardware, so Icollapsed the firewall function onto the routers and used CARP insteadof ECMP for outbound traffic.

The problem is that one or the other router will start dropping traffic"randomly". Never both at the same time (so far). The first symptom Inotice is usually that DNS lookups suddenly start to fail. Rebootingthe problem router always fixes the issue... but sometimes I pick thewrong router to reboot and have to reboot both. This is, of course, acrappy solution in the first place - the issue isn't that I'm not surewhich one to reboot, it's that I have to reboot it at all.

I *believe* the dropped packets are inbound replies; I run two BGPsessions with my upstream, so traffic is stochastically (I think) splitbetween the two routers.

There's enough traffic running through them that leaving tcpdump(8)running on both is not feasible. The pf(4) ruleset is trivial, andshould never be able to block DNS traffic to or from my workstation -the rule that hits (or should, anyway) is "pass all flags any keep state(sloppy, pflow) allow-opts"!

If it matters, pfsync0 and all the routing interfaces are vlan(4)interfaces on top of trunk(4) LACP interfaces. The pfsync0/vlan8 is adedicated VLAN that only exists on these two trunk ports, and I'm usingprivate IPv4 address space with syncpeer to set up pfsync0.

This problem never occurred even once in many months of operation withpf(4) disabled; it never occurred in about two months of operation withpf(4) enabled, an accept-all ruleset and no pfsync, and now with pfsyncconfigured it's happening about once a week.

None of my customers have complained yet, but since it affects my ownworkstation, I must assume it's only a matter of time...

I don't see anything unusual in /var/log or dmesg, I don't see anythingunusual in netstat -s output either - but I'm not sure I know what tolook for.

With apologies for suppressing part of the data, the *entire* pf ruleset("pfctl -s rules") on each router is:

pass all flags any keep state (sloppy, pflow) allow-opts
block drop inet from any to 198.xxx.xxx.xxx/28
pass inet from 198.yyy.yyy.yyy/25 to 198.xxx.xxx.xxx/28 flags S/SAkeep state (sloppy, pflow)pass log (matches) inet proto tcp from any to 198.xxx.xxx.xxx port =aaaa flags S/SA keep state (sloppy, pflow)pass log (matches) inet proto tcp from any to 198.xxx.xxx.xxx port =aaaa flags S/SA keep state (sloppy, pflow)pass log (matches) inet proto tcp from any to 198.xxx.xxx.xxx port =aaaa flags S/SA keep state (sloppy, pflow)

My workstation - where I see the effect of this problem most immediately- and my local DNS resolvers - all live in that 198.yyy.yyy.yyy/25subnet; I don't know if this is relevant or not.

So... at this point, what problem indicators (counters? log messages?)should I be looking at or monitoring?


--
-Adam Thompson
 athom...@athompso.net
 +1 (204) 291-7950 - cell
 +1 (204) 489-6515 - fax

where to start troubleshooting pfsync?

Reply via email to