For over a year now we have been seeing instability on our firewalls that seems to kick in when our state tables approach 200K entries. The number varies, but it's a safe bet that once we cross the 180K threshold, the machines start getting cranky. At 200K+ performance visibly degrades, often leading to a complete lockup of the network stack, or a spontaneous reboot.
The symptoms are varied, but the early onset indication is interactive response at the shell prompt gets stuttery. As it progresses, network traffic stops flowing and the network stack eventually just locks up. We also see the occasional: pmap_unwire: wiring for pmap 0xfffffd8e8a946528 va 0xc000d4d000 didn't change! logged on the console. The machines are not hurting for resources: load averages: 1.06, 1.12, 1.12 xxxxxxxxxxxxxxxxxxxxxxx 17:53:08 48 processes: 47 idle, 1 on processor up 6:06 CPU0: 0.0% user, 0.0% nice, 22.0% sys, 0.8% spin, 5.8% intr, 71.5% idle CPU1: 0.0% user, 0.0% nice, 27.7% sys, 1.2% spin, 5.2% intr, 65.9% idle CPU2: 0.0% user, 0.0% nice, 40.5% sys, 0.6% spin, 4.4% intr, 54.5% idle CPU3: 0.0% user, 0.0% nice, 1.4% sys, 0.0% spin, 6.8% intr, 91.8% idle Memory: Real: 110M/1722M act/tot Free: 60G Cache: 851M Swap: 0K/21G Our pf settings are pretty simple: set optimization normal set ruleset-optimization basic set limit states 400000 set limit src-nodes 100000 set loginterface none set skip on lo set reassemble yes # Reduce the number of state table entries in FIN_WAIT_2 state. set timeout tcp.finwait 4 (Note that the limit states 400000 is a hold over from the 6.x days, where the default value was too small to handle our load.) vmstat reports this for pf state table memory usage: pfstate 320 58417177 0 202558 135845 117730 18115 25210 0 8 0 pfstkey 112 58417177 0 179214 35152 29744 5408 7208 0 8 0 pfstitem 24 58417122 0 179214 6952 5811 1141 1520 0 8 0 At this moment we're running with 210K state table entries. There seem to be an awful lot (>40%) of those in FIN_WAIT_2:FIN_WAIT_2 state -- I'm still trying to puzzle that one out. But my immediate (and only -- please do NOT start a bikeshed on ruleset design!) question is: Is there a practical limit on the number of states pf can handle? Our experiences says there is, and the number is around 180K. Prior to release 7.1 we didn't see anything like this at all. This started happening with the 7.1 release, and we noticed a real escalation in instability in 7.2. Enough so that we rolled the affected firewalls back to 7.1. That worked around the problem, until last night, when the firewall rebooted itself (at the time of least traffic load?!). Because of all this we have been avoiding upgrading any of the firewalls beyond 7.1 as we cannot afford the resulting downtime. Even carp didn't save us. We've had a couple of incidents where on firewall panics, carp fails over, then the 2nd firewall locks up. And this points out another issue. When the network stack freezes, the carp interfaces do not flip. I haven't figured that one out yet, either. Okay, so what's the point of all this blathering? I guess there are two things I'm wondering: 1) are there known limitations in the pf code that would explain this? 2) has anyone else seen this sort of behaviour on their firewalls? Thanks! --lyndon