For over a year now we have been seeing instability on our firewalls
that seems to kick in when our state tables approach 200K entries.
The number varies, but it's a safe bet that once we cross the 180K
threshold, the machines start getting cranky.  At 200K+ performance
visibly degrades, often leading to a complete lockup of the network
stack, or a spontaneous reboot.

The symptoms are varied, but the early onset indication is interactive
response at the shell prompt gets stuttery.  As it progresses, network
traffic stops flowing and the network stack eventually just locks up.

We also see the occasional:

  pmap_unwire: wiring for pmap 0xfffffd8e8a946528 va 0xc000d4d000 didn't change!

logged on the console.

The machines are not hurting for resources:

load averages:  1.06,  1.12,  1.12             xxxxxxxxxxxxxxxxxxxxxxx 17:53:08
48 processes: 47 idle, 1 on processor                                  up  6:06
CPU0:  0.0% user,  0.0% nice, 22.0% sys,  0.8% spin,  5.8% intr, 71.5% idle
CPU1:  0.0% user,  0.0% nice, 27.7% sys,  1.2% spin,  5.2% intr, 65.9% idle
CPU2:  0.0% user,  0.0% nice, 40.5% sys,  0.6% spin,  4.4% intr, 54.5% idle
CPU3:  0.0% user,  0.0% nice,  1.4% sys,  0.0% spin,  6.8% intr, 91.8% idle
Memory: Real: 110M/1722M act/tot Free: 60G Cache: 851M Swap: 0K/21G

Our pf settings are pretty simple:

  set optimization normal
  set ruleset-optimization basic
  set limit states 400000
  set limit src-nodes 100000
  set loginterface none
  set skip on lo
  set reassemble yes
  
  # Reduce the number of state table entries in FIN_WAIT_2 state.
  set timeout tcp.finwait 4

(Note that the limit states 400000 is a hold over from the 6.x
days, where the default value was too small to handle our load.)

vmstat reports this for pf state table memory usage:

pfstate      320 58417177    0   202558 135845 117730 18115 25210   0     8    0
pfstkey      112 58417177    0   179214 35152 29744  5408  7208     0     8    0
pfstitem      24 58417122    0   179214  6952  5811  1141  1520     0     8    0

At this moment we're running with 210K state table entries.  There
seem to be an awful lot (>40%) of those in FIN_WAIT_2:FIN_WAIT_2
state -- I'm still trying to puzzle that one out.

But my immediate (and only -- please do NOT start a bikeshed on
ruleset design!) question is:

        Is there a practical limit on the number of states pf can handle?

Our experiences says there is, and the number is around 180K.

Prior to release 7.1 we didn't see anything like this at all.  This
started happening with the 7.1 release, and we noticed a real
escalation in instability in 7.2.  Enough so that we rolled the
affected firewalls back to 7.1.  That worked around the problem,
until last night, when the firewall rebooted itself (at the time
of least traffic load?!).

Because of all this we have been avoiding upgrading any of the
firewalls beyond 7.1 as we cannot afford the resulting downtime.

Even carp didn't save us.  We've had a couple of incidents where
on firewall panics, carp fails over, then the 2nd firewall locks
up.

And this points out another issue.  When the network stack freezes,
the carp interfaces do not flip.  I haven't figured that one out
yet, either.

Okay, so what's the point of all this blathering?  I guess there
are two things I'm wondering:

1) are there known limitations in the pf code that would explain this?

2) has anyone else seen this sort of behaviour on their firewalls?

Thanks!

--lyndon

Reply via email to