Hi, this should have gone into the thread 'spurious "need to frag" messages'. Sorry for opening a new thread.
On Wed, 31.03.2010 at 13:36:48 +0200, Toni Mueller <openbsd-m...@oeko.net> wrote: > recently, a problem with OpenBSD has popped up over here that manifests > itself in "random" connection failures after some time. Network > diagram: > > workstation (1) --- (3b) firewall (3a) --- Internet --- www.example.com (2) > > You surf from your workstation to www.example.com. On the firewall, you > can see packets flowing, on the exterior interface. > > (1) -> (2) > (2) -> (1) > (1) -> (2) > (2) -> (1) > (1) -> (2) > (2) -> (1) > (1) -> (2) > (2) -> (1) > (1) -> (2) > (2) -> (1) > (1) -> (2) > (2) -> (1) > (1) -> (2) > (2) -> (1) > (1) -> (2) > > and so on. Everything works just fine. Now, with nothing changed except > for the firewall being up some days (currently: 13 days), and having > pushed some traffic already, connections start to fail: > > On (3a), you see "almost" the same packet sequence like shown above, > shortened for brevity: > > (1) -> (2) > (2) -> (1) > (1) -> (2) > (2) -> (1) > (1) -> (2) > (2) -> (1) > (1) -> (2) > (2) -> (1) > (1) -> (2) > (2) -> (1) > (1) -> (2) > (2) -> (1) <- point where the connection fails > (2) -> (1) > (2) -> (1) > (2) -> (1) > (2) -> (1) > > but on (3b), you see: > > (1) -> (2) > (2) -> (1) > (1) -> (2) > (2) -> (1) > (1) -> (2) > (2) -> (1) > (1) -> (2) > (2) -> (1) > (1) -> (2) > (2) -> (1) > (1) -> (2) > > and then nothing more, like if the web server on the other side had > stopped sending packets. I can't see the packets on pflog0, either, and > using slightly different networking to "bypass" the firewall, > everything works still fine, but "fixing" the problem involves powering > down the firewall. Simply rebooting it w/o powering it down, does not > fix the problem. investigating further, the firewall starts to send ICMP packets on (3a) to the other sides "(2)", claiming that 1.2.3.4 > 92.122.217.187: icmp: 1.2.3.24 unreachable - need to frag (mtu 1420) Which is wrong. I've verified that the complete path supports an MTU of 1500 bytes. FWIW, the machines 1.2.3.4 and 1.2.3.24 are connected via a 100MBit/s Ethernet (machine - switch - machine, << 10 m of cabling). I've also verified the pf configuration, and there's nothing in there that lowers the MTU: # grep -v '^#' pf* |grep -F 1420 pf.os:4096:64:0:44:M1420: NewtonOS:2.1::NewtonOS 2.1 # > This problem first occurred for us with 4.6-stable on both i386 and > amd64, and now also occurred on -current with kernel 448 on i386. I'm > underway trying to get yet-more-recent stuff installed to see whether > the problem is fixed. Experimenting with several variations of OpenBSD, like 4.6-stable and some snapshots yields that with ever more recent versions of -current, the problem seems to deteriorate, going down from almost two weeks with kernel #448 on i386 to less than one week with kernel #148 on amd64, with about no change in configuration except adding a few more packet filter rules, currently around 500-600 after replacing many macros with tables. The machine has moved some 300-400 mio packets since booting about six days ago. I'm probably going to try yesterday's snapshot, too. Since I also swapped in a brand new, bigger machine to do the job, I am confident that there is no hardware problem involved. The remaining idea from my side is to suspect some kind of kernel memory corruption. Unfortunately, I can so far only detect the problem on the outside of the network by observing the packets emitted by the machine(s) in question, but I failed to find a way to diagnose the problem from within the machine(s) itself (except for using tcpdump, which is a bit besides the point). -- Kind regards, --Toni++