Hi,

this should have gone into the thread 'spurious "need to frag"
messages'. Sorry for opening a new thread.

On Wed, 31.03.2010 at 13:36:48 +0200, Toni Mueller <openbsd-m...@oeko.net> 
wrote:
> recently, a problem with OpenBSD has popped up over here that manifests
> itself in "random" connection failures after some time. Network
> diagram:
> 
>  workstation (1) --- (3b) firewall (3a) --- Internet --- www.example.com (2)
> 
> You surf from your workstation to www.example.com. On the firewall, you
> can see packets flowing, on the exterior interface.
> 
>  (1) -> (2)
>  (2) -> (1)
>  (1) -> (2)
>  (2) -> (1)
>  (1) -> (2)
>  (2) -> (1)
>  (1) -> (2)
>  (2) -> (1)
>  (1) -> (2)
>  (2) -> (1)
>  (1) -> (2)
>  (2) -> (1)
>  (1) -> (2)
>  (2) -> (1)
>  (1) -> (2)
> 
> and so on. Everything works just fine. Now, with nothing changed except
> for the firewall being up some days (currently: 13 days), and having
> pushed some traffic already, connections start to fail:
> 
> On (3a), you see "almost" the same packet sequence like shown above,
> shortened for brevity:
> 
>  (1) -> (2)
>  (2) -> (1)
>  (1) -> (2)
>  (2) -> (1)
>  (1) -> (2)
>  (2) -> (1)
>  (1) -> (2)
>  (2) -> (1)
>  (1) -> (2)
>  (2) -> (1)
>  (1) -> (2)
>  (2) -> (1)    <- point where the connection fails
>  (2) -> (1)
>  (2) -> (1)
>  (2) -> (1)
>  (2) -> (1)
> 
> but on (3b), you see:
> 
>  (1) -> (2)
>  (2) -> (1)
>  (1) -> (2)
>  (2) -> (1)
>  (1) -> (2)
>  (2) -> (1)
>  (1) -> (2)
>  (2) -> (1)
>  (1) -> (2)
>  (2) -> (1)
>  (1) -> (2)
> 
> and then nothing more, like if the web server on the other side had
> stopped sending packets. I can't see the packets on pflog0, either, and
> using slightly different networking to "bypass" the firewall,
> everything works still fine, but "fixing" the problem involves powering
> down the firewall.  Simply rebooting it w/o powering it down, does not
> fix the problem.

investigating further, the firewall starts to send ICMP packets on (3a)
to the other sides "(2)", claiming that

1.2.3.4 > 92.122.217.187: icmp: 1.2.3.24 unreachable - need to frag (mtu 1420)

Which is wrong. I've verified that the complete path supports an MTU of
1500 bytes. FWIW, the machines 1.2.3.4 and 1.2.3.24 are connected via a
100MBit/s Ethernet (machine - switch - machine, << 10 m of cabling).
I've also verified the pf configuration, and there's nothing in there
that lowers the MTU:

# grep -v '^#' pf* |grep -F 1420
pf.os:4096:64:0:44:M1420:               NewtonOS:2.1::NewtonOS 2.1
#

> This problem first occurred for us with 4.6-stable on both i386 and
> amd64, and now also occurred on -current with kernel 448 on i386. I'm
> underway trying to get yet-more-recent stuff installed to see whether
> the problem is fixed.

Experimenting with several variations of OpenBSD, like 4.6-stable and
some snapshots yields that with ever more recent versions of -current,
the problem seems to deteriorate, going down from almost two weeks with
kernel #448 on i386 to less than one week with kernel #148 on amd64,
with about no change in configuration except adding a few more packet
filter rules, currently around 500-600 after replacing many macros with
tables. The machine has moved some 300-400 mio packets since booting
about six days ago.

I'm probably going to try yesterday's snapshot, too.

Since I also swapped in a brand new, bigger machine to do the job, I am
confident that there is no hardware problem involved. The remaining
idea from my side is to suspect some kind of kernel memory corruption.

Unfortunately, I can so far only detect the problem on the outside of
the network by observing the packets emitted by the machine(s) in
question, but I failed to find a way to diagnose the problem from
within the machine(s) itself (except for using tcpdump, which is a bit
besides the point).


-- 
Kind regards,
--Toni++

Reply via email to