At 08:27 AM 6/25/2007, Bill Moran wrote:
In response to Bill Moran <[EMAIL PROTECTED]>:
> In response to Adam McDougall <[EMAIL PROTECTED]>:
>
> > On Tue, Jun 12, 2007 at 10:19:49AM -0400, Bill Moran wrote:
> >
> >
> > This one has got me pretty befuddled.
> >
> > We're seeing some really odd behaviour with FreeBSD ignoring
SYN packets.
> > I've been trying to diagnose this for a couple of weeks now,
and my current
> > guess is that there's something wrong with the em
driver. Here's a narrowed
> > down list of what I've ruled out:
> > *) I've done my best to eliminate other network components as
the problem.
> > My theory at this point is that it can't possibly be any
other network
> > hardware, based on the tcpdump show below.
> > *) The problem occurred on both FreeBSD 6.1 and FreeBSD 6.2-p3.
> > *) The problem does not appear to be tied to CPU usage -- the
CPU is nearly
> > idle when the problem occurs.
> > *) I can now reproduce it pretty easily, so I'll know when it's fixed.
> > *) The system exhibiting the problem is running 15 jails, but they are
> > idle 95% of the time. The problem initially occurred inside one of
> > the jails, but I just recreated it outside the jail (on
the host) and
> > it's _easier_ to reproduce outside the jail.
> > *) The problem occurred with both GENERIC, and the SMP kernel
(this is a
> > dual-CPU, hyperthreaded system)
> > *) I've tested and the behavior occurs both with a
dynamically generated
> > file (from PHP) or from a static file.
> >
> > The nature of the beast is that we've got a SOAP application
running under
> > Apache and PHP. This application is subject to many requests in rapid
> > succession, such that load can be simulated by the following loop:
> >
> > while true; do fetch http://192.168.121.250/test.php; done
> >
> > The problem is that occasionally, the Apache server machine
just ignores
> > SYN packets. Take the following tcpdump output for example:
> >
> > 13:34:17.312296 IP
web04-v100.cust00.pitbpa1.priv.collaborativefusion.com.54808 >
anchor-is00.is.pitbpa1.priv.collaborativefusion.com.http: S
2645061726:2645061726(0) win 65535 <mss 1380,nop,wscale
1,nop,nop,timestamp 2690201156 0,sackOK,eol>
> > 13:34:20.312398 IP
web04-v100.cust00.pitbpa1.priv.collaborativefusion.com.54808 >
anchor-is00.is.pitbpa1.priv.collaborativefusion.com.http: S
2645061726:2645061726(0) win 65535 <mss 1380,nop,wscale
1,nop,nop,timestamp 2690204156 0,sackOK,eol>
> > 13:34:23.512626 IP
web04-v100.cust00.pitbpa1.priv.collaborativefusion.com.54808 >
anchor-is00.is.pitbpa1.priv.collaborativefusion.com.http: S
2645061726:2645061726(0) win 65535 <mss 1380,nop,wscale
1,nop,nop,timestamp 2690207356 0,sackOK,eol>
> >
> > This is the _only_ traffic on port 80 during the test. It
looks like the
> > kernel has ignored the initial syn packet and two
duplicates. I've seen it
> > take as long as 45 seconds to establish a connection, and this causes
> > ugly performance problems, as well as frequent timeouts on
the client end.
> > The only clue I've found so far is this output from netstat -s.
> >
> >
> > Does the Apache server have a firewall of any sort? (Could be
making unexpected
> > decisions there, even not part of a fw rule)
> >
> > Try net.inet.ip.portrange.randomized=0 on the client? (If this
is the problem,
> > we would probably see a reused port if you had a tcpdump of a few minutes
> > if started after waiting for several minutes of "silence")
> >
> > Are both systems on the same subnet? If not, can/have you tried that?
>
> No, they aren't. My ability to test on the same subnet is limited and
> the results inconclusive.
>
> > Can you show tcpdump output using -e on the requests that aren't answered
> > as well as an example that IS answered? (I have seen routers
mess up the MAC
> > addresses for the source and destination and if I kept staring at layer 3
> > data all day I might never have seen the problem)
> >
> > Better yet, can you post files containing tcpdump output using
-w of an entire
> > session that ideally contains failed attempts that eventually
work? That way
> > people could look at a broader picture and perhaps pick up on
something subtle.
> > Its worth comparing a SYN that works, directly with a SYN that
doesn't work.
>
> We've decided to swap the card out on Friday and see if that resolves the
> problem. We have similar units that don't exhibit the problem, so I'm
> getting pretty suspicious that this might be a flaky NIC. If the new
> card doesn't solve the problem, I'll post more details on Monday.
Just in case someone was curious as to the result, or finds this on a web
search.
The behaviour was apparently hardware related. We swapped the NIC out and
can no longer reproduce the problem.
To follow up on my situation - Over the weekend I took the Soekris
box that demonstrated the bad TCP checksums and wiped then
reinstalled the same vintage CURRENT and the problem disappeared. I
used the same kernel config in both cases.
Thanks to those who replied...
dave c
_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "[EMAIL PROTECTED]"