Boris B. Zhmurov wrote:
Hello, Jesse Brandeburg.

On 06.04.2006 04:42 you said the following:

I built and tested the driver with patches on 2.6.16, with pci-x adapters. I removed some workarounds for PCIe adapters, but I dont think anyone having this problem has a PCIe adapter anyway. I saw no TX hangs and ran some bi-directional tests, so i think the driver should work okay. Just warning you I did minimal testing.

*********************
e1000: transmit the old fashioned way

It seems back in the day of 2.6.11, there were no sk_forward_alloc
asserions. Forward port that transmit code to see if it fixes the issues
in today's kernel.  Unfortunately it doesn't have all the bug fixes that
the current code has, but if we get transmit timeouts we can add in
workarounds appropriately.

this changes only the e1000_tso function

With this one still having:

TCP: Treason uncloaked! Peer 80.72.16.78:11460/80 shrinks window 2223569515:2223569516. Repaired. KERNEL: assertion (!sk->sk_forward_alloc) failed at net/core/stream.c (283) KERNEL: assertion (!sk->sk_forward_alloc) failed at net/ipv4/af_inet.c (150)
This is a very important result. It shows that the changes to the driver to call pskb_expand_head for TSO operations are not the cause of this problem.

We also have some new data from the last couple of days. First, I think that this problem is likely not just E1000's fault. We have multiple reports both in bugzilla.kernel.org and from a distro that show this problem has occurred on (at least) tg3 driven adapters as well as e1000.

I've been able to reliably reproduce this issue in house (finally) thanks to one of our testers. The test is using the tbench application from the dbench package at samba.org.

on the server, start tbench_srv
on the machine you're trying to repro the issue on, start tbench 500 <server ip>, on another client start tbench 50 <server ip> I've seen sk_forward_alloc assertions on both server and client both running 2.6.16. We're trying to figure out where there might be a stale pointer to an sk that accesses the data after free. something seems to write ff ff ff ff 00 00 00 00 to memory after it is freed maybe?

It does seem that the load (the 500 threads) is important to this failure. I've also seen a report that a memory poisoning kernel caught the failure.

Any investigation hints for me?
e1000: implement old xmit_frame

It seems back in the day of 2.6.11, there were no sk_forward_alloc
asserions. Forward port that transmit code to see if it fixes the issues
in today's kernel.  Unfortunately it doesn't have all the bug fixes that
the current code has, but if we get transmit timeouts we can add in
workarounds appropriately.

this changes the e1000_xmit_frame function, and some ancilliaries

Signed-off-by: Jesse Brandeburg <[EMAIL PROTECTED]>



Can't apply this one:

[EMAIL PROTECTED] linux-2.6.16]$ patch -p1 < ../../../SOURCES/linux-2.6.16-e1000-implement_old_xmit_frame.patch
patching file drivers/net/e1000/e1000_main.c
Hunk #1 succeeded at 2620 (offset -105 lines).
Hunk #2 FAILED at 2695.
Hunk #4 FAILED at 2837.
Hunk #5 FAILED at 2868.
Hunk #6 FAILED at 2899.
4 out of 6 hunks FAILED -- saving rejects to file drivers/net/e1000/e1000_main.c.rej

well that seems kind of lame, but I think we got the data that we needed from the first patch.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to