Michael Chan wrote:
Philip Molter wrote:

Is there any additional information that I can give to help get some more work targeted at this bug? I've been getting this lockup three or four times a week per server (I have four of them exhibiting this behavior).

The network setup is fairly complicated, but unfortunately, these are production machines pushing multi-gigabit traffic loads. We're using vlans on top of bonding on top of anywhere from 2-to-6 broadcomm NICs, but it appears that the problem is unrelated to the bonding and vlans, as others are reporting similar problems without those enabled.

Any assistance would be appreciated. I've left the original information below for reference.

Since you're using a rather old version of tg3, I suggest that you
upgrade to a newer version first.  Your problem is probably
different from Bernd Schubert's since he has ASF enabled and you
don't.

If anyone could even explain what this error means, that would be helpful. Maybe we can change something to work around it.


The stop_block error messages are not too important.  The important
thing is that you're getting a transmit timeout.  It means that
the tx queue is getting full because the NIC is no longer getting
interrupts.  When this condition is detected, the NIC will get reset
which should normally bring the NIC back to life.  It seems that
in your case, it doesn't come back.  Do you get these timeouts on
both ports at the same time?

It's hard to tell. When the error gets logged, it doesn't say which interface it's happening on. The box is locked up by the time we get to it, but I think it's happening on both.

I've had NICs lock up with queue issues before, but I've never had it lock up a box completely, unresponsive on console even. Normally, network just breaks, and sure, it requires a reboot, but at least we can do a controlled reboot.

This only started happening when we moved these NICs to jumbo frames. We've used the exact same hardware in less demanding applications (up to 500Mbits vs. 750+Mbits) with jumbo without issue, but these particular machines, these pushers, only started locking up when we switched to jumbo.

Please try the latest driver.  If you still get the timeouts, I'll
need to send you some debug patches to dump the state when these
timeouts occur.

Will do.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to