I recently rebuilt a SunBlade 2000 system that was running Solaris 8 to Gentoo 2006.0. The system sports a Sun RIO GEM NIC, and worked quite well for the first few days, however, we didn't hit it hard during that time period either. The systems primary task is to be our source repository, and so needs to be network enabled.
The system was initially setup on 3/9/2006, and ran fine until 3/15/2006 when we started getting the below error messages: Mar 15 15:39:25 tsdfft1 NETDEV WATCHDOG: eth0: transmit timed out Mar 15 15:39:25 tsdfft1 eth0: transmit timed out, resetting Mar 15 15:39:25 tsdfft1 eth0: TX_STATE[003ffc05:00000001:00000019] Mar 15 15:39:25 tsdfft1 eth0: RX_STATE[0100c805:00000001:00000021] Mar 15 15:39:25 tsdfft1 eth0: Link is up at 100 Mbps, half-duplex. Mar 15 15:39:25 tsdfft1 eth0: Pause is disabled And: Mar 15 16:11:58 tsdfft1 eth0: TX MAC xmit underrun. We're presently using the 2.6.16 kernel (vanilla) with sungem driver version 0.98. We have also seen this issue with the 2.6.15.6 kernel (vanilla) and the 2.4.32_r2 kernel (provided by Gentoo 2006.0). The first one is spuratic, but happens from time to time. (Same error message everytime, save date & time.) The second one is the most reproducible as all I have to do is try to pull down source from the repository (hosted on Apache2 via WebDAV), and after about 6 MiB of data transfer, the link will die until an ifconfig down/up is done, when it will go for a while longer and then require a system reboot. In researching the issue, I discovered that there is one of several issues at play - the card is going bad, or there is a driver problem. I found a link to an xmit underrun issue for Solaris, but was unable to access it due to it being locked under sunsolve.sun.com. So I have no guarantee that going back to Solaris will solve the issue either. I have had a hard time finding an xmit underrun issue under Linux, most searches result in references to where the message is generated from and not from users trying to find solutions to the problem. I did, however, notice that there was a similar problem with overflows on the RX portion of the chip, which was solved through resetting the chip's RX unit via gem_rxmac_reset(). My first attempt at a fix was to modify the driver at the point of issue to schedule a reset, based on code elsewhere in the driver. (See sungem-fix1.patch.txt) At first this patch did not seem to work, however, I have been running the kernel with it for about a week now, and at least SSH and Apache seem to keep running. So I do think it at least helped to improve the situation, but it does not solve the problem on the Subversion side (Apache/WebDAV) which still dies after issues (just tested to make sure). I then tried building a solution based on the gem_rxmac_reset() and the various init functions, and produced gem_txmac_reset(). However, my first use locked up the kernel. It might be just that I tried to gain a lock when I shouldn't have (I did try to get the lock and tx_lock for the driver). However, I am not sure that I did it correctly. I would very much appreciate it if someone who is more familiar with the sungem driver would look at the patches and verify that (a) it is the correct thing to do, and (b) I did it correctly. I am aware that the network the system is running on is suppose to be full duplex, 100 Mbps. However, I have noticed that the card/driver seems to think it is half-duplex. Could this simply be a duplexing issue? I have no control of the switch it is plugged into (so far as settings go), but have not been able to find a way to get ifconfig to force it to full-duplex. (We've typically built the driver into the kernel.) If there is any information that I missed which would be helpful, please let me know and I will be glad to pass on what I can. Patches and additional error log information on eth0 are available at the following URL: http://tinyurl.com/hxfbp Summary of system information: System: Sun Microsystem's SunBlade 2000 Purchased: roughly 11/03. Processor: UltraSparcIII+/cheetah+/sparc64 NIC: Sun RIO GEM 10/100, built-in on SunBlade 2000 Linux Distro: Gentoo 2006.0 Kernel Versions: 2.6.16, 2.6.15.6, Gentoo's 2.4.32_r2 Specific error: NETDEV WATCHDOG: eth0: transmit timed out eth0: transmit timed out,resetting eth0: TX_STATE[003ffc05:00000001:00000019] eth0: RX_STATE[0100c805:00000001:00000021] eth0: Link is up at 100 Mbps,half-duplex. eth0: Pause is disabled ... eth0: TX MAC xmit underrun. Any advice, help, etc. would be greatly appreciated. TIA, Benjamen R. Meyer - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html