In days of yore (Mon, 15 Apr 2024), Jamie thus quoth: > So there is a very nasty bug in the e1000e network card > driver.
https://www.intel.com/content/www/us/en/support/articles/000005480/ethernet-products.html notes that MSI interrupts may be problematic on some systems. Worth digging into whether that is an issue on this system of yours. I am not sure Debian can resolve this problem with the driver, but upstream kernel folks might. Unsure whether Intel still helps maintain this driver as it is quite old (I dealt with support issues on this driver some 15-16 years ago). The Intel page states this is upstream kernel only at this point, so going to SourceForge for their out-of-tree driver is no longer an option. > I am running Debian 12 Bookworm. > > You will get the message "Detected Hardware Unit Hang" and then > the network card just stops working. [snip] > This is a gigabit network card as I said it is a built in NIC I believe it > is an Intel NIC. It is an Intel NIC. Most of the NIC drivers beginning with an 'e' followed by numbers are Intel as far as I know. These NICs were very common as on-board NICs in OEM systems as Intel provided them in large volumes. They are not the best, but they usually do their job. [snip] > This seems to happen when you are actually pushing a bit of traffic > though it not a lot but just even a little bit. It isn't network overload > or anything I am barely doing anything really but it will do this. If it is a hardware hang, it may be the NIC firmware getting its knickers in a twist, and that is not something the kernel or the driver can do much about. > I have already tried the following > > ethtool -K eth1 tx off rx off > ethtool -K eth1 tso off gso off > ethtool -K eth1 gso off gro off tso off tx off rx off rxvlan off txvlan > off sg off All worthwhile things to try. You can also try reducing the RX buffers from the default 4096 to 2048 if you are not running a lot of traffic. It might not help, but worth trying. > I have disabled all power management in the bios as well including the one > for ASPM > > I added the following to grub > > pcie_aspm=off e1000e.SmartPowerDownEnable=0 > > > This is in /etc/default/grub > GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_aspm=off > e1000e.SmartPowerDownEnable=0" Good thinking about power management. :) > Then I did an update-grub as well. > > None of this has worked in fixing this problem. I am still getting the > same issue. Best bet at this point would be to scout the Linux Kernel Mailing List archives to see if anyone else have run into the same problems, and then reviewing the kernel maintainers list to find someone that works on the e1000e driver to strike up a direct dialogue with them. > Can you please fix this issue this is a really nasty problem with Debian > 12 (Bookworm) > > I am seeing this being reported back in Kernel 5.3.x but i am not seeing any > reports for 6.1.x about this issue. > > Debian Bug report logs - #945912 > Kernel 5.3 e100e Detected Hardware Unit Hang > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=945912 If it has been reported before and is still present now, one of two things is likely true. 1) the problem was intermittent and could not be reliably reproduced in order to debug and resolve 2) the problem was related to the hardware itself, and there was no way to fix it in either driver or firmware It has been known to happen that drivers implement workarounds for issues in the hardware itself, so that hardware bugs do not get tripped (or are tripped less often). > Please reply back and confirm that you got this email and that you are > looking into this problem please. To state the obvious, I am not a kernel maintainer for Debian and do not speak on behalf of the Debian project. I work for a Linux company you may have heard of and have done so for almost eighteen years, a decade of which was in support. 15 years ago, I know exactly who I would have gone to to look into this problem, but he now works for Broadcom and probably has forgotten all about the e1000/e1000e drivers. Upstream driver maintainer would be the best bet IMHO. If this driver is community support only (i.e. if Intel no longer participates in driver maintenance), I would say that all bets are off. With only one datapoint - your system and your NIC, it is not possible to rule out that the NIC itself is bad. :-/ > -- This email message, including any attachments, is for the intended > recipient(s) only and may contain information that is privileged, > confidential and/or exempt from disclosure under applicable law. If you > have received this message in error, or are obviously not one of the > intended recipients, please immediately notify the sender by reply email > and delete this email message, including any attachments. All > information in this email including any attachment(s) is to be kept in > strict confidence and is not to be released to anyone without my prior > written consent. You may want to discard these blurbs when posting to a mailing list. -- Kind regards, /S