Hi Alexander, On 12/18/19 9:24 PM, Alexander Dahl wrote: > > meanwhile I'm running bullseye with kernel v5.3, but the problem > persists and my Xen system is annoyingly unstable due to this bug. I > attach some more logs from the last days and add the debian xen devel > list in Cc. Maybe someone over there has an idea how to fix this. After > all the log shows plenty of hints it could have something to do with Xen.
I think the xen parts you see in the stack trace listings are usual calls that show that a domU is asking dom0 via the hypervisor to do some disk read/writes or send data over the network (the 'upcall'). https://wiki.xen.org/wiki/Event_Channel_Internals So, after getting that request, the dom0 Linux kernel tries to execute it, which is e.g. the enqueue function to throw a network packet at the physical network interface. The first error we see is the "transmit queue 0 timed out". This looks like the Linux kernel is looking at the network port hardware, and expects it to accept the packet, deal with it and put it on the wire. When this does not happen, and the network port hardware seems frozen and timeouts, it's forcibly reset (I don't know if the thing is resetting itself because it crashed, or if the Linux kernel does something to reset it). "Reset adapter unexpectedly" gives me the feeling that the firmware inside the network card crashed and something inside there also reset it. > Anyone care to help debug this? I have no idea where to start. Can > kernel or xen generate coredumps one could analyze? Or is the log output > the only thing? > > (If you look at the logs, the strange thing is the system does not crash > and reboot immediately, but later after lots of errors with storage, but > comes back fine after reboot.) The ata errors (disk fails to process a command) happen after all of the above happens. Usually disk errors that look like this point at broken disk hardware or bugs in the firmware in the disk. However, if it consistently happens 6 to 7 seconds after the network card disaster, it might be a symptom of the former. The first thing I would recommend is disabling transmit segmentation offloading to the network card in dom0 (ethtool enp1s0 tso off) and see if it prevents the network card from choking on some kind of input. If not, play with more settings like transmit checksum offloading (tx off). If this does not help, we can start asking some Xen developers if they have an idea how we can help with debugging and what we should do. (I help maintaining the Xen packages in Debian, my knowledge about internals of it is mostly limited to all the been-there-done-thats during the years of using it as a user.) I expect the problem to be related to Linux and the hardware, and not specifically Xen. Knowing if the same happens when just booting Linux without Xen is valuable debugging info. However, I realize that it's likely a bit complicated to, in that case, try triggering the problem by generate the same workload that's now coming from the domUs. Curious to hear what happens, Thanks, Hans van Kranenburg