Hi all, I have a weird bug on systems that uses Haswell Architecture and "real" serial ports /dev/ttyS*.
Hardware: some embedded device with "Intel(R) Celeron(R) 2980U @ 1.60GHz", I tried with microcode 0x23 and 0x24. Also on a HP Elite 840 G1". Both have Haswell architecture. I can plug a different CPU module into the embedded device, then I have an "Intel(R) Atom(TM) CPU N455 @ 1.66GHz", obviously no Haswell. With identical kernel, I don't get the same error. Kernel: happens with distro kernels (Debian, Ubuntu, Fedora). Common factor seems that the kernels are >= 4.9.x. But also with upstream stable kernels, I used 4.13.x, 4.14.x, 4.18.x, even with 4.18.16. The embedded device also behaves strange (e.g. I had once MCEs with a 32bit kernel, which went away when using a 64bit kernel). We also sometimes get an error in AUFS with the same timestamp as the do_IRQ-message. I don't understand what AUFS has to do with hardware interrupts. However, I don't want to concentrate on this yet, I think that strange message in a mainland kernel in itself is worthwhile to be tracked. If some interrupt get's haywire, there is certainly the chance that some memory get's corrupted. Also, this might be something totally different, because the HP Elite doesn't show this. Also, the MCE went away after switching from 32bit kernel to 64bit kernel. So, let's return to the better reproducible "do_IRQ: 0.39 No irq handler for vector". I'm happy that I found a way to reproduce it: the message triggers when I close the serial port. printk's indicate that after the IER is cleared, and even after synchronize_irq() in serial8250_do_shutdown() the error happens. Sometimes even a "stty </dev/ttyS1" is enough, because it already opens/closes the port. But it happens only sometimes. A better way is to use a tool called "stress-ng" in version with various stressors. Some newer version (e.g. the one in Debian, 0.07.16-1) just open all files in /dev, run an fstat() on them, and close them again. All of this in a loop and very fast. This has the side-effect that /ttyS* are opened/closed very fast. And that shows the error message easily: [ 6.558244] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx [ 17.048154] fuse init (API version 7.27) [ 17.248215] do_IRQ: 0.39 No irq handler for vector [ 17.249622] do_IRQ: 0.39 No irq handler for vector [ 17.252415] do_IRQ: 0.39 No irq handler for vector [ 17.253698] do_IRQ: 0.39 No irq handler for vector [ 18.528774] do_IRQ: 0.39 No irq handler for vector [ 18.532305] do_IRQ: 0.39 No irq handler for vector [ 18.532540] do_IRQ: 0.39 No irq handler for vector [ 18.606916] do_IRQ: 0.39 No irq handler for vector [ 20.227241] random: crng init done Here I did run stress-ng just for some seconds. Unfortunately, from time to time the exact same setup makes the error scarce, e.g. it can happen that we don't see the error for 15 minutes. So when running this for a night I had between 1500 and 30000 of this messages in my dmesg/journal. One thing that I noticed is that "noapic=1" makes the error go away. Also using the Atom cpu with the older architecture makes the error go away, but that one is no EOL. :-( Any advice on how to proceed further? Greetings, Holger