On Wed, Nov 24, 2021 at 12:05 AM Mahesh Salgaonkar <mah...@linux.ibm.com> wrote: > > *snip* > > This causes the EEH handler to get stuck for ~6 > seconds before it could notify that the pci error has been detected and > stop any active operations. Hence with running I/O traffic, during this 6 > seconds, the network driver continues its operation and hits a timeout > (netdev watchdog).On timeouts, network driver go into ffdc capture mode > and reset path assuming the PCI device is in fatal condition. This causes > EEH recovery to fail and sometimes it leads to system hang or crash.
Whatever is causing that crash is the real issue IMO. PCI error reporting is fundamentally asynchronous and the driver always has to tolerate some amount of latency between the error occuring and being reported. Six seconds is admittedly an eternity, but it should not cause a system crash under any circumstances. Printing a warning due to a timeout is annoying, but it's not the end of the world.