Hi, we have a Cisco UCS infrastructure where we have fnic Fibre-Channel Adapters that we expose to guests. The UCS infrastruture allows to create virtual HBAs that can be exposed to a host so its possible to have quite a lot of them.
We ran into a strange issue when we started having more than one vServer with a FibreChannel Adapter passed thru with vfio-pci. When a hypervisor shuts down it the kernel sees the following error: pcieport 0000:00:07.0: AER: Uncorrected (Non-Fatal) error received: id=0038 pcieport 0000:00:07.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, id=0038(Receiver ID) pcieport 0000:00:07.0: device [8086:340e] error status/mask=00200000/00100000 pcieport 0000:00:07.0: [21] Unknown Error Bit (First) pcieport 0000:00:07.0: broadcast error_detected message pcieport 0000:00:07.0: AER: Device recovery failed Bit 21 seems to be ACS Violation. And 0000:00:07.0 is the PCIE Root Port on that System. This wouldn't be a big problem, altough I would like to find out what the ACS Violation causes. The real problem is that all other vfio-pci cards on that root port get notified of this error and the connected vServers are suspended with RUN_STATE_INTERNAL_ERROR. Any ideas to work around this other than hacking qemu to not register an error handler or modifying vfio_err_notifier_handler to not suspend the vServer? Is it correct that all children of a root port are notified? Should qemu distinguish between fatal and non-fatal errors when suspending a vServer? Thanks, Peter