Re: [Qemu-devel] vfio-pci issues with multiple devices on the same root port

Alex Williamson Mon, 15 Dec 2014 07:12:52 -0800

On Sat, 2014-12-13 at 21:43 +0100, Peter Lieven wrote:
> Am 12.12.2014 um 23:21 schrieb Alex Williamson:
> > On Fri, 2014-12-12 at 22:38 +0100, Peter Lieven wrote:
> >> Hi,
> >>
> >> we have a Cisco UCS infrastructure where we have fnic Fibre-Channel 
> >> Adapters that we expose to guests. The UCS
> >> infrastruture allows to create virtual HBAs that can be exposed to a host 
> >> so its possible to have quite a lot of them.
> >>
> >> We ran into a strange issue when we started having more than one vServer 
> >> with a FibreChannel Adapter passed
> >> thru with vfio-pci.
> >>
> >> When a hypervisor shuts down it the kernel sees the following error:
> >>
> >>  pcieport 0000:00:07.0: AER: Uncorrected (Non-Fatal) error received: 
> >> id=0038
> >>  pcieport 0000:00:07.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), 
> >> type=Transaction Layer, id=0038(Receiver ID)
> >>  pcieport 0000:00:07.0:   device [8086:340e] error 
> >> status/mask=00200000/00100000
> >>  pcieport 0000:00:07.0:    [21] Unknown Error Bit (First)
> >>  pcieport 0000:00:07.0: broadcast error_detected message
> >>  pcieport 0000:00:07.0: AER: Device recovery failed
> >>
> >> Bit 21 seems to be ACS Violation. And 0000:00:07.0 is the PCIE Root Port 
> >> on that System.
> >>
> >> This wouldn't be a big problem, altough I would like to find out what the 
> >> ACS Violation causes.
> >>
> >> The real problem is that all other vfio-pci cards on that root port get 
> >> notified of this error and the connected vServers are suspended
> >> with RUN_STATE_INTERNAL_ERROR.
> >>
> >> Any ideas to work around this other than hacking qemu to not register an 
> >> error handler or modifying vfio_err_notifier_handler
> >> to not suspend the vServer?
> > You could set bit 21 in the AER uncorrected error mask register to avoid
> > the root port signaling the error.  Is bit 21 already clear in the
> > severity register to make this non-fatal?
> >
> >> Is it correct that all children of a root port are notified? Should qemu 
> >> distinguish between fatal and non-fatal errors when
> >> suspending a vServer?
> > Yes, each child is notified.  QEMU only gets an eventfd signal, which is
> > supposed to occur only for fatal errors.  I don't quite understand why
> > this apparently non-fatal error is getting through.  The kernel-side
> > VFIO code is where filtering of fatal vs non-fatal should occur.
> 
> Had a look at vfio-pci.c from master. I can't see where there is a filtering 
> of fatal vs. non-fatal


I'm under the impression that fatal vs non-fatal would be determined
somewhere in the PCI layers and the driver would only be notified for
uncorrected/fatal.  Are we missing that filtering?  Thanks,

Alex

> static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
>                                                 pci_channel_state_t state)
> {
>       struct vfio_pci_device *vdev;
>       struct vfio_device *device;
> 
>       device = vfio_device_get_from_dev(&pdev->dev);
>       if (device == NULL)
>               return PCI_ERS_RESULT_DISCONNECT;
> 
>       vdev = vfio_device_data(device);
>       if (vdev == NULL) {
>               vfio_device_put(device);
>               return PCI_ERS_RESULT_DISCONNECT;
>       }
> 
>       mutex_lock(&vdev->igate);
> 
>       if (vdev->err_trigger)
>               eventfd_signal(vdev->err_trigger, 1);
> 
>       mutex_unlock(&vdev->igate);
> 
>       vfio_device_put(device);
> 
>       return PCI_ERS_RESULT_CAN_RECOVER;
> }
> 
> static struct pci_error_handlers vfio_err_handlers = {
>       .error_detected = vfio_pci_aer_err_detected,
> };
> 
> static struct pci_driver vfio_pci_driver = {
>       .name           = "vfio-pci",
>       .id_table       = NULL, /* only dynamic ids */
>       .probe          = vfio_pci_probe,
>       .remove         = vfio_pci_remove,
>       .err_handler    = &vfio_err_handlers,
> };
> 
> Peter
> 
>

Re: [Qemu-devel] vfio-pci issues with multiple devices on the same root port

Reply via email to