On Wed, 2015-01-28 at 16:37 +0800, Chen Fan wrote: > when the vfio device encounters an uncorrectable error in host, > the vfio_pci driver will signal the eventfd registered by this > vfio device, the results in the qemu eventfd handler getting > invoked. > > this patch is to pass the error to guest and have the guest driver > recover from the error. > > Signed-off-by: Chen Fan <chen.fan.f...@cn.fujitsu.com> > --- > hw/vfio/pci.c | 34 ++++++++++++++++++++++++++++------ > 1 file changed, 28 insertions(+), 6 deletions(-) > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c > index 2072261..8c81bb3 100644 > --- a/hw/vfio/pci.c > +++ b/hw/vfio/pci.c > @@ -3141,18 +3141,40 @@ static void vfio_put_device(VFIOPCIDevice *vdev) > static void vfio_err_notifier_handler(void *opaque) > { > VFIOPCIDevice *vdev = opaque; > + PCIDevice *dev = &vdev->pdev; > + PCIEAERMsg msg = { > + .severity = 0, > + .source_id = (pci_bus_num(dev->bus) << 8) | dev->devfn, > + }; > > if (!event_notifier_test_and_clear(&vdev->err_notifier)) { > return; > } > > + /* we should read the error details from the real hardware > + * configuration spaces, here we only need to do is signaling > + * to guest an uncorrectable error has occurred. > + */ > + if (dev->exp.aer_cap) { > + uint8_t *aer_cap = dev->config + dev->exp.aer_cap; > + uint32_t uncor_status; > + bool isfatal; > + > + uncor_status = vfio_pci_read_config(dev, > + dev->exp.aer_cap + PCI_ERR_UNCOR_STATUS, 4); > + > + isfatal = uncor_status & pci_get_long(aer_cap + PCI_ERR_UNCOR_SEVER); > + > + msg.severity = isfatal ? PCI_ERR_ROOT_CMD_FATAL_EN : > + PCI_ERR_ROOT_CMD_NONFATAL_EN; > + > + pcie_aer_msg(dev, &msg); > + return; > + }
What if the guest machine type is 440FX? We've just killed the existing vm_stop functionality for the majority of users. > + > /* > - * TBD. Retrieve the error details and decide what action > - * needs to be taken. One of the actions could be to pass > - * the error to the guest and have the guest driver recover > - * from the error. This requires that PCIe capabilities be > - * exposed to the guest. For now, we just terminate the > - * guest to contain the error. > + * If the aer capability is not exposed to the guest. we just > + * terminate the guest to contain the error. Just because it's exposed doesn't mean the guest chipset allows access to it, right? > */ > > error_report("%s(%04x:%02x:%02x.%x) Unrecoverable error detected. "