On Thu, 19 May 2016 09:49:00 +0800 Zhou Jie <zhoujie2...@cn.fujitsu.com> wrote:
> On 2016/5/19 2:26, Alex Williamson wrote: > > On Wed, 18 May 2016 11:31:09 +0800 > > Zhou Jie <zhoujie2...@cn.fujitsu.com> wrote: > > > >> From: Chen Fan <chen.fan.f...@cn.fujitsu.com> > >> > >> For supporting aer recovery, host and guest would run the same aer > >> recovery code, that would do the secondary bus reset if the error > >> is fatal, the aer recovery process: > >> 1. error_detected > >> 2. reset_link (if fatal) > >> 3. slot_reset/mmio_enabled > >> 4. resume > >> > >> It indicates that host will do secondary bus reset to reset > >> the physical devices under bus in step 2, that would cause > >> devices in D3 status in a short time. But in qemu, we register > >> an error detected handler, that would be invoked as host broadcasts > >> the error-detected event in step 1, in order to avoid guest do > >> reset_link when host do reset_link simultaneously. it may cause > >> fatal error. we introduce a resmue notifier to assure host reset > >> completely. > >> In qemu, the aer recovery process: > >> 1. Detect support for resume notification > >> If host vfio driver does not support for resume notification, > >> directly fail to boot up VM as with aer enabled. > >> 2. Immediately notify the VM on error detected. > >> 3. Stall any access to the device until resume is signaled. > > > > The code below doesn't actually do this, mmaps are disabled, but that > > just means any device access will use the slow path through QEMU. That > > path is still fully enabled and so is config space access. > I will modify the code and find other way to > stall any access to the device. > > >> 4. Delay the guest directed bus reset. > > > > It's delayed, but not the way I expected. The guest goes on executing > > and then we do the reset at some point later? More comments below... > > > >> 5. Wait for resume notification. > >> If we don't get the resume notification from the host after > >> some timeout, we would abort the guest directed bus reset > >> altogether and unplug of the device to prevent it from further > >> interacting with the VM. > >> 6. After get the resume notification reset bus. > >> > >> Signed-off-by: Chen Fan <chen.fan.f...@cn.fujitsu.com> > >> Signed-off-by: Zhou Jie <zhoujie2...@cn.fujitsu.com> > >> --- > >> hw/vfio/pci.c | 182 > >> ++++++++++++++++++++++++++++++++++++++++++++- > >> hw/vfio/pci.h | 5 ++ > >> linux-headers/linux/vfio.h | 1 + > >> 3 files changed, 186 insertions(+), 2 deletions(-) > >> > >> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c > >> index 6877a3d..39a9a9f 100644 > >> --- a/hw/vfio/pci.c > >> +++ b/hw/vfio/pci.c > >> @@ -35,6 +35,7 @@ > >> #include "trace.h" > >> > >> #define MSIX_CAP_LENGTH 12 > >> +#define VFIO_RESET_TIMEOUT 1000 > > > > It deserves at least a comment as to why this value was chosen. > OK. I will add a comment here. > > >> static void vfio_disable_interrupts(VFIOPCIDevice *vdev); > >> static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled); > >> @@ -1937,6 +1938,14 @@ static void vfio_check_hot_bus_reset(VFIOPCIDevice > >> *vdev, Error **errp) > >> VFIOGroup *group; > >> int ret, i, devfn, range_limit; > >> > >> + if (!vdev->vfio_resume_cap) { > >> + error_setg(errp, "vfio: Cannot enable AER for device %s," > >> + " host vfio driver does not support for" > >> + " resume notification", > >> + vdev->vbasedev.name); > >> + return; > >> + } > >> + > >> ret = vfio_get_hot_reset_info(vdev, &info); > >> if (ret) { > >> error_setg(errp, "vfio: Cannot enable AER for device %s," > >> @@ -2594,6 +2603,23 @@ static int vfio_populate_device(VFIOPCIDevice *vdev) > >> vbasedev->name); > >> } > >> > >> + irq_info.index = VFIO_PCI_RESUME_IRQ_INDEX; > >> + > >> + ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info); > >> + if (ret) { > >> + /* This can fail for an old kernel or legacy PCI dev */ > >> + trace_vfio_populate_device_get_irq_info_failure(); > >> + ret = 0; > >> + } else if (irq_info.count == 1) { > >> + vdev->vfio_resume_cap = true; > > > > "vfio" in the name is redundant, ERR_IRQ uses pci_aer, so a logical > > name might be pci_aer_has_resume. > OK. > > >> + } else { > >> + error_report("vfio: %s " > >> + "Could not enable error recovery for the device," > >> + " because host vfio driver does not support for" > >> + " resume notification", > >> + vbasedev->name); > >> + } > > > > This error_report makes sense for ERR_IRQ because halt-on-AER is setup > > transparently, but I don't think it makes sense here. If the user has > > specified to enable AER then it should either work or they should get > > an error message. If they have not specified to enable AER, why does > > the user care if there's an inconsistency here? > OK. I will delete the error report at here. > > >> + > >> error: > >> return ret; > >> } > >> @@ -2661,6 +2687,17 @@ static void vfio_err_notifier_handler(void *opaque) > >> msg.severity = isfatal ? PCI_ERR_ROOT_CMD_FATAL_EN : > >> PCI_ERR_ROOT_CMD_NONFATAL_EN; > >> > >> + if (isfatal) { > >> + PCIDevice *dev_0 = pci_get_function_0(dev); > >> + vdev->vfio_resume_wait = true; > >> + vdev->vfio_resume_arrived = false; > > > > Possible names: > > > > pci_aer_error_signaled > > pci_aer_resume_signaled > OK. > > >> + vfio_mmap_set_enabled(vdev, false); > >> + if (dev_0 != dev) { > >> + VFIOPCIDevice *vdev_0 = DO_UPCAST(VFIOPCIDevice, pdev, > >> dev_0); > >> + vdev_0->vfio_resume_wait = true; > >> + vdev_0->vfio_resume_arrived = false; > >> + } > > > > Why is function 0 special here? Don't we expect that it will also get > > an ERR_IRQ? > I tested in this condition. > The device is multifunction. > And function 0 is OK, function 1 occured an error. > function 0 got an ERR_IRQ, but returned at following code. > if (!(uncor_status & ~0UL)) { > return; > } > If don't set the function 0 at here, it will not wait in vfio_pci_reset. > > And I also tested the nofatal error. > The aer will send the nofatal error notification to qemu, > but the guest will not invoke vfio_pci_reset. > So, I need know whether the error is fatal, > and then set the pci_aer_error_signaled. Do we need to worry about some functions getting a resume while others don't? Should the reset stall until all functions that received a fatal error receive a resume? Maybe all tracking should be done with an 8-byte (256-bit to support ARI) atomic bitmap on function 0 and reset is blocked until the mask of all functions receiving a fatal error is equal to the mask of all functions receiving a resume. > >> + } > >> pcie_aer_msg(dev, &msg); > >> return; > >> } > >> @@ -2756,6 +2793,96 @@ static void > >> vfio_unregister_err_notifier(VFIOPCIDevice *vdev) > >> event_notifier_cleanup(&vdev->err_notifier); > >> } > >> > >> +static void vfio_resume_notifier_handler(void *opaque) > > > > Please use "aer" in the name, otherwise resume might refer to > > suspend-resume. > OK. > > >> +{ > >> + VFIOPCIDevice *vdev = opaque; > >> + > >> + if (!event_notifier_test_and_clear(&vdev->resume_notifier)) { > >> + return; > >> + } > >> + > >> + vdev->vfio_resume_arrived = true; > >> + if (vdev->reset_timer != NULL) { > > > > pci_aer_reset_blocked_timer > OK > > >> + timer_mod(vdev->reset_timer, > >> qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL)); > >> + } > >> +} > >> + > >> +static void vfio_register_aer_resume_notifier(VFIOPCIDevice *vdev) > >> +{ > >> + int ret; > >> + int argsz; > >> + struct vfio_irq_set *irq_set; > >> + int32_t *pfd; > >> + > >> + if (!(vdev->features & VFIO_FEATURE_ENABLE_AER) || > >> + !vdev->vfio_resume_cap) { > >> + return; > >> + } > >> + > >> + if (event_notifier_init(&vdev->resume_notifier, 0)) { > >> + error_report("vfio: Unable to init event notifier for" > >> + " resume notification"); > >> + vdev->vfio_resume_cap = false; > >> + return; > >> + } > >> + > >> + argsz = sizeof(*irq_set) + sizeof(*pfd); > >> + > >> + irq_set = g_malloc0(argsz); > >> + irq_set->argsz = argsz; > >> + irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | > >> + VFIO_IRQ_SET_ACTION_TRIGGER; > >> + irq_set->index = VFIO_PCI_RESUME_IRQ_INDEX; > >> + irq_set->start = 0; > >> + irq_set->count = 1; > >> + pfd = (int32_t *)&irq_set->data; > >> + > >> + *pfd = event_notifier_get_fd(&vdev->resume_notifier); > >> + qemu_set_fd_handler(*pfd, vfio_resume_notifier_handler, NULL, vdev); > >> + > >> + ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set); > >> + if (ret) { > >> + error_report("vfio: Failed to set up resume notification"); > >> + qemu_set_fd_handler(*pfd, NULL, NULL, vdev); > >> + event_notifier_cleanup(&vdev->resume_notifier); > >> + vdev->vfio_resume_cap = false; > >> + } > >> + g_free(irq_set); > >> +} > >> + > >> +static void vfio_unregister_aer_resume_notifier(VFIOPCIDevice *vdev) > >> +{ > >> + int argsz; > >> + struct vfio_irq_set *irq_set; > >> + int32_t *pfd; > >> + int ret; > >> + > >> + if (!vdev->vfio_resume_cap) { > >> + return; > >> + } > >> + > >> + argsz = sizeof(*irq_set) + sizeof(*pfd); > >> + > >> + irq_set = g_malloc0(argsz); > >> + irq_set->argsz = argsz; > >> + irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | > >> + VFIO_IRQ_SET_ACTION_TRIGGER; > >> + irq_set->index = VFIO_PCI_RESUME_IRQ_INDEX; > >> + irq_set->start = 0; > >> + irq_set->count = 1; > >> + pfd = (int32_t *)&irq_set->data; > >> + *pfd = -1; > >> + > >> + ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set); > >> + if (ret) { > >> + error_report("vfio: Failed to de-assign error fd: %m"); > >> + } > >> + g_free(irq_set); > >> + qemu_set_fd_handler(event_notifier_get_fd(&vdev->resume_notifier), > >> + NULL, NULL, vdev); > >> + event_notifier_cleanup(&vdev->resume_notifier); > >> +} > >> + > >> static void vfio_req_notifier_handler(void *opaque) > >> { > >> VFIOPCIDevice *vdev = opaque; > >> @@ -3061,6 +3188,7 @@ static int vfio_initfn(PCIDevice *pdev) > >> } > >> > >> vfio_register_err_notifier(vdev); > >> + vfio_register_aer_resume_notifier(vdev); > >> vfio_register_req_notifier(vdev); > >> vfio_setup_resetfn_quirk(vdev); > >> > >> @@ -3091,6 +3219,7 @@ static void vfio_exitfn(PCIDevice *pdev) > >> VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev); > >> > >> vfio_unregister_req_notifier(vdev); > >> + vfio_unregister_aer_resume_notifier(vdev); > >> vfio_unregister_err_notifier(vdev); > >> pci_device_set_intx_routing_notifier(&vdev->pdev, NULL); > >> vfio_disable_interrupts(vdev); > >> @@ -3101,6 +3230,34 @@ static void vfio_exitfn(PCIDevice *pdev) > >> vfio_bars_exit(vdev); > >> } > >> > >> +static void vfio_pci_delayed_reset(void *opaque) > >> +{ > >> + VFIOPCIDevice *vdev = opaque; > >> + PCIDevice *pdev = &vdev->pdev; > >> + > >> + timer_free(vdev->reset_timer); > >> + vdev->reset_timer = NULL; > > > > Racy, the timer gets freed before set to NULL so the test above could > > see it non-NULL as it's being freed, assuming QEMU supports that sort > > of concurrency. > I will use sem to avoid race condition. > More comments below... Seems like you could just reverse the order, cache vdev->reset_timer, set it to NULL, then call timer_free() on the cached value. But as I question below, it'd be more simple to not have a timer. > >> + > >> + if (!vdev->vfio_resume_wait) { > >> + return; > >> + } > >> + vdev->vfio_resume_wait = false; > >> + > >> + if (vdev->vfio_resume_arrived) { > >> + vfio_mmap_set_enabled(vdev, true); > > > > How do you know if mmap was enabled when you started? This could > > interfere with other cases where mmaps get disabled. > Yes, I will modify the code. > > >> + if (pci_get_function_0(pdev) == pdev) { > >> + vfio_pci_hot_reset(vdev, vdev->single_depend_dev); > >> + } > >> + } else { > >> + uint32_t uncor_status; > >> + uncor_status = vfio_pci_read_config(pdev, > >> + pdev->exp.aer_cap + PCI_ERR_UNCOR_STATUS, 4); > >> + if (uncor_status & ~0UL) { > >> + qdev_unplug(&vdev->pdev.qdev, NULL); > >> + } > >> + } > >> +} > >> + > >> static void vfio_pci_reset(DeviceState *dev) > >> { > >> PCIDevice *pdev = DO_UPCAST(PCIDevice, qdev, dev); > >> @@ -3113,13 +3270,34 @@ static void vfio_pci_reset(DeviceState *dev) > >> > >> if ((pci_get_word(br->config + PCI_BRIDGE_CONTROL) & > >> PCI_BRIDGE_CTL_BUS_RESET)) { > >> - if (pci_get_function_0(pdev) == pdev) { > >> - vfio_pci_hot_reset(vdev, vdev->single_depend_dev); > >> + if (!vdev->vfio_resume_wait) { > >> + if (pci_get_function_0(pdev) == pdev) { > >> + vfio_pci_hot_reset(vdev, vdev->single_depend_dev); > >> + } > >> + } else { > >> + if (vdev->vfio_resume_arrived) { > >> + vdev->vfio_resume_wait = false; > >> + vfio_mmap_set_enabled(vdev, true); > > > > mmap is getting restored in too many places, it should be disabled on > > ERR_IRQ and re-enabled on ERR_RESUME, no more. > OK. > > >> + if (pci_get_function_0(pdev) == pdev) { > >> + vfio_pci_hot_reset(vdev, vdev->single_depend_dev); > >> + } > >> + } else { > >> + vdev->reset_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL, > >> + vfio_pci_delayed_reset, vdev); > >> + timer_mod(vdev->reset_timer, > >> + qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) > >> + + VFIO_RESET_TIMEOUT); > >> + } > > > > Wait, so we just set a timer and return pretending the reset occurred > > when actually it might occur at some point in the future? How is that > > supposed to work? I thought the plan was to block here. > How about invoke sem_post at vfio_resume_notifier_handler, > and wait here use sem_timedwait? Do we even need a timer? What if we simply spin? for (i = 0; i < 1000; i++) { if (vdev->pci_aer_resume_signaled) { break; } g_usleep(1000); } if (i == 1000) { /* We failed */ } else { /* Proceed with reset */ } Does QEMU have enough concurrency to do this? Thanks, Alex > >> } > >> return; > >> } > >> } > >> > >> + if (vdev->vfio_resume_wait) { > >> + vdev->vfio_resume_wait = false; > >> + vfio_mmap_set_enabled(vdev, true); > >> + } > > > > Again, these are getting changed in too many places, the state machine > > is too complicated. Thanks, > In my test this code will never be invoked. > But I add this code to clear vfio_resume_wait if the guest don't > reset the bus. > > Sincerely, > Zhou Jie > > > > > > Alex > > > >> + > >> vfio_pci_pre_reset(vdev); > >> > >> if (vdev->resetfn && !vdev->resetfn(vdev)) { > >> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h > >> index 9fb0206..49e28d8 100644 > >> --- a/hw/vfio/pci.h > >> +++ b/hw/vfio/pci.h > >> @@ -119,6 +119,7 @@ typedef struct VFIOPCIDevice { > >> VFIOVGA *vga; /* 0xa0000, 0x3b0, 0x3c0 */ > >> PCIHostDeviceAddress host; > >> EventNotifier err_notifier; > >> + EventNotifier resume_notifier; > >> EventNotifier req_notifier; > >> int (*resetfn)(struct VFIOPCIDevice *); > >> uint32_t vendor_id; > >> @@ -144,6 +145,10 @@ typedef struct VFIOPCIDevice { > >> bool no_kvm_msi; > >> bool no_kvm_msix; > >> bool single_depend_dev; > >> + bool vfio_resume_cap; > >> + bool vfio_resume_wait; > >> + bool vfio_resume_arrived; > >> + QEMUTimer *reset_timer; > >> } VFIOPCIDevice; > >> > >> uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len); > >> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h > >> index 759b850..01dfd5d 100644 > >> --- a/linux-headers/linux/vfio.h > >> +++ b/linux-headers/linux/vfio.h > >> @@ -433,6 +433,7 @@ enum { > >> VFIO_PCI_MSIX_IRQ_INDEX, > >> VFIO_PCI_ERR_IRQ_INDEX, > >> VFIO_PCI_REQ_IRQ_INDEX, > >> + VFIO_PCI_RESUME_IRQ_INDEX, > >> VFIO_PCI_NUM_IRQS > >> }; > >> > > > > > > > > . > > > >