Re: [Qemu-devel] [PATCH 11/12] vfio: register aer resume notification handler for aer resume

Alex Williamson Wed, 18 May 2016 19:20:22 -0700

On Thu, 19 May 2016 09:49:00 +0800
Zhou Jie <zhoujie2...@cn.fujitsu.com> wrote:


> On 2016/5/19 2:26, Alex Williamson wrote:
> > On Wed, 18 May 2016 11:31:09 +0800
> > Zhou Jie <zhoujie2...@cn.fujitsu.com> wrote:
> >  
> >> From: Chen Fan <chen.fan.f...@cn.fujitsu.com>
> >>
> >> For supporting aer recovery, host and guest would run the same aer
> >> recovery code, that would do the secondary bus reset if the error
> >> is fatal, the aer recovery process:
> >>   1. error_detected
> >>   2. reset_link (if fatal)
> >>   3. slot_reset/mmio_enabled
> >>   4. resume
> >>
> >> It indicates that host will do secondary bus reset to reset
> >> the physical devices under bus in step 2, that would cause
> >> devices in D3 status in a short time. But in qemu, we register
> >> an error detected handler, that would be invoked as host broadcasts
> >> the error-detected event in step 1, in order to avoid guest do
> >> reset_link when host do reset_link simultaneously. it may cause
> >> fatal error. we introduce a resmue notifier to assure host reset
> >> completely.
> >> In qemu, the aer recovery process:
> >>   1. Detect support for resume notification
> >>      If host vfio driver does not support for resume notification,
> >>      directly fail to boot up VM as with aer enabled.
> >>   2. Immediately notify the VM on error detected.
> >>   3. Stall any access to the device until resume is signaled.  
> >
> > The code below doesn't actually do this, mmaps are disabled, but that
> > just means any device access will use the slow path through QEMU.  That
> > path is still fully enabled and so is config space access.  
> I will modify the code and find other way to
> stall any access to the device.
> 
> >>   4. Delay the guest directed bus reset.  
> >
> > It's delayed, but not the way I expected.  The guest goes on executing
> > and then we do the reset at some point later?  More comments below...
> >  
> >>   5. Wait for resume notification.
> >>      If we don't get the resume notification from the host after
> >>      some timeout, we would abort the guest directed bus reset
> >>      altogether and unplug of the device to prevent it from further
> >>      interacting with the VM.
> >>   6. After get the resume notification reset bus.
> >>
> >> Signed-off-by: Chen Fan <chen.fan.f...@cn.fujitsu.com>
> >> Signed-off-by: Zhou Jie <zhoujie2...@cn.fujitsu.com>
> >> ---
> >>  hw/vfio/pci.c              | 182 
> >> ++++++++++++++++++++++++++++++++++++++++++++-
> >>  hw/vfio/pci.h              |   5 ++
> >>  linux-headers/linux/vfio.h |   1 +
> >>  3 files changed, 186 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> >> index 6877a3d..39a9a9f 100644
> >> --- a/hw/vfio/pci.c
> >> +++ b/hw/vfio/pci.c
> >> @@ -35,6 +35,7 @@
> >>  #include "trace.h"
> >>
> >>  #define MSIX_CAP_LENGTH 12
> >> +#define VFIO_RESET_TIMEOUT 1000  
> >
> > It deserves at least a comment as to why this value was chosen.  
> OK. I will add a comment here.
> 
> >>  static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
> >>  static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled);
> >> @@ -1937,6 +1938,14 @@ static void vfio_check_hot_bus_reset(VFIOPCIDevice 
> >> *vdev, Error **errp)
> >>      VFIOGroup *group;
> >>      int ret, i, devfn, range_limit;
> >>
> >> +    if (!vdev->vfio_resume_cap) {
> >> +        error_setg(errp, "vfio: Cannot enable AER for device %s,"
> >> +                   " host vfio driver does not support for"
> >> +                   " resume notification",
> >> +                   vdev->vbasedev.name);
> >> +        return;
> >> +    }
> >> +
> >>      ret = vfio_get_hot_reset_info(vdev, &info);
> >>      if (ret) {
> >>          error_setg(errp, "vfio: Cannot enable AER for device %s,"
> >> @@ -2594,6 +2603,23 @@ static int vfio_populate_device(VFIOPCIDevice *vdev)
> >>                       vbasedev->name);
> >>      }
> >>
> >> +    irq_info.index = VFIO_PCI_RESUME_IRQ_INDEX;
> >> +
> >> +    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info);
> >> +    if (ret) {
> >> +        /* This can fail for an old kernel or legacy PCI dev */
> >> +        trace_vfio_populate_device_get_irq_info_failure();
> >> +        ret = 0;
> >> +    } else if (irq_info.count == 1) {
> >> +        vdev->vfio_resume_cap = true;  
> >
> > "vfio" in the name is redundant, ERR_IRQ uses pci_aer, so a logical
> > name might be pci_aer_has_resume.  
> OK.
> 
> >> +    } else {
> >> +        error_report("vfio: %s "
> >> +                     "Could not enable error recovery for the device,"
> >> +                     " because host vfio driver does not support for"
> >> +                     " resume notification",
> >> +                     vbasedev->name);
> >> +    }  
> >
> > This error_report makes sense for ERR_IRQ because halt-on-AER is setup
> > transparently, but I don't think it makes sense here.  If the user has
> > specified to enable AER then it should either work or they should get
> > an error message.  If they have not specified to enable AER, why does
> > the user care if there's an inconsistency here?  
> OK. I will delete the error report at here.
> 
> >> +
> >>  error:
> >>      return ret;
> >>  }
> >> @@ -2661,6 +2687,17 @@ static void vfio_err_notifier_handler(void *opaque)
> >>          msg.severity = isfatal ? PCI_ERR_ROOT_CMD_FATAL_EN :
> >>                                   PCI_ERR_ROOT_CMD_NONFATAL_EN;
> >>
> >> +        if (isfatal) {
> >> +            PCIDevice *dev_0 = pci_get_function_0(dev);
> >> +            vdev->vfio_resume_wait = true;
> >> +            vdev->vfio_resume_arrived = false;  
> >
> > Possible names:
> >
> > pci_aer_error_signaled
> > pci_aer_resume_signaled  
> OK.
> 
> >> +            vfio_mmap_set_enabled(vdev, false);
> >> +            if (dev_0 != dev) {
> >> +                VFIOPCIDevice *vdev_0 = DO_UPCAST(VFIOPCIDevice, pdev, 
> >> dev_0);
> >> +                vdev_0->vfio_resume_wait = true;
> >> +                vdev_0->vfio_resume_arrived = false;
> >> +            }  
> >
> > Why is function 0 special here?  Don't we expect that it will also get
> > an ERR_IRQ?  
> I tested in this condition.
> The device is multifunction.
> And function 0 is OK, function 1 occured an error.
> function 0 got an ERR_IRQ, but returned at following code.
> if (!(uncor_status & ~0UL)) {
>      return;
> }
> If don't set the function 0 at here, it will not wait in vfio_pci_reset.
> 
> And I also tested the nofatal error.
> The aer will send the nofatal error notification to qemu,
> but the guest will not invoke vfio_pci_reset.
> So, I need know whether the error is fatal,
> and then set the pci_aer_error_signaled.

Do we need to worry about some functions getting a resume while others
don't?  Should the reset stall until all functions that received a
fatal error receive a resume?  Maybe all tracking should be done with
an 8-byte (256-bit to support ARI) atomic bitmap on function 0 and
reset is blocked until the mask of all functions receiving a fatal
error is equal to the mask of all functions receiving a resume.
 
> >> +        }
> >>          pcie_aer_msg(dev, &msg);
> >>          return;
> >>      }
> >> @@ -2756,6 +2793,96 @@ static void 
> >> vfio_unregister_err_notifier(VFIOPCIDevice *vdev)
> >>      event_notifier_cleanup(&vdev->err_notifier);
> >>  }
> >>
> >> +static void vfio_resume_notifier_handler(void *opaque)  
> >
> > Please use "aer" in the name, otherwise resume might refer to
> > suspend-resume.  
> OK.
> 
> >> +{
> >> +    VFIOPCIDevice *vdev = opaque;
> >> +
> >> +    if (!event_notifier_test_and_clear(&vdev->resume_notifier)) {
> >> +        return;
> >> +    }
> >> +
> >> +    vdev->vfio_resume_arrived = true;
> >> +    if (vdev->reset_timer != NULL) {  
> >
> > pci_aer_reset_blocked_timer  
> OK
> 
> >> +        timer_mod(vdev->reset_timer, 
> >> qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL));
> >> +    }
> >> +}
> >> +
> >> +static void vfio_register_aer_resume_notifier(VFIOPCIDevice *vdev)
> >> +{
> >> +    int ret;
> >> +    int argsz;
> >> +    struct vfio_irq_set *irq_set;
> >> +    int32_t *pfd;
> >> +
> >> +    if (!(vdev->features & VFIO_FEATURE_ENABLE_AER) ||
> >> +        !vdev->vfio_resume_cap) {
> >> +        return;
> >> +    }
> >> +
> >> +    if (event_notifier_init(&vdev->resume_notifier, 0)) {
> >> +        error_report("vfio: Unable to init event notifier for"
> >> +                     " resume notification");
> >> +        vdev->vfio_resume_cap = false;
> >> +        return;
> >> +    }
> >> +
> >> +    argsz = sizeof(*irq_set) + sizeof(*pfd);
> >> +
> >> +    irq_set = g_malloc0(argsz);
> >> +    irq_set->argsz = argsz;
> >> +    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
> >> +                     VFIO_IRQ_SET_ACTION_TRIGGER;
> >> +    irq_set->index = VFIO_PCI_RESUME_IRQ_INDEX;
> >> +    irq_set->start = 0;
> >> +    irq_set->count = 1;
> >> +    pfd = (int32_t *)&irq_set->data;
> >> +
> >> +    *pfd = event_notifier_get_fd(&vdev->resume_notifier);
> >> +    qemu_set_fd_handler(*pfd, vfio_resume_notifier_handler, NULL, vdev);
> >> +
> >> +    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
> >> +    if (ret) {
> >> +        error_report("vfio: Failed to set up resume notification");
> >> +        qemu_set_fd_handler(*pfd, NULL, NULL, vdev);
> >> +        event_notifier_cleanup(&vdev->resume_notifier);
> >> +        vdev->vfio_resume_cap = false;
> >> +    }
> >> +    g_free(irq_set);
> >> +}
> >> +
> >> +static void vfio_unregister_aer_resume_notifier(VFIOPCIDevice *vdev)
> >> +{
> >> +    int argsz;
> >> +    struct vfio_irq_set *irq_set;
> >> +    int32_t *pfd;
> >> +    int ret;
> >> +
> >> +    if (!vdev->vfio_resume_cap) {
> >> +        return;
> >> +    }
> >> +
> >> +    argsz = sizeof(*irq_set) + sizeof(*pfd);
> >> +
> >> +    irq_set = g_malloc0(argsz);
> >> +    irq_set->argsz = argsz;
> >> +    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD |
> >> +                     VFIO_IRQ_SET_ACTION_TRIGGER;
> >> +    irq_set->index = VFIO_PCI_RESUME_IRQ_INDEX;
> >> +    irq_set->start = 0;
> >> +    irq_set->count = 1;
> >> +    pfd = (int32_t *)&irq_set->data;
> >> +    *pfd = -1;
> >> +
> >> +    ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
> >> +    if (ret) {
> >> +        error_report("vfio: Failed to de-assign error fd: %m");
> >> +    }
> >> +    g_free(irq_set);
> >> +    qemu_set_fd_handler(event_notifier_get_fd(&vdev->resume_notifier),
> >> +                        NULL, NULL, vdev);
> >> +    event_notifier_cleanup(&vdev->resume_notifier);
> >> +}
> >> +
> >>  static void vfio_req_notifier_handler(void *opaque)
> >>  {
> >>      VFIOPCIDevice *vdev = opaque;
> >> @@ -3061,6 +3188,7 @@ static int vfio_initfn(PCIDevice *pdev)
> >>      }
> >>
> >>      vfio_register_err_notifier(vdev);
> >> +    vfio_register_aer_resume_notifier(vdev);
> >>      vfio_register_req_notifier(vdev);
> >>      vfio_setup_resetfn_quirk(vdev);
> >>
> >> @@ -3091,6 +3219,7 @@ static void vfio_exitfn(PCIDevice *pdev)
> >>      VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
> >>
> >>      vfio_unregister_req_notifier(vdev);
> >> +    vfio_unregister_aer_resume_notifier(vdev);
> >>      vfio_unregister_err_notifier(vdev);
> >>      pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
> >>      vfio_disable_interrupts(vdev);
> >> @@ -3101,6 +3230,34 @@ static void vfio_exitfn(PCIDevice *pdev)
> >>      vfio_bars_exit(vdev);
> >>  }
> >>
> >> +static void vfio_pci_delayed_reset(void *opaque)
> >> +{
> >> +    VFIOPCIDevice *vdev = opaque;
> >> +    PCIDevice *pdev = &vdev->pdev;
> >> +
> >> +    timer_free(vdev->reset_timer);
> >> +    vdev->reset_timer = NULL;  
> >
> > Racy, the timer gets freed before set to NULL so the test above could
> > see it non-NULL as it's being freed, assuming QEMU supports that sort
> > of concurrency.  
> I will use sem to avoid race condition.
> More comments below...

Seems like you could just reverse the order, cache vdev->reset_timer,
set it to NULL, then call timer_free() on the cached value.  But as I
question below, it'd be more simple to not have a timer.
 
> >> +
> >> +    if (!vdev->vfio_resume_wait) {
> >> +        return;
> >> +    }
> >> +    vdev->vfio_resume_wait = false;
> >> +
> >> +    if (vdev->vfio_resume_arrived) {
> >> +        vfio_mmap_set_enabled(vdev, true);  
> >
> > How do you know if mmap was enabled when you started?  This could
> > interfere with other cases where mmaps get disabled.  
> Yes, I will modify the code.
> 
> >> +        if (pci_get_function_0(pdev) == pdev) {
> >> +            vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
> >> +        }
> >> +    } else {
> >> +        uint32_t uncor_status;
> >> +        uncor_status = vfio_pci_read_config(pdev,
> >> +                           pdev->exp.aer_cap + PCI_ERR_UNCOR_STATUS, 4);
> >> +        if (uncor_status & ~0UL) {
> >> +            qdev_unplug(&vdev->pdev.qdev, NULL);
> >> +        }
> >> +    }
> >> +}
> >> +
> >>  static void vfio_pci_reset(DeviceState *dev)
> >>  {
> >>      PCIDevice *pdev = DO_UPCAST(PCIDevice, qdev, dev);
> >> @@ -3113,13 +3270,34 @@ static void vfio_pci_reset(DeviceState *dev)
> >>
> >>          if ((pci_get_word(br->config + PCI_BRIDGE_CONTROL) &
> >>               PCI_BRIDGE_CTL_BUS_RESET)) {
> >> -            if (pci_get_function_0(pdev) == pdev) {
> >> -                vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
> >> +            if (!vdev->vfio_resume_wait) {
> >> +                if (pci_get_function_0(pdev) == pdev) {
> >> +                    vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
> >> +                }
> >> +            } else {
> >> +                if (vdev->vfio_resume_arrived) {
> >> +                    vdev->vfio_resume_wait = false;
> >> +                    vfio_mmap_set_enabled(vdev, true);  
> >
> > mmap is getting restored in too many places, it should be disabled on
> > ERR_IRQ and re-enabled on ERR_RESUME, no more.  
> OK.
> 
> >> +                    if (pci_get_function_0(pdev) == pdev) {
> >> +                        vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
> >> +                    }
> >> +                } else {
> >> +                    vdev->reset_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
> >> +                                            vfio_pci_delayed_reset, vdev);
> >> +                    timer_mod(vdev->reset_timer,
> >> +                              qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL)
> >> +                              + VFIO_RESET_TIMEOUT);
> >> +                }  
> >
> > Wait, so we just set a timer and return pretending the reset occurred
> > when actually it might occur at some point in the future?  How is that
> > supposed to work?  I thought the plan was to block here.  
> How about invoke sem_post at vfio_resume_notifier_handler,
> and wait here use sem_timedwait?

Do we even need a timer?  What if we simply spin?

for (i = 0; i < 1000; i++) {
    if (vdev->pci_aer_resume_signaled) {
        break;
    }
    g_usleep(1000);
}

if (i == 1000) {
    /* We failed */
} else {
    /* Proceed with reset */
}

Does QEMU have enough concurrency to do this?  Thanks,

Alex

> >>              }
> >>              return;
> >>          }
> >>      }
> >>
> >> +    if (vdev->vfio_resume_wait) {
> >> +        vdev->vfio_resume_wait = false;
> >> +        vfio_mmap_set_enabled(vdev, true);
> >> +    }  
> >
> > Again, these are getting changed in too many places, the state machine
> > is too complicated.  Thanks,  
> In my test this code will never be invoked.
> But I add this code to clear vfio_resume_wait if the guest don't
>   reset the bus.
> 
> Sincerely,
> Zhou Jie
> 
> 
> >
> > Alex
> >  
> >> +
> >>      vfio_pci_pre_reset(vdev);
> >>
> >>      if (vdev->resetfn && !vdev->resetfn(vdev)) {
> >> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> >> index 9fb0206..49e28d8 100644
> >> --- a/hw/vfio/pci.h
> >> +++ b/hw/vfio/pci.h
> >> @@ -119,6 +119,7 @@ typedef struct VFIOPCIDevice {
> >>      VFIOVGA *vga; /* 0xa0000, 0x3b0, 0x3c0 */
> >>      PCIHostDeviceAddress host;
> >>      EventNotifier err_notifier;
> >> +    EventNotifier resume_notifier;
> >>      EventNotifier req_notifier;
> >>      int (*resetfn)(struct VFIOPCIDevice *);
> >>      uint32_t vendor_id;
> >> @@ -144,6 +145,10 @@ typedef struct VFIOPCIDevice {
> >>      bool no_kvm_msi;
> >>      bool no_kvm_msix;
> >>      bool single_depend_dev;
> >> +    bool vfio_resume_cap;
> >> +    bool vfio_resume_wait;
> >> +    bool vfio_resume_arrived;
> >> +    QEMUTimer *reset_timer;
> >>  } VFIOPCIDevice;
> >>
> >>  uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
> >> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> >> index 759b850..01dfd5d 100644
> >> --- a/linux-headers/linux/vfio.h
> >> +++ b/linux-headers/linux/vfio.h
> >> @@ -433,6 +433,7 @@ enum {
> >>    VFIO_PCI_MSIX_IRQ_INDEX,
> >>    VFIO_PCI_ERR_IRQ_INDEX,
> >>    VFIO_PCI_REQ_IRQ_INDEX,
> >> +        VFIO_PCI_RESUME_IRQ_INDEX,
> >>    VFIO_PCI_NUM_IRQS
> >>  };
> >>  
> >
> >
> >
> > .
> >  
> 
>

Re: [Qemu-devel] [PATCH 11/12] vfio: register aer resume notification handler for aer resume

Reply via email to