On Fri, Mar 20, 2015 at 03:01:43PM +1100, Gavin Shan wrote: >On Wed, Mar 18, 2015 at 03:54:09PM +1100, Gavin Shan wrote: >>On Tue, Mar 17, 2015 at 03:16:46PM -0600, Alex Williamson wrote: >>>On Tue, 2015-03-17 at 03:31 +1100, Gavin Shan wrote: >>>> When Linux guest recovers from EEH error on the following Emulex >>>> adapter, the MSIx interrupts are disabled and the INTx emulation >>>> is enabled. One INTx interrupt is injected to the guest by host >>>> because of detected pending INTx interrupts on the adapter. QEMU >>>> disables mmap'ed BAR regions and starts a timer to enable those >>>> regions at later point the INTx interrupt handler. Unfortunately, >>>> "VFIOPCIDevice->intx.pending" isn't cleared, meaning those disabled >>>> mapp'ed BAR regions won't be reenabled properly. It leads to EEH >>>> recovery failure at guest side because of hanged MMIO access. >>>> >>>> # lspci | grep Emulex >>>> 0000:01:00.0 Ethernet controller: Emulex Corporation \ >>>> OneConnect 10Gb NIC (be3) (rev 02) >>>> 0000:01:00.1 Ethernet controller: Emulex Corporation \ >>>> OneConnect 10Gb NIC (be3) (rev 02) >>>> >>>> The patch disables INTx interrupt before doing EEH reset to avoid >>>> the issue. >>>> >>>> Signed-off-by: Gavin Shan <gws...@linux.vnet.ibm.com> >>>> --- >>>> hw/vfio/pci.c | 13 +++++++++++++ >>>> 1 file changed, 13 insertions(+) >>>> >>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c >>>> index fca1edc..bfa3d0c 100644 >>>> --- a/hw/vfio/pci.c >>>> +++ b/hw/vfio/pci.c >>>> @@ -3340,6 +3340,14 @@ int vfio_container_eeh_event(AddressSpace *as, >>>> int32_t groupid, >>>> * disable it so that it can be reenabled properly. Also, >>>> * the cached MSIx table should be cleared as it's not >>>> * reflecting the contents in hardware. >>>> + * >>>> + * We might have INTx interrupt whose handler disables the >>>> + * memory mapped BARs. The INTx pending state can't be >>>> + * cleared with memory BAR access in slow path. The timer >>>> + * kicked by the INTx interrupt handler won't enable those >>>> + * disabled memory mapped BARs, which leads to hanged MMIO >>>> + * register access and EEH recovery failure. We simply disable >>>> + * INTx if it has been enabled. >>>> */ >>> >>>This feels like a quick hack for a problem we don't really understand. >>>Why is INTx being fired through QEMU rather than KVM? Why isn't the >>>INTx re-enabling happening since this is exactly the scenario where it's >>>supposed to work (ie. INTx occurs, BAR mmap disabled, guest accesses >>>BAR, mmap re-enabled, INTx unmasked)? >>> >> >>Indeed. It's a quick hack before finding the root cause about why slow >>path doesn't work when fast path is disabled. I'm still tracing it and >>hopefully I can find something soon. Note that: KVM IRQFD isn't enabled >>on the system I was doing experiments. >> > >I'm not sure if there're further replies to this thread since then. My >mailbox is broken for quite a while. Also, I don't know how Ben is missed >from the cc list. Add him who might have more thoughts. Sorry for mail >flooding to you, Ben :-) > >The root cause of the problem would be specific to how MMIO regions are >supported on KVM for book3s_hv, where invalid page table entries (PTEs) >are built to trace the address mapping between guest virtual address >and host physical address. Since they're marked as "invalid", accessing >to the MMIO region will cause page fault in host kernel and then the >PTE table is searched to locate the address mapping, the access is >propogated to QEMU for emulation if the address mapping exits in PTE >table. It's how the "slow path" works. > >For the "fast path", KVM creates memory slot and update PTE table with >the mappings upon page fault. Those PTEs are "valid" and seen by CPU >to do address translation automatically. When the memory region for >the "fast path" is disabled from QEMU, the host will purge TLB/PTEs >for the mappings. So when the "fast path" regions are disabled, the >PTE table doesn't have the entries required for "slow path" access >any more. At later point, guest tries access to the region and the >host just doesn't know how to handle the case. > >I already have couple of patches to introduce addtional flag to >identify the "fast path" region, which has properties that RAM >region and MMIO region have. Once the region is disabled, host >purges its mappings in PTEs/TLB, and then update PTEs with the >mappings required by "slow path" region. With that, I didn't see >the problem. Both host/QEMU needs changes. I'll remove this patch >from the series and send out a new revision. The patches to address >the "fast path" issue will be sent separately after checking with >Ben or Paul they're worthy to be sent out. >
Checked with Paul Mackerras. He already has patch fixing the issue in an different way :-) Thanks, Gavin > >> >>>> QLIST_FOREACH(vbasedev, &group->device_list, next) { >>>> vdev = container_of(vbasedev, VFIOPCIDevice, vbasedev); >>>> @@ -3348,6 +3356,11 @@ int vfio_container_eeh_event(AddressSpace *as, >>>> int32_t groupid, >>>> } >>>> >>>> msix_reset(&vdev->pdev); >>>> + >>>> + /* Disable INTx */ >>>> + if (vdev->interrupt == VFIO_INT_INTx) { >>>> + vfio_disable_intx(vdev); >>>> + } >>>> } >>>> >>>> break; >>> >>> >>> >>>