On Mon, 22 Sep 2025 12:18:34 -0500 (CDT) Timothy Pearson <tpear...@raptorengineering.com> wrote:
> PCI devices prior to PCI 2.3 both use level interrupts and do not support > interrupt masking, leading to a failure when passed through to a KVM guest on > at least the ppc64 platform. This failure manifests as receiving and > acknowledging a single interrupt in the guest, while the device continues to > assert the level interrupt indicating a need for further servicing. > > When lazy IRQ masking is used on DisINTx- (non-PCI 2.3) hardware, the > following > sequence occurs: > > * Level IRQ assertion on device > * IRQ marked disabled in kernel > * Host interrupt handler exits without clearing the interrupt on the device > * Eventfd is delivered to userspace > * Host interrupt controller sees still-active INTX, reasserts IRQ > * Host kernel ignores disabled IRQ > * Guest processes IRQ and clears device interrupt > * Software mask removed by VFIO driver This isn't the sequence that was previously identified as the issue. An interrupt controller that reasserts the IRQ when it remains active is not susceptible to the issue, and is what I think we generally expect on x86. I understand that we believe this issue manifests exactly because the interrupt controller does not reassert an interrupt that remains active. I think the sequence is: * device asserts INTx * vfio_intx_handler() calls disable_irq_nosync() to mark IRQ disabled * interrupt delivered to userspace via eventfd * userspace driver/VM services interrupt * device de-asserts INTx * device re-asserts INTx * interrupt received while IRQ disabled is masked at controller * VMM performs EOI via unmask ioctl, enable_irq() clears disable and unmasks IRQ * interrupt controller does not reassert interrupt to the host The fix then, aiui, is that disabling the unlazy mode masks the interrupt at disable_irq_nosync(), the same sequence of de-asserting and re-asserting the interrupt occurs at the controller, but since the controller was masked at the new rising edge, it will send the interrupt when umasked. > The behavior is now platform-dependent. Some platforms (amd64) will continue > to spew IRQs for as long as the INTX line remains asserted, therefore the IRQ > will be handled by the host as soon as the mask is dropped. Others (ppc64) > will > only send the one request, and if it is not handled no further interrupts will > be sent. The former behavior theoretically leaves the system vulnerable to > interrupt storm, and the latter will result in the device stalling after > receiving exactly one interrupt in the guest. > > Work around this by disabling lazy IRQ masking for DisINTx- INTx devices. > --- > drivers/vfio/pci/vfio_pci_intrs.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/drivers/vfio/pci/vfio_pci_intrs.c > b/drivers/vfio/pci/vfio_pci_intrs.c > index 123298a4dc8f..d8637b53d051 100644 > --- a/drivers/vfio/pci/vfio_pci_intrs.c > +++ b/drivers/vfio/pci/vfio_pci_intrs.c > @@ -304,6 +304,9 @@ static int vfio_intx_enable(struct vfio_pci_core_device > *vdev, > > vdev->irq_type = VFIO_PCI_INTX_IRQ_INDEX; > > + if (!vdev->pci_2_3) > + irq_set_status_flags(pdev->irq, IRQ_DISABLE_UNLAZY); > + > ret = request_irq(pdev->irq, vfio_intx_handler, > irqflags, ctx->name, ctx); > if (ret) { This branch is an example of where we're not clearing the flag on error. Thanks, Alex > @@ -352,6 +355,8 @@ static void vfio_intx_disable(struct vfio_pci_core_device > *vdev) > vfio_virqfd_disable(&ctx->unmask); > vfio_virqfd_disable(&ctx->mask); > free_irq(pdev->irq, ctx); > + if (!vdev->pci_2_3) > + irq_clear_status_flags(pdev->irq, IRQ_DISABLE_UNLAZY); > if (ctx->trigger) > eventfd_ctx_put(ctx->trigger); > kfree(ctx->name);