On Thu, May 14, 2026 at 3:48 PM Alex Williamson <[email protected]> wrote:
>
> On Thu, 14 May 2026 14:49:28 -0700
> Josh Hilke <[email protected]> wrote:
>
> > On Thu, May 14, 2026 at 9:28 AM Alex Williamson <[email protected]> wrote:
> > >
> > > On Wed, 13 May 2026 23:33:02 +0000
> > > David Matlack <[email protected]> wrote:
> > >
> > > > On 2026-05-13 11:49 AM, Josh Hilke wrote:
> > > > > On Tue, May 12, 2026 at 7:12 PM Josh Hilke <[email protected]> wrote:
> > > > > > On Mon, May 11, 2026 at 4:45 PM David Matlack <[email protected]> 
> > > > > > wrote:
> > > > > > > On 2026-05-11 09:18 PM, Josh Hilke wrote:
> > > >
> > > > > > > > +     retries = 100;
> > > > > > > > +     while (retries-- > 0) {
> > > > > > > > +             if (rx->wb.status_error & 1)
> > > > > > > > +                     break;
> > > > > > > > +             usleep(10);
> > > > > > > > +     }
> > > > > > >
> > > > > > > Why bail after a certain timeout? The test may have kicked off a 
> > > > > > > large
> > > > > > > count of memcpys. Is this for error detection?
> > > > > >
> > > > > > The bailout was intended to detect errors during development.
> > > > > > Shouldn't need it anymore. I'll remove it in v2.
> > > > >
> > > > > Sorry, I forgot: we need the timeout  to detect DMA errors for the
> > > > > memcpy_from_unmapped_iova test in vfio_pci_driver_test. The test
> > > > > triggers an IOMMU fault because the IOVA is unmapped, and the IOMMU
> > > > > aborts the DMA operation. However, the QEMU IGB implementation does
> > > > > not set an error bit, so timing out is our only method for error
> > > > > detection.
> > > >
> > > > Hm... that's going to be tricky then. This means we would have to set
> > > > the timeout to longer than the longest possible memcpy duration to avoid
> > > > false negatives? That means we'll have to set the timeout to quite long.
> > >
> > > FWIW, I had AI churn on trying to make this work on a physical 82576 as
> > > I have several of these in my local machines as sort of the defacto,
> > > readily available SR-IOV NIC.  The AI got up to 30/35 tests passing but
> > > is currently stuck that the queues stall in the mix-and-match test when
> > > it's trying to DMA from an unmapped IOVA.  So far none of the in-band
> > > methods to kick the queues seem to work, I'm not sure if we'll need to
> > > resort to an FLR.
> > >
> > > I'd be happy to send the changes it's made so far if you want to
> > > validate and incorporate, or have any thoughts to kicking it after the
> > > IOMMU fault.  Some of the changes are related to timeouts, where QEMU
> > > loopback is actually faster than bare metal since the physical  queues
> > > run at 1Gbps even in loopback mode.
> > >
> > > I'll also plant the seed that if we do have outstanding issues for a
> > > driver that binds to a real world device, but only works on the
> > > emulated version of that device... how do we handle that?  In part, I
> > > think it's emulated in QEMU because it is so ubiquitous.  I'm also
> > > hoping to use the same device for the new SR-IOV selftests.  Thanks,
> > >
> > > Alex
> >
> > I'm glad you're interested in this as well!
> >
> > Unfortunately (and ironically) I don't have access to a physical device.
> >
> > Regarding driver support for the real vs. emulated device, I think we
> > should prioritize supporting the emulated version. This approach
> > unlocks the ability for anyone to run VFIO/Live Update tests without
> > needing specific hardware. Once that's done, other folks can add
> > patches to update the driver if they want to use the physical device.
> > What do you think?
> >
> > I plan to add SR-IOV support to this driver in a separate series after
> > the driver merges.
>
> I've got it working now, I'll need to do some cleanup and verification,
> but it's not too bad (imo), several places where QEMU emulation only
> supports one mode and we don't fully configure the device or don't
> account for physical hardware limitations or timing.  The most
> significant change is in resolving the issue above, that once the queue
> gets wedged from DMA errors, VFIO_DEVICE_RESET unfortunately seems to
> be the most straightforward mechanism to get it unstuck.  That may
> result in some initialization refactoring and plumbing to restore the
> interrupt state. Thanks,
>
> Alex

Alex, how do you want to incorporate your changes? Do you want to send
out a v2 of this driver? Or should I continue implementing my v2, and
then you send a follow-up patch? I'm fine with either option.

Reply via email to