On Wed, 13 May 2026 23:33:02 +0000
David Matlack <[email protected]> wrote:

> On 2026-05-13 11:49 AM, Josh Hilke wrote:
> > On Tue, May 12, 2026 at 7:12 PM Josh Hilke <[email protected]> wrote:  
> > > On Mon, May 11, 2026 at 4:45 PM David Matlack <[email protected]> 
> > > wrote:  
> > > > On 2026-05-11 09:18 PM, Josh Hilke wrote:  
> 
> > > > > +     retries = 100;
> > > > > +     while (retries-- > 0) {
> > > > > +             if (rx->wb.status_error & 1)
> > > > > +                     break;
> > > > > +             usleep(10);
> > > > > +     }  
> > > >
> > > > Why bail after a certain timeout? The test may have kicked off a large
> > > > count of memcpys. Is this for error detection?  
> > >
> > > The bailout was intended to detect errors during development.
> > > Shouldn't need it anymore. I'll remove it in v2.  
> > 
> > Sorry, I forgot: we need the timeout  to detect DMA errors for the
> > memcpy_from_unmapped_iova test in vfio_pci_driver_test. The test
> > triggers an IOMMU fault because the IOVA is unmapped, and the IOMMU
> > aborts the DMA operation. However, the QEMU IGB implementation does
> > not set an error bit, so timing out is our only method for error
> > detection.  
> 
> Hm... that's going to be tricky then. This means we would have to set
> the timeout to longer than the longest possible memcpy duration to avoid
> false negatives? That means we'll have to set the timeout to quite long.

FWIW, I had AI churn on trying to make this work on a physical 82576 as
I have several of these in my local machines as sort of the defacto,
readily available SR-IOV NIC.  The AI got up to 30/35 tests passing but
is currently stuck that the queues stall in the mix-and-match test when
it's trying to DMA from an unmapped IOVA.  So far none of the in-band
methods to kick the queues seem to work, I'm not sure if we'll need to
resort to an FLR.

I'd be happy to send the changes it's made so far if you want to
validate and incorporate, or have any thoughts to kicking it after the
IOMMU fault.  Some of the changes are related to timeouts, where QEMU
loopback is actually faster than bare metal since the physical  queues
run at 1Gbps even in loopback mode.

I'll also plant the seed that if we do have outstanding issues for a
driver that binds to a real world device, but only works on the
emulated version of that device... how do we handle that?  In part, I
think it's emulated in QEMU because it is so ubiquitous.  I'm also
hoping to use the same device for the new SR-IOV selftests.  Thanks,

Alex

Reply via email to