On Thu, May 14, 2026 at 3:48 PM Alex Williamson <[email protected]> wrote: > > On Thu, 14 May 2026 14:49:28 -0700 > Josh Hilke <[email protected]> wrote: > > > On Thu, May 14, 2026 at 9:28 AM Alex Williamson <[email protected]> wrote: > > > > > > On Wed, 13 May 2026 23:33:02 +0000 > > > David Matlack <[email protected]> wrote: > > > > > > > On 2026-05-13 11:49 AM, Josh Hilke wrote: > > > > > On Tue, May 12, 2026 at 7:12 PM Josh Hilke <[email protected]> wrote: > > > > > > On Mon, May 11, 2026 at 4:45 PM David Matlack <[email protected]> > > > > > > wrote: > > > > > > > On 2026-05-11 09:18 PM, Josh Hilke wrote: > > > > > > > > > > > > + retries = 100; > > > > > > > > + while (retries-- > 0) { > > > > > > > > + if (rx->wb.status_error & 1) > > > > > > > > + break; > > > > > > > > + usleep(10); > > > > > > > > + } > > > > > > > > > > > > > > Why bail after a certain timeout? The test may have kicked off a > > > > > > > large > > > > > > > count of memcpys. Is this for error detection? > > > > > > > > > > > > The bailout was intended to detect errors during development. > > > > > > Shouldn't need it anymore. I'll remove it in v2. > > > > > > > > > > Sorry, I forgot: we need the timeout to detect DMA errors for the > > > > > memcpy_from_unmapped_iova test in vfio_pci_driver_test. The test > > > > > triggers an IOMMU fault because the IOVA is unmapped, and the IOMMU > > > > > aborts the DMA operation. However, the QEMU IGB implementation does > > > > > not set an error bit, so timing out is our only method for error > > > > > detection. > > > > > > > > Hm... that's going to be tricky then. This means we would have to set > > > > the timeout to longer than the longest possible memcpy duration to avoid > > > > false negatives? That means we'll have to set the timeout to quite long. > > > > > > FWIW, I had AI churn on trying to make this work on a physical 82576 as > > > I have several of these in my local machines as sort of the defacto, > > > readily available SR-IOV NIC. The AI got up to 30/35 tests passing but > > > is currently stuck that the queues stall in the mix-and-match test when > > > it's trying to DMA from an unmapped IOVA. So far none of the in-band > > > methods to kick the queues seem to work, I'm not sure if we'll need to > > > resort to an FLR. > > > > > > I'd be happy to send the changes it's made so far if you want to > > > validate and incorporate, or have any thoughts to kicking it after the > > > IOMMU fault. Some of the changes are related to timeouts, where QEMU > > > loopback is actually faster than bare metal since the physical queues > > > run at 1Gbps even in loopback mode. > > > > > > I'll also plant the seed that if we do have outstanding issues for a > > > driver that binds to a real world device, but only works on the > > > emulated version of that device... how do we handle that? In part, I > > > think it's emulated in QEMU because it is so ubiquitous. I'm also > > > hoping to use the same device for the new SR-IOV selftests. Thanks, > > > > > > Alex > > > > I'm glad you're interested in this as well! > > > > Unfortunately (and ironically) I don't have access to a physical device. > > > > Regarding driver support for the real vs. emulated device, I think we > > should prioritize supporting the emulated version. This approach > > unlocks the ability for anyone to run VFIO/Live Update tests without > > needing specific hardware. Once that's done, other folks can add > > patches to update the driver if they want to use the physical device. > > What do you think? > > > > I plan to add SR-IOV support to this driver in a separate series after > > the driver merges. > > I've got it working now, I'll need to do some cleanup and verification, > but it's not too bad (imo), several places where QEMU emulation only > supports one mode and we don't fully configure the device or don't > account for physical hardware limitations or timing. The most > significant change is in resolving the issue above, that once the queue > gets wedged from DMA errors, VFIO_DEVICE_RESET unfortunately seems to > be the most straightforward mechanism to get it unstuck. That may > result in some initialization refactoring and plumbing to restore the > interrupt state. Thanks, > > Alex
Alex, how do you want to incorporate your changes? Do you want to send out a v2 of this driver? Or should I continue implementing my v2, and then you send a follow-up patch? I'm fine with either option.

