Hi Christian, On 13/03/2026 09:21, Christian König wrote: > On 3/12/26 19:45, Matt Evans wrote: >> Hi all, >> >> >> There were various suggestions in the September 2025 thread "[TECH >> TOPIC] vfio, iommufd: Enabling user space drivers to vend more >> granular access to client processes" [0], and LPC discussions, around >> improving the situation for multi-process userspace driver designs. >> This RFC series implements some of these ideas. >> >> (Thanks for feedback on v1! Revised series, with changes noted >> inline.) >> >> Background: Multi-process USDs >> ============================== >> >> The userspace driver scenario discussed in that thread involves a >> primary process driving a PCIe function through VFIO/iommufd, which >> manages the function-wide ownership/lifecycle. The function is >> designed to provide multiple distinct programming interfaces (for >> example, several independent MMIO register frames in one function), >> and the primary process delegates control of these interfaces to >> multiple independent client processes (which do the actual work). >> This scenario clearly relies on a HW design that provides appropriate >> isolation between the programming interfaces. >> >> The two key needs are: >> >> 1. Mechanisms to safely delegate a subset of the device MMIO >> resources to a client process without over-sharing wider access >> (or influence over whole-device activities, such as reset). >> >> 2. Mechanisms to allow a client process to do its own iommufd >> management w.r.t. its address space, in a way that's isolated >> from DMA relating to other clients. >> >> >> mmap() of VFIO DMABUFs >> ====================== >> >> This RFC addresses #1 in "vfio/pci: Support mmap() of a VFIO DMABUF", >> implementing the proposals in [0] to add mmap() support to the >> existing VFIO DMABUF exporter. >> >> This enables a userspace driver to define DMABUF ranges corresponding >> to sub-ranges of a BAR, and grant a given client (via a shared fd) >> the capability to access (only) those sub-ranges. The VFIO device fds >> would be kept private to the primary process. All the client can do >> with that fd is map (or iomap via iommufd) that specific subset of >> resources, and the impact of bugs/malice is contained. >> >> (We'll follow up on #2 separately, as a related-but-distinct problem. >> PASIDs are one way to achieve per-client isolation of DMA; another >> could be sharing of a single IOVA space via 'constrained' iommufds.) >> >> >> New in v2: To achieve this, the existing VFIO BAR mmap() path is >> converted to use DMABUFs behind the scenes, in "vfio/pci: Convert BAR >> mmap() to use a DMABUF" plus new helper functions, as Jason/Christian >> suggested in the v1 discussion [3]. >> >> This means: >> >> - Both regular and new DMABUF BAR mappings share the same vm_ops, >> i.e. mmap()ing DMABUFs is a smaller change on top of the existing >> mmap(). >> >> - The zapping of mappings occurs via vfio_pci_dma_buf_move(), and the >> vfio_pci_zap_bars() originally paired with the _move()s can go >> away. Each DMABUF has a unique address_space. >> >> - It's a step towards future iommufd VFIO Type1 emulation >> implementing P2P, since iommufd can now get a DMABUF from a VA that >> it's mapping for IO; the VMAs' vm_file is that of the backing >> DMABUF. >> >> >> Revocation/reclaim >> ================== >> >> Mapping a BAR subset is useful, but the lifetime of access granted to >> a client needs to be managed well. For example, a protocol between >> the primary process and the client can indicate when the client is >> done, and when it's safe to reuse the resources elsewhere, but cleanup >> can't practically be cooperative. >> >> For robustness, we enable the driver to make the resources >> guaranteed-inaccessible when it chooses, so that it can re-assign them >> to other uses in future. >> >> "vfio/pci: Permanently revoke a DMABUF on request" adds a new VFIO >> device fd ioctl, VFIO_DEVICE_PCI_DMABUF_REVOKE. This takes a DMABUF >> fd parameter previously exported (from that device!) and permanently >> revokes the DMABUF. This notifies/detaches importers, zaps PTEs for >> any mappings, and guarantees no future attachment/import/map/access is >> possible by any means. >> >> A primary driver process would use this operation when the client's >> tenure ends to reclaim "loaned-out" MMIO interfaces, at which point >> the interfaces could be safely re-used. >> >> New in v2: ioctl() on VFIO driver fd, rather than DMABUF fd. A DMABUF >> is revoked using code common to vfio_pci_dma_buf_move(), selectively >> zapping mappings (after waiting for completion on the >> dma_buf_invalidate_mappings() request). >> >> >> BAR mapping access attributes >> ============================= >> >> Inspired by Alex [Mastro] and Jason's comments in [0] and Mahmoud's >> work in [1] with the goal of controlling CPU access attributes for >> VFIO BAR mappings (e.g. WC), we can decorate DMABUFs with access >> attributes that are then used by a mapping's PTEs. >> >> I've proposed reserving a field in struct >> vfio_device_feature_dma_buf's flags to specify an attribute for its >> ranges. Although that keeps the (UAPI) struct unchanged, it means all >> ranges in a DMABUF share the same attribute. I feel a single >> attribute-to-mmap() relation is logical/reasonable. An application >> can also create multiple DMABUFs to describe any BAR layout and mix of >> attributes. >> >> >> Tests >> ===== >> >> (Still sharing the [RFC ONLY] userspace test/demo program for context, >> not for merge.) >> >> It illustrates & tests various map/revoke cases, but doesn't use the >> existing VFIO selftests and relies on a (tweaked) QEMU EDU function. >> I'm (still) working on integrating the scenarios into the existing >> VFIO selftests. >> >> This code has been tested in mapping DMABUFs of single/multiple >> ranges, aliasing mmap()s, aliasing ranges across DMABUFs, vm_pgoff > >> 0, revocation, shutdown/cleanup scenarios, and hugepage mappings seem >> to work correctly. I've lightly tested WC mappings also (by observing >> resulting PTEs as having the correct attributes...). >> >> >> Fin >> === >> >> v2 is based on next-20260310 (to build on Leon's recent series >> "vfio: Wait for dma-buf invalidation to complete" [2]). >> >> >> Please share your thoughts! I'd like to de-RFC if we feel this >> approach is now fair. > > I only skimmed over it, but at least of hand I couldn't find anything > fundamentally wrong.
Thank you! > The locking order seems to change in patch #6. In general I strongly > recommend to enable lockdep while testing anyway but explicitly when I see > such changes. I'll definitely +1 on testing with lockdep. Note that patch #6 doesn't [intend to] change the locking; the naming of the existing vfio_pci_zap_and_down_write_memory_lock() is potentially confusing because _really_ it's vfio_pci_down_write_memory_lock_and_zap(). Patch #6 is replacing that with _just_ the existing down_write(&memory_lock) part. (FWIW, lockdep's happy when running the test scenarios on this series.) > Additional to that it might also be a good idea to have a lockdep initcall > function which defines the locking order in the way all the VFIO code should > follow. > > See function dma_resv_lockdep() for an example on how to do that. Especially > with mmap support and all the locks involved with that it has proven to be a > good practice to have something like that. That's a good suggestion; I'll investigate, and thanks for the pointer. I spent time stepping through the locking particularly in the revoke path, and automation here would be pretty useful if possible. Thanks and regards, Matt > > Regards, > Christian. > >> >> >> Many thanks, >> >> >> Matt >> >> >> >> References: >> >> [0]: >> https://lore.kernel.org/linux-iommu/[email protected]/ >> [1]: https://lore.kernel.org/all/[email protected]/ >> [2]: >> https://lore.kernel.org/linux-iommu/20260205-nocturnal-poetic-chamois-f566ad@houat/T/#m310cd07011e3a1461b6fda45e3f9b886ba76571a >> >> [3]: https://lore.kernel.org/all/[email protected]/ >> >> -------------------------------------------------------------------------------- >> Changelog: >> >> v2: Respin based on the feedback/suggestions: >> >> - Transform the existing VFIO BAR mmap path to also use DMABUFs behind >> the scenes, and then simply share that code for explicitly-mapped >> DMABUFs. >> >> - Refactors the export itself out of vfio_pci_core_feature_dma_buf, >> and shared by a new vfio_pci_core_mmap_prep_dmabuf helper used by >> the regular VFIO mmap to create a DMABUF. >> >> - Revoke buffers using a VFIO device fd ioctl >> >> v1: https://lore.kernel.org/all/[email protected]/ >> >> >> Matt Evans (10): >> vfio/pci: Set up VFIO barmap before creating a DMABUF >> vfio/pci: Clean up DMABUFs before disabling function >> vfio/pci: Add helper to look up PFNs for DMABUFs >> vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA >> vfio/pci: Convert BAR mmap() to use a DMABUF >> vfio/pci: Remove vfio_pci_zap_bars() >> vfio/pci: Support mmap() of a VFIO DMABUF >> vfio/pci: Permanently revoke a DMABUF on request >> vfio/pci: Add mmap() attributes to DMABUF feature >> [RFC ONLY] selftests: vfio: Add standalone vfio_dmabuf_mmap_test >> >> drivers/vfio/pci/Kconfig | 3 +- >> drivers/vfio/pci/Makefile | 3 +- >> drivers/vfio/pci/vfio_pci_config.c | 18 +- >> drivers/vfio/pci/vfio_pci_core.c | 123 +-- >> drivers/vfio/pci/vfio_pci_dmabuf.c | 425 +++++++-- >> drivers/vfio/pci/vfio_pci_priv.h | 46 +- >> include/uapi/linux/vfio.h | 42 +- >> tools/testing/selftests/vfio/Makefile | 1 + >> .../vfio/standalone/vfio_dmabuf_mmap_test.c | 837 ++++++++++++++++++ >> 9 files changed, 1339 insertions(+), 159 deletions(-) >> create mode 100644 >> tools/testing/selftests/vfio/standalone/vfio_dmabuf_mmap_test.c >> >
