Hi Christian,

On 13/03/2026 09:21, Christian König wrote:
> On 3/12/26 19:45, Matt Evans wrote:
>> Hi all,
>>
>>
>> There were various suggestions in the September 2025 thread "[TECH
>> TOPIC] vfio, iommufd: Enabling user space drivers to vend more
>> granular access to client processes" [0], and LPC discussions, around
>> improving the situation for multi-process userspace driver designs.
>> This RFC series implements some of these ideas.
>>
>> (Thanks for feedback on v1!  Revised series, with changes noted
>> inline.)
>>
>> Background: Multi-process USDs
>> ==============================
>>
>> The userspace driver scenario discussed in that thread involves a
>> primary process driving a PCIe function through VFIO/iommufd, which
>> manages the function-wide ownership/lifecycle.  The function is
>> designed to provide multiple distinct programming interfaces (for
>> example, several independent MMIO register frames in one function),
>> and the primary process delegates control of these interfaces to
>> multiple independent client processes (which do the actual work).
>> This scenario clearly relies on a HW design that provides appropriate
>> isolation between the programming interfaces.
>>
>> The two key needs are:
>>
>>  1.  Mechanisms to safely delegate a subset of the device MMIO
>>      resources to a client process without over-sharing wider access
>>      (or influence over whole-device activities, such as reset).
>>
>>  2.  Mechanisms to allow a client process to do its own iommufd
>>      management w.r.t. its address space, in a way that's isolated
>>      from DMA relating to other clients.
>>
>>
>> mmap() of VFIO DMABUFs
>> ======================
>>
>> This RFC addresses #1 in "vfio/pci: Support mmap() of a VFIO DMABUF",
>> implementing the proposals in [0] to add mmap() support to the
>> existing VFIO DMABUF exporter.
>>
>> This enables a userspace driver to define DMABUF ranges corresponding
>> to sub-ranges of a BAR, and grant a given client (via a shared fd)
>> the capability to access (only) those sub-ranges.  The VFIO device fds
>> would be kept private to the primary process.  All the client can do
>> with that fd is map (or iomap via iommufd) that specific subset of
>> resources, and the impact of bugs/malice is contained.
>>
>>  (We'll follow up on #2 separately, as a related-but-distinct problem.
>>   PASIDs are one way to achieve per-client isolation of DMA; another
>>   could be sharing of a single IOVA space via 'constrained' iommufds.)
>>
>>
>> New in v2: To achieve this, the existing VFIO BAR mmap() path is
>> converted to use DMABUFs behind the scenes, in "vfio/pci: Convert BAR
>> mmap() to use a DMABUF" plus new helper functions, as Jason/Christian
>> suggested in the v1 discussion [3].
>>
>> This means:
>>
>>  - Both regular and new DMABUF BAR mappings share the same vm_ops,
>>    i.e.  mmap()ing DMABUFs is a smaller change on top of the existing
>>    mmap().
>>
>>  - The zapping of mappings occurs via vfio_pci_dma_buf_move(), and the
>>    vfio_pci_zap_bars() originally paired with the _move()s can go
>>    away.  Each DMABUF has a unique address_space.
>>
>>  - It's a step towards future iommufd VFIO Type1 emulation
>>    implementing P2P, since iommufd can now get a DMABUF from a VA that
>>    it's mapping for IO; the VMAs' vm_file is that of the backing
>>    DMABUF.
>>
>>
>> Revocation/reclaim
>> ==================
>>
>> Mapping a BAR subset is useful, but the lifetime of access granted to
>> a client needs to be managed well.  For example, a protocol between
>> the primary process and the client can indicate when the client is
>> done, and when it's safe to reuse the resources elsewhere, but cleanup
>> can't practically be cooperative.
>>
>> For robustness, we enable the driver to make the resources
>> guaranteed-inaccessible when it chooses, so that it can re-assign them
>> to other uses in future.
>>
>> "vfio/pci: Permanently revoke a DMABUF on request" adds a new VFIO
>> device fd ioctl, VFIO_DEVICE_PCI_DMABUF_REVOKE.  This takes a DMABUF
>> fd parameter previously exported (from that device!) and permanently
>> revokes the DMABUF.  This notifies/detaches importers, zaps PTEs for
>> any mappings, and guarantees no future attachment/import/map/access is
>> possible by any means.
>>
>> A primary driver process would use this operation when the client's
>> tenure ends to reclaim "loaned-out" MMIO interfaces, at which point
>> the interfaces could be safely re-used.
>>
>> New in v2: ioctl() on VFIO driver fd, rather than DMABUF fd.  A DMABUF
>> is revoked using code common to vfio_pci_dma_buf_move(), selectively
>> zapping mappings (after waiting for completion on the
>> dma_buf_invalidate_mappings() request).
>>
>>
>> BAR mapping access attributes
>> =============================
>>
>> Inspired by Alex [Mastro] and Jason's comments in [0] and Mahmoud's
>> work in [1] with the goal of controlling CPU access attributes for
>> VFIO BAR mappings (e.g. WC), we can decorate DMABUFs with access
>> attributes that are then used by a mapping's PTEs.
>>
>> I've proposed reserving a field in struct
>> vfio_device_feature_dma_buf's flags to specify an attribute for its
>> ranges.  Although that keeps the (UAPI) struct unchanged, it means all
>> ranges in a DMABUF share the same attribute.  I feel a single
>> attribute-to-mmap() relation is logical/reasonable.  An application
>> can also create multiple DMABUFs to describe any BAR layout and mix of
>> attributes.
>>
>>
>> Tests
>> =====
>>
>> (Still sharing the [RFC ONLY] userspace test/demo program for context,
>> not for merge.)
>>
>> It illustrates & tests various map/revoke cases, but doesn't use the
>> existing VFIO selftests and relies on a (tweaked) QEMU EDU function.
>> I'm (still) working on integrating the scenarios into the existing
>> VFIO selftests.
>>
>> This code has been tested in mapping DMABUFs of single/multiple
>> ranges, aliasing mmap()s, aliasing ranges across DMABUFs, vm_pgoff >
>> 0, revocation, shutdown/cleanup scenarios, and hugepage mappings seem
>> to work correctly.  I've lightly tested WC mappings also (by observing
>> resulting PTEs as having the correct attributes...).
>>
>>
>> Fin
>> ===
>>
>> v2 is based on next-20260310 (to build on Leon's recent series
>> "vfio: Wait for dma-buf invalidation to complete" [2]).
>>
>>
>> Please share your thoughts!  I'd like to de-RFC if we feel this
>> approach is now fair.
> 
> I only skimmed over it, but at least of hand I couldn't find anything 
> fundamentally wrong.

Thank you!

> The locking order seems to change in patch #6. In general I strongly 
> recommend to enable lockdep while testing anyway but explicitly when I see 
> such changes.

I'll definitely +1 on testing with lockdep.

Note that patch #6 doesn't [intend to] change the locking; the naming of
the existing vfio_pci_zap_and_down_write_memory_lock() is potentially
confusing because _really_ it's
vfio_pci_down_write_memory_lock_and_zap().  Patch #6 is replacing that
with _just_ the existing down_write(&memory_lock) part.

(FWIW, lockdep's happy when running the test scenarios on this series.)

> Additional to that it might also be a good idea to have a lockdep initcall 
> function which defines the locking order in the way all the VFIO code should 
> follow.
> 
> See function dma_resv_lockdep() for an example on how to do that. Especially 
> with mmap support and all the locks involved with that it has proven to be a 
> good practice to have something like that.

That's a good suggestion; I'll investigate, and thanks for the pointer.
I spent time stepping through the locking particularly in the revoke
path, and automation here would be pretty useful if possible.


Thanks and regards,


Matt


> 
> Regards,
> Christian.
> 
>>
>>
>> Many thanks,
>>
>>
>> Matt
>>
>>
>>
>> References:
>>
>> [0]: 
>> https://lore.kernel.org/linux-iommu/[email protected]/ 
>> [1]: https://lore.kernel.org/all/[email protected]/ 
>> [2]: 
>> https://lore.kernel.org/linux-iommu/20260205-nocturnal-poetic-chamois-f566ad@houat/T/#m310cd07011e3a1461b6fda45e3f9b886ba76571a
>>  
>> [3]: https://lore.kernel.org/all/[email protected]/ 
>>
>> --------------------------------------------------------------------------------
>> Changelog:
>>
>> v2:  Respin based on the feedback/suggestions:
>>
>> - Transform the existing VFIO BAR mmap path to also use DMABUFs behind
>>   the scenes, and then simply share that code for explicitly-mapped
>>   DMABUFs.
>>
>> - Refactors the export itself out of vfio_pci_core_feature_dma_buf,
>>   and shared by a new vfio_pci_core_mmap_prep_dmabuf helper used by
>>   the regular VFIO mmap to create a DMABUF.
>>
>> - Revoke buffers using a VFIO device fd ioctl
>>
>> v1: https://lore.kernel.org/all/[email protected]/ 
>>
>>
>> Matt Evans (10):
>>   vfio/pci: Set up VFIO barmap before creating a DMABUF
>>   vfio/pci: Clean up DMABUFs before disabling function
>>   vfio/pci: Add helper to look up PFNs for DMABUFs
>>   vfio/pci: Add a helper to create a DMABUF for a BAR-map VMA
>>   vfio/pci: Convert BAR mmap() to use a DMABUF
>>   vfio/pci: Remove vfio_pci_zap_bars()
>>   vfio/pci: Support mmap() of a VFIO DMABUF
>>   vfio/pci: Permanently revoke a DMABUF on request
>>   vfio/pci: Add mmap() attributes to DMABUF feature
>>   [RFC ONLY] selftests: vfio: Add standalone vfio_dmabuf_mmap_test
>>
>>  drivers/vfio/pci/Kconfig                      |   3 +-
>>  drivers/vfio/pci/Makefile                     |   3 +-
>>  drivers/vfio/pci/vfio_pci_config.c            |  18 +-
>>  drivers/vfio/pci/vfio_pci_core.c              | 123 +--
>>  drivers/vfio/pci/vfio_pci_dmabuf.c            | 425 +++++++--
>>  drivers/vfio/pci/vfio_pci_priv.h              |  46 +-
>>  include/uapi/linux/vfio.h                     |  42 +-
>>  tools/testing/selftests/vfio/Makefile         |   1 +
>>  .../vfio/standalone/vfio_dmabuf_mmap_test.c   | 837 ++++++++++++++++++
>>  9 files changed, 1339 insertions(+), 159 deletions(-)
>>  create mode 100644 
>> tools/testing/selftests/vfio/standalone/vfio_dmabuf_mmap_test.c
>>
> 

Reply via email to