group to QEMU

Alex Williamson Mon, 17 Oct 2022 08:22:11 -0700

On Mon, 17 Oct 2022 13:54:03 +0300
Andrey Ryabinin <a...@yandex-team.com> wrote:


> These patches add possibility to pass VFIO device to QEMU using file
> descriptors of VFIO container/group, instead of creating those by QEMU.
> This allows to take away permissions to open /dev/vfio/* from QEMU and
> delegate that to managment layer like libvirt.
> 
> The VFIO API doen't allow to pass just fd of device, since we also need to 
> have
> VFIO container and group. So these patches allow to pass created VFIO 
> container/group
> to QEMU via command line/QMP, e.g. like this:
>             -object vfio-container,id=ct,fd=5 \
>             -object vfio-group,id=grp,fd=6,container=ct \
>             -device vfio-pci,host=05:00.0,group=grp

This suggests that management tools need to become intimately familiar
with container and group association restrictions for implicit
dependencies, such as device AddressSpace.  We had considered this
before and intentionally chosen to allow QEMU to manage that
relationship.  Things like PCI bus type and presence of a vIOMMU factor
into these relationships.

In the above example, what happens in a mixed environment, for example
if we then add '-device vfio-pci,host=06:00.0' to the command line?
Isn't QEMU still going to try to re-use the container if it exists in
the same address space?  Potentially this device could also be a member
of the same group.  How would the management tool know when to expect
the provided fds be released?

We also have an outstanding RFC for iommufd that already proposes an fd
passing interface, where iommufd removes many of the issues of the vfio
container by supporting multiple address spaces within a single fd
context, avoiding the duplicate locked page accounting issues between
containers, and proposing a direct device fd interface for vfio.  Why at
this point in time would we choose to expand the QEMU vfio interface in
this way?  Thanks,

Alex

> A bit more detailed example can be found in the test:
> tests/avocado/vfio.py
> 
>  *Possible future steps*
> 
> Also these patches could be a step for making local migration (within one 
> host)
> of the QEMU with VFIO devices.
> I've built some prototype on top of these patches to try such idea.
> In short the scheme of such migration is following:
>  - migrate source VM to file.
>  - retrieve fd numbers of VFIO container/group/device via new property and 
> qom-get command
>  - get the actual file descriptor via SCM_RIGHTS using new qmp command 
> 'returnfd' which
>    sends fd from QEMU by the number: { 'command': 'returnfd', 'data': {'fd': 
> 'int'}}
>  - shutdown source VM
>  - launch destination VM, plug VFIO devices using obtained file descriptors.
>  - PCI device reset duriing plugging the device avoided with the help of new 
> parameter
>     on vfio-pci device.
> This is alternative to 'cpr-exec' migration scheme proposed here:
>    
> https://lore.kernel.org/qemu-devel/1658851843-236870-1-git-send-email-steven.sist...@oracle.com/
> Unlike cpr-exec it doesn't require new kernel flags 
> VFIO_DMA_UNMAP_FLAG_VADDR/VFIO_DMA_MAP_FLAG_VADDR
> And doesn't require new migration mode, just some additional steps from 
> management layer.
> 
> 
> Andrey Ryabinin (4):
>   vfio: add vfio-container user createable object
>   vfio: add vfio-group user createable object
>   vfio: Add 'group' property to 'vfio-pci' device
>   tests/avocado/vfio: add test for vfio devices
> 
>  hw/vfio/ap.c                  |   2 +-
>  hw/vfio/ccw.c                 |   2 +-
>  hw/vfio/common.c              | 471 +++++++++++++++++++++++-----------
>  hw/vfio/pci.c                 |  10 +-
>  hw/vfio/platform.c            |   2 +-
>  hw/vfio/trace-events          |   4 +-
>  include/hw/vfio/vfio-common.h |  10 +-
>  qapi/qom.json                 |  29 +++
>  tests/avocado/vfio.py         | 156 +++++++++++
>  9 files changed, 525 insertions(+), 161 deletions(-)
>  create mode 100644 tests/avocado/vfio.py
>

Re: [PATCH 0/4] Allow to pass pre-created VFIO container/group to QEMU

Reply via email to