On Mon, 17 Oct 2022 13:54:03 +0300 Andrey Ryabinin <a...@yandex-team.com> wrote:
> These patches add possibility to pass VFIO device to QEMU using file > descriptors of VFIO container/group, instead of creating those by QEMU. > This allows to take away permissions to open /dev/vfio/* from QEMU and > delegate that to managment layer like libvirt. > > The VFIO API doen't allow to pass just fd of device, since we also need to > have > VFIO container and group. So these patches allow to pass created VFIO > container/group > to QEMU via command line/QMP, e.g. like this: > -object vfio-container,id=ct,fd=5 \ > -object vfio-group,id=grp,fd=6,container=ct \ > -device vfio-pci,host=05:00.0,group=grp This suggests that management tools need to become intimately familiar with container and group association restrictions for implicit dependencies, such as device AddressSpace. We had considered this before and intentionally chosen to allow QEMU to manage that relationship. Things like PCI bus type and presence of a vIOMMU factor into these relationships. In the above example, what happens in a mixed environment, for example if we then add '-device vfio-pci,host=06:00.0' to the command line? Isn't QEMU still going to try to re-use the container if it exists in the same address space? Potentially this device could also be a member of the same group. How would the management tool know when to expect the provided fds be released? We also have an outstanding RFC for iommufd that already proposes an fd passing interface, where iommufd removes many of the issues of the vfio container by supporting multiple address spaces within a single fd context, avoiding the duplicate locked page accounting issues between containers, and proposing a direct device fd interface for vfio. Why at this point in time would we choose to expand the QEMU vfio interface in this way? Thanks, Alex > A bit more detailed example can be found in the test: > tests/avocado/vfio.py > > *Possible future steps* > > Also these patches could be a step for making local migration (within one > host) > of the QEMU with VFIO devices. > I've built some prototype on top of these patches to try such idea. > In short the scheme of such migration is following: > - migrate source VM to file. > - retrieve fd numbers of VFIO container/group/device via new property and > qom-get command > - get the actual file descriptor via SCM_RIGHTS using new qmp command > 'returnfd' which > sends fd from QEMU by the number: { 'command': 'returnfd', 'data': {'fd': > 'int'}} > - shutdown source VM > - launch destination VM, plug VFIO devices using obtained file descriptors. > - PCI device reset duriing plugging the device avoided with the help of new > parameter > on vfio-pci device. > This is alternative to 'cpr-exec' migration scheme proposed here: > > https://lore.kernel.org/qemu-devel/1658851843-236870-1-git-send-email-steven.sist...@oracle.com/ > Unlike cpr-exec it doesn't require new kernel flags > VFIO_DMA_UNMAP_FLAG_VADDR/VFIO_DMA_MAP_FLAG_VADDR > And doesn't require new migration mode, just some additional steps from > management layer. > > > Andrey Ryabinin (4): > vfio: add vfio-container user createable object > vfio: add vfio-group user createable object > vfio: Add 'group' property to 'vfio-pci' device > tests/avocado/vfio: add test for vfio devices > > hw/vfio/ap.c | 2 +- > hw/vfio/ccw.c | 2 +- > hw/vfio/common.c | 471 +++++++++++++++++++++++----------- > hw/vfio/pci.c | 10 +- > hw/vfio/platform.c | 2 +- > hw/vfio/trace-events | 4 +- > include/hw/vfio/vfio-common.h | 10 +- > qapi/qom.json | 29 +++ > tests/avocado/vfio.py | 156 +++++++++++ > 9 files changed, 525 insertions(+), 161 deletions(-) > create mode 100644 tests/avocado/vfio.py >