group to QEMU

Daniel P . Berrangé Mon, 17 Oct 2022 04:07:37 -0700

On Mon, Oct 17, 2022 at 01:54:03PM +0300, Andrey Ryabinin wrote:
> These patches add possibility to pass VFIO device to QEMU using file
> descriptors of VFIO container/group, instead of creating those by QEMU.
> This allows to take away permissions to open /dev/vfio/* from QEMU and
> delegate that to managment layer like libvirt.
> 
> The VFIO API doen't allow to pass just fd of device, since we also need to 
> have
> VFIO container and group. So these patches allow to pass created VFIO 
> container/group
> to QEMU via command line/QMP, e.g. like this:
>             -object vfio-container,id=ct,fd=5 \
>             -object vfio-group,id=grp,fd=6,container=ct \
>             -device vfio-pci,host=05:00.0,group=grp
> 
> A bit more detailed example can be found in the test:
> tests/avocado/vfio.py
> 
>  *Possible future steps*
> 
> Also these patches could be a step for making local migration (within one 
> host)
> of the QEMU with VFIO devices.
> I've built some prototype on top of these patches to try such idea.
> In short the scheme of such migration is following:
>  - migrate source VM to file.
>  - retrieve fd numbers of VFIO container/group/device via new property and 
> qom-get command
>  - get the actual file descriptor via SCM_RIGHTS using new qmp command 
> 'returnfd' which
>    sends fd from QEMU by the number: { 'command': 'returnfd', 'data': {'fd': 
> 'int'}}
>  - shutdown source VM
>  - launch destination VM, plug VFIO devices using obtained file descriptors.
>  - PCI device reset duriing plugging the device avoided with the help of new 
> parameter
>     on vfio-pci device.


Is there a restriction by VFIO on how many processes can have the FD
open concurrently ? I guess it must be, as with SCM_RIGHTS, both src
QEMU and libvirt will have the FD open concurrently for at least a
short period, as you can't atomically close the FD at the exact same
time as SCM_RIGHTS sends it.

With migration it is *highly* desirable to never stop the source VM's
QEMU until the new QEMU has completed migration and got its vCPUs
running, in order to have best chance of successful rollback upon
failure

So assuming both QEMU's can have the FD open, provided they don't
both concurrently operate on it, could src QEMU just pass the FDs
to the target QEMU as part of the migration stream. eg use a UNIX
socket between the 2 QEMUs, and SCM_RIGHTS to pass the FDs across,
avoiding libvirt needing to be in the middle of the FD passing
dance. Since target QEMU gets the FDs as part of the migration
stream, it would inherantly know that it shold skip device reset
in that flow, without requiring any new param.


> This is alternative to 'cpr-exec' migration scheme proposed here:
>    
> https://lore.kernel.org/qemu-devel/1658851843-236870-1-git-send-email-steven.sist...@oracle.com/
> Unlike cpr-exec it doesn't require new kernel flags 
> VFIO_DMA_UNMAP_FLAG_VADDR/VFIO_DMA_MAP_FLAG_VADDR
> And doesn't require new migration mode, just some additional steps from 
> management layer.

Avoiding creating a whole new set of mgmt commands in QMP does
make this appealing as an option instead of cpr-exec. If we can
layer FD passing into the migration stream too, that'd be even
more compelling IMHO.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH 0/4] Allow to pass pre-created VFIO container/group to QEMU

Reply via email to