On Mon, 17 Aug 2020 15:44:03 -0600 Alex Williamson <alex.william...@redhat.com> wrote:
> On Mon, 17 Aug 2020 17:20:57 -0400 > Steven Sistare <steven.sist...@oracle.com> wrote: > > > On 8/17/2020 4:48 PM, Alex Williamson wrote: > > > On Mon, 17 Aug 2020 14:30:51 -0400 > > > Steven Sistare <steven.sist...@oracle.com> wrote: > > > > > >> On 7/30/2020 11:14 AM, Steve Sistare wrote: > > >>> Anonymous memory segments used by the guest are preserved across a > > >>> re-exec > > >>> of qemu, mapped at the same VA, via a proposed madvise(MADV_DOEXEC) > > >>> option > > >>> in the Linux kernel. For the madvise patches, see: > > >>> > > >>> https://lore.kernel.org/lkml/1595869887-23307-1-git-send-email-anthony.yzn...@oracle.com/ > > >>> > > >>> Signed-off-by: Steve Sistare <steven.sist...@oracle.com> > > >>> --- > > >>> include/qemu/osdep.h | 7 +++++++ > > >>> 1 file changed, 7 insertions(+) > > >> > > >> Hi Alex, > > >> The MADV_DOEXEC functionality, which is a pre-requisite for the entire > > >> qemu > > >> live update series, is getting a chilly reception on lkml. We could > > >> instead > > >> create guest memory using memfd_create and preserve the fd across exec. > > >> However, > > >> the subsequent mmap(fd) will return a different VA than was used > > >> previously, > > >> which is a problem for memory that was registered with vfio, as the > > >> original VA > > >> is remembered in the kernel struct vfio_dma and used in various kernel > > >> functions, > > >> such as vfio_iommu_replay. > > >> > > >> To fix, we could provide a VFIO_IOMMU_REMAP_DMA ioctl taking iova, size, > > >> and > > >> new_vaddr. The implementation finds an exact match for (iova, size) and > > >> replaces > > >> vaddr with new_vaddr. Flags cannot be changed. > > >> > > >> memfd_create plus VFIO_IOMMU_REMAP_DMA would replace MADV_DOEXEC. > > >> vfio on any form of shared memory (shm, dax, etc) could also be > > >> preserved across > > >> exec with shmat/mmap plus VFIO_IOMMU_REMAP_DMA. > > >> > > >> What do you think > > > > > > Your new REMAP ioctl would have parameters identical to the MAP_DMA > > > ioctl, so I think we should just use one of the flag bits on the > > > existing MAP_DMA ioctl for this variant. > > > > Sounds good. > > > > > Reading through the discussion on the kernel side there seems to be > > > some confusion around why vfio needs the vaddr beyond the user call to > > > MAP_DMA though. Originally this was used to test for virtually > > > contiguous mappings for merging and splitting purposes. This is > > > defunct in the v2 interface, however the vaddr is now used largely for > > > mdev devices. If an mdev device is not backed by an IOMMU device and > > > does not share a container with an IOMMU device, then a user MAP_DMA > > > ioctl essentially just registers the translation within the vfio > > > container. The mdev vendor driver can then later either request pages > > > to be pinned for device DMA or can perform copy_to/from_user() to > > > simulate DMA via the CPU. > > > > > > Therefore I don't see that there's a simple re-architecture of the vfio > > > IOMMU backend that could drop vaddr use. > > > > Yes. I did not explain on lkml as you do here (thanks), but I reached the > > same conclusion. > > > > > I'm a bit concerned this new > > > remap proposal also raises the question of how do we prevent userspace > > > remapping vaddrs racing with asynchronous kernel use of the previous > > > vaddrs. > > > > Agreed. After a quick glance at the code, holding iommu->lock during > > remap might be sufficient, but it needs more study. > > Unless you're suggesting an extended hold of the lock across the entire > re-exec of QEMU, that's only going to prevent a race between a remap > and a vendor driver pin or access, the time between the previous vaddr > becoming invalid and the remap is unprotected. > > > > Are we expecting guest drivers/agents to quiesce the device, > > > or maybe relying on clearing bus-master, for PCI devices, to halt DMA? > > > > No. We want to support any guest, and the guest is not aware that qemu > > live update is occurring. > > > > > The vfio migration interface we've developed does have a mechanism to > > > stop a device, would we need to use this here? If we do have a > > > mechanism to quiesce the device, is the only reason we're not unmapping > > > everything and remapping it into the new address space the latency in > > > performing that operation? Thanks, > > > > Same answer - we don't require that the guest has vfio migration support. > > QEMU toggling the runstate of the device via the vfio migration > interface could be done transparently to the guest, but if your > intention is to support any device (where none currently support the > migration interface) perhaps it's a moot point. It seems like this > scheme only works with IOMMU backed devices where the device can > continue to operate against pinned pages, anything that might need to > dynamically pin pages against the process vaddr as it's running async > to the QEMU re-exec needs to be blocked or stopped. Thanks, Hmm, even if a device is running against pinned memory, how do we handle device interrupts that occur during QEMU's downtime? I see that we reconfigure interrupts, but does QEMU need to drain the eventfd and manually inject those missed interrupts or will setting up the irqfds trigger a poll? Thanks, Alex