Re: RFC: use VFIO over a UNIX domain socket to implement device offloading

John G Johnson Tue, 09 Jun 2020 23:26:48 -0700

> On Jun 2, 2020, at 8:06 AM, Alex Williamson <alex.william...@redhat.com> 
> wrote:
> 
> On Wed, 20 May 2020 17:45:13 -0700
> John G Johnson <john.g.john...@oracle.com> wrote:
> 
>>> I'm confused by VFIO_USER_ADD_MEMORY_REGION vs VFIO_USER_IOMMU_MAP_DMA.
>>> The former seems intended to provide the server with access to the
>>> entire GPA space, while the latter indicates an IOVA to GPA mapping of
>>> those regions.  Doesn't this break the basic isolation of a vIOMMU?
>>> This essentially says to me "here's all the guest memory, but please
>>> only access these regions for which we're providing DMA mappings".
>>> That invites abuse.
>>> 
>> 
>>      The purpose behind separating QEMU into multiple processes is
>> to provide an additional layer protection for the infrastructure against
>> a malign guest, not for the guest against itself, so preventing a server
>> that has been compromised by a guest from accessing all of guest memory
>> adds no additional benefit.  We don’t even have an IOMMU in our current
>> guest model for this reason.
> 
> One of the use cases we see a lot with vfio is nested assignment, ie.
> we assign a device to a VM where the VM includes a vIOMMU, such that
> the guest OS can then assign the device to userspace within the guest.
> This is safe to do AND provides isolation within the guest exactly
> because the device only has access to memory mapped to the device, not
> the entire guest address space.  I don't think it's just the hypervisor
> you're trying to protect, we can't assume there are always trusted
> drivers managing the device.
> 

        We intend to support an IOMMU.  The question seems to be whether
it’s implemented in the server or client.  The current proposal has it
in the server, ala vhost-user, but we are fine with moving it.


>> 
>>      The implementation was stolen from vhost-user, with the exception
>> that we push IOTLB translations from client to server like VFIO does, as
>> opposed to pulling them from server to client like vhost-user does.
> 
> It seems that vhost has numerous hacks forcing it to know whether a
> vIOMMU is present as a result of this, vfio has none.
> 

        I imagine this decision was driven by performance considerations.
If the IOMMU is implemented on the client side, the server must execute mmap()
or munmap() for every IOMMU MAP/UNMAP message.  If the IOMMU is implemented
on the server side, the server doesn’t need these system calls; it just adds a
SW translation entry to its own table.


>>      That said, neither the qemu-mp nor MUSER implementation uses an
>> IOMMU, so if you prefer another IOMMU model, we can consider it.  We
>> could only send the guest memory file descriptors with IOMMU_MAP_DMA
>> requests, although that would cost performance since each request would
>> require the server to execute an mmap() system call.
> 
> It would seem shortsighted to not fully enable a vIOMMU compatible
> implementation at this time.
> 
>>> Also regarding VFIO_USER_ADD_MEMORY_REGION, it's not clear to me how
>>> "an array of file descriptors will be sent as part of the message
>>> meta-data" works.  Also consider s/SUB/DEL/.  Why is the Device ID in
>>> the table specified as 0?  How does a client learn their Device ID?
>>> 
>> 
>>      SCM_RIGHTS message controls allow sendmsg() to send an array of
>> file descriptors over a UNIX domain socket.
>> 
>>      We’re only supporting one device per socket in this protocol
>> version, so the device ID will always be 0.  This may change in a future
>> revision, so we included the field in the header to avoid a major version
>> change if device multiplexing is added later.
>> 
>> 
>>> VFIO_USER_DEVICE_GET_REGION_INFO (or anything else making use of a
>>> capability chain), the cap_offset and next pointers within the chain
>>> need to specify what their offset is relative to (ie. the start of the
>>> packet, the start of the vfio compatible data structure, etc).  I
>>> assume the latter for client compatibility.
>>> 
>> 
>>      Yes.  We will attempt to make the language clearer.
>> 
>> 
>>> Also on REGION_INFO, offset is specified as "the base offset to be
>>> given to the mmap() call for regions with the MMAP attribute".  Base
>>> offset from what?  Is the mmap performed on the socket fd?  Do we not
>>> allow read/write, we need to use VFIO_USER_MMIO_READ/WRITE instead?
>>> Why do we specify "MMIO" in those operations versus simply "REGION"?
>>> Are we arbitrarily excluding support for I/O port regions or device
>>> specific regions?  If these commands replace direct read and write to
>>> an fd offset, how is PCI config space handled?
>>> 
>> 
>>      The base offset refers to the sparse areas, where the sparse area
>> offset is added to the base region offset.  We will try to make the text
>> clearer here as well.
>> 
>>      MMIO was added to distinguish these operations from DMA operations.
>> I can see how this can cause confusion when the region refers to a port 
>> range,
>> so we can change the name to REGION_READ/WRITE. 
>> 
>> 
>>> VFIO_USER_MMIO_READ specifies the count field is zero and the reply
>>> will include the count specifying the amount of data read.  How does
>>> the client specify how much data to read?  Via message size?
>>> 
>> 
>>      This is a bug in the doc.  As you said, the read field should
>> be the amount of data to be read.
>>      
>> 
>>> VFIO_USER_DMA_READ/WRITE, is the address a GPA or IOVA?  IMO the device
>>> should only ever have access via IOVA, which implies a DMA mapping
>>> exists for the device.  Can you provide an example of why we need these
>>> commands since there seems little point to this interface if a device
>>> cannot directly interact with VM memory.
>>> 
>> 
>>      It is a GPA.  The device emulation code would only handle the DMA
>> addresses the guest programmed it with; the server infrastructure knows
>> whether an IOMMU exists, and whether the DMA address needs translation to
>> GPA or not.
> 
> I'll re-iterate, a device should only ever issue DMAs in terms of IOVA.
> This is how vfio works.
> 
>>> The IOMMU commands should be unnecessary, a vIOMMU should be
>>> transparent to the server by virtue that the device only knows about
>>> IOVA mappings accessible to the device.  Requiring the client to expose
>>> all memory to the server implies that the server must always be trusted.
>>> 
>> 
>>      The client and server are equally trusted; the guest is the untrusted
>> entity.
> 
> And therefore the driver is untrusted and opening the client/sever
> window to expose all of guest memory presents a larger attack surface.
> 
>>> Interrupt info format, s/type/index/, s/vector/subindex/
>>> 
>> 
>>      ok
>> 
>> 
>>> In addition to the unused ioctls, the entire concept of groups and
>>> containers are not found in this specification.  To some degree that
>>> makes sense and even mdevs and typically SR-IOV VFs have a 1:1 device
>>> to group relationship.  However, the container is very much involved in
>>> the development of migration support, where it's the container that
>>> provides dirty bitmaps.  Since we're doing map and unmap without that
>>> container concept here, perhaps we'd equally apply those APIs to this
>>> same socket.  Thanks,  
>> 
>>      Groups and containers are host IOMMU concepts, and we don’t
>> interact with the host here.  The kernel VFIO driver doesn’t even need
>> to exist for VFIO over socket.  I think it’s fine to assume a 1-1
>> correspondence between containers, groups, and a VFIO over socket device.
> 
> Obviously the kernel driver and host IOMMU are out of the picture here.
> The point I was trying to make is that we're building interfaces to
> support migration around concepts that don't exist in this model, so
> it's not clear how we'd map, for example, dirty page tracking on the
> container interface to this API.  This seems more akin to the no-iommu
> model of vfio, which is a hack where we allow userspace to have access
> to a device using the vfio API, but they're on their own for DMA.  We
> don't support that model in QEMU, and without those conceptual
> equivalencies, I wonder how much we'll be able to leverage existing
> QEMU code or co-develop and support future features.  IOW, is this
> really just "a vfio-like device model over unix socket" rather than
> "vfio over unix socket"?  Thanks,
> 


        In this model, each device is in its own IOMMU and VFIO container.
We can add a note stating this to the spec.

        VFIO does seem to support a QEMU guest with no IOMMU.  It uses the
VFIO_IOMMU_MAP_DMA ioctl()s to establish the 1-1 device DMA address to
guest physical address mapping in the host IOMMU.  The current MUSER
prototypes takes advantage of this in their guest, which lacks an IOMMU.

                                                Thanos & JJ
Re: RFC: use VFIO over a UNIX domain socket to implement device offloading

Reply via email to