> On Jun 2, 2020, at 8:06 AM, Alex Williamson <alex.william...@redhat.com>
> wrote:
>
> On Wed, 20 May 2020 17:45:13 -0700
> John G Johnson <john.g.john...@oracle.com> wrote:
>
>>> I'm confused by VFIO_USER_ADD_MEMORY_REGION vs VFIO_USER_IOMMU_MAP_DMA.
>>> The former seems intended to provide the server with access to the
>>> entire GPA space, while the latter indicates an IOVA to GPA mapping of
>>> those regions. Doesn't this break the basic isolation of a vIOMMU?
>>> This essentially says to me "here's all the guest memory, but please
>>> only access these regions for which we're providing DMA mappings".
>>> That invites abuse.
>>>
>>
>> The purpose behind separating QEMU into multiple processes is
>> to provide an additional layer protection for the infrastructure against
>> a malign guest, not for the guest against itself, so preventing a server
>> that has been compromised by a guest from accessing all of guest memory
>> adds no additional benefit. We don’t even have an IOMMU in our current
>> guest model for this reason.
>
> One of the use cases we see a lot with vfio is nested assignment, ie.
> we assign a device to a VM where the VM includes a vIOMMU, such that
> the guest OS can then assign the device to userspace within the guest.
> This is safe to do AND provides isolation within the guest exactly
> because the device only has access to memory mapped to the device, not
> the entire guest address space. I don't think it's just the hypervisor
> you're trying to protect, we can't assume there are always trusted
> drivers managing the device.
>
We intend to support an IOMMU. The question seems to be whether
it’s implemented in the server or client. The current proposal has it
in the server, ala vhost-user, but we are fine with moving it.
>>
>> The implementation was stolen from vhost-user, with the exception
>> that we push IOTLB translations from client to server like VFIO does, as
>> opposed to pulling them from server to client like vhost-user does.
>
> It seems that vhost has numerous hacks forcing it to know whether a
> vIOMMU is present as a result of this, vfio has none.
>
I imagine this decision was driven by performance considerations.
If the IOMMU is implemented on the client side, the server must execute mmap()
or munmap() for every IOMMU MAP/UNMAP message. If the IOMMU is implemented
on the server side, the server doesn’t need these system calls; it just adds a
SW translation entry to its own table.
>> That said, neither the qemu-mp nor MUSER implementation uses an
>> IOMMU, so if you prefer another IOMMU model, we can consider it. We
>> could only send the guest memory file descriptors with IOMMU_MAP_DMA
>> requests, although that would cost performance since each request would
>> require the server to execute an mmap() system call.
>
> It would seem shortsighted to not fully enable a vIOMMU compatible
> implementation at this time.
>
>>> Also regarding VFIO_USER_ADD_MEMORY_REGION, it's not clear to me how
>>> "an array of file descriptors will be sent as part of the message
>>> meta-data" works. Also consider s/SUB/DEL/. Why is the Device ID in
>>> the table specified as 0? How does a client learn their Device ID?
>>>
>>
>> SCM_RIGHTS message controls allow sendmsg() to send an array of
>> file descriptors over a UNIX domain socket.
>>
>> We’re only supporting one device per socket in this protocol
>> version, so the device ID will always be 0. This may change in a future
>> revision, so we included the field in the header to avoid a major version
>> change if device multiplexing is added later.
>>
>>
>>> VFIO_USER_DEVICE_GET_REGION_INFO (or anything else making use of a
>>> capability chain), the cap_offset and next pointers within the chain
>>> need to specify what their offset is relative to (ie. the start of the
>>> packet, the start of the vfio compatible data structure, etc). I
>>> assume the latter for client compatibility.
>>>
>>
>> Yes. We will attempt to make the language clearer.
>>
>>
>>> Also on REGION_INFO, offset is specified as "the base offset to be
>>> given to the mmap() call for regions with the MMAP attribute". Base
>>> offset from what? Is the mmap performed on the socket fd? Do we not
>>> allow read/write, we need to use VFIO_USER_MMIO_READ/WRITE instead?
>>> Why do we specify "MMIO" in those operations versus simply "REGION"?
>>> Are we arbitrarily excluding support for I/O port regions or device
>>> specific regions? If these commands replace direct read and write to
>>> an fd offset, how is PCI config space handled?
>>>
>>
>> The base offset refers to the sparse areas, where the sparse area
>> offset is added to the base region offset. We will try to make the text
>> clearer here as well.
>>
>> MMIO was added to distinguish these operations from DMA operations.
>> I can see how this can cause confusion when the region refers to a port
>> range,
>> so we can change the name to REGION_READ/WRITE.
>>
>>
>>> VFIO_USER_MMIO_READ specifies the count field is zero and the reply
>>> will include the count specifying the amount of data read. How does
>>> the client specify how much data to read? Via message size?
>>>
>>
>> This is a bug in the doc. As you said, the read field should
>> be the amount of data to be read.
>>
>>
>>> VFIO_USER_DMA_READ/WRITE, is the address a GPA or IOVA? IMO the device
>>> should only ever have access via IOVA, which implies a DMA mapping
>>> exists for the device. Can you provide an example of why we need these
>>> commands since there seems little point to this interface if a device
>>> cannot directly interact with VM memory.
>>>
>>
>> It is a GPA. The device emulation code would only handle the DMA
>> addresses the guest programmed it with; the server infrastructure knows
>> whether an IOMMU exists, and whether the DMA address needs translation to
>> GPA or not.
>
> I'll re-iterate, a device should only ever issue DMAs in terms of IOVA.
> This is how vfio works.
>
>>> The IOMMU commands should be unnecessary, a vIOMMU should be
>>> transparent to the server by virtue that the device only knows about
>>> IOVA mappings accessible to the device. Requiring the client to expose
>>> all memory to the server implies that the server must always be trusted.
>>>
>>
>> The client and server are equally trusted; the guest is the untrusted
>> entity.
>
> And therefore the driver is untrusted and opening the client/sever
> window to expose all of guest memory presents a larger attack surface.
>
>>> Interrupt info format, s/type/index/, s/vector/subindex/
>>>
>>
>> ok
>>
>>
>>> In addition to the unused ioctls, the entire concept of groups and
>>> containers are not found in this specification. To some degree that
>>> makes sense and even mdevs and typically SR-IOV VFs have a 1:1 device
>>> to group relationship. However, the container is very much involved in
>>> the development of migration support, where it's the container that
>>> provides dirty bitmaps. Since we're doing map and unmap without that
>>> container concept here, perhaps we'd equally apply those APIs to this
>>> same socket. Thanks,
>>
>> Groups and containers are host IOMMU concepts, and we don’t
>> interact with the host here. The kernel VFIO driver doesn’t even need
>> to exist for VFIO over socket. I think it’s fine to assume a 1-1
>> correspondence between containers, groups, and a VFIO over socket device.
>
> Obviously the kernel driver and host IOMMU are out of the picture here.
> The point I was trying to make is that we're building interfaces to
> support migration around concepts that don't exist in this model, so
> it's not clear how we'd map, for example, dirty page tracking on the
> container interface to this API. This seems more akin to the no-iommu
> model of vfio, which is a hack where we allow userspace to have access
> to a device using the vfio API, but they're on their own for DMA. We
> don't support that model in QEMU, and without those conceptual
> equivalencies, I wonder how much we'll be able to leverage existing
> QEMU code or co-develop and support future features. IOW, is this
> really just "a vfio-like device model over unix socket" rather than
> "vfio over unix socket"? Thanks,
>
In this model, each device is in its own IOMMU and VFIO container.
We can add a note stating this to the spec.
VFIO does seem to support a QEMU guest with no IOMMU. It uses the
VFIO_IOMMU_MAP_DMA ioctl()s to establish the 1-1 device DMA address to
guest physical address mapping in the host IOMMU. The current MUSER
prototypes takes advantage of this in their guest, which lacks an IOMMU.
Thanos & JJ