Hi Stefan, On Wed, Jun 5, 2024 at 4:28 PM Stefan Hajnoczi <stefa...@redhat.com> wrote:
> On Wed, Jun 05, 2024 at 10:13:32AM +0200, Albert Esteve wrote: > > On Tue, Jun 4, 2024 at 8:54 PM Stefan Hajnoczi <stefa...@redhat.com> > wrote: > > > > > On Thu, May 30, 2024 at 05:22:23PM +0200, Albert Esteve wrote: > > > > Add SHMEM_MAP/UNMAP requests to vhost-user. > > > > > > > > This request allows backends to dynamically map > > > > fds into a shared memory region indentified by > > > > > > Please call this "VIRTIO Shared Memory Region" everywhere (code, > > > vhost-user spec, commit description, etc) so it's clear that this is > not > > > about vhost-user shared memory tables/regions. > > > > > > > its `shmid`. Then, the fd memory is advertised > > > > to the frontend through a BAR+offset, so it can > > > > be read by the driver while its valid. > > > > > > Why is a PCI BAR mentioned here? vhost-user does not know about the > > > VIRTIO Transport (e.g. PCI) being used. It's the frontend's job to > > > report VIRTIO Shared Memory Regions to the driver. > > > > > > > > I will remove PCI BAR, as it is true that it depends on the > > transport. I was trying to explain that the driver > > will use the shm_base + shm_offset to access > > the mapped memory. > > > > > > > > > > > > Then, the backend can munmap the memory range > > > > in a given shared memory region (again, identified > > > > by its `shmid`), to free it. After this, the > > > > region becomes private and shall not be accessed > > > > by the frontend anymore. > > > > > > What does "private" mean? > > > > > > The frontend must mmap PROT_NONE to reserve the virtual memory space > > > when no fd is mapped in the VIRTIO Shared Memory Region. Otherwise an > > > unrelated mmap(NULL, ...) might use that address range and the guest > > > would have access to the host memory! This is a security issue and > needs > > > to be mentioned explicitly in the spec. > > > > > > > I mentioned private because it changes the mapping from MAP_SHARED > > to MAP_PRIVATE. I will highlight PROT_NONE instead. > > I see. Then "MAP_PRIVATE" would be clearer. I wasn't sure whether you > mean mmap flags or something like the memory range is no longer > accessible to the driver. > > > > > > > > > > > > > > > > Initializing the memory region is reponsiblity > > > > of the PCI device that will using it. > > > > > > What does this mean? > > > > > > > The MemoryRegion is declared in `struct VirtIODevice`, > > but it is uninitialized in this commit. So I was trying to say > > that the initialization will happen in, e.g., vhost-user-gpu-pci.c > > with something like `memory_region_init` , and later `pci_register_bar`. > > Okay. The device model needs to create MemoryRegion instances for the > device's Shared Memory Regions and add them to the VirtIODevice. > > --device vhost-user-device will need to query the backend since, unlike > vhost-user-gpu-pci and friends, it doesn't have knowledge of specific > device types. It will need to create MemoryRegions enumerated from the > backend. > > By the way, the VIRTIO MMIO Transport also supports VIRTIO Shared Memory > Regions so this work should not be tied to PCI. > > > > > I am testing that part still. > > > > > > > > > > > > > > > Signed-off-by: Albert Esteve <aest...@redhat.com> > > > > --- > > > > docs/interop/vhost-user.rst | 23 ++++++++ > > > > hw/virtio/vhost-user.c | 106 > ++++++++++++++++++++++++++++++++++++ > > > > hw/virtio/virtio.c | 2 + > > > > include/hw/virtio/virtio.h | 3 + > > > > 4 files changed, 134 insertions(+) > > > > > > Two missing pieces: > > > > > > 1. QEMU's --device vhost-user-device needs a way to enumerate VIRTIO > > > Shared Memory Regions from the vhost-user backend. vhost-user-device is > > > a generic vhost-user frontend without knowledge of the device type, so > > > it doesn't know what the valid shmids are and what size the regions > > > have. > > > > > > > Ok. I was assuming that if a device (backend) makes a request without a > > valid shmid or not enough size in the region to perform the mmap, would > > just fail in the VMM performing the actual mmap operation. So it would > > not necessarily need to know about valid shmids or region sizes. > > But then --device vhost-user-device wouldn't be able to support VIRTIO > Shared Memory Regions, which means this patch series is incomplete. New > vhost-user features need to support both --device vhost-user-<type>-* > and --device vhost-user-device. > > What's needed is: > 1. New vhost-user messages so the frontend can query the shmids and > sizes from the backend. > 2. QEMU --device vhost-user-device code that queries the VIRTIO Shared > Memory Regions from the vhost-user backend and then creates > MemoryRegions for them. > > > > > > > > > > > 2. Other backends don't see these mappings. If the guest submits a > vring > > > descriptor referencing a mapping to another backend, then that backend > > > won't be able to access this memory. David Gilbert hit this problem > when > > > working on the virtiofs DAX Window. Either the frontend needs to > forward > > > all SHMAP_MAP/UNMAP messages to the other backends (inefficient and > > > maybe racy!) or a new "memcpy" message is needed as a fallback for when > > > vhost-user memory region translation fails. > > > > > > > Ok. In which scenario would another device want to access the same > mapping? > > If it is for a shared VIRTIO object, the device that receives the dmabuf > fd > > could > > just do a new mapping in its VIRTIO shared memory region. > > You can reproduce this with two virtiofs devices. Where device A has the > DAX Window feature enabled. The test program mmaps the DAX file and then > opens a new file on device B and performs an O_DIRECT write(2) syscall. > > I think that this issue must be address in this series since this issue > is introduced by enabling VIRTIO Shared Memory Regions in vhost-user. > > A bit more detail: > > Device A has a VIRTIO Shared Memory Region. An application mmaps that > memory (examples: guest userspace driver using Linux VFIO, a guest > kernel driver that exposes the memory to userspace via mmap, or guest > kernel DAX). The application passes that memory as an I/O buffer to > device B (e.g. O_DIRECT disk I/O). > > The result is that device B's vhost-user backend receives a vring > descriptor that points to a guest memory address in device A's VIRTIO > Shared Memory Region. Since device B does not have this memory in its > table, it cannot translate the address and the device breaks. > > I am working on the next version, and I am struggling on what to do to solve this situation. So IIUC is the device B (in your example) who will notice that it cannot translate, and then, the backend will send a message (let's say, SHMEM_COPY) to the frontend so that it copies all mappings done on device A to device B. Something like that? Maybe better only for the mapping that triggered the message? But the message would have to point to the device A so that the frontend knows the src and dest to perform the memcpy? And indicate the memory range that we want to map (offset + len). I am not sure how the frontend would know that it needs to copy from device A to device B after receiving the new message, and get a pointer of both devices (device B is easy as it is the one that sent the message, but not device A). Or maybe I misunderstood your suggested solution. BR, Albert. > > > > > > > > > > > > > > > diff --git a/docs/interop/vhost-user.rst > b/docs/interop/vhost-user.rst > > > > index d8419fd2f1..3caf2a290c 100644 > > > > --- a/docs/interop/vhost-user.rst > > > > +++ b/docs/interop/vhost-user.rst > > > > @@ -1859,6 +1859,29 @@ is sent by the front-end. > > > > when the operation is successful, or non-zero otherwise. Note > that if > > > the > > > > operation fails, no fd is sent to the backend. > > > > > > > > +``VHOST_USER_BACKEND_SHMEM_MAP`` > > > > + :id: 9 > > > > + :equivalent ioctl: N/A > > > > + :request payload: fd and ``struct VhostUserMMap`` > > > > + :reply payload: N/A > > > > + > > > > + This message can be submitted by the backends to advertise a new > > > mapping > > > > + to be made in a given shared memory region. Upon receiving the > > > message, > > > > + QEMU will mmap the given fd into the shared memory region with the > > > > > > s/QEMU/the frontend/ > > > > > > > + requested ``shmid``. A reply is generated indicating whether > mapping > > > > + succeeded. > > > > > > Please document whether mapping over an existing mapping is allowed. I > > > think it should be allowed because it might be useful to atomically > > > update a mapping without a race where the driver sees unmapped memory. > > > > > > > > So in my understanding, the frontend (driver) in the guest would initiate > > the > > mmap/munmap by sending request to the backend (vhost-user device). > > Then the vhost-user device sends a request to the VMM to perform the > > mapping. We could enforce an ACK to ensure that the mmap operation > finished > > before the vhost-user device responds to the driver, and thus avoid > > races. This way, we support only the simple usecase of not allowing > > mmaps over an already mapped region. > > I think it's fine to start with the stricter model where no overlapping > mappings are allowed. If someone needs overlapping mappings in the > future, a feature bit could be added to negotiate support. > > Please document that overlapping mappings are not supported. > > > > If mapping over an existing mapping is allowed, does the new mapping > > > need to cover the old mapping exactly, or can it span multiple previous > > > mappings or a subset of an existing mapping? > > > > > > From a security point of view we need to be careful here. A potentially > > > untrusted backend process now has the ability to mmap an arbitrary file > > > descriptor into the frontend process. The backend can cause > > > denial of service by creating many small mappings to exhaust the OS > > > limits on virtual memory areas. The backend can map memory to use as > > > part of a security compromise, so we need to be sure the virtual memory > > > addresses are not leaked to the backend and only read/write page > > > permissions are available. > > > > > > > Right, security from untrusted backends is usally the hardest part to me. > > But vhost-user devices do only see shm_offset, so this should be safe, > no? > > Yes, I think the current interface is safe, too. > > > > > > > > +``VHOST_USER_BACKEND_SHMEM_UNMAP`` > > > > + :id: 10 > > > > + :equivalent ioctl: N/A > > > > + :request payload: ``struct VhostUserMMap`` > > > > + :reply payload: N/A > > > > + > > > > + This message can be submitted by the backends so that QEMU un-mmap > > > > > > s/QEMU/the frontend/ > > > > > > > This is probably my bad, but I really thought the frontend is the driver. > > So frontend/backend as alternative terms for vhost-user driver/device. > > "vhost-user driver" is not a term that's used, as far as I know. The > vhost-user spec calls it the front-end (older code and docs may call it > the vhost-user master). > > > And then here we would keep QEMU or use VMM instead? > > The vhost-user spec uses both front-end and QEMU to mean the same thing. > VMM is not used in the vhost-user spec. I suggest using front-end in new > spec text because QEMU is not the only application that implements this > spec anymore. > > > > > > > > > > > > + a given range (``offset``, ``len``) in the shared memory region > with > > > the > > > > + requested ``shmid``. > > > > > > Does the range need to correspond to a previously-mapped VhostUserMMap > > > or can it cross multiple VhostUserMMaps, be a subset of a > VhostUserMMap, > > > etc? > > > > > > > I would prefer to keep it simple and disallow mapping over a > > previously-mapped > > region. The range need to correspond to a valid (within size bound), > free, > > memory region, or else the request will fail. I will add that to the > text. > > > > Nonetheless, I am open to discuss other options. > > That sounds good. I'm not aware of any use cases that require anything > fancy. > > > > > > > > > > > > + A reply is generated indicating whether unmapping succeeded. > > > > + > > > > .. _reply_ack: > > > > > > > > VHOST_USER_PROTOCOL_F_REPLY_ACK > > > > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c > > > > index cdf9af4a4b..9526b9d07f 100644 > > > > --- a/hw/virtio/vhost-user.c > > > > +++ b/hw/virtio/vhost-user.c > > > > @@ -115,6 +115,8 @@ typedef enum VhostUserBackendRequest { > > > > VHOST_USER_BACKEND_SHARED_OBJECT_ADD = 6, > > > > VHOST_USER_BACKEND_SHARED_OBJECT_REMOVE = 7, > > > > VHOST_USER_BACKEND_SHARED_OBJECT_LOOKUP = 8, > > > > + VHOST_USER_BACKEND_SHMEM_MAP = 9, > > > > + VHOST_USER_BACKEND_SHMEM_UNMAP = 10, > > > > VHOST_USER_BACKEND_MAX > > > > } VhostUserBackendRequest; > > > > > > > > @@ -192,6 +194,23 @@ typedef struct VhostUserShared { > > > > unsigned char uuid[16]; > > > > } VhostUserShared; > > > > > > > > +/* For the flags field of VhostUserMMap */ > > > > +#define VHOST_USER_FLAG_MAP_R (1u << 0) > > > > +#define VHOST_USER_FLAG_MAP_W (1u << 1) > > > > + > > > > +typedef struct { > > > > + /* Shared memory region ID */ > > > > + uint8_t shmid; > > > > > > There is a hole (padding) in the struct since the following fields are > > > naturally aligned to 8 bytes. I suggest moving shmid to the end to > > > reduce the chance of information leaks if padding is not zeroed. > > > > > > > I see. I can move it to the end of the struct or add a padding field in > > between. I'll do what you suggested, as it sound like the simplest > solution. > > > > > > > > > > > + /* File offset */ > > > > + uint64_t fd_offset; > > > > + /* Offset within the shared memory region */ > > > > + uint64_t shm_offset; > > > > + /* Size of region to map */ > > > > > > To avoid giving "region" additional meanings: > > > > > > s/Size of the region to map/Size of the mapping/ > > > > > > > Ok, I will change it in the next drop. Probably will keep the RFC for > > at least one more patch, seeing that there are a few things I wasn't > > correctly considering. > > > > Thanks for all the feedback! > > Great, thanks! > > Stefan > > > > > > > > > > > > + uint64_t len; > > > > + /* Flags for the mmap operation, from VHOST_USER_FLAG_* */ > > > > + uint64_t flags; > > > > +} VhostUserMMap; > > > > + > > > > typedef struct { > > > > VhostUserRequest request; > > > > > > > > @@ -224,6 +243,7 @@ typedef union { > > > > VhostUserInflight inflight; > > > > VhostUserShared object; > > > > VhostUserTransferDeviceState transfer_state; > > > > + VhostUserMMap mmap; > > > > } VhostUserPayload; > > > > > > > > typedef struct VhostUserMsg { > > > > @@ -1748,6 +1768,85 @@ > > > vhost_user_backend_handle_shared_object_lookup(struct vhost_user *u, > > > > return 0; > > > > } > > > > > > > > +static int > > > > +vhost_user_backend_handle_shmem_map(struct vhost_dev *dev, > > > > + VhostUserMMap *vu_mmap, > > > > + int fd) > > > > +{ > > > > + void *addr = 0; > > > > + MemoryRegion *mr = NULL; > > > > + > > > > + if (fd < 0) { > > > > + error_report("Bad fd for map"); > > > > + return -EBADF; > > > > + } > > > > + > > > > + if (!dev->vdev->shmem_list || > > > > + dev->vdev->n_shmem_regions <= vu_mmap->shmid) { > > > > + error_report("Shared memory region at " > > > > + "ID %d unitialized", vu_mmap->shmid); > > > > + return -EFAULT; > > > > + } > > > > + > > > > + mr = &dev->vdev->shmem_list[vu_mmap->shmid]; > > > > + > > > > + if ((vu_mmap->shm_offset + vu_mmap->len) < vu_mmap->len || > > > > + (vu_mmap->shm_offset + vu_mmap->len) > mr->size) { > > > > + error_report("Bad offset/len for mmap %" PRIx64 "+%" PRIx64, > > > > + vu_mmap->shm_offset, vu_mmap->len); > > > > + return -EFAULT; > > > > + } > > > > + > > > > + void *shmem_ptr = memory_region_get_ram_ptr(mr); > > > > + > > > > + addr = mmap(shmem_ptr + vu_mmap->shm_offset, vu_mmap->len, > > > > + ((vu_mmap->flags & VHOST_USER_FLAG_MAP_R) ? PROT_READ : 0) | > > > > + ((vu_mmap->flags & VHOST_USER_FLAG_MAP_W) ? PROT_WRITE : 0), > > > > + MAP_SHARED | MAP_FIXED, fd, vu_mmap->fd_offset); > > > > + if (addr == MAP_FAILED) { > > > > + error_report("Failed to mmap mem fd"); > > > > + return -EFAULT; > > > > + } > > > > + > > > > + return 0; > > > > +} > > > > + > > > > +static int > > > > +vhost_user_backend_handle_shmem_unmap(struct vhost_dev *dev, > > > > + VhostUserMMap *vu_mmap) > > > > +{ > > > > + void *addr = 0; > > > > + MemoryRegion *mr = NULL; > > > > + > > > > + if (!dev->vdev->shmem_list || > > > > + dev->vdev->n_shmem_regions <= vu_mmap->shmid) { > > > > + error_report("Shared memory region at " > > > > + "ID %d unitialized", vu_mmap->shmid); > > > > + return -EFAULT; > > > > + } > > > > + > > > > + mr = &dev->vdev->shmem_list[vu_mmap->shmid]; > > > > + > > > > + if ((vu_mmap->shm_offset + vu_mmap->len) < vu_mmap->len || > > > > + (vu_mmap->shm_offset + vu_mmap->len) > mr->size) { > > > > + error_report("Bad offset/len for mmap %" PRIx64 "+%" PRIx64, > > > > + vu_mmap->shm_offset, vu_mmap->len); > > > > + return -EFAULT; > > > > + } > > > > + > > > > + void *shmem_ptr = memory_region_get_ram_ptr(mr); > > > > + > > > > + addr = mmap(shmem_ptr + vu_mmap->shm_offset, vu_mmap->len, > > > > + PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, > -1, > > > 0); > > > > + > > > > + if (addr == MAP_FAILED) { > > > > + error_report("Failed to unmap memory"); > > > > + return -EFAULT; > > > > + } > > > > + > > > > + return 0; > > > > +} > > > > + > > > > static void close_backend_channel(struct vhost_user *u) > > > > { > > > > g_source_destroy(u->backend_src); > > > > @@ -1816,6 +1915,13 @@ static gboolean backend_read(QIOChannel *ioc, > > > GIOCondition condition, > > > > ret = > > > vhost_user_backend_handle_shared_object_lookup(dev->opaque, ioc, > > > > &hdr, > > > &payload); > > > > break; > > > > + case VHOST_USER_BACKEND_SHMEM_MAP: > > > > + ret = vhost_user_backend_handle_shmem_map(dev, > &payload.mmap, > > > > + fd ? fd[0] : -1); > > > > + break; > > > > + case VHOST_USER_BACKEND_SHMEM_UNMAP: > > > > + ret = vhost_user_backend_handle_shmem_unmap(dev, > &payload.mmap); > > > > + break; > > > > default: > > > > error_report("Received unexpected msg type: %d.", > hdr.request); > > > > ret = -EINVAL; > > > > diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c > > > > index 893a072c9d..59596370ec 100644 > > > > --- a/hw/virtio/virtio.c > > > > +++ b/hw/virtio/virtio.c > > > > @@ -3264,6 +3264,8 @@ void virtio_init(VirtIODevice *vdev, uint16_t > > > device_id, size_t config_size) > > > > virtio_vmstate_change, vdev); > > > > vdev->device_endian = virtio_default_endian(); > > > > vdev->use_guest_notifier_mask = true; > > > > + vdev->shmem_list = NULL; > > > > + vdev->n_shmem_regions = 0; > > > > } > > > > > > > > /* > > > > diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h > > > > index 7d5ffdc145..34bec26593 100644 > > > > --- a/include/hw/virtio/virtio.h > > > > +++ b/include/hw/virtio/virtio.h > > > > @@ -165,6 +165,9 @@ struct VirtIODevice > > > > */ > > > > EventNotifier config_notifier; > > > > bool device_iotlb_enabled; > > > > + /* Shared memory region for vhost-user mappings. */ > > > > + MemoryRegion *shmem_list; > > > > + int n_shmem_regions; > > > > }; > > > > > > > > struct VirtioDeviceClass { > > > > -- > > > > 2.44.0 > > > > > > > >