Hi Stefan,

On Wed, Jun 5, 2024 at 4:28 PM Stefan Hajnoczi <stefa...@redhat.com> wrote:

> On Wed, Jun 05, 2024 at 10:13:32AM +0200, Albert Esteve wrote:
> > On Tue, Jun 4, 2024 at 8:54 PM Stefan Hajnoczi <stefa...@redhat.com>
> wrote:
> >
> > > On Thu, May 30, 2024 at 05:22:23PM +0200, Albert Esteve wrote:
> > > > Add SHMEM_MAP/UNMAP requests to vhost-user.
> > > >
> > > > This request allows backends to dynamically map
> > > > fds into a shared memory region indentified by
> > >
> > > Please call this "VIRTIO Shared Memory Region" everywhere (code,
> > > vhost-user spec, commit description, etc) so it's clear that this is
> not
> > > about vhost-user shared memory tables/regions.
> > >
> > > > its `shmid`. Then, the fd memory is advertised
> > > > to the frontend through a BAR+offset, so it can
> > > > be read by the driver while its valid.
> > >
> > > Why is a PCI BAR mentioned here? vhost-user does not know about the
> > > VIRTIO Transport (e.g. PCI) being used. It's the frontend's job to
> > > report VIRTIO Shared Memory Regions to the driver.
> > >
> > >
> > I will remove PCI BAR, as it is true that it depends on the
> > transport. I was trying to explain that the driver
> > will use the shm_base + shm_offset to access
> > the mapped memory.
> >
> >
> > > >
> > > > Then, the backend can munmap the memory range
> > > > in a given shared memory region (again, identified
> > > > by its `shmid`), to free it. After this, the
> > > > region becomes private and shall not be accessed
> > > > by the frontend anymore.
> > >
> > > What does "private" mean?
> > >
> > > The frontend must mmap PROT_NONE to reserve the virtual memory space
> > > when no fd is mapped in the VIRTIO Shared Memory Region. Otherwise an
> > > unrelated mmap(NULL, ...) might use that address range and the guest
> > > would have access to the host memory! This is a security issue and
> needs
> > > to be mentioned explicitly in the spec.
> > >
> >
> > I mentioned private because it changes the mapping from MAP_SHARED
> > to MAP_PRIVATE. I will highlight PROT_NONE instead.
>
> I see. Then "MAP_PRIVATE" would be clearer. I wasn't sure whether you
> mean mmap flags or something like the memory range is no longer
> accessible to the driver.
>
> >
> >
> > >
> > > >
> > > > Initializing the memory region is reponsiblity
> > > > of the PCI device that will using it.
> > >
> > > What does this mean?
> > >
> >
> > The MemoryRegion is declared in `struct VirtIODevice`,
> > but it is uninitialized in this commit. So I was trying to say
> > that the initialization will happen in, e.g., vhost-user-gpu-pci.c
> > with something like `memory_region_init` , and later `pci_register_bar`.
>
> Okay. The device model needs to create MemoryRegion instances for the
> device's Shared Memory Regions and add them to the VirtIODevice.
>
> --device vhost-user-device will need to query the backend since, unlike
> vhost-user-gpu-pci and friends, it doesn't have knowledge of specific
> device types. It will need to create MemoryRegions enumerated from the
> backend.
>
> By the way, the VIRTIO MMIO Transport also supports VIRTIO Shared Memory
> Regions so this work should not be tied to PCI.
>
> >
> > I am testing that part still.
> >
> >
> > >
> > > >
> > > > Signed-off-by: Albert Esteve <aest...@redhat.com>
> > > > ---
> > > >  docs/interop/vhost-user.rst |  23 ++++++++
> > > >  hw/virtio/vhost-user.c      | 106
> ++++++++++++++++++++++++++++++++++++
> > > >  hw/virtio/virtio.c          |   2 +
> > > >  include/hw/virtio/virtio.h  |   3 +
> > > >  4 files changed, 134 insertions(+)
> > >
> > > Two missing pieces:
> > >
> > > 1. QEMU's --device vhost-user-device needs a way to enumerate VIRTIO
> > > Shared Memory Regions from the vhost-user backend. vhost-user-device is
> > > a generic vhost-user frontend without knowledge of the device type, so
> > > it doesn't know what the valid shmids are and what size the regions
> > > have.
> > >
> >
> > Ok. I was assuming that if a device (backend) makes a request without a
> > valid shmid or not enough size in the region to perform the mmap, would
> > just fail in the VMM performing the actual mmap operation. So it would
> > not necessarily need to know about valid shmids or region sizes.
>
> But then --device vhost-user-device wouldn't be able to support VIRTIO
> Shared Memory Regions, which means this patch series is incomplete. New
> vhost-user features need to support both --device vhost-user-<type>-*
> and --device vhost-user-device.
>
> What's needed is:
> 1. New vhost-user messages so the frontend can query the shmids and
>    sizes from the backend.
> 2. QEMU --device vhost-user-device code that queries the VIRTIO Shared
>    Memory Regions from the vhost-user backend and then creates
>    MemoryRegions for them.
>
> >
> >
> > >
> > > 2. Other backends don't see these mappings. If the guest submits a
> vring
> > > descriptor referencing a mapping to another backend, then that backend
> > > won't be able to access this memory. David Gilbert hit this problem
> when
> > > working on the virtiofs DAX Window. Either the frontend needs to
> forward
> > > all SHMAP_MAP/UNMAP messages to the other backends (inefficient and
> > > maybe racy!) or a new "memcpy" message is needed as a fallback for when
> > > vhost-user memory region translation fails.
> > >
> >
> > Ok. In which scenario would another device want to access the same
> mapping?
> > If it is for a shared VIRTIO object, the device that receives the dmabuf
> fd
> > could
> > just do a new mapping in its VIRTIO shared memory region.
>
> You can reproduce this with two virtiofs devices. Where device A has the
> DAX Window feature enabled. The test program mmaps the DAX file and then
> opens a new file on device B and performs an O_DIRECT write(2) syscall.
>
> I think that this issue must be address in this series since this issue
> is introduced by enabling VIRTIO Shared Memory Regions in vhost-user.
>
> A bit more detail:
>
> Device A has a VIRTIO Shared Memory Region. An application mmaps that
> memory (examples: guest userspace driver using Linux VFIO, a guest
> kernel driver that exposes the memory to userspace via mmap, or guest
> kernel DAX). The application passes that memory as an I/O buffer to
> device B (e.g. O_DIRECT disk I/O).
>
> The result is that device B's vhost-user backend receives a vring
> descriptor that points to a guest memory address in device A's VIRTIO
> Shared Memory Region. Since device B does not have this memory in its
> table, it cannot translate the address and the device breaks.
>
>
I am working on the next version, and I am struggling on what to do
to solve this situation.
So IIUC is the device B (in your example) who will notice that it cannot
translate, and then, the backend will send a message (let's say,
SHMEM_COPY) to the frontend so that it copies all mappings
done on device A to device B. Something like that?
Maybe better only for the mapping that triggered the
message?

But the message would have to point to the device A so that
the frontend knows the src and dest to perform the memcpy?
And indicate the memory range that we want to map (offset + len).
I am not sure how the frontend would know that it needs to copy
from device A to device B after receiving the new message,
and get a pointer of both devices (device B is easy as it is
the one that sent the message, but not device A).
Or maybe I misunderstood your suggested solution.

BR,
Albert.


> >
> >
> > >
> > > >
> > > > diff --git a/docs/interop/vhost-user.rst
> b/docs/interop/vhost-user.rst
> > > > index d8419fd2f1..3caf2a290c 100644
> > > > --- a/docs/interop/vhost-user.rst
> > > > +++ b/docs/interop/vhost-user.rst
> > > > @@ -1859,6 +1859,29 @@ is sent by the front-end.
> > > >    when the operation is successful, or non-zero otherwise. Note
> that if
> > > the
> > > >    operation fails, no fd is sent to the backend.
> > > >
> > > > +``VHOST_USER_BACKEND_SHMEM_MAP``
> > > > +  :id: 9
> > > > +  :equivalent ioctl: N/A
> > > > +  :request payload: fd and ``struct VhostUserMMap``
> > > > +  :reply payload: N/A
> > > > +
> > > > +  This message can be submitted by the backends to advertise a new
> > > mapping
> > > > +  to be made in a given shared memory region. Upon receiving the
> > > message,
> > > > +  QEMU will mmap the given fd into the shared memory region with the
> > >
> > > s/QEMU/the frontend/
> > >
> > > > +  requested ``shmid``. A reply is generated indicating whether
> mapping
> > > > +  succeeded.
> > >
> > > Please document whether mapping over an existing mapping is allowed. I
> > > think it should be allowed because it might be useful to atomically
> > > update a mapping without a race where the driver sees unmapped memory.
> > >
> > >
> > So in my understanding, the frontend (driver) in the guest would initiate
> > the
> > mmap/munmap by sending request to the backend (vhost-user device).
> > Then the vhost-user device sends a request to the VMM to perform the
> > mapping. We could enforce an ACK to ensure that the mmap operation
> finished
> > before the vhost-user device responds to the driver, and thus avoid
> > races. This way, we support only the simple usecase of not allowing
> > mmaps over an already mapped region.
>
> I think it's fine to start with the stricter model where no overlapping
> mappings are allowed. If someone needs overlapping mappings in the
> future, a feature bit could be added to negotiate support.
>
> Please document that overlapping mappings are not supported.
>
> > > If mapping over an existing mapping is allowed, does the new mapping
> > > need to cover the old mapping exactly, or can it span multiple previous
> > > mappings or a subset of an existing mapping?
> > >
> > > From a security point of view we need to be careful here. A potentially
> > > untrusted backend process now has the ability to mmap an arbitrary file
> > > descriptor into the frontend process. The backend can cause
> > > denial of service by creating many small mappings to exhaust the OS
> > > limits on virtual memory areas. The backend can map memory to use as
> > > part of a security compromise, so we need to be sure the virtual memory
> > > addresses are not leaked to the backend and only read/write page
> > > permissions are available.
> > >
> >
> > Right, security from untrusted backends is usally the hardest part to me.
> > But vhost-user devices do only see shm_offset, so this should be safe,
> no?
>
> Yes, I think the current interface is safe, too.
>
> > >
> > > > +``VHOST_USER_BACKEND_SHMEM_UNMAP``
> > > > +  :id: 10
> > > > +  :equivalent ioctl: N/A
> > > > +  :request payload: ``struct VhostUserMMap``
> > > > +  :reply payload: N/A
> > > > +
> > > > +  This message can be submitted by the backends so that QEMU un-mmap
> > >
> > > s/QEMU/the frontend/
> > >
> >
> > This is probably my bad, but I really thought the frontend is the driver.
> > So frontend/backend as alternative terms for vhost-user driver/device.
>
> "vhost-user driver" is not a term that's used, as far as I know. The
> vhost-user spec calls it the front-end (older code and docs may call it
> the vhost-user master).
>
> > And then here we would keep QEMU or use VMM instead?
>
> The vhost-user spec uses both front-end and QEMU to mean the same thing.
> VMM is not used in the vhost-user spec. I suggest using front-end in new
> spec text because QEMU is not the only application that implements this
> spec anymore.
>
> >
> >
> > >
> > > > +  a given range (``offset``, ``len``) in the shared memory region
> with
> > > the
> > > > +  requested ``shmid``.
> > >
> > > Does the range need to correspond to a previously-mapped VhostUserMMap
> > > or can it cross multiple VhostUserMMaps, be a subset of a
> VhostUserMMap,
> > > etc?
> > >
> >
> > I would prefer to keep it simple and disallow mapping over a
> > previously-mapped
> > region. The range need to correspond to a valid (within size bound),
> free,
> > memory region, or else the request will fail. I will add that to the
> text.
> >
> > Nonetheless, I am open to discuss other options.
>
> That sounds good. I'm not aware of any use cases that require anything
> fancy.
>
> >
> >
> > >
> > > > +  A reply is generated indicating whether unmapping succeeded.
> > > > +
> > > >  .. _reply_ack:
> > > >
> > > >  VHOST_USER_PROTOCOL_F_REPLY_ACK
> > > > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
> > > > index cdf9af4a4b..9526b9d07f 100644
> > > > --- a/hw/virtio/vhost-user.c
> > > > +++ b/hw/virtio/vhost-user.c
> > > > @@ -115,6 +115,8 @@ typedef enum VhostUserBackendRequest {
> > > >      VHOST_USER_BACKEND_SHARED_OBJECT_ADD = 6,
> > > >      VHOST_USER_BACKEND_SHARED_OBJECT_REMOVE = 7,
> > > >      VHOST_USER_BACKEND_SHARED_OBJECT_LOOKUP = 8,
> > > > +    VHOST_USER_BACKEND_SHMEM_MAP = 9,
> > > > +    VHOST_USER_BACKEND_SHMEM_UNMAP = 10,
> > > >      VHOST_USER_BACKEND_MAX
> > > >  }  VhostUserBackendRequest;
> > > >
> > > > @@ -192,6 +194,23 @@ typedef struct VhostUserShared {
> > > >      unsigned char uuid[16];
> > > >  } VhostUserShared;
> > > >
> > > > +/* For the flags field of VhostUserMMap */
> > > > +#define VHOST_USER_FLAG_MAP_R (1u << 0)
> > > > +#define VHOST_USER_FLAG_MAP_W (1u << 1)
> > > > +
> > > > +typedef struct {
> > > > +    /* Shared memory region ID */
> > > > +    uint8_t shmid;
> > >
> > > There is a hole (padding) in the struct since the following fields are
> > > naturally aligned to 8 bytes. I suggest moving shmid to the end to
> > > reduce the chance of information leaks if padding is not zeroed.
> > >
> >
> > I see. I can move it to the end of the struct or add a padding field in
> > between. I'll do what you suggested, as it sound like the simplest
> solution.
> >
> >
> > >
> > > > +    /* File offset */
> > > > +    uint64_t fd_offset;
> > > > +    /* Offset within the shared memory region */
> > > > +    uint64_t shm_offset;
> > > > +    /* Size of region to map */
> > >
> > > To avoid giving "region" additional meanings:
> > >
> > > s/Size of the region to map/Size of the mapping/
> > >
> >
> > Ok, I will change it in the next drop. Probably will keep the RFC for
> > at least one more patch, seeing that there are a few things I wasn't
> > correctly considering.
> >
> > Thanks for all the feedback!
>
> Great, thanks!
>
> Stefan
>
> >
> >
> > >
> > > > +    uint64_t len;
> > > > +    /* Flags for the mmap operation, from VHOST_USER_FLAG_* */
> > > > +    uint64_t flags;
> > > > +} VhostUserMMap;
> > > > +
> > > >  typedef struct {
> > > >      VhostUserRequest request;
> > > >
> > > > @@ -224,6 +243,7 @@ typedef union {
> > > >          VhostUserInflight inflight;
> > > >          VhostUserShared object;
> > > >          VhostUserTransferDeviceState transfer_state;
> > > > +        VhostUserMMap mmap;
> > > >  } VhostUserPayload;
> > > >
> > > >  typedef struct VhostUserMsg {
> > > > @@ -1748,6 +1768,85 @@
> > > vhost_user_backend_handle_shared_object_lookup(struct vhost_user *u,
> > > >      return 0;
> > > >  }
> > > >
> > > > +static int
> > > > +vhost_user_backend_handle_shmem_map(struct vhost_dev *dev,
> > > > +                                  VhostUserMMap *vu_mmap,
> > > > +                                  int fd)
> > > > +{
> > > > +    void *addr = 0;
> > > > +    MemoryRegion *mr = NULL;
> > > > +
> > > > +    if (fd < 0) {
> > > > +        error_report("Bad fd for map");
> > > > +        return -EBADF;
> > > > +    }
> > > > +
> > > > +    if (!dev->vdev->shmem_list ||
> > > > +        dev->vdev->n_shmem_regions <= vu_mmap->shmid) {
> > > > +        error_report("Shared memory region at "
> > > > +                     "ID %d unitialized", vu_mmap->shmid);
> > > > +        return -EFAULT;
> > > > +    }
> > > > +
> > > > +    mr = &dev->vdev->shmem_list[vu_mmap->shmid];
> > > > +
> > > > +    if ((vu_mmap->shm_offset + vu_mmap->len) < vu_mmap->len ||
> > > > +        (vu_mmap->shm_offset + vu_mmap->len) > mr->size) {
> > > > +        error_report("Bad offset/len for mmap %" PRIx64 "+%" PRIx64,
> > > > +                     vu_mmap->shm_offset, vu_mmap->len);
> > > > +        return -EFAULT;
> > > > +    }
> > > > +
> > > > +    void *shmem_ptr = memory_region_get_ram_ptr(mr);
> > > > +
> > > > +    addr = mmap(shmem_ptr + vu_mmap->shm_offset, vu_mmap->len,
> > > > +        ((vu_mmap->flags & VHOST_USER_FLAG_MAP_R) ? PROT_READ : 0) |
> > > > +        ((vu_mmap->flags & VHOST_USER_FLAG_MAP_W) ? PROT_WRITE : 0),
> > > > +        MAP_SHARED | MAP_FIXED, fd, vu_mmap->fd_offset);
> > > > +    if (addr == MAP_FAILED) {
> > > > +        error_report("Failed to mmap mem fd");
> > > > +        return -EFAULT;
> > > > +    }
> > > > +
> > > > +    return 0;
> > > > +}
> > > > +
> > > > +static int
> > > > +vhost_user_backend_handle_shmem_unmap(struct vhost_dev *dev,
> > > > +                                    VhostUserMMap *vu_mmap)
> > > > +{
> > > > +    void *addr = 0;
> > > > +    MemoryRegion *mr = NULL;
> > > > +
> > > > +    if (!dev->vdev->shmem_list ||
> > > > +        dev->vdev->n_shmem_regions <= vu_mmap->shmid) {
> > > > +        error_report("Shared memory region at "
> > > > +                     "ID %d unitialized", vu_mmap->shmid);
> > > > +        return -EFAULT;
> > > > +    }
> > > > +
> > > > +    mr = &dev->vdev->shmem_list[vu_mmap->shmid];
> > > > +
> > > > +    if ((vu_mmap->shm_offset + vu_mmap->len) < vu_mmap->len ||
> > > > +        (vu_mmap->shm_offset + vu_mmap->len) > mr->size) {
> > > > +        error_report("Bad offset/len for mmap %" PRIx64 "+%" PRIx64,
> > > > +                     vu_mmap->shm_offset, vu_mmap->len);
> > > > +        return -EFAULT;
> > > > +    }
> > > > +
> > > > +    void *shmem_ptr = memory_region_get_ram_ptr(mr);
> > > > +
> > > > +    addr = mmap(shmem_ptr + vu_mmap->shm_offset, vu_mmap->len,
> > > > +                PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,
> -1,
> > > 0);
> > > > +
> > > > +    if (addr == MAP_FAILED) {
> > > > +        error_report("Failed to unmap memory");
> > > > +        return -EFAULT;
> > > > +    }
> > > > +
> > > > +    return 0;
> > > > +}
> > > > +
> > > >  static void close_backend_channel(struct vhost_user *u)
> > > >  {
> > > >      g_source_destroy(u->backend_src);
> > > > @@ -1816,6 +1915,13 @@ static gboolean backend_read(QIOChannel *ioc,
> > > GIOCondition condition,
> > > >          ret =
> > > vhost_user_backend_handle_shared_object_lookup(dev->opaque, ioc,
> > > >                                                               &hdr,
> > > &payload);
> > > >          break;
> > > > +    case VHOST_USER_BACKEND_SHMEM_MAP:
> > > > +        ret = vhost_user_backend_handle_shmem_map(dev,
> &payload.mmap,
> > > > +                                                fd ? fd[0] : -1);
> > > > +        break;
> > > > +    case VHOST_USER_BACKEND_SHMEM_UNMAP:
> > > > +        ret = vhost_user_backend_handle_shmem_unmap(dev,
> &payload.mmap);
> > > > +        break;
> > > >      default:
> > > >          error_report("Received unexpected msg type: %d.",
> hdr.request);
> > > >          ret = -EINVAL;
> > > > diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> > > > index 893a072c9d..59596370ec 100644
> > > > --- a/hw/virtio/virtio.c
> > > > +++ b/hw/virtio/virtio.c
> > > > @@ -3264,6 +3264,8 @@ void virtio_init(VirtIODevice *vdev, uint16_t
> > > device_id, size_t config_size)
> > > >              virtio_vmstate_change, vdev);
> > > >      vdev->device_endian = virtio_default_endian();
> > > >      vdev->use_guest_notifier_mask = true;
> > > > +    vdev->shmem_list = NULL;
> > > > +    vdev->n_shmem_regions = 0;
> > > >  }
> > > >
> > > >  /*
> > > > diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
> > > > index 7d5ffdc145..34bec26593 100644
> > > > --- a/include/hw/virtio/virtio.h
> > > > +++ b/include/hw/virtio/virtio.h
> > > > @@ -165,6 +165,9 @@ struct VirtIODevice
> > > >       */
> > > >      EventNotifier config_notifier;
> > > >      bool device_iotlb_enabled;
> > > > +    /* Shared memory region for vhost-user mappings. */
> > > > +    MemoryRegion *shmem_list;
> > > > +    int n_shmem_regions;
> > > >  };
> > > >
> > > >  struct VirtioDeviceClass {
> > > > --
> > > > 2.44.0
> > > >
> > >
>

Reply via email to