On Tue, Jan 14, 2020 at 02:12:46AM -0500, Michael S. Tsirkin wrote: > > On Mon, Dec 09, 2019 at 02:00:47AM -0500, Raphael Norwitz wrote: > > The current vhost-user implementation in Qemu imposes a limit on the > > maximum number of memory slots exposed to a VM using a vhost-user > > device. This change provides a new protocol feature > > VHOST_USER_F_CONFIGURE_SLOTS which, when enabled, lifts this limit > > and allows a VM with a vhost-user device to expose a configurable > > number of memory slots, up to the maximum supported by the platform. > > Existing backends are unaffected. > > > > This feature works by using three new messages, > > VHOST_USER_GET_MAX_MEM_SLOTS, VHOST_USER_ADD_MEM_REG and > > VHOST_USER_REM_MEM_REG. VHOST_USER_GET_MAX_MEM_SLOTS gets the > > number of memory slots the backend is willing to accept. Then, > > when the memory tables are set or updated, a series of > > VHOST_USER_ADD_MEM_REG and VHOST_USER_REM_MEM_REG messages are sent > > to transmit the regions to map and/or unmap instead of trying to > > send all the regions in one fixed size VHOST_USER_SET_MEM_TABLE message. > > > > The vhost_user struct maintains a shadow state of the VM’s memory > > regions. When the memory tables are modified, the > > vhost_user_set_mem_table() function compares the new device memory state > > to the shadow state and only sends regions which need to be unmapped or > > mapped in. The regions which must be unmapped are sent first, followed > > by the new regions to be mapped in. After all the messages have been sent, > > the shadow state is set to the current virtual device state. > > > > The current feature implementation does not work with postcopy migration > > and cannot be enabled if the VHOST_USER_PROTOCOL_F_REPLY_ACK feature has > > also been negotiated. > > > > Signed-off-by: Raphael Norwitz <raphael.norw...@nutanix.com> > > --- > > docs/interop/vhost-user.rst | 43 ++++++++ > > hw/virtio/vhost-user.c | 251 > > ++++++++++++++++++++++++++++++++++++++++---- > > 2 files changed, 273 insertions(+), 21 deletions(-) > > > > diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst > > index 7827b71..855a072 100644 > > --- a/docs/interop/vhost-user.rst > > +++ b/docs/interop/vhost-user.rst > > @@ -785,6 +785,7 @@ Protocol features > > #define VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD 10 > > #define VHOST_USER_PROTOCOL_F_HOST_NOTIFIER 11 > > #define VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD 12 > > + #define VHOST_USER_PROTOCOL_F_CONFIGURE_SLOTS 13 > > > > Master message types > > -------------------- > > @@ -1190,6 +1191,48 @@ Master message types > > ancillary data. The GPU protocol is used to inform the master of > > rendering state and updates. See vhost-user-gpu.rst for details. > > > > +``VHOST_USER_GET_MAX_MEM_SLOTS`` > > + :id: 34 > > + :equivalent ioctl: N/A > > + :slave payload: u64 > > + > > + When the VHOST_USER_PROTOCOL_F_CONFIGURE_SLOTS protocol feature has been > > + successfully negotiated, this message is submitted by master to the > > + slave. The slave should return the message with a u64 payload > > + containing the maximum number of memory slots for QEMU to expose to > > + the guest. This message is not supported with postcopy migration or if > > + the VHOST_USER_PROTOCOL_F_REPLY_ACK feature has also been negotiated. > > + > > +``VHOST_USER_ADD_MEM_REG`` > > + :id: 35 > > + :equivalent ioctl: N/A > > + :slave payload: memory region > > + > > + When the VHOST_USER_PROTOCOL_F_CONFIGURE_SLOTS protocol feature has been > > + successfully negotiated, this message is submitted by master to the > > slave. > > + The message payload contains a memory region descriptor struct, > > describing > > + a region of guest memory which the slave device must map in. When the > > + VHOST_USER_PROTOCOL_F_CONFIGURE_SLOTS protocol feature has been > > successfully > > + negotiated, along with the VHOST_USER_REM_MEM_REG message, this message > > is > > + used to set and update the memory tables of the slave device. This > > message > > + is not supported with postcopy migration or if the > > + VHOST_USER_PROTOCOL_F_REPLY_ACK feature has also been negotiated. > > + > > +``VHOST_USER_REM_MEM_REG`` > > + :id: 36 > > + :equivalent ioctl: N/A > > + :slave payload: memory region > > + > > + When the VHOST_USER_PROTOCOL_F_CONFIGURE_SLOTS protocol feature has been > > + successfully negotiated, this message is submitted by master to the > > slave. > > + The message payload contains a memory region descriptor struct, > > describing > > + a region of guest memory which the slave device must unmap. When the > > + VHOST_USER_PROTOCOL_F_CONFIGURE_SLOTS protocol feature has been > > successfully > > + negotiated, along with the VHOST_USER_ADD_MEM_REG message, this message > > is > > + used to set and update the memory tables of the slave device. This > > message > > + is not supported with postcopy migration or if the > > + VHOST_USER_PROTOCOL_F_REPLY_ACK feature has also been negotiated. > > + > > Slave message types > > ------------------- > > > > diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c > > index 2134e81..3432462 100644 > > --- a/hw/virtio/vhost-user.c > > +++ b/hw/virtio/vhost-user.c > > @@ -35,11 +35,29 @@ > > #include <linux/userfaultfd.h> > > #endif > > > > -#define VHOST_MEMORY_MAX_NREGIONS 8 > > +#define VHOST_MEMORY_LEGACY_NREGIONS 8 > > #define VHOST_USER_F_PROTOCOL_FEATURES 30 > > #define VHOST_USER_SLAVE_MAX_FDS 8 > > > > /* > > + * Set maximum number of RAM slots supported to > > + * the maximum number supported by the target > > + * hardware plaform. > > + */ > > +#if defined(TARGET_X86) || defined(TARGET_X86_64) || \ > > + defined(TARGET_ARM) || defined(TARGET_ARM_64) > > +#include "hw/acpi/acpi.h" > > +#define VHOST_USER_MAX_RAM_SLOTS ACPI_MAX_RAM_SLOTS > > + > > +#elif defined(TARGET_PPC) || defined(TARGET_PPC_64) > > +#include "hw/ppc/spapr.h" > > +#define VHOST_USER_MAX_RAM_SLOTS SPAPR_MAX_RAM_SLOTS > > + > > +#else > > +#define VHOST_USER_MAX_RAM_SLOTS 512 > > +#endif > > + > > +/* > > * Maximum size of virtio device config space > > */ > > #define VHOST_USER_MAX_CONFIG_SIZE 256 > > @@ -58,6 +76,7 @@ enum VhostUserProtocolFeature { > > VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD = 10, > > VHOST_USER_PROTOCOL_F_HOST_NOTIFIER = 11, > > VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD = 12, > > + VHOST_USER_PROTOCOL_F_CONFIGURE_SLOTS = 13, > > VHOST_USER_PROTOCOL_F_MAX > > }; > > > > @@ -98,6 +117,9 @@ typedef enum VhostUserRequest { > > VHOST_USER_GET_INFLIGHT_FD = 31, > > VHOST_USER_SET_INFLIGHT_FD = 32, > > VHOST_USER_GPU_SET_SOCKET = 33, > > + VHOST_USER_GET_MAX_MEM_SLOTS = 34, > > + VHOST_USER_ADD_MEM_REG = 35, > > + VHOST_USER_REM_MEM_REG = 36, > > VHOST_USER_MAX > > } VhostUserRequest; > > > > @@ -119,9 +141,14 @@ typedef struct VhostUserMemoryRegion { > > typedef struct VhostUserMemory { > > uint32_t nregions; > > uint32_t padding; > > - VhostUserMemoryRegion regions[VHOST_MEMORY_MAX_NREGIONS]; > > + VhostUserMemoryRegion regions[VHOST_MEMORY_LEGACY_NREGIONS]; > > } VhostUserMemory; > > > > +typedef struct VhostUserMemRegMsg { > > + uint32_t padding; > > + VhostUserMemoryRegion region; > > +} VhostUserMemRegMsg; > > + > > typedef struct VhostUserLog { > > uint64_t mmap_size; > > uint64_t mmap_offset; > > @@ -180,6 +207,7 @@ typedef union { > > struct vhost_vring_state state; > > struct vhost_vring_addr addr; > > VhostUserMemory memory; > > + VhostUserMemRegMsg mem_reg; > > VhostUserLog log; > > struct vhost_iotlb_msg iotlb; > > VhostUserConfig config; > > @@ -208,7 +236,7 @@ struct vhost_user { > > int slave_fd; > > NotifierWithReturn postcopy_notifier; > > struct PostCopyFD postcopy_fd; > > - uint64_t postcopy_client_bases[VHOST_MEMORY_MAX_NREGIONS]; > > + uint64_t postcopy_client_bases[VHOST_USER_MAX_RAM_SLOTS]; > > /* Length of the region_rb and region_rb_offset arrays */ > > size_t region_rb_len; > > /* RAMBlock associated with a given region */ > > @@ -220,6 +248,10 @@ struct vhost_user { > > > > /* True once we've entered postcopy_listen */ > > bool postcopy_listen; > > + > > + /* Our current regions */ > > + int num_shadow_regions; > > + VhostUserMemoryRegion shadow_regions[VHOST_USER_MAX_RAM_SLOTS]; > > }; > > > > static bool ioeventfd_enabled(void) > > @@ -368,7 +400,7 @@ int vhost_user_gpu_set_socket(struct vhost_dev *dev, > > int fd) > > static int vhost_user_set_log_base(struct vhost_dev *dev, uint64_t base, > > struct vhost_log *log) > > { > > - int fds[VHOST_MEMORY_MAX_NREGIONS]; > > + int fds[VHOST_USER_MAX_RAM_SLOTS]; > > size_t fd_num = 0; > > bool shmfd = virtio_has_feature(dev->protocol_features, > > VHOST_USER_PROTOCOL_F_LOG_SHMFD); > > @@ -426,7 +458,7 @@ static int vhost_user_fill_set_mem_table_msg(struct > > vhost_user *u, > > &offset); > > fd = memory_region_get_fd(mr); > > if (fd > 0) { > > - if (*fd_num == VHOST_MEMORY_MAX_NREGIONS) { > > + if (*fd_num == VHOST_MEMORY_LEGACY_NREGIONS) { > > error_report("Failed preparing vhost-user memory table > > msg"); > > return -1; > > } > > @@ -472,7 +504,7 @@ static int vhost_user_set_mem_table_postcopy(struct > > vhost_dev *dev, > > struct vhost_memory *mem) > > { > > struct vhost_user *u = dev->opaque; > > - int fds[VHOST_MEMORY_MAX_NREGIONS]; > > + int fds[VHOST_MEMORY_LEGACY_NREGIONS]; > > size_t fd_num = 0; > > VhostUserMsg msg_reply; > > int region_i, msg_i; > > @@ -521,7 +553,7 @@ static int vhost_user_set_mem_table_postcopy(struct > > vhost_dev *dev, > > } > > > > memset(u->postcopy_client_bases, 0, > > - sizeof(uint64_t) * VHOST_MEMORY_MAX_NREGIONS); > > + sizeof(uint64_t) * VHOST_USER_MAX_RAM_SLOTS); > > > > /* They're in the same order as the regions that were sent > > * but some of the regions were skipped (above) if they > > @@ -562,18 +594,151 @@ static int vhost_user_set_mem_table_postcopy(struct > > vhost_dev *dev, > > return 0; > > } > > > > +static inline bool reg_equal(VhostUserMemoryRegion *shadow_reg, > > + struct vhost_memory_region *vdev_reg) > > +{ > > + if (shadow_reg->guest_phys_addr == vdev_reg->guest_phys_addr && > > + shadow_reg->userspace_addr == vdev_reg->userspace_addr && > > + shadow_reg->memory_size == vdev_reg->memory_size) { > > + return true; > > + } else { > > + return false; > > + } > > +} > > + > > +static int vhost_user_send_add_remove_regions(struct vhost_dev *dev, > > + struct vhost_memory *mem, > > + VhostUserMsg *msg) > > +{ > > + struct vhost_user *u = dev->opaque; > > + int i, j, fd; > > + bool found[VHOST_USER_MAX_RAM_SLOTS] = {}; > > + bool matching = false; > > + struct vhost_memory_region *reg; > > + ram_addr_t offset; > > + MemoryRegion *mr; > > + > > + /* > > + * Ensure the VHOST_USER_PROTOCOL_F_REPLY_ACK has not been > > + * negotiated and no postcopy migration is in progress. > > + */ > > + assert(!virtio_has_feature(dev->protocol_features, > > + VHOST_USER_PROTOCOL_F_REPLY_ACK)); > > + if (u->postcopy_listen && u->postcopy_fd.handler) { > > + error_report("Postcopy migration is not supported when the " > > + "VHOST_USER_PROTOCOL_F_CONFIGURE_SLOTS feature " > > + "has been negotiated"); > > + return -1; > > + } > > + > > + msg->hdr.size = sizeof(msg->payload.mem_reg.padding); > > + msg->hdr.size += sizeof(VhostUserMemoryRegion); > > + > > + /* > > + * Send VHOST_USER_REM_MEM_REG for memory regions in our shadow state > > + * which are not found not in the device's memory state. > > + */ > > + for (i = 0; i < u->num_shadow_regions; ++i) { > > + reg = dev->mem->regions; > > + > > + for (j = 0; j < dev->mem->nregions; j++) { > > + reg = dev->mem->regions + j; > > + > > + assert((uintptr_t)reg->userspace_addr == reg->userspace_addr); > > + mr = memory_region_from_host((void > > *)(uintptr_t)reg->userspace_addr, > > + &offset); > > + fd = memory_region_get_fd(mr); > > + > > + if (reg_equal(&u->shadow_regions[i], reg)) { > > + matching = true; > > + found[j] = true; > > + break; > > + } > > + } > > + > > + if (fd > 0 && !matching) { > > + msg->hdr.request = VHOST_USER_REM_MEM_REG; > > + msg->payload.mem_reg.region.userspace_addr = > > reg->userspace_addr; > > + msg->payload.mem_reg.region.memory_size = reg->memory_size; > > + msg->payload.mem_reg.region.guest_phys_addr = > > + reg->guest_phys_addr; > > + msg->payload.mem_reg.region.mmap_offset = offset; > > + > > + if (vhost_user_write(dev, msg, &fd, 1) < 0) { > > + return -1; > > + } > > + } > > + } > > + > > + /* > > + * Send messages to add regions present in the device which are not > > + * in our shadow state. > > + */ > > + for (i = 0; i < dev->mem->nregions; ++i) { > > + reg = dev->mem->regions + i; > > + > > + /* > > + * If the region was in both the shadow and vdev state we don't > > + * need to send a VHOST_USER_ADD_MEM_REG message for it. > > + */ > > + if (found[i]) { > > + continue; > > + } > > + > > + assert((uintptr_t)reg->userspace_addr == reg->userspace_addr); > > + mr = memory_region_from_host((void > > *)(uintptr_t)reg->userspace_addr, > > + &offset); > > + fd = memory_region_get_fd(mr); > > + > > + if (fd > 0) { > > + msg->hdr.request = VHOST_USER_ADD_MEM_REG; > > + msg->payload.mem_reg.region.userspace_addr = > > reg->userspace_addr; > > + msg->payload.mem_reg.region.memory_size = reg->memory_size; > > + msg->payload.mem_reg.region.guest_phys_addr = > > + reg->guest_phys_addr; > > + msg->payload.mem_reg.region.mmap_offset = offset; > > + > > + if (vhost_user_write(dev, msg, &fd, 1) < 0) { > > + return -1; > > + } > > + } > > + } > > + > > + /* Make our shadow state match the updated device state. */ > > + u->num_shadow_regions = dev->mem->nregions; > > + for (i = 0; i < dev->mem->nregions; ++i) { > > + reg = dev->mem->regions + i; > > + u->shadow_regions[i].guest_phys_addr = reg->guest_phys_addr; > > + u->shadow_regions[i].userspace_addr = reg->userspace_addr; > > + u->shadow_regions[i].memory_size = reg->memory_size; > > + } > > + > > + return 0; > > +} > > + > > static int vhost_user_set_mem_table(struct vhost_dev *dev, > > struct vhost_memory *mem) > > { > > struct vhost_user *u = dev->opaque; > > - int fds[VHOST_MEMORY_MAX_NREGIONS]; > > + int fds[VHOST_MEMORY_LEGACY_NREGIONS]; > > size_t fd_num = 0; > > bool do_postcopy = u->postcopy_listen && u->postcopy_fd.handler; > > bool reply_supported = virtio_has_feature(dev->protocol_features, > > > > VHOST_USER_PROTOCOL_F_REPLY_ACK); > > + bool config_slots = > > + virtio_has_feature(dev->protocol_features, > > + VHOST_USER_PROTOCOL_F_CONFIGURE_SLOTS); > > > > if (do_postcopy) { > > - /* Postcopy has enough differences that it's best done in it's own > > + if (config_slots) { > > + error_report("Postcopy migration not supported with " > > + "VHOST_USER_PROTOCOL_F_CONFIGURE_SLOTS feature " > > + "enabled."); > > + return -1; > > + } > > + > > + /* > > + * Postcopy has enough differences that it's best done in it's own > > * version > > */ > > return vhost_user_set_mem_table_postcopy(dev, mem); > > @@ -587,17 +752,22 @@ static int vhost_user_set_mem_table(struct vhost_dev > > *dev, > > msg.hdr.flags |= VHOST_USER_NEED_REPLY_MASK; > > } > > > > - if (vhost_user_fill_set_mem_table_msg(u, dev, &msg, fds, &fd_num, > > - false) < 0) { > > - return -1; > > - } > > - > > - if (vhost_user_write(dev, &msg, fds, fd_num) < 0) { > > - return -1; > > - } > > + if (config_slots && !reply_supported) { > > + if (vhost_user_send_add_remove_regions(dev, mem, &msg) < 0) { > > + return -1; > > + } > > + } else { > > + if (vhost_user_fill_set_mem_table_msg(u, dev, &msg, fds, &fd_num, > > + false) < 0) { > > + return -1; > > + } > > + if (vhost_user_write(dev, &msg, fds, fd_num) < 0) { > > + return -1; > > + } > > > > - if (reply_supported) { > > - return process_message_reply(dev, &msg); > > + if (reply_supported) { > > + return process_message_reply(dev, &msg); > > + } > > } > > > > return 0; > > @@ -762,7 +932,7 @@ static int vhost_set_vring_file(struct vhost_dev *dev, > > VhostUserRequest request, > > struct vhost_vring_file *file) > > { > > - int fds[VHOST_MEMORY_MAX_NREGIONS]; > > + int fds[VHOST_USER_MAX_RAM_SLOTS]; > > size_t fd_num = 0; > > VhostUserMsg msg = { > > .hdr.request = request, > > @@ -1496,7 +1666,46 @@ static int vhost_user_get_vq_index(struct vhost_dev > > *dev, int idx) > > > > static int vhost_user_memslots_limit(struct vhost_dev *dev) > > { > > - return VHOST_MEMORY_MAX_NREGIONS; > > + VhostUserMsg msg = { > > + .hdr.request = VHOST_USER_GET_MAX_MEM_SLOTS, > > + .hdr.flags = VHOST_USER_VERSION, > > + }; > > + > > + if (!virtio_has_feature(dev->protocol_features, > > + VHOST_USER_PROTOCOL_F_CONFIGURE_SLOTS)) { > > + return VHOST_MEMORY_LEGACY_NREGIONS; > > + } > > + > > + if (virtio_has_feature(dev->protocol_features, > > + VHOST_USER_PROTOCOL_F_REPLY_ACK)) { > > I'd just assume that VHOST_USER_GET_MAX_MEM_SLOTS always > gets a response. > > Makes sense > > + error_report("The VHOST_USER_PROTOCOL_F_CONFIGURE_SLOTS protocol " > > + "feature is not supported when the " > > + "VHOST_USER_PROTOCOL_F_REPLY_ACK features has also " > > + "been negotiated"); > > + return -1; > > + } > > + > > + if (vhost_user_write(dev, &msg, NULL, 0) < 0) { > > + return -1; > > + } > > This will send the message each time e.g. memory hotplug > triggers. Why not just get this value during init time? > Also, any way we can cut down number of round trips? > Can we combine this value in a single message with > some other stuff we are fetching on startup? >
Agreed, sending a VHOST_USER_GET_MEMSLOTS message on every hot-add is unnessesary. I can think of a couple ways to get number of memslots without adding any additional round trips. I don’t like either of them though.` 1. Only 14 of the 64 bits are used in the VHOST_USER_GET_FEATURES message are used to store feature flags. If CONFIGURE SLOTS is enabled, we could use the remaining 64 - VHOST_USER_PROTOCOL_F_MAX bits to store the maximum number of memory slots. We would only need around 10 bits to express a reasonable number of slots so that still leaves room for plenty of future features with VHOST_USER_PROTOCOL_F_CONFIGURE_SLOTS negotiated. 2. We could always have it sent from the backend with an existing VHOST_USER_GET_* message in vhost_user_backend_init(). The best option I see is using the VHOST_USER_GET_QUEUE_NUM if the VHOST_USER_PROTOCOL_F_MQ is negotiated. This has the obvious drawback of requiring VHOST_USER_PROTOCOL_F_MQ to negotiate VHOST_USER_PROTOCOL_F_CONFIGURE_SOTS. I don’t see another option without such a limitation. Both (1) and (2) seem hacky. I think it’s preferable to keep the VHOST_USER_GET_MAX_MEM_SLOTS message but send it once on init and cache the response in the vhost-user struct. Do you agree? > > + > > + if (vhost_user_read(dev, &msg) < 0) { > > + return -1; > > + } > > + > > + if (msg.hdr.request != VHOST_USER_GET_MAX_MEM_SLOTS) { > > + error_report("Received unexpected msg type. Expected %d received > > %d", > > + VHOST_USER_GET_MAX_MEM_SLOTS, msg.hdr.request); > > + return -1; > > + } > > + > > + if (msg.hdr.size != sizeof(msg.payload.u64)) { > > + error_report("Received bad msg size"); > > + return -1; > > + } > > + > > + return MIN(MAX(msg.payload.u64, VHOST_MEMORY_LEGACY_NREGIONS), > > + VHOST_USER_MAX_RAM_SLOTS); > > What's this trying to do? Pls add code comments. > Should be MIN(msg.payload.u64, VHOST_USER_MAX_RAM_SLOTS) - will fix > > } > > > > static bool vhost_user_requires_shm_log(struct vhost_dev *dev) > > -- > > 1.8.3.1 > >