On 26.02.2025 14:49, Cédric Le Goater wrote:
On 2/19/25 21:34, Maciej S. Szmigiero wrote:
From: "Maciej S. Szmigiero" <maciej.szmigi...@oracle.com>
Since it's important to finish loading device state transferred via the
main migration channel (via save_live_iterate SaveVMHandler) before
starting loading the data asynchronously transferred via multifd the thread
doing the actual loading of the multifd transferred data is only started
from switchover_start SaveVMHandler.
switchover_start handler is called when MIG_CMD_SWITCHOVER_START
sub-command of QEMU_VM_COMMAND is received via the main migration channel.
This sub-command is only sent after all save_live_iterate data have already
been posted so it is safe to commence loading of the multifd-transferred
device state upon receiving it - loading of save_live_iterate data happens
synchronously in the main migration thread (much like the processing of
MIG_CMD_SWITCHOVER_START) so by the time MIG_CMD_SWITCHOVER_START is
processed all the proceeding data must have already been loaded.
Signed-off-by: Maciej S. Szmigiero <maciej.szmigi...@oracle.com>
---
hw/vfio/migration-multifd.c | 225 ++++++++++++++++++++++++++++++++++++
hw/vfio/migration-multifd.h | 2 +
hw/vfio/migration.c | 12 ++
hw/vfio/trace-events | 5 +
4 files changed, 244 insertions(+)
diff --git a/hw/vfio/migration-multifd.c b/hw/vfio/migration-multifd.c
index 5d5ee1393674..b3a88c062769 100644
--- a/hw/vfio/migration-multifd.c
+++ b/hw/vfio/migration-multifd.c
@@ -42,8 +42,13 @@ typedef struct VFIOStateBuffer {
} VFIOStateBuffer;
typedef struct VFIOMultifd {
+ QemuThread load_bufs_thread;
+ bool load_bufs_thread_running;
+ bool load_bufs_thread_want_exit;
+
VFIOStateBuffers load_bufs;
QemuCond load_bufs_buffer_ready_cond;
+ QemuCond load_bufs_thread_finished_cond;
QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
uint32_t load_buf_idx;
uint32_t load_buf_idx_last;
@@ -179,6 +184,175 @@ bool vfio_load_state_buffer(void *opaque, char *data,
size_t data_size,
return true;
}
+static int vfio_load_bufs_thread_load_config(VFIODevice *vbasedev)
+{
+ return -EINVAL;
+}
please move to next patch.
As I wrote on the previous version of the patch set at
https://lore.kernel.org/qemu-devel/4f335de0-ba9f-4537-b230-2cf8af1c1...@maciej.szmigiero.name/:
The dummy call has to be there, otherwise the code at the
previous commit time wouldn't compile since that
vfio_load_bufs_thread_load_config() call is a part of
vfio_load_bufs_thread().
This is an artifact of splitting the whole load operation in
multiple commits.
I think adding empty dummy implementations is the typical way
to do this - much like you asked today to leave
vfio_multifd_transfer_setup() returning true unconditionally
before being filled with true implementation in later patch.
See also my response at the end of this e-mail message, below
the call to vfio_load_bufs_thread_load_config().
+static VFIOStateBuffer *vfio_load_state_buffer_get(VFIOMultifd *multifd)
+{
+ VFIOStateBuffer *lb;
+ guint bufs_len;
guint: I guess it's ok to use here. It is not common practice in VFIO.
+
+ bufs_len = vfio_state_buffers_size_get(&multifd->load_bufs);
+ if (multifd->load_buf_idx >= bufs_len) {
+ assert(multifd->load_buf_idx == bufs_len);
+ return NULL;
+ }
+
+ lb = vfio_state_buffers_at(&multifd->load_bufs,
+ multifd->load_buf_idx);
Could be one line. minor.
+ if (!lb->is_present) {
+ return NULL;
+ }
+
+ return lb;
+}
+
+static bool vfio_load_state_buffer_write(VFIODevice *vbasedev,
+ VFIOStateBuffer *lb,
+ Error **errp)
+{
+ VFIOMigration *migration = vbasedev->migration;
+ VFIOMultifd *multifd = migration->multifd;
+ g_autofree char *buf = NULL;
+ char *buf_cur;
+ size_t buf_len;
+
+ if (!lb->len) {
+ return true;
+ }
+
+ trace_vfio_load_state_device_buffer_load_start(vbasedev->name,
+ multifd->load_buf_idx);
I thin we can move this trace event to vfio_load_bufs_thread()
It would get messy since we don't load empty buffers,
so we we don't print this trace point (and its _end sibling)
for empty buffers.
If we print this in vfio_load_bufs_thread() then it would
need to duplicate that !lb->len check.
+ /* lb might become re-allocated when we drop the lock */
+ buf = g_steal_pointer(&lb->data);
+ buf_cur = buf;
+ buf_len = lb->len;
+ while (buf_len > 0) {
+ ssize_t wr_ret;
+ int errno_save;
+
+ /*
+ * Loading data to the device takes a while,
+ * drop the lock during this process.
+ */
+ qemu_mutex_unlock(&multifd->load_bufs_mutex);
+ wr_ret = write(migration->data_fd, buf_cur, buf_len);> +
errno_save = errno;
+ qemu_mutex_lock(&multifd->load_bufs_mutex);
+
+ if (wr_ret < 0) {
+ error_setg(errp,
+ "writing state buffer %" PRIu32 " failed: %d",
+ multifd->load_buf_idx, errno_save);
+ return false;
+ }
+
+ assert(wr_ret <= buf_len);
+ buf_len -= wr_ret;
+ buf_cur += wr_ret;
+ }
+
+ trace_vfio_load_state_device_buffer_load_end(vbasedev->name,
+ multifd->load_buf_idx);
and drop this trace event.
That's important data since it provides for how long it took to load that
buffer (_end - _start).
It's not the same information as _start(next buffer) - _start(current buffer)
since the next buffer might not have arrived yet so its loading won't
start immediately after the end of loading of the previous one.
In which case, we can modify the parameters of vfio_load_state_buffer_write()
to use directly a 'VFIOMultifd *multifd'and an fd instead of
"migration->data_fd".
+
+ return true;
+}
+
+static bool vfio_load_bufs_thread_want_exit(VFIOMultifd *multifd,
+ bool *should_quit)
+{
+ return multifd->load_bufs_thread_want_exit || qatomic_read(should_quit);
+}
+
+/*
+ * This thread is spawned by vfio_multifd_switchover_start() which gets
+ * called upon encountering the switchover point marker in main migration
+ * stream.
+ *
+ * It exits after either:
+ * * completing loading the remaining device state and device config, OR:
+ * * encountering some error while doing the above, OR:
+ * * being forcefully aborted by the migration core by it setting should_quit
+ * or by vfio_load_cleanup_load_bufs_thread() setting
+ * multifd->load_bufs_thread_want_exit.
+ */
+static bool vfio_load_bufs_thread(void *opaque, bool *should_quit, Error
**errp)
+{
+ VFIODevice *vbasedev = opaque;
+ VFIOMigration *migration = vbasedev->migration;
+ VFIOMultifd *multifd = migration->multifd;
+ bool ret = true;
+ int config_ret;
No needed IMO. see below.
+
+ assert(multifd);
+ QEMU_LOCK_GUARD(&multifd->load_bufs_mutex);
+
+ assert(multifd->load_bufs_thread_running);
We could add a trace event for the start and the end of the thread.
Added vfio_load_bufs_thread_{start,end} trace events now.
+ while (true) {
+ VFIOStateBuffer *lb;
+
+ /*
+ * Always check cancellation first after the buffer_ready wait below in
+ * case that cond was signalled by
vfio_load_cleanup_load_bufs_thread().
+ */
+ if (vfio_load_bufs_thread_want_exit(multifd, should_quit)) {
+ error_setg(errp, "operation cancelled");
+ ret = false;
+ goto ret_signal;
goto thread_exit ?
I'm not sure that I fully understand this comment.
Do you mean to rename ret_signal label to thread_exit?
+ }
+
+ assert(multifd->load_buf_idx <= multifd->load_buf_idx_last);
+
+ lb = vfio_load_state_buffer_get(multifd);
+ if (!lb) {
+ trace_vfio_load_state_device_buffer_starved(vbasedev->name,
+ multifd->load_buf_idx);
+ qemu_cond_wait(&multifd->load_bufs_buffer_ready_cond,
+ &multifd->load_bufs_mutex);
+ continue;
+ }
+
+ if (multifd->load_buf_idx == multifd->load_buf_idx_last) {
+ break;
+ }
+
+ if (multifd->load_buf_idx == 0) {
+ trace_vfio_load_state_device_buffer_start(vbasedev->name);
+ }
+
+ if (!vfio_load_state_buffer_write(vbasedev, lb, errp)) {
+ ret = false;
+ goto ret_signal;
+ }
+
+ if (multifd->load_buf_idx == multifd->load_buf_idx_last - 1) {
+ trace_vfio_load_state_device_buffer_end(vbasedev->name);
+ }
+
+ multifd->load_buf_idx++;
+ }
if ret is assigned to true here, the "ret = false" can dropped
I inverted the "ret" logic here now - initialized ret to false
at definition, removed "ret = false" at every failure/early exit block
and added "ret = true" just before the "ret_signal" label.
+ config_ret = vfio_load_bufs_thread_load_config(vbasedev);
+ if (config_ret) {
+ error_setg(errp, "load config state failed: %d", config_ret);
+ ret = false;
+ }
please move to next patch. This is adding nothing to this patch
since it's returning -EINVAL.
That's the whole point - if someone were to accidentally enable this
(for example by forgetting to apply the next patch when backporting
the series) it would fail safely with EINVAL instead of having a
half-broken implementation.
Another option would be to simply integrate the next patch into this
one as these are two parts of the same single operation and I think
splitting them in two in the end brings little value.
Thanks,
C.
Thanks,
Maciej