migration: Multifd device state transfer support - received buffers queuing

Cédric Le Goater Mon, 17 Feb 2025 05:49:37 -0800

On 2/14/25 21:58, Maciej S. Szmigiero wrote:

On 12.02.2025 14:47, Cédric Le Goater wrote:

On 1/30/25 11:08, Maciej S. Szmigiero wrote:

From: "Maciej S. Szmigiero" <maciej.szmigi...@oracle.com>


The multifd received data needs to be reassembled since device state
packets sent via different multifd channels can arrive out-of-order.

Therefore, each VFIO device state packet carries a header indicating its
position in the stream.
The raw device state data is saved into a VFIOStateBuffer for later
in-order loading into the device.

The last such VFIO device state packet should have
VFIO_DEVICE_STATE_CONFIG_STATE flag set and carry the device config state.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigi...@oracle.com>
---
  hw/vfio/migration.c           | 116 ++++++++++++++++++++++++++++++++++
  hw/vfio/pci.c                 |   2 +
  hw/vfio/trace-events          |   1 +
  include/hw/vfio/vfio-common.h |   1 +
  4 files changed, 120 insertions(+)

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index bcdf204d5cf4..0c0caec1bd64 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -301,6 +301,12 @@ typedef struct VFIOStateBuffer {
  } VFIOStateBuffer;
  typedef struct VFIOMultifd {
+    VFIOStateBuffers load_bufs;
+    QemuCond load_bufs_buffer_ready_cond;
+    QemuMutex load_bufs_mutex; /* Lock order: this lock -> BQL */
+    uint32_t load_buf_idx;
+    uint32_t load_buf_idx_last;
+    uint32_t load_buf_queued_pending_buffers;
  } VFIOMultifd;
  static void vfio_state_buffer_clear(gpointer data)
@@ -346,6 +352,103 @@ static VFIOStateBuffer 
*vfio_state_buffers_at(VFIOStateBuffers *bufs, guint idx)
      return &g_array_index(bufs->array, VFIOStateBuffer, idx);
  }

Each routine executed from a migration thread should have a preliminary
comment saying from which context it is called: migration or VFIO


Do you mean like whether it is called from the code in qemu/migration/
directory or the code in hw/vfio/ directory?


Threads are spawned from different subsystems: migration callbacks
(save), and from VFIO (load, well, not load phase, switchover phase).
It would be good to provide hints to the reader.

I am struggling to understand how this works. Imagine a new comer
looking at the code and at the git history in 2y time ... Check
vfio in QEMU 1.3 (one small file) and see what it has become today.

What about internal linkage ("static") functions?


There shouldn't be any static left when all multifd code is moved
to its own hw/vfio/migration-multifd.c file.

Do they need such comment too?  That would actually decrease the readability> 
of these one-or-two line helpers due to high comment-to-code ratio.


I meant the higher level routines.

Tbh, this lacks tons of documentation, under docs, under each file,
for the properties, etc. This should be addressed before resend.

As far as I can see, pretty much no existing VFIO migration function
has such comment.> >>> +static bool vfio_load_state_buffer_insert(VFIODevice 
*vbasedev,

+                                          VFIODeviceStatePacket *packet,
+                                          size_t packet_total_size,
+                                          Error **errp)
+{
+    VFIOMigration *migration = vbasedev->migration;
+    VFIOMultifd *multifd = migration->multifd;
+    VFIOStateBuffer *lb;
+
+    vfio_state_buffers_assert_init(&multifd->load_bufs);
+    if (packet->idx >= vfio_state_buffers_size_get(&multifd->load_bufs)) {
+        vfio_state_buffers_size_set(&multifd->load_bufs, packet->idx + 1);
+    }
+
+    lb = vfio_state_buffers_at(&multifd->load_bufs, packet->idx);
+    if (lb->is_present) {
+        error_setg(errp, "state buffer %" PRIu32 " already filled",
+                   packet->idx);
+        return false;
+    }
+
+    assert(packet->idx >= multifd->load_buf_idx);
+
+    multifd->load_buf_queued_pending_buffers++;
+    if (multifd->load_buf_queued_pending_buffers >
+        vbasedev->migration_max_queued_buffers) {
+        error_setg(errp,
+                   "queuing state buffer %" PRIu32 " would exceed the max of 
%" PRIu64,
+                   packet->idx, vbasedev->migration_max_queued_buffers);
+        return false;
+    }


AFAICT, attributes multifd->load_buf_queued_pending_buffers and
vbasedev->migration_max_queued_buffers are not strictly necessary.
They allow to count buffers and check an arbitrary limit, which
is UINT64_MAX today. It makes me wonder how useful they are.


You are right they aren't strictly necessary and in fact they weren't
there in early versions of this patch set.

It was introduced upon Peter's request since otherwise the source> could 
theoretically cause the target QEMU to allocate unlimited
amounts of memory for buffers-in-flight:
https://lore.kernel.org/qemu-devel/9e85016e-ac72-4207-8e69-8cba054ce...@maciej.szmigiero.name/
(scroll to the "Risk of OOM on unlimited VFIO buffering" section).

If that's an actual risk in someone's use case then that person
could lower that limit from UINT64_MAX to, for example, 10 buffers.
>> Please introduce them in a separate patch at the end of the series,

adding documentation on the "x-migration-max-queued-buffers" property
and also general documentation on why and how to use it.


I can certainly move it to the end of the series - done now.


Great. Please add the comment above in the commit log. We will decide
it this is experimental or not.

Also, I wonder if this should be a global migration property.

+
+    lb->data = g_memdup2(&packet->data, packet_total_size - sizeof(*packet));
+    lb->len = packet_total_size - sizeof(*packet);
+    lb->is_present = true;
+
+    return true;
+}
+
+static bool vfio_load_state_buffer(void *opaque, char *data, size_t data_size,
+                                   Error **errp)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    VFIOMultifd *multifd = migration->multifd;
+    VFIODeviceStatePacket *packet = (VFIODeviceStatePacket *)data;
+
+    /*
+     * Holding BQL here would violate the lock order and can cause
+     * a deadlock once we attempt to lock load_bufs_mutex below.
+     */
+    assert(!bql_locked());
+
+    if (!migration->multifd_transfer) {
+        error_setg(errp,
+                   "got device state packet but not doing multifd transfer");
+        return false;
+    }
+
+    assert(multifd);
+
+    if (data_size < sizeof(*packet)) {
+        error_setg(errp, "packet too short at %zu (min is %zu)",
+                   data_size, sizeof(*packet));
+        return false;
+    }
+
+    if (packet->version != 0) {


Please add a define for version, even if 0.


I've introduced a new define VFIO_DEVICE_STATE_PACKET_VER_CURRENT.

+        error_setg(errp, "packet has unknown version %" PRIu32,
+                   packet->version);
+        return false;
+    }
+
+    if (packet->idx == UINT32_MAX) {
+        error_setg(errp, "packet has too high idx %" PRIu32,
+                   packet->idx);


I don't think printing out packet->idx is useful here.


Yeah, it's unlikely that the value of UINT32_MAX will ever change :)

Removed now.

+        return false;
+    }
+
+    trace_vfio_load_state_device_buffer_incoming(vbasedev->name, packet->idx);


I wonder if we can add thread ids to trace events. It would be useful.


load_state_buffer is called from multifd channel receive threads
so passing multifd channel id there would require adding this multifd-specific
parameter to qemu_loadvm_load_state_buffer() and load_state_buffer
SaveVMHandler.


'-msg timestamp=on' should be enough. Having logical thread names would
be nice. It's another topic.

+
+    QEMU_LOCK_GUARD(&multifd->load_bufs_mutex);
+
+    /* config state packet should be the last one in the stream */
+    if (packet->flags & VFIO_DEVICE_STATE_CONFIG_STATE) {
+        multifd->load_buf_idx_last = packet->idx;
+    }
+
+    if (!vfio_load_state_buffer_insert(vbasedev, packet, data_size, errp)) {


So the migration thread calling multifd_device_state_recv() will

exit


The thread is calling multifd_device_state_recv() is a multifd
channel receive thread.

and the vfio thread loading the state into the device will
hang until its aborted ?


In the normal (successful) migration flow the vfio_load_bufs_thread()
will exit after loading (write()'ing) all buffers into the device
and then loading its config state.

In the aborted/error/unsuccessful migration flow it will get
terminated from vfio_load_cleanup() -> vfio_multifd_free() ->
vfio_load_cleanup_load_bufs_thread().

vfio_load_cleanup_load_bufs_thread() will signal
load_bufs_buffer_ready_cond and load_bufs_iter_done_cond since
the load thread indeed could be waiting on them.


This sequence is expected to be called to release the vfio thread

        while (multifd->load_bufs_thread_running) {
             multifd->load_bufs_thread_want_exit = true;

             qemu_cond_signal(&multifd->load_bufs_buffer_ready_cond);
         ...
        }

right ?


Right, that's a part of the code in vfio_load_cleanup_load_bufs_thread().


ok. So I think this lacks comments on thread termination points.
Please try to comment a bit more these areas in the code. I will
check next version more closely.

The way the series is presented makes it a bit complex to follow the
proposition, especially regarding the creation and termination of
threads, something the reader should be aware of.

As an initial step in clarifying the design, I would have preferred
a series of patches introducing the various threads, migration threads
and VFIO threads, without any workload. Once the creation and termination
points are established I would then introduce the work load for each
thread.


When I am doing review of anything more complex (though it's not usually
in QEMU) I mainly follow the final code flow as an operation is handled
since looking just from top to down at individual commits rarely gives
enough context to see how every part interacts together.

But for this the reviewer needs to see the whole code for the logical
operation, rather than just a part of it.


and this is the very problematic :/ Very very hard to maintain on the
long run. I also don't have *time* to dig in all the context. So please
try to keep it as simple as possible.

I think that adding the load operation in parts doesn't really
help since the reason why things are done such way in earlier patches
are only apparent in later patches and the earlier parts doesn't
really have much sense on their own.
Not to mention extra code churn when rebasing/reworking that increases
chance of a typo or a copy-paste mistake happening at some point.
> I also see that in comments to a later patch you dislike that
a dummy vfio_load_bufs_thread_load_config() gets added in one patch
then immediately replaced by the real implementation in the next patch.
Previously, you also said that vfio_load_config_after_iter() seems
to be unused in the patch that adds it - that's exactly the kind of
issues that bringing the complete operation in one patch avoids.


May be I did. Sorry I switched context may times already and this
was lost in oblivion. Again, please help the reviewer. Changes
should be made obvious.

I agree that, for example, x-migration-load-config-after-iter feature
could be a separate patch as it is a relatively simple change.

Same goes for x-migration-max-queued-buffers checking/enforcement,
compat changes, exporting existing settings (variables) as properties
or adding a g_autoptr() cleanup function for an existing type.

That's why originally the VFIO part of the series was divided into two
parts - receive and send, since these are two separate, yet internally
complete operations.


I am now asking to have a better understanding of how threads are
created/terminated. It's another sub split of the load part AFAICT.
If you prefer we can forget about the load thread first, like I
asked initially iirc. I would very much prefer that for QEMU 10.0.

I also export the whole series (including the current WiP state, with
code moved to migration-multifd.{c,h} files, etc.) as a git tree at
https://gitlab.com/maciejsszmigiero/qemu/-/commits/multifd-device-state-transfer-vfio
since this way it can be easily seen how the QEMU code currently
looks after the whole patch set or set of patches there.


Overall, I think this is making great progress. For such a complex
work, I would imagine a couple of RFCs first and half dozen normal
series. So ~10 iterations. We are only at v4. At least two more are
expected.


Thanks,

C.

Re: [PATCH v4 27/33] vfio/migration: Multifd device state transfer support - received buffers queuing

Reply via email to