On 19.09.2024 23:11, Peter Xu wrote:
On Thu, Sep 19, 2024 at 09:49:10PM +0200, Maciej S. Szmigiero wrote:
On 9.09.2024 22:03, Peter Xu wrote:
On Tue, Aug 27, 2024 at 07:54:27PM +0200, Maciej S. Szmigiero wrote:
From: "Maciej S. Szmigiero" <maciej.szmigi...@oracle.com>

load_finish SaveVMHandler allows migration code to poll whether
a device-specific asynchronous device state loading operation had finished.

In order to avoid calling this handler needlessly the device is supposed
to notify the migration code of its possible readiness via a call to
qemu_loadvm_load_finish_ready_broadcast() while holding
qemu_loadvm_load_finish_ready_lock.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigi...@oracle.com>
---
   include/migration/register.h | 21 +++++++++++++++
   migration/migration.c        |  6 +++++
   migration/migration.h        |  3 +++
   migration/savevm.c           | 52 ++++++++++++++++++++++++++++++++++++
   migration/savevm.h           |  4 +++
   5 files changed, 86 insertions(+)

diff --git a/include/migration/register.h b/include/migration/register.h
index 4a578f140713..44d8cf5192ae 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -278,6 +278,27 @@ typedef struct SaveVMHandlers {
       int (*load_state_buffer)(void *opaque, char *data, size_t data_size,
                                Error **errp);
+    /**
+     * @load_finish
+     *
+     * Poll whether all asynchronous device state loading had finished.
+     * Not called on the load failure path.
+     *
+     * Called while holding the qemu_loadvm_load_finish_ready_lock.
+     *
+     * If this method signals "not ready" then it might not be called
+     * again until qemu_loadvm_load_finish_ready_broadcast() is invoked
+     * while holding qemu_loadvm_load_finish_ready_lock.

[1]

+     *
+     * @opaque: data pointer passed to register_savevm_live()
+     * @is_finished: whether the loading had finished (output parameter)
+     * @errp: pointer to Error*, to store an error if it happens.
+     *
+     * Returns zero to indicate success and negative for error
+     * It's not an error that the loading still hasn't finished.
+     */
+    int (*load_finish)(void *opaque, bool *is_finished, Error **errp);

The load_finish() semantics is a bit weird, especially above [1] on "only
allowed to be called once if ..." and also on the locks.

The point of this remark is that a driver needs to call
qemu_loadvm_load_finish_ready_broadcast() if it wants for the migration
core to call its load_finish handler again.

It looks to me vfio_load_finish() also does the final load of the device.

I wonder whether that final load can be done in the threads,

Here, the problem is that current VFIO VMState has to be loaded from the main
migration thread as it internally calls QEMU core address space modification
methods which explode if called from another thread(s).

Ahh, I see.  I'm trying to make dest qemu loadvm in a thread too and yield
BQL if possible, when that's ready then in your case here IIUC you can
simply take BQL in whichever thread that loads it.. but yeah it's not ready
at least..

Yeah, long term we might want to work on making these QEMU core address space
modification methods somehow callable from multiple threads but that's
definitely not something for the initial patch set.

Would it be possible vfio_save_complete_precopy_async_thread_config_state()
be done in VFIO's save_live_complete_precopy() through the main channel
somehow?  IOW, does it rely on iterative data to be fetched first from
kernel, or completely separate states?

The device state data needs to be fully loaded first before "activating"
the device by loading its config state.

And just curious: how large is it
normally (and I suppose this decides whether it's applicable to be sent via
the main channel at all..)?

Config data is *much* smaller than device state data - as far as I remember
it was on order of kilobytes.


then after
everything loaded the device post a semaphore telling the main thread to
continue.  See e.g.:

      if (migrate_switchover_ack()) {
          qemu_loadvm_state_switchover_ack_needed(mis);
      }

IIUC, VFIO can register load_complete_ack similarly so it only sem_post()
when all things are loaded?  We can then get rid of this slightly awkward
interface.  I had a feeling that things can be simplified (e.g., if the
thread will take care of loading the final vmstate then the mutex is also
not needed? etc.).

With just a single call to switchover_ack_needed per VFIO device it would
need to do a blocking wait for the device buffers and config state load
to finish, therefore blocking other VFIO devices from potentially loading
their config state if they are ready to begin this operation earlier.

I am not sure I get you here, loading VFIO device states (I mean, the
non-iterable part) will need to be done sequentially IIUC due to what you
said and should rely on BQL, so I don't know how that could happen
concurrently for now.  But I think indeed BQL is a problem.
Consider that we have two VFIO devices (A and B), with the following order
of switchover_ack_needed handler calls for them: first A get this call,
once the call for A finishes then B gets this call.

Now consider what happens if B had loaded all its buffers (in the loading
thread) and it is ready for its config load before A finished loading its
buffers.

B has to wait idle in this situation (even though it could have been already
loading its config) since the switchover_ack_needed handler for A won't
return until A is fully done.

So IMHO this recv side interface so far is the major pain that I really
want to avoid (comparing to the rest) in the series.  Let's see whether we
can come up with something better..

One other (probably not pretty..) idea is when waiting here in the main
thread it yields BQL, then other threads can take it and load the VFIO
final chunk of data.  But I could miss something else.


I think temporary dropping BQL deep inside migration code is similar
to running QEMU event loop deep inside migration code (about which
people complained in my generic thread pool implementation): it's easy
to miss some subtle dependency/race somewhere and accidentally cause rare
hard to debug deadlock.

That's why I think that it's ultimately probably better to make QEMU core
address space modification methods thread safe / re-entrant instead.

Thanks,
Maciej


Reply via email to