On 21/10/2024 19:54, Peter Xu wrote:
External email: Use caution opening links or attachments


On Mon, Oct 21, 2024 at 06:43:13PM +0200, Cédric Le Goater wrote:
Hello,

IIUC the migration thread should always see valid migration object, as it
takes one refcount at the entrance of migration_thread():

      object_ref(OBJECT(s));
Could the migration have failed before ? in migrate_fd_connect()
I just noticed it's a vm state change notifier..

Yep.
I stumbled upon this bug by accident when running on a buggy machine.
Migration wasn't involved, I just started the VM, shut it down and got the assert (as my VFIO device was faulty and errored on RUNNING->STOP state change).

You can repro it by forcefully triggering an error on *->STOP transition:

diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 17199b73ae..d41cb7c634 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -831,7 +831,9 @@ static void vfio_vmstate_change(void *opaque, bool running, RunState state)
     }

     ret = vfio_migration_set_state_or_reset(vbasedev, new_state, &local_err);
-    if (ret) {
+    if (ret || new_state == VFIO_DEVICE_STATE_STOP) {
+        ret = -1;
+        error_setg(&local_err, "%s: forced error", vbasedev->name);
         /*
          * Migration should be aborted in this case, but vm_state_notify()
          * currently does not support reporting failures.


If so, maybe VFIO could refer to its internal states showing that it's
during a migration first (so as to guarantee the migration object is valid;
e.g., only after save_setup() but before save_cleanup() hooks are invoked).

It's an option, but I think it's a bit awkward as we'd need to check that VFIOMigration->data_buffer is set or add a new flag. Besides that, as Cedric pointed out, VFIO code calls migration_is_setup_or_active() which can also be unsafe, as it might be invoked after migration object has been freed.

Maybe a simpler solution would be to extend the the migration object lifetime? Looking at commit history, you can see that commit 1f8956041ad3 ("migration: finalize current_migration object") added migration object finalization at the very end of qemu cleanup. Then came commit 892ae715b6bc ("migration: Cleanup during exit") and moved the migration object finalization to the beginning of qemu cleanup (before stopping the VM etc.).

If so, the fix could be something like the below?

-------------8<-------------
diff --git a/include/migration/misc.h b/include/migration/misc.h
index bfadc5613b..5eb099349a 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -52,6 +52,7 @@ void dump_vmstate_json_to_file(FILE *out_fp);

 /* migration/migration.c */
 void migration_object_init(void);
+void migration_object_finalize(void);
 void migration_shutdown(void);
 bool migration_is_idle(void);
 bool migration_is_active(void);
diff --git a/migration/migration.c b/migration/migration.c
index 021faee2f3..9eaa7947bc 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -265,6 +265,11 @@ void migration_object_init(void)
     dirty_bitmap_mig_init();
 }

+void migration_object_finalize(void)
+{
+    object_unref(OBJECT(current_migration));
+}
+
 typedef struct {
     QEMUBH *bh;
     QEMUBHFunc *cb;
@@ -330,7 +335,6 @@ void migration_shutdown(void)
      * stop the migration using this structure
      */
     migration_cancel(NULL);
-    object_unref(OBJECT(current_migration));

     /*
      * Cancel outgoing migration of dirty bitmaps. It should
diff --git a/system/runstate.c b/system/runstate.c
index c2c9afa905..fa823f5e72 100644
--- a/system/runstate.c
+++ b/system/runstate.c
@@ -930,5 +930,6 @@ void qemu_cleanup(int status)
     monitor_cleanup();
     qemu_chr_cleanup();
     user_creatable_cleanup();
+    migration_object_finalize();
     /* TODO: unref root container, check all devices are ok */
 }
-------------8<-------------

Thanks.


Reply via email to