On 21/10/2024 19:54, Peter Xu wrote:
External email: Use caution opening links or attachments
On Mon, Oct 21, 2024 at 06:43:13PM +0200, Cédric Le Goater wrote:
Hello,
IIUC the migration thread should always see valid migration object, as it
takes one refcount at the entrance of migration_thread():
object_ref(OBJECT(s));
Could the migration have failed before ? in migrate_fd_connect()
I just noticed it's a vm state change notifier..
Yep.
I stumbled upon this bug by accident when running on a buggy machine.
Migration wasn't involved, I just started the VM, shut it down and got
the assert (as my VFIO device was faulty and errored on RUNNING->STOP
state change).
You can repro it by forcefully triggering an error on *->STOP transition:
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 17199b73ae..d41cb7c634 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -831,7 +831,9 @@ static void vfio_vmstate_change(void *opaque, bool
running, RunState state)
}
ret = vfio_migration_set_state_or_reset(vbasedev, new_state,
&local_err);
- if (ret) {
+ if (ret || new_state == VFIO_DEVICE_STATE_STOP) {
+ ret = -1;
+ error_setg(&local_err, "%s: forced error", vbasedev->name);
/*
* Migration should be aborted in this case, but vm_state_notify()
* currently does not support reporting failures.
If so, maybe VFIO could refer to its internal states showing that it's
during a migration first (so as to guarantee the migration object is valid;
e.g., only after save_setup() but before save_cleanup() hooks are invoked).
It's an option, but I think it's a bit awkward as we'd need to check
that VFIOMigration->data_buffer is set or add a new flag.
Besides that, as Cedric pointed out, VFIO code calls
migration_is_setup_or_active() which can also be unsafe, as it might be
invoked after migration object has been freed.
Maybe a simpler solution would be to extend the the migration object
lifetime?
Looking at commit history, you can see that commit 1f8956041ad3
("migration: finalize current_migration object") added migration object
finalization at the very end of qemu cleanup.
Then came commit 892ae715b6bc ("migration: Cleanup during exit") and
moved the migration object finalization to the beginning of qemu cleanup
(before stopping the VM etc.).
If so, the fix could be something like the below?
-------------8<-------------
diff --git a/include/migration/misc.h b/include/migration/misc.h
index bfadc5613b..5eb099349a 100644
--- a/include/migration/misc.h
+++ b/include/migration/misc.h
@@ -52,6 +52,7 @@ void dump_vmstate_json_to_file(FILE *out_fp);
/* migration/migration.c */
void migration_object_init(void);
+void migration_object_finalize(void);
void migration_shutdown(void);
bool migration_is_idle(void);
bool migration_is_active(void);
diff --git a/migration/migration.c b/migration/migration.c
index 021faee2f3..9eaa7947bc 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -265,6 +265,11 @@ void migration_object_init(void)
dirty_bitmap_mig_init();
}
+void migration_object_finalize(void)
+{
+ object_unref(OBJECT(current_migration));
+}
+
typedef struct {
QEMUBH *bh;
QEMUBHFunc *cb;
@@ -330,7 +335,6 @@ void migration_shutdown(void)
* stop the migration using this structure
*/
migration_cancel(NULL);
- object_unref(OBJECT(current_migration));
/*
* Cancel outgoing migration of dirty bitmaps. It should
diff --git a/system/runstate.c b/system/runstate.c
index c2c9afa905..fa823f5e72 100644
--- a/system/runstate.c
+++ b/system/runstate.c
@@ -930,5 +930,6 @@ void qemu_cleanup(int status)
monitor_cleanup();
qemu_chr_cleanup();
user_creatable_cleanup();
+ migration_object_finalize();
/* TODO: unref root container, check all devices are ok */
}
-------------8<-------------
Thanks.