Re: [RFC PATCH] Fix race in live migration failure path

Peter Xu Wed, 22 Jan 2025 08:41:25 -0800

Hi, Shivam,

On Wed, Jan 22, 2025 at 10:54:17AM +0000, Shivam Kumar wrote:
> There is one place where we set the migration status to FAILED but we don’t 
> set
> s->error, i.e. in save_snapshot workflow, please check qemu_savevm_state but
> not sure setting s->error in this case is possible (or required), as it seems 
> a
> different workflow to me.


Yes it's very different indeed.  I may not yet fully get the challenge on
how switching to s->error (implies FAILING) would affect this one, though.
I mean, in qemu_savevm_state() it tries to fetch qemufile errors with:

    ret = qemu_file_get_error(f);

IIUC we could also try to fetch s->error just like what migration thread
does, and if it sets means it's failing?

> 
> In addition, one potentially real problem that I see is this comment in
> migration_detect_error:
> /*
>  * For postcopy, we allow the network to be down for a
>  * while. After that, it can be continued by a
>  * recovery phase.
>  */
> Let's say if we set s->error at some place and there was a file error on 
> either
> source or destination (qemu_file_get_error_obj_any returns a positive value

This is trivial, but I suppose you meant s/positive/negative/ here.. as
qemufile's last_error should always be negative, iiuc.

> when called by migration_detect_error). We expect migration to fail in this
> case but migration will continue to run since post-copy migration is tolerant
> to file errors?

Yes it can halt at postcopy_pause().  I'm not yet understand why it's an
issue to using s->error, though.

In general, I'm thinking whether we could also check s->error in detect
error path like this:

===8<===
diff --git a/migration/migration.c b/migration/migration.c
index 2d1da917c7..fbd97395e0 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3015,17 +3015,17 @@ static MigThrError 
migration_detect_error(MigrationState *s)
     ret = qemu_file_get_error_obj_any(s->to_dst_file,
                                       s->postcopy_qemufile_src,
                                       &local_error);
-    if (!ret) {
-        /* Everything is fine */
-        assert(!local_error);
-        return MIG_THR_ERR_NONE;
-    }
-
-    if (local_error) {
+    if (ret) {
+        /* Passover qemufile errors to s->error */
+        assert(local_error);
         migrate_set_error(s, local_error);
         error_free(local_error);
     }
 
+    if (!migrate_has_error(s)) {
+        return MIG_THR_ERR_NONE;
+    }
+
     if (state == MIGRATION_STATUS_POSTCOPY_ACTIVE && ret) {
         /*
          * For postcopy, we allow the network to be down for a
@@ -3037,6 +3037,8 @@ static MigThrError migration_detect_error(MigrationState 
*s)
         /*
          * For precopy (or postcopy with error outside IO), we fail
          * with no time.
+         *
+         * TODO: update FAILED only until the end of migration in BH.
          */
         migrate_set_state(&s->state, state, MIGRATION_STATUS_FAILED);
         trace_migration_thread_file_err();
===8<===

I kept a TODO above, I would hope if you reworked everything to route
errors to s->error, then we can move this to the cleanup BH to avoid the
race.

Do you think that could work?

Thanks,

-- 
Peter Xu

Re: [RFC PATCH] Fix race in live migration failure path

Reply via email to