Hi, Shivam, On Wed, Jan 22, 2025 at 10:54:17AM +0000, Shivam Kumar wrote: > There is one place where we set the migration status to FAILED but we don’t > set > s->error, i.e. in save_snapshot workflow, please check qemu_savevm_state but > not sure setting s->error in this case is possible (or required), as it seems > a > different workflow to me.
Yes it's very different indeed. I may not yet fully get the challenge on how switching to s->error (implies FAILING) would affect this one, though. I mean, in qemu_savevm_state() it tries to fetch qemufile errors with: ret = qemu_file_get_error(f); IIUC we could also try to fetch s->error just like what migration thread does, and if it sets means it's failing? > > In addition, one potentially real problem that I see is this comment in > migration_detect_error: > /* > * For postcopy, we allow the network to be down for a > * while. After that, it can be continued by a > * recovery phase. > */ > Let's say if we set s->error at some place and there was a file error on > either > source or destination (qemu_file_get_error_obj_any returns a positive value This is trivial, but I suppose you meant s/positive/negative/ here.. as qemufile's last_error should always be negative, iiuc. > when called by migration_detect_error). We expect migration to fail in this > case but migration will continue to run since post-copy migration is tolerant > to file errors? Yes it can halt at postcopy_pause(). I'm not yet understand why it's an issue to using s->error, though. In general, I'm thinking whether we could also check s->error in detect error path like this: ===8<=== diff --git a/migration/migration.c b/migration/migration.c index 2d1da917c7..fbd97395e0 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -3015,17 +3015,17 @@ static MigThrError migration_detect_error(MigrationState *s) ret = qemu_file_get_error_obj_any(s->to_dst_file, s->postcopy_qemufile_src, &local_error); - if (!ret) { - /* Everything is fine */ - assert(!local_error); - return MIG_THR_ERR_NONE; - } - - if (local_error) { + if (ret) { + /* Passover qemufile errors to s->error */ + assert(local_error); migrate_set_error(s, local_error); error_free(local_error); } + if (!migrate_has_error(s)) { + return MIG_THR_ERR_NONE; + } + if (state == MIGRATION_STATUS_POSTCOPY_ACTIVE && ret) { /* * For postcopy, we allow the network to be down for a @@ -3037,6 +3037,8 @@ static MigThrError migration_detect_error(MigrationState *s) /* * For precopy (or postcopy with error outside IO), we fail * with no time. + * + * TODO: update FAILED only until the end of migration in BH. */ migrate_set_state(&s->state, state, MIGRATION_STATUS_FAILED); trace_migration_thread_file_err(); ===8<=== I kept a TODO above, I would hope if you reworked everything to route errors to s->error, then we can move this to the cleanup BH to avoid the race. Do you think that could work? Thanks, -- Peter Xu