On Fri, Jan 10, 2025 at 10:09:38AM -0300, Fabiano Rosas wrote: > Shivam Kumar <shivam.kum...@nutanix.com> writes: > > > Even if a live migration fails due to some reason, migration status > > should not be set to MIGRATION_STATUS_FAILED until migrate fd cleanup > > is done, else the client can trigger another instance of migration > > before the cleanup is complete (as it would assume no migration is > > active) or reset migration capabilities affecting old migration's > > cleanup. Hence, set the status to 'failing' when a migration failure > > happens and once the cleanup is complete, set the migration status to > > MIGRATION_STATUS_FAILED. > > > > Signed-off-by: Shivam Kumar <shivam.kum...@nutanix.com> > > --- > > migration/migration.c | 49 +++++++++++++++++++++---------------------- > > migration/migration.h | 9 ++++++++ > > migration/multifd.c | 6 ++---- > > migration/savevm.c | 7 +++---- > > 4 files changed, 38 insertions(+), 33 deletions(-) > > > > diff --git a/migration/migration.c b/migration/migration.c > > index df61ca4e93..f084f54f6b 100644 > > --- a/migration/migration.c > > +++ b/migration/migration.c > > @@ -1143,8 +1143,9 @@ static bool migration_is_active(void) > > migration_is_running() is the one that gates qmp_migrate() and > qmp_migrate_set_capabilities(). > > > { > > MigrationState *s = current_migration; > > > > - return (s->state == MIGRATION_STATUS_ACTIVE || > > - s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE); > > + return ((s->state == MIGRATION_STATUS_ACTIVE || > > + s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) && > > + !qatomic_read(&s->failing)); > > } > > > > static bool migrate_show_downtime(MigrationState *s) > > @@ -1439,6 +1440,11 @@ static void migrate_fd_cleanup(MigrationState *s) > > MIGRATION_STATUS_CANCELLED); > > } > > > > + if (qatomic_xchg(&s->failing, 0)) { > > + migrate_set_state(&s->state, s->state, > > + MIGRATION_STATUS_FAILED); > > + } > > I hope you've verified that sure every place that used to set FAILED > will also reach migrate_fd_cleanup() eventually. > > Also, we probably still need the FAILING state. Otherwise, this will > trip code that expects a state change on failure. Anything that does: > > if (state != MIGRATION_STATUS_FOO) { > ... > } > > So I think the change overall should be > > -migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED); > +migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILING); > > void migrate_set_state(MigrationStatus *state, MigrationStatus old_state, > MigrationStatus new_state) > { > assert(new_state < MIGRATION_STATUS__MAX); > if (qatomic_cmpxchg(state, old_state, new_state) == old_state) { > trace_migrate_set_state(MigrationStatus_str(new_state)); > > + if (new_state == MIGRATION_STATUS_FAILING) { > + qatomic_set(&s->failing, 1); > + } > migrate_generate_event(new_state); > } > } > > And we should proably do the same for CANCELLING actually, but there the > (preexisting) issue is actual concurrency, while here it's just > inconsistency in the state.
Yes something like FAILING sounds reasonable. Though since we have s->error, I wonder whether that's a better place to represent a migration as "failing" in one place, because otherwise we need to set two places (both FAILING state, and the s->error) - whenever something fails, we'd better always update s->error so as to remember what failed, then reported via query-migrate. >From that POV, s->failing is probably never gonna be needed (due to s->error being present anyway)? So far, such Error* looks like the best single point to say that the migration is failing - it also enforces the Error to be provided whoever wants to set it to failing state. -- Peter Xu