On Tue, Feb 11, 2025 at 03:04:37PM -0300, Fabiano Rosas wrote: > Peter Xu <pet...@redhat.com> writes: > > > On Tue, Feb 11, 2025 at 12:01:33PM -0300, Fabiano Rosas wrote: > >> It's possible that the migration is cancelled during > >> migration_switchover_start(). In that case, don't set the migration > >> state FAILED in migration_completion(). > >> > >> Fixes: 3dde8fdbad ("migration: Merge precopy/postcopy on switchover start") > >> Signed-off-by: Fabiano Rosas <faro...@suse.de> > > > > I remember I paid some attention on this one when working on the commit, > > where it has: > > > > static bool migration_switchover_prepare(MigrationState *s) > > { > > /* Concurrent cancellation? Quit */ > > if (s->state == MIGRATION_STATUS_CANCELLING) { <================= [1] > > return false; > > } > > ... > > bql_unlock(); > > > > qemu_sem_wait(&s->pause_sem); > > > > bql_lock(); > > /* > > * After BQL released and retaken, the state can be CANCELLING if it > > * happend during sem_wait().. Only change the state if it's still > > * pre-switchover. > > */ > > migrate_set_state(&s->state, MIGRATION_STATUS_PRE_SWITCHOVER, <====== > > [2] > > MIGRATION_STATUS_DEVICE); > > > > return s->state == MIGRATION_STATUS_DEVICE; > > } > > > > So when holding BQL logically it can't change to CANCELLING, it'll check > > first [1] making sure no prior CANCELLING. Then after release and retake > > BQL it'll check again [2] (see the comment above [2], it's done by passing > > in explicit old_state to not change it if it's CANCELLING). > > Right, it doesn't change the state. But the function returns false and > someone else changes to FAILED. That's what both my patch and your > snippet below fix. > > > > > Any hint on how this could be triggered? > > > > OTOH, when looking at this.. I seem to have found a bug indeed (which could > > be another?), where I may have forgot to touch up the old_state in > > migrate_set_state() after switching to always use DEVICE.. > > > > diff --git a/migration/migration.c b/migration/migration.c > > index 74c50cc72c..513e5955cc 100644 > > --- a/migration/migration.c > > +++ b/migration/migration.c > > @@ -2793,8 +2793,9 @@ static int postcopy_start(MigrationState *ms, Error > > **errp) > > fail_closefb: > > qemu_fclose(fb); > > fail: > > - migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_ACTIVE, > > - MIGRATION_STATUS_FAILED); > > + if (ms->state != MIGRATION_STATUS_CANCELLING) { > > + migrate_set_state(&ms->state, ms->state, MIGRATION_STATUS_FAILED); > > + } > > Now that I think about it, we should probably just use the skip at > migrate_set_state() always. Isn't this^ the same as: > > migrate_set_state(&ms->state, MIGRATION_STATUS_DEVICE, > MIGRATION_STATUS_FAILED); > > Better to list the state explicitly, no?
There's one case where it can be in ACTIVE rather than DEVICE, unfortunately: ret = migration_stop_vm(ms, RUN_STATE_FINISH_MIGRATE); if (ret < 0) { error_setg_errno(errp, -ret, "%s: Failed to stop the VM", __func__); goto fail; } > > Or... do we want to incorporate this into migrate_set_state()? > > void migrate_set_state(MigrationStatus *state, MigrationStatus old_state, > MigrationStatus new_state) > { > assert(new_state < MIGRATION_STATUS__MAX); > > if (qatomic_read(state) == CANCELLING && new_state != CANCELLED) { > /* Once it's cancelling, there's no way back, it must finish cancel */ > return; > } > > if (qatomic_cmpxchg(state, old_state, new_state) == old_state) { > trace_migrate_set_state(MigrationStatus_str(new_state)); > migrate_generate_event(new_state); > } > } IMHO we'll need the original migrate_set_state() more or less, e.g. when setting CANCELLING->CANCELLED in migration_[fd_]cleanup(). So maybe it's slightly easier we keep it. Said that, maybe we could have a few helpers for the state transitions, like: migrate_set_state_failure(MigrationState *s) Which can consider CANCELLING. Also, we have a portion of such state transitions not caring about current state, so we could also have some helper for that, like: migrate_set_state_always(MigrationState *s, MigrationStatus status) Or rename old migrate_set_state() into migrate_set_state_atomic(), then make migrate_set_state() to ignore current state. > > > migration_block_activate(NULL); > > migration_call_notifiers(ms, MIG_EVENT_PRECOPY_FAILED, NULL); > > bql_unlock(); > > > > I'm not sure whether it's relevant to what you hit, though.. since you're > > looking at this, I'd rely on you help figuring it out before I do.. :) > > > >> --- > >> migration/migration.c | 4 +++- > >> 1 file changed, 3 insertions(+), 1 deletion(-) > >> > >> diff --git a/migration/migration.c b/migration/migration.c > >> index 375de6d460..5dc43bcdc0 100644 > >> --- a/migration/migration.c > >> +++ b/migration/migration.c > >> @@ -2986,7 +2986,9 @@ fail: > >> error_free(local_err); > >> } > >> > >> - migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED); > >> + if (s->state != MIGRATION_STATUS_CANCELLING) { > >> + migrate_set_state(&s->state, s->state, MIGRATION_STATUS_FAILED); > >> + } > >> } > >> > >> /** > >> -- > >> 2.35.3 > >> > -- Peter Xu