Cédric Le Goater <c...@redhat.com> writes: > In case of error, close_return_path_on_source() can perform a shutdown > to exit the return-path thread. However, in migrate_fd_cleanup(), > 'to_dst_file' is closed before calling close_return_path_on_source() > and the shutdown fails, leaving the source and destination waiting for > an event to occur.
Hi, Cédric Are you sure this is not caused by patch 13? That 'if (ms->to_dst_file' was there to avoid this sort of thing happening. Is there some reordering possibility that I'm not spotting in the code below? I think the data dependency on to_dst_file shouldn't allow it. migrate_fd_cleanup: qemu_mutex_lock(&s->qemu_file_lock); tmp = s->to_dst_file; s->to_dst_file = NULL; qemu_mutex_unlock(&s->qemu_file_lock); ... qemu_fclose(tmp); close_return_path_on_source: WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) { if (ms->to_dst_file && ms->rp_state.from_dst_file && qemu_file_get_error(ms->to_dst_file)) { qemu_file_shutdown(ms->rp_state.from_dst_file); } } I'm thinking maybe the culprit is the close_return_path_on_source() at migration_completion(). It might be possible for it to race with the migrate_fd_cleanup_bh from migration_iteration_finish(). If that's the case, then I think that one possible fix would be to hold the BQL at migration_completion() so the BH doesn't get dispatched until we properly close the return path. > > Close the file after calling close_return_path_on_source() so that the > shutdown succeeds and the return-path thread exits. > > Signed-off-by: Cédric Le Goater <c...@redhat.com> > --- > > This is an RFC because the correct fix implies reworking the QEMUFile > construct, built on top of the QEMU I/O channel. > > migration/migration.c | 13 ++++++------- > 1 file changed, 6 insertions(+), 7 deletions(-) > > diff --git a/migration/migration.c b/migration/migration.c > index > 5f55af3d7624750ca416c4177781241b3e291e5d..de329f2c553288935d824748286e79e535929b8b > 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -1313,6 +1313,8 @@ void migrate_set_state(int *state, int old_state, int > new_state) > > static void migrate_fd_cleanup(MigrationState *s) > { > + QEMUFile *tmp = NULL; > + > g_free(s->hostname); > s->hostname = NULL; > json_writer_free(s->vmdesc); > @@ -1321,8 +1323,6 @@ static void migrate_fd_cleanup(MigrationState *s) > qemu_savevm_state_cleanup(); > > if (s->to_dst_file) { > - QEMUFile *tmp; > - > trace_migrate_fd_cleanup(); > bql_unlock(); > if (s->migration_thread_running) { > @@ -1341,15 +1341,14 @@ static void migrate_fd_cleanup(MigrationState *s) > * critical section won't block for long. > */ > migration_ioc_unregister_yank_from_file(tmp); > - qemu_fclose(tmp); > } > > - /* > - * We already cleaned up to_dst_file, so errors from the return > - * path might be due to that, ignore them. > - */ > close_return_path_on_source(s); > > + if (tmp) { > + qemu_fclose(tmp); > + } > + > assert(!migration_is_active(s)); > > if (s->state == MIGRATION_STATUS_CANCELLING) {