Cédric Le Goater <c...@redhat.com> writes: > On 2/8/24 14:07, Fabiano Rosas wrote: >> Cédric Le Goater <c...@redhat.com> writes: >> >>> close_return_path_on_source() retrieves the migration error from the >>> the QEMUFile '->to_dst_file' to know if a shutdown is required. This >>> shutdown is required to exit the return-path thread. However, in >>> migrate_fd_cleanup(), '->to_dst_file' is cleaned up before calling >>> close_return_path_on_source() and the shutdown is never performed, >>> leaving the source and destination waiting for an event to occur. >>> >>> Avoid relying on '->to_dst_file' and use migrate_has_error() instead. >>> >>> Suggested-by: Peter Xu <pet...@redhat.com> >>> Signed-off-by: Cédric Le Goater <c...@redhat.com> >>> --- >>> migration/migration.c | 3 +-- >>> 1 file changed, 1 insertion(+), 2 deletions(-) >>> >>> diff --git a/migration/migration.c b/migration/migration.c >>> index >>> d5f705ceef4c925589aa49335969672c0d761fa2..5f55af3d7624750ca416c4177781241b3e291e5d >>> 100644 >>> --- a/migration/migration.c >>> +++ b/migration/migration.c >>> @@ -2372,8 +2372,7 @@ static bool >>> close_return_path_on_source(MigrationState *ms) >>> * cause it to unblock if it's stuck waiting for the destination. >>> */ >>> WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) { >>> - if (ms->to_dst_file && ms->rp_state.from_dst_file && >>> - qemu_file_get_error(ms->to_dst_file)) { >>> + if (migrate_has_error(ms) && ms->rp_state.from_dst_file) { >>> qemu_file_shutdown(ms->rp_state.from_dst_file); >>> } >>> } >> >> Hm, maybe Peter can help defend this, but this assumes that every >> function that takes an 'f' and sets the file error also sets >> migrate_set_error(). I'm not sure we have determined that, have we? > > How could we check all the code path ? I agree it is difficult when > looking at the code :/
It would help if the thing wasn't called 'f' for the most part of the code to begin with. Whenever there's a file error at to_dst_file there's the chance that the rp_state.from_dst_file got stuck. So we cannot ignore the file error. Would it work if we checked it earlier during cleanup as you did previously and then set the migration error?