Fabiano Rosas <faro...@suse.de> writes: > Peter Xu <pet...@redhat.com> writes: > >> On Fri, Jul 28, 2023 at 09:15:14AM -0300, Fabiano Rosas wrote: >>> When waiting for the return path (RP) thread to finish, there is >>> really nothing wrong in the RP if the destination end of the migration >>> stops responding, leaving it stuck. >>> >>> Stop returning an error at that point and leave it to other parts of >>> the code to catch. One such part is the very next routine run by >>> migration_completion() which checks 'to_dst_file' for an error and fails >>> the migration. Another is the RP thread itself when the recvmsg() >>> returns an error. >>> >>> With this we stop marking RP bad from outside of the thread and can >>> reuse await_return_path_close_on_source() in the next patches to wait >>> on the thread during a paused migration. >>> >>> Signed-off-by: Fabiano Rosas <faro...@suse.de> >>> --- >>> migration/migration.c | 1 - >>> 1 file changed, 1 deletion(-) >>> >>> diff --git a/migration/migration.c b/migration/migration.c >>> index 91bba630a8..051067f8c5 100644 >>> --- a/migration/migration.c >>> +++ b/migration/migration.c >>> @@ -2049,7 +2049,6 @@ static int >>> await_return_path_close_on_source(MigrationState *ms) >>> * waiting for the destination. >>> */ >>> qemu_file_shutdown(ms->rp_state.from_dst_file); >>> - mark_source_rp_bad(ms); >>> } >>> trace_await_return_path_close_on_source_joining(); >>> qemu_thread_join(&ms->rp_state.rp_thread); >> >> The retval of await_return_path_close_on_source() relies on >> ms->rp_state.error. If mark_source_rp_bad() is dropped, is it possible >> that it'll start to return succeed where it used to return failure? > > Yep, as described in the commit message, I think it's ok to do that. The > critical part is doing the shutdown. Other instances of > mark_source_rp_bad() continue existing and we continue returning > rp_state.error. > >> >> Maybe not a big deal: I see migration_completion() also has another >> qemu_file_get_error() later to catch errors, but I don't know how solid >> that is. > > That is the instance I refer to in the commit message. At > await_return_path_close_on_source() we only call mark_source_rp_bad() if > to_dst_file has an error. That will be caught by this > qemu_file_get_error() anyway.
Actually, I can do better, I can merge this shutdown() into migration_completion(). Then this dependency becomes explicit. Since you suggested moving await_return_path_close_on_source() into postcopy_pause(), it doesn't make sense to check to_dst_file anymore, because when pausing we clear that file.