On Fri, Jul 28, 2023 at 09:15:14AM -0300, Fabiano Rosas wrote: > When waiting for the return path (RP) thread to finish, there is > really nothing wrong in the RP if the destination end of the migration > stops responding, leaving it stuck. > > Stop returning an error at that point and leave it to other parts of > the code to catch. One such part is the very next routine run by > migration_completion() which checks 'to_dst_file' for an error and fails > the migration. Another is the RP thread itself when the recvmsg() > returns an error. > > With this we stop marking RP bad from outside of the thread and can > reuse await_return_path_close_on_source() in the next patches to wait > on the thread during a paused migration. > > Signed-off-by: Fabiano Rosas <faro...@suse.de> > --- > migration/migration.c | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/migration/migration.c b/migration/migration.c > index 91bba630a8..051067f8c5 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -2049,7 +2049,6 @@ static int > await_return_path_close_on_source(MigrationState *ms) > * waiting for the destination. > */ > qemu_file_shutdown(ms->rp_state.from_dst_file); > - mark_source_rp_bad(ms); > } > trace_await_return_path_close_on_source_joining(); > qemu_thread_join(&ms->rp_state.rp_thread);
The retval of await_return_path_close_on_source() relies on ms->rp_state.error. If mark_source_rp_bad() is dropped, is it possible that it'll start to return succeed where it used to return failure? Maybe not a big deal: I see migration_completion() also has another qemu_file_get_error() later to catch errors, but I don't know how solid that is. I think as long as after this patch we can fail properly on e.g. network failures for precopy when cap return-path=on, then we should be good. -- Peter Xu