On Fri, Jul 28, 2023 at 09:15:14AM -0300, Fabiano Rosas wrote:
> When waiting for the return path (RP) thread to finish, there is
> really nothing wrong in the RP if the destination end of the migration
> stops responding, leaving it stuck.
> 
> Stop returning an error at that point and leave it to other parts of
> the code to catch. One such part is the very next routine run by
> migration_completion() which checks 'to_dst_file' for an error and fails
> the migration. Another is the RP thread itself when the recvmsg()
> returns an error.
> 
> With this we stop marking RP bad from outside of the thread and can
> reuse await_return_path_close_on_source() in the next patches to wait
> on the thread during a paused migration.
> 
> Signed-off-by: Fabiano Rosas <faro...@suse.de>
> ---
>  migration/migration.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 91bba630a8..051067f8c5 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -2049,7 +2049,6 @@ static int 
> await_return_path_close_on_source(MigrationState *ms)
>           * waiting for the destination.
>           */
>          qemu_file_shutdown(ms->rp_state.from_dst_file);
> -        mark_source_rp_bad(ms);
>      }
>      trace_await_return_path_close_on_source_joining();
>      qemu_thread_join(&ms->rp_state.rp_thread);

The retval of await_return_path_close_on_source() relies on
ms->rp_state.error.  If mark_source_rp_bad() is dropped, is it possible
that it'll start to return succeed where it used to return failure?

Maybe not a big deal: I see migration_completion() also has another
qemu_file_get_error() later to catch errors, but I don't know how solid
that is.

I think as long as after this patch we can fail properly on e.g. network
failures for precopy when cap return-path=on, then we should be good.

-- 
Peter Xu


Reply via email to