On Fri, Aug 11, 2023 at 12:08:33PM -0300, Fabiano Rosas wrote:
> Replace the return path retry logic with finishing and restarting the
> thread. This fixes a race when resuming the migration that leads to a
> segfault.
> 
> Currently when doing postcopy we consider that an IO error on the
> return path file could be due to a network intermittency. We then keep
> the thread alive but have it do cleanup of the 'from_dst_file' and
> wait on the 'postcopy_pause_rp' semaphore. When the user issues a
> migrate resume, a new return path is opened and the thread is allowed
> to continue.
> 
> There's a race condition in the above mechanism. It is possible for
> the new return path file to be setup *before* the cleanup code in the
> return path thread has had a chance to run, leading to the *new* file
> being closed and the pointer set to NULL. When the thread is released
> after the resume, it tries to dereference 'from_dst_file' and crashes:
> 
> Thread 7 "return path" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7fffd1dbf700 (LWP 9611)]
> 0x00005555560e4893 in qemu_file_get_error_obj (f=0x0, errp=0x0) at 
> ../migration/qemu-file.c:154
> 154         return f->last_error;
> 
> (gdb) bt
>  #0  0x00005555560e4893 in qemu_file_get_error_obj (f=0x0, errp=0x0) at 
> ../migration/qemu-file.c:154
>  #1  0x00005555560e4983 in qemu_file_get_error (f=0x0) at 
> ../migration/qemu-file.c:206
>  #2  0x0000555555b9a1df in source_return_path_thread (opaque=0x555556e06000) 
> at ../migration/migration.c:1876
>  #3  0x000055555602e14f in qemu_thread_start (args=0x55555782e780) at 
> ../util/qemu-thread-posix.c:541
>  #4  0x00007ffff38d76ea in start_thread (arg=0x7fffd1dbf700) at 
> pthread_create.c:477
>  #5  0x00007ffff35efa6f in clone () at 
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
> 
> Here's the race (important bit is open_return_path happening before
> migration_release_dst_files):
> 
> migration                 | qmp                         | return path
> --------------------------+-----------------------------+---------------------------------
>                           qmp_migrate_pause()
>                            shutdown(ms->to_dst_file)
>                             f->last_error = -EIO
> migrate_detect_error()
>  postcopy_pause()
>   set_state(PAUSED)
>   wait(postcopy_pause_sem)
>                           qmp_migrate(resume)
>                           migrate_fd_connect()
>                            resume = state == PAUSED
>                            open_return_path <-- TOO SOON!
>                            set_state(RECOVER)
>                            post(postcopy_pause_sem)
>                                                       (incoming closes 
> to_src_file)
>                                                       res = 
> qemu_file_get_error(rp)
>                                                       
> migration_release_dst_files()
>                                                       
> ms->rp_state.from_dst_file = NULL
>   post(postcopy_pause_rp_sem)
>                                                       
> postcopy_pause_return_path_thread()
>                                                         
> wait(postcopy_pause_rp_sem)
>                                                       rp = 
> ms->rp_state.from_dst_file
>                                                       goto retry
>                                                       qemu_file_get_error(rp)
>                                                       SIGSEGV
> -------------------------------------------------------------------------------------------
> 
> We can keep the retry logic without having the thread alive and
> waiting. The only piece of data used by it is the 'from_dst_file' and
> it is only allowed to proceed after a migrate resume is issued and the
> semaphore released at migrate_fd_connect().
> 
> Move the retry logic to outside the thread by waiting for the thread
> to finish before pausing the migration.
> 
> Signed-off-by: Fabiano Rosas <faro...@suse.de>

Reviewed-by: Peter Xu <pet...@redhat.com>

-- 
Peter Xu


Reply via email to