On Fri, Aug 11, 2023 at 12:08:33PM -0300, Fabiano Rosas wrote: > Replace the return path retry logic with finishing and restarting the > thread. This fixes a race when resuming the migration that leads to a > segfault. > > Currently when doing postcopy we consider that an IO error on the > return path file could be due to a network intermittency. We then keep > the thread alive but have it do cleanup of the 'from_dst_file' and > wait on the 'postcopy_pause_rp' semaphore. When the user issues a > migrate resume, a new return path is opened and the thread is allowed > to continue. > > There's a race condition in the above mechanism. It is possible for > the new return path file to be setup *before* the cleanup code in the > return path thread has had a chance to run, leading to the *new* file > being closed and the pointer set to NULL. When the thread is released > after the resume, it tries to dereference 'from_dst_file' and crashes: > > Thread 7 "return path" received signal SIGSEGV, Segmentation fault. > [Switching to Thread 0x7fffd1dbf700 (LWP 9611)] > 0x00005555560e4893 in qemu_file_get_error_obj (f=0x0, errp=0x0) at > ../migration/qemu-file.c:154 > 154 return f->last_error; > > (gdb) bt > #0 0x00005555560e4893 in qemu_file_get_error_obj (f=0x0, errp=0x0) at > ../migration/qemu-file.c:154 > #1 0x00005555560e4983 in qemu_file_get_error (f=0x0) at > ../migration/qemu-file.c:206 > #2 0x0000555555b9a1df in source_return_path_thread (opaque=0x555556e06000) > at ../migration/migration.c:1876 > #3 0x000055555602e14f in qemu_thread_start (args=0x55555782e780) at > ../util/qemu-thread-posix.c:541 > #4 0x00007ffff38d76ea in start_thread (arg=0x7fffd1dbf700) at > pthread_create.c:477 > #5 0x00007ffff35efa6f in clone () at > ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 > > Here's the race (important bit is open_return_path happening before > migration_release_dst_files): > > migration | qmp | return path > --------------------------+-----------------------------+--------------------------------- > qmp_migrate_pause() > shutdown(ms->to_dst_file) > f->last_error = -EIO > migrate_detect_error() > postcopy_pause() > set_state(PAUSED) > wait(postcopy_pause_sem) > qmp_migrate(resume) > migrate_fd_connect() > resume = state == PAUSED > open_return_path <-- TOO SOON! > set_state(RECOVER) > post(postcopy_pause_sem) > (incoming closes > to_src_file) > res = > qemu_file_get_error(rp) > > migration_release_dst_files() > > ms->rp_state.from_dst_file = NULL > post(postcopy_pause_rp_sem) > > postcopy_pause_return_path_thread() > > wait(postcopy_pause_rp_sem) > rp = > ms->rp_state.from_dst_file > goto retry > qemu_file_get_error(rp) > SIGSEGV > ------------------------------------------------------------------------------------------- > > We can keep the retry logic without having the thread alive and > waiting. The only piece of data used by it is the 'from_dst_file' and > it is only allowed to proceed after a migrate resume is issued and the > semaphore released at migrate_fd_connect(). > > Move the retry logic to outside the thread by waiting for the thread > to finish before pausing the migration. > > Signed-off-by: Fabiano Rosas <faro...@suse.de>
Reviewed-by: Peter Xu <pet...@redhat.com> -- Peter Xu