The /x86_64/migration/postcopy/preempt/recovery/plain test is sometimes failing due a segmentation fault on the migration return path. There is a race involving the retry logic of the return path and the migration resume command.
The issue happens when the retry logic tries to cleanup the current return path file, but ends up cleaning the new one and trying to use it right after. Tracing shows it clearly: open_return_path_on_source <-- at migration start open_return_path_on_source_continue <-- rp thread created postcopy_pause_incoming postcopy_pause_fast_load qemu-system-x86_64: Detected IO failure for postcopy. Migration paused. (incoming) postcopy_pause_fault_thread qemu-system-x86_64: Detected IO failure for postcopy. Migration paused. (source) postcopy_pause_incoming_continued open_return_path_on_source <-- NOK, too soon postcopy_pause_continued postcopy_pause_return_path <-- too late, already operating on the new from_dst_file postcopy_pause_return_path_continued <-- will continue and crash postcopy_pause_incoming qemu-system-x86_64: Detected IO failure for postcopy. Migration paused. postcopy_pause_incoming_continued We could solve this by adding some form of synchronization to ensure that we always do the cleanup before setting up the new file, but I find it more straight-forward to move the retry logic outside of the thread by letting it finish and starting a new thread when resuming the migration. More details on the commit message. CI run: https://gitlab.com/farosas/qemu/-/pipelines/947875609 Fabiano Rosas (3): migration: Stop marking RP bad after shutdown migration: Simplify calling of await_return_path_close_on_source migration: Replace the return path retry logic migration/migration.c | 94 ++++++++++++++---------------------------- migration/migration.h | 1 - migration/trace-events | 3 +- 3 files changed, 33 insertions(+), 65 deletions(-) -- 2.35.3