Peter Xu <pet...@redhat.com> writes: > v2: > - Patch "migration: Provide explicit error message for file shutdowns" > - Touched up qapi doc [Fabiano] > - Added Bugzilla link to commit which I didn't even notice that I was > fixing a bug.. but rightfully pointed out by Laszlo. > - Moved it to the 1st patch because it fixes a bug, please consider > review and merge it earlier. > > This is a small series that reworks error handling of postcopy return path > threads. > > We used to contain a bunch of error_report(), converting them into > error_setg() properly and deliver any of those errors to migration generic > error reports (via migrate_set_error()). Then these errors can also be > observed in query-migrate after postcopy is paused. > > Dropped the return-path specific error reporting: mark_source_rp_bad(), > because it's a duplication if we can always use migrate_set_error(). > > Please have a look, thanks. > > Peter Xu (7): > migration: Display error in query-migrate irrelevant of status > migration: Let migrate_set_error() take ownership > migration: Introduce migrate_has_error() > migration: Refactor error handling in source return path > migration: Deliver return path file error to migrate state too > qemufile: Always return a verbose error > migration: Provide explicit error message for file shutdowns > > qapi/migration.json | 5 +- > migration/migration.h | 8 +- > migration/ram.h | 5 +- > migration/channel.c | 1 - > migration/migration.c | 168 +++++++++++++++++++++++---------------- > migration/multifd.c | 10 +-- > migration/postcopy-ram.c | 1 - > migration/qemu-file.c | 20 ++++- > migration/ram.c | 42 +++++----- > migration/trace-events | 2 +- > 10 files changed, 149 insertions(+), 113 deletions(-)
Hi Peter, Were you aiming at solving any specific bug with this series? I'm seeing a bug on master (361d5397355) with the /x86_64/migration/postcopy/preempt/recovery/plain test around the areas that this series touches. It happens very rarely and I'm still investigating, but in case you have any thoughts: ==== It seems there's a race condition between postcopy resume and the return path cleanup. It is possible for open_return_path_on_source() to setup the new QEMUFile *before* the cleanup path at source_return_path_thread() has had a chance to run, so we end up calling migration_release_dst_files() on the new file and ms->rp_state.from_dst_file gets set to NULL again, leading to a SIGSEGV at qemu_file_get_error(rp) due to rp being NULL. Here's a trace from when it works (both src and dst): open_return_path_on_source open_return_path_on_source_continue postcopy_pause_incoming postcopy_pause_fast_load qemu-system-x86_64: Detected IO failure for postcopy. Migration paused. postcopy_pause_fault_thread postcopy_pause_return_path <--- qemu-system-x86_64: Detected IO failure for postcopy. Migration paused. open_return_path_on_source <--- OK postcopy_pause_continued <--- postcopy_pause_return_path_continued postcopy_pause_incoming_continued postcopy_pause_fault_thread_continued postcopy_pause_fast_load_continued versus when it fails: open_return_path_on_source open_return_path_on_source_continue postcopy_pause_incoming postcopy_pause_fast_load qemu-system-x86_64: Detected IO failure for postcopy. Migration paused. postcopy_pause_fault_thread qemu-system-x86_64: Detected IO failure for postcopy. Migration paused. postcopy_pause_incoming_continued open_return_path_on_source <--- NOK postcopy_pause_continued postcopy_pause_return_path <--- postcopy_pause_return_path_continued <--- postcopy_pause_incoming qemu-system-x86_64: Detected IO failure for postcopy. Migration paused. postcopy_pause_incoming_continued