Fabiano Rosas <faro...@suse.de> writes: > Peter Xu <pet...@redhat.com> writes: > >> v2: >> - Patch "migration: Provide explicit error message for file shutdowns" >> - Touched up qapi doc [Fabiano] >> - Added Bugzilla link to commit which I didn't even notice that I was >> fixing a bug.. but rightfully pointed out by Laszlo. >> - Moved it to the 1st patch because it fixes a bug, please consider >> review and merge it earlier. >> >> This is a small series that reworks error handling of postcopy return path >> threads. >> >> We used to contain a bunch of error_report(), converting them into >> error_setg() properly and deliver any of those errors to migration generic >> error reports (via migrate_set_error()). Then these errors can also be >> observed in query-migrate after postcopy is paused. >> >> Dropped the return-path specific error reporting: mark_source_rp_bad(), >> because it's a duplication if we can always use migrate_set_error(). >> >> Please have a look, thanks. >> >> Peter Xu (7): >> migration: Display error in query-migrate irrelevant of status >> migration: Let migrate_set_error() take ownership >> migration: Introduce migrate_has_error() >> migration: Refactor error handling in source return path >> migration: Deliver return path file error to migrate state too >> qemufile: Always return a verbose error >> migration: Provide explicit error message for file shutdowns >> >> qapi/migration.json | 5 +- >> migration/migration.h | 8 +- >> migration/ram.h | 5 +- >> migration/channel.c | 1 - >> migration/migration.c | 168 +++++++++++++++++++++++---------------- >> migration/multifd.c | 10 +-- >> migration/postcopy-ram.c | 1 - >> migration/qemu-file.c | 20 ++++- >> migration/ram.c | 42 +++++----- >> migration/trace-events | 2 +- >> 10 files changed, 149 insertions(+), 113 deletions(-) > > Hi Peter, > > Were you aiming at solving any specific bug with this series? I'm seeing > a bug on master (361d5397355) with the > /x86_64/migration/postcopy/preempt/recovery/plain test around the areas > that this series touches. > > It happens very rarely and I'm still investigating, but in case you have > any thoughts: > > ==== > It seems there's a race condition between postcopy resume and the return > path cleanup. > > It is possible for open_return_path_on_source() to setup the new > QEMUFile *before* the cleanup path at source_return_path_thread() has > had a chance to run, so we end up calling migration_release_dst_files() > on the new file and ms->rp_state.from_dst_file gets set to NULL again, > leading to a SIGSEGV at qemu_file_get_error(rp) due to rp being NULL.
I did some more digging and this is indeed what happens. When we pause on the incoming side, the to_src_file is closed and the source return path sees an error (EBADFD) which leads to the cleanup (from_dst_file = NULL). This happens independently and without any synchronization with a potential concurrent resume operation. Is there a reason for not closing the return path thread and starting a new one for resume? The from_dst_file is the only thing being changed anyway. It would allow us to remove the retry logic along with the problematic cleanup path and not need another synchronization point between qmp_migrate() and the return path. Here's the race (important bit is open_return_path happening before migration_release_dst_files): migration | qmp | return path --------------------------+-----------------------------+--------------------------------- qmp_migrate_pause() shutdown(ms->to_dst_file) f->last_error = -EIO migrate_detect_error() postcopy_pause() set_state(PAUSED) wait(postcopy_pause_sem) qmp_migrate(resume) migrate_fd_connect() resume = state == PAUSED open_return_path <-- TOO SOON! set_state(RECOVER) post(postcopy_pause_sem) (incoming closes to_src_file) res = qemu_file_get_error(rp) migration_release_dst_files() ms->rp_state.from_dst_file = NULL post(postcopy_pause_rp_sem) postcopy_pause_return_path_thread() wait(postcopy_pause_rp_sem) rp = ms->rp_state.from_dst_file goto retry qemu_file_get_error(rp) SIGSEGV -------------------------------------------------------------------------------------------