Hi, Fabiano, Sorry to be late on responding.
On Tue, Jul 25, 2023 at 03:24:26PM -0300, Fabiano Rosas wrote: > Fabiano Rosas <faro...@suse.de> writes: > > > Peter Xu <pet...@redhat.com> writes: > > > >> v2: > >> - Patch "migration: Provide explicit error message for file shutdowns" > >> - Touched up qapi doc [Fabiano] > >> - Added Bugzilla link to commit which I didn't even notice that I was > >> fixing a bug.. but rightfully pointed out by Laszlo. > >> - Moved it to the 1st patch because it fixes a bug, please consider > >> review and merge it earlier. > >> > >> This is a small series that reworks error handling of postcopy return path > >> threads. > >> > >> We used to contain a bunch of error_report(), converting them into > >> error_setg() properly and deliver any of those errors to migration generic > >> error reports (via migrate_set_error()). Then these errors can also be > >> observed in query-migrate after postcopy is paused. > >> > >> Dropped the return-path specific error reporting: mark_source_rp_bad(), > >> because it's a duplication if we can always use migrate_set_error(). > >> > >> Please have a look, thanks. > >> > >> Peter Xu (7): > >> migration: Display error in query-migrate irrelevant of status > >> migration: Let migrate_set_error() take ownership > >> migration: Introduce migrate_has_error() > >> migration: Refactor error handling in source return path > >> migration: Deliver return path file error to migrate state too > >> qemufile: Always return a verbose error > >> migration: Provide explicit error message for file shutdowns > >> > >> qapi/migration.json | 5 +- > >> migration/migration.h | 8 +- > >> migration/ram.h | 5 +- > >> migration/channel.c | 1 - > >> migration/migration.c | 168 +++++++++++++++++++++++---------------- > >> migration/multifd.c | 10 +-- > >> migration/postcopy-ram.c | 1 - > >> migration/qemu-file.c | 20 ++++- > >> migration/ram.c | 42 +++++----- > >> migration/trace-events | 2 +- > >> 10 files changed, 149 insertions(+), 113 deletions(-) > > > > Hi Peter, > > > > Were you aiming at solving any specific bug with this series? Nop. I simply noticed that the error in return path cannot be proxied to migration object and thought maybe we should do that. Thanks for looing into this problem. > > I'm seeing > > a bug on master (361d5397355) with the > > /x86_64/migration/postcopy/preempt/recovery/plain test around the areas > > that this series touches. > > > > It happens very rarely and I'm still investigating, but in case you have > > any thoughts: > > > > ==== > > It seems there's a race condition between postcopy resume and the return > > path cleanup. > > > > It is possible for open_return_path_on_source() to setup the new > > QEMUFile *before* the cleanup path at source_return_path_thread() has > > had a chance to run, so we end up calling migration_release_dst_files() > > on the new file and ms->rp_state.from_dst_file gets set to NULL again, > > leading to a SIGSEGV at qemu_file_get_error(rp) due to rp being NULL. > > I did some more digging and this is indeed what happens. When we pause > on the incoming side, the to_src_file is closed and the source return > path sees an error (EBADFD) which leads to the cleanup (from_dst_file = > NULL). This happens independently and without any synchronization with a > potential concurrent resume operation. > > Is there a reason for not closing the return path thread and starting a > new one for resume? Not anything special. When I initially worked on that (quite a few years ago) I thought it would be the simplest we keep hold as much things as possible, including the threads. But maybe it's not too hard either to just reinitiate the thread when resume indeed. > The from_dst_file is the only thing being changed > anyway. It would allow us to remove the retry logic along with the > problematic cleanup path and not need another synchronization point > between qmp_migrate() and the return path. I'm fine if you want to remove the return path thread when a pause happens, as long as we can cleanly do that; if you already drafted something and looks all good from your side, happy to read it. We may somewhere need another call to await_return_path_close_on_source() when pause from migration thread. The other way is the main thread can stupidly wait for the two files to be released, namely, from_dst_file and postcopy_qemufile_src. > > Here's the race (important bit is open_return_path happening before > migration_release_dst_files): > > migration | qmp | return path > --------------------------+-----------------------------+--------------------------------- > qmp_migrate_pause() > shutdown(ms->to_dst_file) > f->last_error = -EIO > migrate_detect_error() > postcopy_pause() > set_state(PAUSED) > wait(postcopy_pause_sem) > qmp_migrate(resume) > migrate_fd_connect() > resume = state == PAUSED > open_return_path <-- TOO SOON! > set_state(RECOVER) > post(postcopy_pause_sem) > (incoming closes > to_src_file) > res = > qemu_file_get_error(rp) > > migration_release_dst_files() > > ms->rp_state.from_dst_file = NULL > post(postcopy_pause_rp_sem) > > postcopy_pause_return_path_thread() > > wait(postcopy_pause_rp_sem) > rp = > ms->rp_state.from_dst_file > goto retry > > qemu_file_get_error(rp) > SIGSEGV > ------------------------------------------------------------------------------------------- > Thanks, -- Peter Xu