Peter Xu <pet...@redhat.com> writes:

> v2:
> - Patch "migration: Provide explicit error message for file shutdowns"
>   - Touched up qapi doc [Fabiano]
>   - Added Bugzilla link to commit which I didn't even notice that I was
>     fixing a bug.. but rightfully pointed out by Laszlo.
>   - Moved it to the 1st patch because it fixes a bug, please consider
>     review and merge it earlier.
>
> This is a small series that reworks error handling of postcopy return path
> threads.
>
> We used to contain a bunch of error_report(), converting them into
> error_setg() properly and deliver any of those errors to migration generic
> error reports (via migrate_set_error()).  Then these errors can also be
> observed in query-migrate after postcopy is paused.
>
> Dropped the return-path specific error reporting: mark_source_rp_bad(),
> because it's a duplication if we can always use migrate_set_error().
>
> Please have a look, thanks.
>
> Peter Xu (7):
>   migration: Display error in query-migrate irrelevant of status
>   migration: Let migrate_set_error() take ownership
>   migration: Introduce migrate_has_error()
>   migration: Refactor error handling in source return path
>   migration: Deliver return path file error to migrate state too
>   qemufile: Always return a verbose error
>   migration: Provide explicit error message for file shutdowns
>
>  qapi/migration.json      |   5 +-
>  migration/migration.h    |   8 +-
>  migration/ram.h          |   5 +-
>  migration/channel.c      |   1 -
>  migration/migration.c    | 168 +++++++++++++++++++++++----------------
>  migration/multifd.c      |  10 +--
>  migration/postcopy-ram.c |   1 -
>  migration/qemu-file.c    |  20 ++++-
>  migration/ram.c          |  42 +++++-----
>  migration/trace-events   |   2 +-
>  10 files changed, 149 insertions(+), 113 deletions(-)

Hi Peter,

Were you aiming at solving any specific bug with this series? I'm seeing
a bug on master (361d5397355) with the
/x86_64/migration/postcopy/preempt/recovery/plain test around the areas
that this series touches.

It happens very rarely and I'm still investigating, but in case you have
any thoughts:

====
It seems there's a race condition between postcopy resume and the return
path cleanup.

It is possible for open_return_path_on_source() to setup the new
QEMUFile *before* the cleanup path at source_return_path_thread() has
had a chance to run, so we end up calling migration_release_dst_files()
on the new file and ms->rp_state.from_dst_file gets set to NULL again,
leading to a SIGSEGV at qemu_file_get_error(rp) due to rp being NULL.

Here's a trace from when it works (both src and dst):

open_return_path_on_source
open_return_path_on_source_continue
postcopy_pause_incoming
postcopy_pause_fast_load
qemu-system-x86_64: Detected IO failure for postcopy. Migration paused.
postcopy_pause_fault_thread
postcopy_pause_return_path   <---
qemu-system-x86_64: Detected IO failure for postcopy. Migration paused.
open_return_path_on_source   <--- OK
postcopy_pause_continued     <---
postcopy_pause_return_path_continued
postcopy_pause_incoming_continued
postcopy_pause_fault_thread_continued
postcopy_pause_fast_load_continued

versus when it fails:

open_return_path_on_source
open_return_path_on_source_continue
postcopy_pause_incoming
postcopy_pause_fast_load
qemu-system-x86_64: Detected IO failure for postcopy. Migration paused.
postcopy_pause_fault_thread
qemu-system-x86_64: Detected IO failure for postcopy. Migration paused.
postcopy_pause_incoming_continued
open_return_path_on_source   <--- NOK
postcopy_pause_continued
postcopy_pause_return_path   <---
postcopy_pause_return_path_continued <---
postcopy_pause_incoming
qemu-system-x86_64: Detected IO failure for postcopy. Migration paused.
postcopy_pause_incoming_continued

Reply via email to