On 15/01/2020 19:10, Laurent Vivier wrote: > Hi, > > On 15/01/2020 18:48, Greg Kurz wrote: >> Migration can potentially race with CAS reboot. If the migration thread >> completes migration after CAS has set spapr->cas_reboot but before the >> mainloop could pick up the reset request and reset the machine, the >> guest is migrated unrebooted and the destination doesn't reboot it >> either because it isn't aware a CAS reboot was needed (eg, because a >> device was added before CAS). This likely result in a broken or hung >> guest. >> >> Even if it is small, the window between CAS and CAS reboot is enough to >> re-qualify spapr->cas_reboot as state that we should migrate. Add a new >> subsection for that and always send it when a CAS reboot is pending. >> This may cause migration to older QEMUs to fail but it is still better >> than end up with a broken guest. >> >> The destination cannot honour the CAS reboot request from a post load >> handler because this must be done after the guest is fully restored. >> It is thus done from a VM change state handler. >> >> Reported-by: Lukáš Doktor <ldok...@redhat.com> >> Signed-off-by: Greg Kurz <gr...@kaod.org> >> --- >> > > I'm wondering if the problem can be related with the fact that > main_loop_should_exit() could release qemu_global_mutex in > pause_all_vcpus() in the reset case? > > 1602 static bool main_loop_should_exit(void) > 1603 { > ... > 1633 request = qemu_reset_requested(); > 1634 if (request) { > 1635 pause_all_vcpus(); > 1636 qemu_system_reset(request); > 1637 resume_all_vcpus(); > 1638 if (!runstate_check(RUN_STATE_RUNNING) && > 1639 !runstate_check(RUN_STATE_INMIGRATE)) { > 1640 runstate_set(RUN_STATE_PRELAUNCH); > 1641 } > 1642 } > ... > > I already sent a patch for this kind of problem (in current Juan pull > request): > > "runstate: ignore finishmigrate -> prelaunch transition" > > but I don't know if it could fix this one.
I think it should be interesting to have the state transition on source and destination when the problem occurs (with something like "-trace runstate_set"). Thanks, Laurent