Hi, On 15/01/2020 18:48, Greg Kurz wrote: > Migration can potentially race with CAS reboot. If the migration thread > completes migration after CAS has set spapr->cas_reboot but before the > mainloop could pick up the reset request and reset the machine, the > guest is migrated unrebooted and the destination doesn't reboot it > either because it isn't aware a CAS reboot was needed (eg, because a > device was added before CAS). This likely result in a broken or hung > guest. > > Even if it is small, the window between CAS and CAS reboot is enough to > re-qualify spapr->cas_reboot as state that we should migrate. Add a new > subsection for that and always send it when a CAS reboot is pending. > This may cause migration to older QEMUs to fail but it is still better > than end up with a broken guest. > > The destination cannot honour the CAS reboot request from a post load > handler because this must be done after the guest is fully restored. > It is thus done from a VM change state handler. > > Reported-by: Lukáš Doktor <ldok...@redhat.com> > Signed-off-by: Greg Kurz <gr...@kaod.org> > --- >
I'm wondering if the problem can be related with the fact that main_loop_should_exit() could release qemu_global_mutex in pause_all_vcpus() in the reset case? 1602 static bool main_loop_should_exit(void) 1603 { ... 1633 request = qemu_reset_requested(); 1634 if (request) { 1635 pause_all_vcpus(); 1636 qemu_system_reset(request); 1637 resume_all_vcpus(); 1638 if (!runstate_check(RUN_STATE_RUNNING) && 1639 !runstate_check(RUN_STATE_INMIGRATE)) { 1640 runstate_set(RUN_STATE_PRELAUNCH); 1641 } 1642 } ... I already sent a patch for this kind of problem (in current Juan pull request): "runstate: ignore finishmigrate -> prelaunch transition" but I don't know if it could fix this one. Thanks, Laurent