On 1/21/20 4:41 AM, David Gibson wrote: > On Wed, Jan 15, 2020 at 07:10:47PM +0100, Cédric Le Goater wrote: >> On 1/15/20 6:48 PM, Greg Kurz wrote: >>> Migration can potentially race with CAS reboot. If the migration thread >>> completes migration after CAS has set spapr->cas_reboot but before the >>> mainloop could pick up the reset request and reset the machine, the >>> guest is migrated unrebooted and the destination doesn't reboot it >>> either because it isn't aware a CAS reboot was needed (eg, because a >>> device was added before CAS). This likely result in a broken or hung >>> guest. >>> >>> Even if it is small, the window between CAS and CAS reboot is enough to >>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new >>> subsection for that and always send it when a CAS reboot is pending. >>> This may cause migration to older QEMUs to fail but it is still better >>> than end up with a broken guest. >>> >>> The destination cannot honour the CAS reboot request from a post load >>> handler because this must be done after the guest is fully restored. >>> It is thus done from a VM change state handler. >>> >>> Reported-by: Lukáš Doktor <ldok...@redhat.com> >>> Signed-off-by: Greg Kurz <gr...@kaod.org> >> >> Cédric Le Goater <c...@kaod.org> >> >> Nice work ! That was quite complex to catch ! > > It is a very nice analysis. However, I'm disinclined to merge this > for the time being. > > My preferred approach would be to just eliminate CAS reboots > altogether, since that has other benefits. I'm feeling like this > isn't super-urgent, since CAS reboots are extremely rare in practice, > now that we've eliminated the one for the irq switchover.
Yes. The possibility of a migration in the window between CAS and CAS reboot must be even more rare. C. > However, if it's not looking like we'll be ready to do that as the > qemu-5.0 release approaches, then I'll be more than willing to > reconsider this.