On Wed, Jan 15, 2020 at 07:10:47PM +0100, Cédric Le Goater wrote: > On 1/15/20 6:48 PM, Greg Kurz wrote: > > Migration can potentially race with CAS reboot. If the migration thread > > completes migration after CAS has set spapr->cas_reboot but before the > > mainloop could pick up the reset request and reset the machine, the > > guest is migrated unrebooted and the destination doesn't reboot it > > either because it isn't aware a CAS reboot was needed (eg, because a > > device was added before CAS). This likely result in a broken or hung > > guest. > > > > Even if it is small, the window between CAS and CAS reboot is enough to > > re-qualify spapr->cas_reboot as state that we should migrate. Add a new > > subsection for that and always send it when a CAS reboot is pending. > > This may cause migration to older QEMUs to fail but it is still better > > than end up with a broken guest. > > > > The destination cannot honour the CAS reboot request from a post load > > handler because this must be done after the guest is fully restored. > > It is thus done from a VM change state handler. > > > > Reported-by: Lukáš Doktor <ldok...@redhat.com> > > Signed-off-by: Greg Kurz <gr...@kaod.org> > > Cédric Le Goater <c...@kaod.org> > > Nice work ! That was quite complex to catch !
It is a very nice analysis. However, I'm disinclined to merge this for the time being. My preferred approach would be to just eliminate CAS reboots altogether, since that has other benefits. I'm feeling like this isn't super-urgent, since CAS reboots are extremely rare in practice, now that we've eliminated the one for the irq switchover. However, if it's not looking like we'll be ready to do that as the qemu-5.0 release approaches, then I'll be more than willing to reconsider this. -- David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson
signature.asc
Description: PGP signature