On Fri, 17 Jan 2020 13:10:35 +0100 Laurent Vivier <lviv...@redhat.com> wrote:
> On 17/01/2020 12:49, Greg Kurz wrote: > > On Wed, 15 Jan 2020 19:26:18 +0100 > > Laurent Vivier <lviv...@redhat.com> wrote: > > > >> On 15/01/2020 19:10, Laurent Vivier wrote: > >>> Hi, > >>> > >>> On 15/01/2020 18:48, Greg Kurz wrote: > >>>> Migration can potentially race with CAS reboot. If the migration thread > >>>> completes migration after CAS has set spapr->cas_reboot but before the > >>>> mainloop could pick up the reset request and reset the machine, the > >>>> guest is migrated unrebooted and the destination doesn't reboot it > >>>> either because it isn't aware a CAS reboot was needed (eg, because a > >>>> device was added before CAS). This likely result in a broken or hung > >>>> guest. > >>>> > >>>> Even if it is small, the window between CAS and CAS reboot is enough to > >>>> re-qualify spapr->cas_reboot as state that we should migrate. Add a new > >>>> subsection for that and always send it when a CAS reboot is pending. > >>>> This may cause migration to older QEMUs to fail but it is still better > >>>> than end up with a broken guest. > >>>> > >>>> The destination cannot honour the CAS reboot request from a post load > >>>> handler because this must be done after the guest is fully restored. > >>>> It is thus done from a VM change state handler. > >>>> > >>>> Reported-by: Lukáš Doktor <ldok...@redhat.com> > >>>> Signed-off-by: Greg Kurz <gr...@kaod.org> > >>>> --- > >>>> > >>> > >>> I'm wondering if the problem can be related with the fact that > >>> main_loop_should_exit() could release qemu_global_mutex in > >>> pause_all_vcpus() in the reset case? > >>> > >>> 1602 static bool main_loop_should_exit(void) > >>> 1603 { > >>> ... > >>> 1633 request = qemu_reset_requested(); > >>> 1634 if (request) { > >>> 1635 pause_all_vcpus(); > >>> 1636 qemu_system_reset(request); > >>> 1637 resume_all_vcpus(); > >>> 1638 if (!runstate_check(RUN_STATE_RUNNING) && > >>> 1639 !runstate_check(RUN_STATE_INMIGRATE)) { > >>> 1640 runstate_set(RUN_STATE_PRELAUNCH); > >>> 1641 } > >>> 1642 } > >>> ... > >>> > >>> I already sent a patch for this kind of problem (in current Juan pull > >>> request): > >>> > >>> "runstate: ignore finishmigrate -> prelaunch transition" > >>> > >>> but I don't know if it could fix this one. > >> > >> I think it should be interesting to have the state transition on source > >> and destination when the problem occurs (with something like "-trace > >> runstate_set"). > >> > > > > With "-serial mon:stdio -trace runstate_set -trace -trace guest_cpu_reset" : > > > > OF stdout device is: /vdevice/vty@71000000 > > Preparing to boot Linux version 4.18.0-80.el8.ppc64le > > (mockbu...@ppc-061.build.eng.bos.redhat.com) (gcc version 8.2.1 20180905 > > (Red Hat 8.2.1-3) (GCC)) #1 SMP Wed Mar 13 11:26:21 UTC 2019 > > Detected machine type: 0000000000000101 > > command line: BOOT_IMAGE=/boot/vmlinuz-4.18.0-80.el8.ppc64le > > root=UUID=012b83a5-2594-48ac-b936-12fec7cdbb9a ro console=ttyS0 > > console=ttyS0,115200n8 no_timer_check net.ifnames=0 crashkernel=auto > > Max number of cores passed to firmware: 2048 (NR_CPUS = 2048) > > Calling ibm,client-architecture-support. > > > > Migration starts here. > > > > ..qemu-system-ppc64: warning: kernel_irqchip allowed but unavailable: > > IRQ_XIVE capability must be present for KVM > > Falling back to kernel-irqchip=off > > > > This ^^ indicates that CAS was called and switched to XIVE, for which > > we lack proper KVM support on GA boston machines. > > > > 23348@1579260982.315795:runstate_set current_run_state 9 (running) > > new_state 7 (finish-migrate) > > 23348@1579260982.360821:runstate_set current_run_state 7 (finish-migrate) > > new_state 5 (postmigrate) > > > > The migration thread is holding the global QEMU mutex at this point. It > > has stopped all CPUs. It now streams the full state to the destination > > before releasing the mutex. > > > > 23340@1579260982.797279:guest_cpu_reset cpu=0xf9dbb48a5e0 > > 23340@1579260982.797319:guest_cpu_reset cpu=0xf9dbb4d56a0 > > > > The main loop regained control and could process the CAS reboot request > > but it is too late... > > > > 23340@1579260982.866071:runstate_set current_run_state 5 (postmigrate) > > new_state 6 (prelaunch) > > Thank you Greg. > > So I think the best we can do is to migrate cas_reboot. > > To delay the H_CAS call would be cleaner but I don't know if H_BUSY is a > valid return state and this forces to update SLOF too. > Since SLOF is currently the only user of H_CAS, I guess we have some flexibility with the valid return codes... but anyway, David doesn't like the idea :) > Reviewed-by: Laurent Vivier <lviv...@redhat.com> > > Thanks, > Laurent >