Eduardo Habkost <ehabk...@redhat.com> writes:
> On Wed, Jul 08, 2020 at 09:11:55PM +0100, Peter Maydell wrote: >> On Wed, 8 Jul 2020 at 18:36, Eduardo Habkost <ehabk...@redhat.com> wrote: >> > >> > On Wed, Jul 08, 2020 at 06:09:49PM +0100, Peter Maydell wrote: >> > > Exactly. It appears that there's a bug in our mechanisms, >> > > which is why I'm suggesting that the right thing is >> > > to fix that bug rather than marking the CPU as halted >> > > earlier in the reset process so that the KVM_RUN happens >> > > to do nothing... >> > >> > I agree this is necessary, but it doesn't seem sufficient. >> > >> > Having cpu_reset() set halted=0 on spapr (and probably other >> > machines) is also a bug, as it could still trigger unwanted >> > KVM_RUN when cpu_reset() returns (and before machine code sets >> > halted=1). >> >> The Arm handling of starting-halted sets halted=1 within cpu_reset, >> based on whether the CPU object was created with a >> "start-powered-off" property. > > Making this mechanism generic sounds like a good idea. I'll take a stab at doing that and using it for the spapr machine. >> I'm not sure in practice that anything can get in asynchronously >> and cause a KVM_RUN in between spapr_reset_vcpu() calling >> cpu_reset() and it setting cs->halted (and the other stuff), >> though. This function ought to be called with the iothread >> lock held, so KVM_RUN will only happen if it calls some >> other function which incorrectly lets the CPU run. > > Yeah, maybe it won't happen in practice. It just seems fragile. > The same way ppc_cpu_reset() kicked the CPU by accident, code > outside cpu_reset() might one day kick the CPU by accident before > setting halted=1. I'm seeing the vcpu being KVM_RUN'd too early twice during hotplug. Both of them are before cpu_reset() and ppc_cpu_reset(). Here's the backtrace for the first of them (redacted for clarity): #0 in cpu_resume () #1 in cpu_common_realizefn () #2 in ppc_cpu_realize () #3 in device_set_realized () #4 in property_set_bool () #5 in object_property_set () #6 in object_property_set_qobject () #7 in object_property_set_bool () #8 in qdev_realize () #9 in spapr_realize_vcpu () #10 in spapr_cpu_core_realize () #11 in device_set_realized () #12 in property_set_bool () #13 in object_property_set () #14 in object_property_set_qobject () #15 in object_property_set_bool () #16 in qdev_realize () #17 in qdev_device_add () #18 in qmp_device_add () Here's the second: #0 in qemu_cpu_kick_thread () #1 in qemu_cpu_kick () #2 in queue_work_on_cpu () #3 in async_run_on_cpu () #4 in tlb_flush_by_mmuidx () #5 in tlb_flush () #6 in ppc_tlb_invalidate_all () #7 in ppc_cpu_reset () #8 in device_transitional_reset () #9 in resettable_phase_hold () #10 in resettable_assert_reset () #11 in device_set_realized () #12 in property_set_bool () #13 in object_property_set () #14 in object_property_set_qobject () #15 in object_property_set_bool () #16 in qdev_realize () #17 in spapr_realize_vcpu () #18 in spapr_cpu_core_realize () #19 in device_set_realized () #20 in property_set_bool () #21 in object_property_set () #22 in object_property_set_qobject () #23 in object_property_set_bool () #24 in qdev_realize () #25 in qdev_device_add () #26 in qmp_device_add () Looking closely, both of them ultimately stem from the qdev_realize(DEVICE(cpu), ...) call in spapr_realize_vcpu(). Is there something wrong with that? I don't know anything about the QEMU device model to be able to tell. One other way I found to avoid the spurious KVM_RUN calls is to remove the cpu_resume() call in cpu_common_realizefn(), which to me seems to be placed way too early in the CPU hotplug path. Simply removing it makes CPU hotplug stop working though. :-) I still have to see if I can find a better place for it... -- Thiago Jung Bauermann IBM Linux Technology Center