Alex Bennée <alex.ben...@linaro.org> writes:
> Thiago Jung Bauermann <bauer...@linux.ibm.com> writes: > >> Eduardo Habkost <ehabk...@redhat.com> writes: >> >>> On Wed, Jul 08, 2020 at 09:11:55PM +0100, Peter Maydell wrote: >>>> On Wed, 8 Jul 2020 at 18:36, Eduardo Habkost <ehabk...@redhat.com> wrote: >>>> > >>>> > On Wed, Jul 08, 2020 at 06:09:49PM +0100, Peter Maydell wrote: >>>> > > Exactly. It appears that there's a bug in our mechanisms, >>>> > > which is why I'm suggesting that the right thing is >>>> > > to fix that bug rather than marking the CPU as halted >>>> > > earlier in the reset process so that the KVM_RUN happens >>>> > > to do nothing... >>>> > >>>> > I agree this is necessary, but it doesn't seem sufficient. >>>> > >>>> > Having cpu_reset() set halted=0 on spapr (and probably other >>>> > machines) is also a bug, as it could still trigger unwanted >>>> > KVM_RUN when cpu_reset() returns (and before machine code sets >>>> > halted=1). >>>> >>>> The Arm handling of starting-halted sets halted=1 within cpu_reset, >>>> based on whether the CPU object was created with a >>>> "start-powered-off" property. >>> >>> Making this mechanism generic sounds like a good idea. >> >> I'll take a stab at doing that and using it for the spapr machine. >> >>>> I'm not sure in practice that anything can get in asynchronously >>>> and cause a KVM_RUN in between spapr_reset_vcpu() calling >>>> cpu_reset() and it setting cs->halted (and the other stuff), >>>> though. This function ought to be called with the iothread >>>> lock held, so KVM_RUN will only happen if it calls some >>>> other function which incorrectly lets the CPU run. >>> >>> Yeah, maybe it won't happen in practice. It just seems fragile. >>> The same way ppc_cpu_reset() kicked the CPU by accident, code >>> outside cpu_reset() might one day kick the CPU by accident before >>> setting halted=1. >> >> I'm seeing the vcpu being KVM_RUN'd too early twice during hotplug. >> Both of them are before cpu_reset() and ppc_cpu_reset(). >> >> Here's the backtrace for the first of them (redacted for clarity): >> >> #0 in cpu_resume () >> #1 in cpu_common_realizefn () >> #2 in ppc_cpu_realize () >> #3 in device_set_realized () >> #4 in property_set_bool () >> #5 in object_property_set () >> #6 in object_property_set_qobject () >> #7 in object_property_set_bool () >> #8 in qdev_realize () > <snip> >> #18 in qmp_device_add () > > Is this a hotplug event? Yes, the way I reproduce the problem is starting a pseries guest with `-smp 2,maxcpus=32,sockets=1,cores=16,threads=2` and then use qmp-shell to send the command: device_add id=device-2 driver=host-spapr-cpu-core core-id=2 node-id=0 >> Here's the second: >> >> #0 in qemu_cpu_kick_thread () >> #1 in qemu_cpu_kick () >> #2 in queue_work_on_cpu () >> #3 in async_run_on_cpu () >> #4 in tlb_flush_by_mmuidx () >> #5 in tlb_flush () >> #6 in ppc_tlb_invalidate_all () > > FWIW tcg_flush_softmmu_tlb handles a tlb_flush in the common reset code. Ok, maybe KVM should be doing that too? Or maybe it does but pseries isn't relying on it. I'll dig further. -- Thiago Jung Bauermann IBM Linux Technology Center