On 10/03/20 10:14, Longpeng(Mike) wrote: > From: Longpeng <longpe...@huawei.com> > > We find an issue when repeat reboot in guest during migration, it cause the > migration thread never be waken up again. > > <main loop> |<migration_thread> > | > LOCK BQL | > ... | > main_loop_should_exit | > pause_all_vcpus | > 1. set all cpus ->stop=true | > and then kick | > 2. return if all cpus is paused | > (by '->stopped == true'), else| > 3. qemu_cond_wait [BQL UNLOCK] | > |LOCK BQL > |... > |do_vm_stop > | pause_all_vcpus > | (A)set all cpus ->stop=true > | and then kick > | (B)return if all cpus is paused > | (by '->stopped == true'), else > | (C)qemu_cond_wait [BQL UNLOCK] > 4. be waken up and LOCK BQL | (D)be waken up BUT wait for BQL > 5. goto 2. | > (BQL is still LOCKed) | > resume_all_vcpus | > 1. set all cpus ->stop=false | > and ->stopped=false | > ... | > BQL UNLOCK | (E)LOCK BQL > | (F)goto B. [but stopped is false now!] > |Finally, sleep at step 3 forever. > > > Note: This patch is just for discuss this issue, I'm looking forward to > your suggestions, thanks!
Thanks Mike, the above sketch is really helpful. I think the problem is not that pause_all_vcpus() is not pausing hard enough; the problem is rather than resume_all_vcpus(), when used outside vm_start(), should know about the race and do nothing if it happens. Fortunately resume_all_vcpus does not release the BQL so it should be enough to test once; translated to code, this would be the patch to fix it: diff --git a/cpus.c b/cpus.c index b4f8b84b61..1eb7533a91 100644 --- a/cpus.c +++ b/cpus.c @@ -1899,6 +1899,10 @@ void resume_all_vcpus(void) { CPUState *cpu; + if (!runstate_is_running()) { + return; + } + qemu_clock_enable(QEMU_CLOCK_VIRTUAL, true); CPU_FOREACH(cpu) { cpu_resume(cpu); Thanks, Paolo