On Mon, 30 Sep 2019 at 14:17, Doug Gale <doug...@gmail.com> wrote: > > I found a lockup in single threaded TCG, with OVMF bios, 16 CPUs. > > qemu-system-x86_64 --accel tcg,thread=single -smp cpus=16 -bios > /usr/share/ovmf/OVMF.fd > > Using Ubuntu 18.04 LTS, default gnome desktop. There is no guest OS, > there is no hard drive, just the OVMF firmware locks it up. (I > narrowed it down to this from a much larger repro)
> Peter Maydell helped me bisect it in IRC. > Works fine at commit 1e8a98b53867f61 > Fails at commit 9458a9a1df1a4c7 > > MTTCG works fine all the way up to master. Thanks for this bug report. I've reproduced it and think I have figured out what is going on here. Commit 9458a9a1df1a4c719e245 is Paolo's commit that fixes the TCG-vs-dirty-bitmap race by having the thread which is doing a memory access wait for the vCPU thread to finish its current TB using a no-op run_on_cpu() operation. In the case of the hang the thread doing the memory access is the iothread, like this: #14 0x000055c150c0a98c in run_on_cpu (cpu=0x55c153801c60, func=0x55c150bbb542 <do_nothing>, data=...) at /home/petmay01/linaro/qemu-from-laptop/qemu/cpus.c:1205 #15 0x000055c150bbb58c in tcg_log_global_after_sync (listener=0x55c1538410c8) at /home/petmay01/linaro/qemu-from-laptop/qemu/exec.c:2963 #16 0x000055c150c1fe18 in memory_global_after_dirty_log_sync () at /home/petmay01/linaro/qemu-from-laptop/qemu/memory.c:2598 #17 0x000055c150c1e6b8 in memory_region_snapshot_and_clear_dirty (mr=0x55c1541e82b0, addr=0, size=1920000, client=0) at /home/petmay01/linaro/qemu-from-laptop/qemu/memory.c:2106 #18 0x000055c150c76c05 in vga_draw_graphic (s=0x55c1541e82a0, full_update=0) at /home/petmay01/linaro/qemu-from-laptop/qemu/hw/display/vga.c:1661 #19 0x000055c150c771c4 in vga_update_display (opaque=0x55c1541e82a0) at /home/petmay01/linaro/qemu-from-laptop/qemu/hw/display/vga.c:1785 #20 0x000055c151052a83 in graphic_hw_update (con=0x55c1536dfaa0) at /home/petmay01/linaro/qemu-from-laptop/qemu/ui/console.c:268 #21 0x000055c151091490 in gd_refresh (dcl=0x55c1549af090) at /home/petmay01/linaro/qemu-from-laptop/qemu/ui/gtk.c:484 #22 0x000055c151056571 in dpy_refresh (s=0x55c1542f9d90) at /home/petmay01/linaro/qemu-from-laptop/qemu/ui/console.c:1629 #23 0x000055c1510527f0 in gui_update (opaque=0x55c1542f9d90) at /home/petmay01/linaro/qemu-from-laptop/qemu/ui/console.c:206 #24 0x000055c1511ee67c in timerlist_run_timers (timer_list=0x55c15370c280) at /home/petmay01/linaro/qemu-from-laptop/qemu/util/qemu-timer.c:592 #25 0x000055c1511ee726 in qemu_clock_run_timers (type=QEMU_CLOCK_REALTIME) at /home/petmay01/linaro/qemu-from-laptop/qemu/util/qemu-timer.c:606 #26 0x000055c1511ee9e5 in qemu_clock_run_all_timers () at /home/petmay01/linaro/qemu-from-laptop/qemu/util/qemu-timer.c:692 #27 0x000055c1511ef181 in main_loop_wait (nonblocking=0) at /home/petmay01/linaro/qemu-from-laptop/qemu/util/main-loop.c:524 #28 0x000055c150db03fe in main_loop () at /home/petmay01/linaro/qemu-from-laptop/qemu/vl.c:1806 run_on_cpu() adds the do_nothing function to a cpu queued-work list. However, the single-threaded TCG runloop qemu_tcg_rr_cpu_thread_fn() has the basic structure: while (1) { for (each vCPU) { run this vCPU for a timeslice; } qemu_tcg_rr_wait_io_event(); } and it processes cpu work queues only in qemu_tcg_rr_wait_io_event(). This is a problem, because the thing which causes us to stop running one vCPU when its timeslice ends and move on to the next is the tcg_kick_vcpu_timer -- and the iothread will never process that timer and kick the vcpu because it is currently blocked in run_on_cpu() ! Not sure currently what the best fix is here. Side note -- this use of run_on_cpu() means that we now drop the iothread lock within memory_region_snapshot_and_clear_dirty(), which we didn't before. This means that a vCPU thread can now get in and execute an access to the device registers of hw/display/vga.c, updating its state in a way I suspect that the device model code is not expecting... So maybe the right answer here should be to come up with a fix for the race that 9458a9a1 addresses that doesn't require us to drop the iothread lock in memory_region_snapshot_and_clear_dirty() ? Alternatively we need to audit the callers and flag in the documentation that this function has the unexpected side effect of briefly dropping the iothread lock. thanks -- PMM