Re: Lockup with --accel tcg,thread=single

Peter Maydell Mon, 30 Sep 2019 08:38:12 -0700

On Mon, 30 Sep 2019 at 14:17, Doug Gale <doug...@gmail.com> wrote:
>
> I found a lockup in single threaded TCG, with OVMF bios, 16 CPUs.
>
> qemu-system-x86_64 --accel tcg,thread=single -smp cpus=16 -bios
> /usr/share/ovmf/OVMF.fd
>
> Using Ubuntu 18.04 LTS, default gnome desktop. There is no guest OS,
> there is no hard drive, just the OVMF firmware locks it up. (I
> narrowed it down to this from a much larger repro)


> Peter Maydell helped me bisect it in IRC.
>  Works fine at commit 1e8a98b53867f61
>  Fails at commit 9458a9a1df1a4c7
>
> MTTCG works fine all the way up to master.

Thanks for this bug report. I've reproduced it and think
I have figured out what is going on here.

Commit 9458a9a1df1a4c719e245 is Paolo's commit that fixes the
TCG-vs-dirty-bitmap race by having the thread which is
doing a memory access wait for the vCPU thread to finish
its current TB using a no-op run_on_cpu() operation.

In the case of the hang the thread doing the memory access
is the iothread, like this:

#14 0x000055c150c0a98c in run_on_cpu (cpu=0x55c153801c60,
func=0x55c150bbb542 <do_nothing>, data=...)
    at /home/petmay01/linaro/qemu-from-laptop/qemu/cpus.c:1205
#15 0x000055c150bbb58c in tcg_log_global_after_sync
(listener=0x55c1538410c8) at
/home/petmay01/linaro/qemu-from-laptop/qemu/exec.c:2963
#16 0x000055c150c1fe18 in memory_global_after_dirty_log_sync () at
/home/petmay01/linaro/qemu-from-laptop/qemu/memory.c:2598
#17 0x000055c150c1e6b8 in memory_region_snapshot_and_clear_dirty
(mr=0x55c1541e82b0, addr=0, size=1920000, client=0)
    at /home/petmay01/linaro/qemu-from-laptop/qemu/memory.c:2106
#18 0x000055c150c76c05 in vga_draw_graphic (s=0x55c1541e82a0, full_update=0)
    at /home/petmay01/linaro/qemu-from-laptop/qemu/hw/display/vga.c:1661
#19 0x000055c150c771c4 in vga_update_display (opaque=0x55c1541e82a0)
at /home/petmay01/linaro/qemu-from-laptop/qemu/hw/display/vga.c:1785
#20 0x000055c151052a83 in graphic_hw_update (con=0x55c1536dfaa0) at
/home/petmay01/linaro/qemu-from-laptop/qemu/ui/console.c:268
#21 0x000055c151091490 in gd_refresh (dcl=0x55c1549af090) at
/home/petmay01/linaro/qemu-from-laptop/qemu/ui/gtk.c:484
#22 0x000055c151056571 in dpy_refresh (s=0x55c1542f9d90) at
/home/petmay01/linaro/qemu-from-laptop/qemu/ui/console.c:1629
#23 0x000055c1510527f0 in gui_update (opaque=0x55c1542f9d90) at
/home/petmay01/linaro/qemu-from-laptop/qemu/ui/console.c:206
#24 0x000055c1511ee67c in timerlist_run_timers
(timer_list=0x55c15370c280) at
/home/petmay01/linaro/qemu-from-laptop/qemu/util/qemu-timer.c:592
#25 0x000055c1511ee726 in qemu_clock_run_timers
(type=QEMU_CLOCK_REALTIME) at
/home/petmay01/linaro/qemu-from-laptop/qemu/util/qemu-timer.c:606
#26 0x000055c1511ee9e5 in qemu_clock_run_all_timers () at
/home/petmay01/linaro/qemu-from-laptop/qemu/util/qemu-timer.c:692
#27 0x000055c1511ef181 in main_loop_wait (nonblocking=0) at
/home/petmay01/linaro/qemu-from-laptop/qemu/util/main-loop.c:524
#28 0x000055c150db03fe in main_loop () at
/home/petmay01/linaro/qemu-from-laptop/qemu/vl.c:1806

run_on_cpu() adds the do_nothing function to a cpu queued-work list.

However, the single-threaded TCG runloop qemu_tcg_rr_cpu_thread_fn()
has the basic structure:

   while (1) {
       for (each vCPU) {
           run this vCPU for a timeslice;
       }
       qemu_tcg_rr_wait_io_event();
   }

and it processes cpu work queues only in qemu_tcg_rr_wait_io_event().
This is a problem, because the thing which causes us to stop running
one vCPU when its timeslice ends and move on to the next is the
tcg_kick_vcpu_timer -- and the iothread will never process that timer
and kick the vcpu because it is currently blocked in run_on_cpu() !

Not sure currently what the best fix is here.

Side note -- this use of run_on_cpu() means that we now drop
the iothread lock within memory_region_snapshot_and_clear_dirty(),
which we didn't before. This means that a vCPU thread can now
get in and execute an access to the device registers of
hw/display/vga.c, updating its state in a way I suspect that the
device model code is not expecting... So maybe the right answer
here should be to come up with a fix for the race that 9458a9a1
addresses that doesn't require us to drop the iothread lock in
memory_region_snapshot_and_clear_dirty() ? Alternatively we need
to audit the callers and flag in the documentation that this
function has the unexpected side effect of briefly dropping the
iothread lock.

thanks
-- PMM

Re: Lockup with --accel tcg,thread=single

Reply via email to