Hi all,

There's a bug in block layer which leads to block graph deadlock.
Notably, it takes place when blockdev IO is processed within a separate
iothread.

This was initially caught by our tests, and I was able to reduce it to a
relatively simple reproducer.  Such deadlocks are probably supposed to
be covered in iotests/graph-changes-while-io, but this deadlock isn't.

Basically what the reproducer does is launches QEMU with a drive having
'iothread' option set, creates a chain of 2 snapshots, launches
block-commit job for a snapshot and then dismisses the job, starting
from the lower snapshot.  If the guest is issuing IO at the same time,
there's a race in acquiring block graph lock and a potential deadlock.

Here's how it can be reproduced:

1. Run QEMU:
> SRCDIR=/path/to/srcdir                                                        
>                                                                               
>    
>                                                                               
>   
> $SRCDIR/build/qemu-system-x86_64 -enable-kvm \                                
>   
>   -machine q35 -cpu Nehalem \                                                 
>   
>   -name guest=alma8-vm,debug-threads=on \                                     
>   
>   -m 2g -smp 2 \                                                              
>   
>   -nographic -nodefaults \                                                    
>   
>   -qmp unix:/var/run/alma8-qmp.sock,server=on,wait=off \                      
>      
>   -serial unix:/var/run/alma8-serial.sock,server=on,wait=off \                
>      
>   -object iothread,id=iothread0 \                                             
>   
>   -blockdev 
> node-name=disk,driver=qcow2,file.driver=file,file.filename=/path/to/img/alma8.qcow2
>  \
>   -device virtio-blk-pci,drive=disk,iothread=iothread0

2. Launch IO (random reads) from within the guest:
> nc -U /var/run/alma8-serial.sock
> ...
> [root@alma8-vm ~]# fio --name=randread --ioengine=libaio --direct=1 --bs=4k 
> --size=1G --numjobs=1 --time_based=1 --runtime=300 --group_reporting 
> --rw=randread --iodepth=1 --filename=/testfile

3. Run snapshots creation & removal of lower snapshot operation in a
loop (script attached):
> while /bin/true ; do ./remove_lower_snap.sh ; done

And then it occasionally hangs.

Note: I've tried bisecting this, and looks like deadlock occurs starting
from the following commit:

(BAD)  5bdbaebcce virtio: Re-enable notifications after drain
(GOOD) c42c3833e0 virtio-scsi: Attach event vq notifier with no_poll

On the latest v10.0.0 it does hang as well.


Here's backtrace of the main thread:

> #0  0x00007fc547d427ce in __ppoll (fds=0x557eb79657b0, nfds=1, 
> timeout=<optimized out>, sigmask=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:43
> #1  0x0000557eb47d955c in qemu_poll_ns (fds=0x557eb79657b0, nfds=1, 
> timeout=-1) at ../util/qemu-timer.c:329
> #2  0x0000557eb47b2204 in fdmon_poll_wait (ctx=0x557eb76c5f20, 
> ready_list=0x7ffd94b4edd8, timeout=-1) at ../util/fdmon-poll.c:79
> #3  0x0000557eb47b1c45 in aio_poll (ctx=0x557eb76c5f20, blocking=true) at 
> ../util/aio-posix.c:730
> #4  0x0000557eb4621edd in bdrv_do_drained_begin (bs=0x557eb795e950, 
> parent=0x0, poll=true) at ../block/io.c:378
> #5  0x0000557eb4621f7b in bdrv_drained_begin (bs=0x557eb795e950) at 
> ../block/io.c:391
> #6  0x0000557eb45ec125 in bdrv_change_aio_context (bs=0x557eb795e950, 
> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
> errp=0x0)
>     at ../block.c:7682
> #7  0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb7964250, 
> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
> errp=0x0)
>     at ../block.c:7608
> #8  0x0000557eb45ec0c4 in bdrv_change_aio_context (bs=0x557eb79575e0, 
> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
> errp=0x0)
>     at ../block.c:7668
> #9  0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb7e59110, 
> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
> errp=0x0)
>     at ../block.c:7608
> #10 0x0000557eb45ec0c4 in bdrv_change_aio_context (bs=0x557eb7e51960, 
> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
> errp=0x0)
>     at ../block.c:7668
> #11 0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb814ed80, 
> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
> errp=0x0)
>     at ../block.c:7608
> #12 0x0000557eb45ee8e4 in child_job_change_aio_ctx (c=0x557eb7c9d3f0, 
> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
> errp=0x0)
>     at ../blockjob.c:157
> #13 0x0000557eb45ebe2d in bdrv_parent_change_aio_context (c=0x557eb7c9d3f0, 
> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
> errp=0x0)
>     at ../block.c:7592
> #14 0x0000557eb45ec06b in bdrv_change_aio_context (bs=0x557eb7d74310, 
> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
> errp=0x0)
>     at ../block.c:7661
> #15 0x0000557eb45dcd7e in bdrv_child_cb_change_aio_ctx
>     (child=0x557eb8565af0, ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = 
> {...}, tran=0x557eb7a87160, errp=0x0) at ../block.c:1234
> #16 0x0000557eb45ebe2d in bdrv_parent_change_aio_context (c=0x557eb8565af0, 
> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
> errp=0x0)
>     at ../block.c:7592
> #17 0x0000557eb45ec06b in bdrv_change_aio_context (bs=0x557eb79575e0, 
> ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
> errp=0x0)
>     at ../block.c:7661
> #18 0x0000557eb45ec1f3 in bdrv_try_change_aio_context (bs=0x557eb79575e0, 
> ctx=0x557eb76c5f20, ignore_child=0x0, errp=0x0) at ../block.c:7715
> #19 0x0000557eb45e1b15 in bdrv_root_unref_child (child=0x557eb7966f30) at 
> ../block.c:3317
> #20 0x0000557eb45eeaa8 in block_job_remove_all_bdrv (job=0x557eb7952800) at 
> ../blockjob.c:209
> #21 0x0000557eb45ee641 in block_job_free (job=0x557eb7952800) at 
> ../blockjob.c:82
> #22 0x0000557eb45f17af in job_unref_locked (job=0x557eb7952800) at 
> ../job.c:474
> #23 0x0000557eb45f257d in job_do_dismiss_locked (job=0x557eb7952800) at 
> ../job.c:771
> #24 0x0000557eb45f25fe in job_dismiss_locked (jobptr=0x7ffd94b4f400, 
> errp=0x7ffd94b4f488) at ../job.c:783
> --Type <RET> for more, q to quit, c to continue without paging--
> #25 0x0000557eb45d8e84 in qmp_job_dismiss (id=0x557eb7aa42b0 "commit-snap1", 
> errp=0x7ffd94b4f488) at ../job-qmp.c:138
> #26 0x0000557eb472f6a3 in qmp_marshal_job_dismiss (args=0x7fc52c00a3b0, 
> ret=0x7fc53c880da8, errp=0x7fc53c880da0) at qapi/qapi-commands-job.c:221
> #27 0x0000557eb47a35f3 in do_qmp_dispatch_bh (opaque=0x7fc53c880e40) at 
> ../qapi/qmp-dispatch.c:128
> #28 0x0000557eb47d1cd2 in aio_bh_call (bh=0x557eb79568f0) at 
> ../util/async.c:172
> #29 0x0000557eb47d1df5 in aio_bh_poll (ctx=0x557eb76c0200) at 
> ../util/async.c:219
> #30 0x0000557eb47b12f3 in aio_dispatch (ctx=0x557eb76c0200) at 
> ../util/aio-posix.c:436
> #31 0x0000557eb47d2266 in aio_ctx_dispatch (source=0x557eb76c0200, 
> callback=0x0, user_data=0x0) at ../util/async.c:361
> #32 0x00007fc549232f4f in g_main_dispatch (context=0x557eb76c6430) at 
> ../glib/gmain.c:3364
> #33 g_main_context_dispatch (context=0x557eb76c6430) at ../glib/gmain.c:4079
> #34 0x0000557eb47d3ab1 in glib_pollfds_poll () at ../util/main-loop.c:287
> #35 0x0000557eb47d3b38 in os_host_main_loop_wait (timeout=0) at 
> ../util/main-loop.c:310
> #36 0x0000557eb47d3c58 in main_loop_wait (nonblocking=0) at 
> ../util/main-loop.c:589
> #37 0x0000557eb4218b01 in qemu_main_loop () at ../system/runstate.c:835
> #38 0x0000557eb46df166 in qemu_default_main (opaque=0x0) at 
> ../system/main.c:50
> #39 0x0000557eb46df215 in main (argc=24, argv=0x7ffd94b4f8d8) at 
> ../system/main.c:80


And here's coroutine trying to acquire read lock:

> (gdb) qemu coroutine reader_queue->entries.sqh_first 
> #0  0x0000557eb47d7068 in qemu_coroutine_switch (from_=0x557eb7aa48b0, 
> to_=0x7fc537fff508, action=COROUTINE_YIELD) at 
> ../util/coroutine-ucontext.c:321
> #1  0x0000557eb47d4d4a in qemu_coroutine_yield () at 
> ../util/qemu-coroutine.c:339
> #2  0x0000557eb47d56c8 in qemu_co_queue_wait_impl (queue=0x557eb59954c0 
> <reader_queue>, lock=0x7fc53c57de50, flags=0) at 
> ../util/qemu-coroutine-lock.c:60
> #3  0x0000557eb461fea7 in bdrv_graph_co_rdlock () at ../block/graph-lock.c:231
> #4  0x0000557eb460c81a in graph_lockable_auto_lock (x=0x7fc53c57dee3) at 
> /home/root/src/qemu/master/include/block/graph-lock.h:213
> #5  0x0000557eb460fa41 in blk_co_do_preadv_part
>     (blk=0x557eb84c0810, offset=6890553344, bytes=4096, qiov=0x7fc530006988, 
> qiov_offset=0, flags=BDRV_REQ_REGISTERED_BUF) at ../block/block-backend.c:1339
> #6  0x0000557eb46104d7 in blk_aio_read_entry (opaque=0x7fc530003240) at 
> ../block/block-backend.c:1619
> #7  0x0000557eb47d6c40 in coroutine_trampoline (i0=-1213577040, i1=21886) at 
> ../util/coroutine-ucontext.c:175
> #8  0x00007fc547c2a360 in __start_context () at 
> ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91
> #9  0x00007ffd94b4ea40 in  ()
> #10 0x0000000000000000 in  ()


So it looks like main thread is processing job-dismiss request and is
holding write lock taken in block_job_remove_all_bdrv() (frame #20
above).  At the same time iothread spawns a coroutine which performs IO
request.  Before the coroutine is spawned, blk_aio_prwv() increases
'in_flight' counter for Blk.  Then blk_co_do_preadv_part() (frame #5) is
trying to acquire the read lock.  But main thread isn't releasing the
lock as blk_root_drained_poll() returns true since blk->in_flight > 0.
Here's the deadlock.

Any comments and suggestions on the subject are welcomed.  Thanks!

Andrey

Attachment: remove_lower_snap.sh
Description: application/shellscript

Reply via email to