On 5/2/25 19:34, Kevin Wolf wrote:
Am 24.04.2025 um 19:32 hat Andrey Drobyshev geschrieben:
Hi all,

There's a bug in block layer which leads to block graph deadlock.
Notably, it takes place when blockdev IO is processed within a separate
iothread.

This was initially caught by our tests, and I was able to reduce it to a
relatively simple reproducer.  Such deadlocks are probably supposed to
be covered in iotests/graph-changes-while-io, but this deadlock isn't.

Basically what the reproducer does is launches QEMU with a drive having
'iothread' option set, creates a chain of 2 snapshots, launches
block-commit job for a snapshot and then dismisses the job, starting
from the lower snapshot.  If the guest is issuing IO at the same time,
there's a race in acquiring block graph lock and a potential deadlock.

Here's how it can be reproduced:

1. Run QEMU:
SRCDIR=/path/to/srcdir
$SRCDIR/build/qemu-system-x86_64 -enable-kvm \
   -machine q35 -cpu Nehalem \
   -name guest=alma8-vm,debug-threads=on \
   -m 2g -smp 2 \
   -nographic -nodefaults \
   -qmp unix:/var/run/alma8-qmp.sock,server=on,wait=off \
   -serial unix:/var/run/alma8-serial.sock,server=on,wait=off \
   -object iothread,id=iothread0 \
   -blockdev 
node-name=disk,driver=qcow2,file.driver=file,file.filename=/path/to/img/alma8.qcow2
 \
   -device virtio-blk-pci,drive=disk,iothread=iothread0
2. Launch IO (random reads) from within the guest:
nc -U /var/run/alma8-serial.sock
...
[root@alma8-vm ~]# fio --name=randread --ioengine=libaio --direct=1 --bs=4k 
--size=1G --numjobs=1 --time_based=1 --runtime=300 --group_reporting 
--rw=randread --iodepth=1 --filename=/testfile
3. Run snapshots creation & removal of lower snapshot operation in a
loop (script attached):
while /bin/true ; do ./remove_lower_snap.sh ; done
And then it occasionally hangs.

Note: I've tried bisecting this, and looks like deadlock occurs starting
from the following commit:

(BAD)  5bdbaebcce virtio: Re-enable notifications after drain
(GOOD) c42c3833e0 virtio-scsi: Attach event vq notifier with no_poll

On the latest v10.0.0 it does hang as well.


Here's backtrace of the main thread:

#0  0x00007fc547d427ce in __ppoll (fds=0x557eb79657b0, nfds=1, timeout=<optimized 
out>, sigmask=0x0) at ../sysdeps/unix/sysv/linux/ppoll.c:43
#1  0x0000557eb47d955c in qemu_poll_ns (fds=0x557eb79657b0, nfds=1, timeout=-1) 
at ../util/qemu-timer.c:329
#2  0x0000557eb47b2204 in fdmon_poll_wait (ctx=0x557eb76c5f20, 
ready_list=0x7ffd94b4edd8, timeout=-1) at ../util/fdmon-poll.c:79
#3  0x0000557eb47b1c45 in aio_poll (ctx=0x557eb76c5f20, blocking=true) at 
../util/aio-posix.c:730
#4  0x0000557eb4621edd in bdrv_do_drained_begin (bs=0x557eb795e950, parent=0x0, 
poll=true) at ../block/io.c:378
#5  0x0000557eb4621f7b in bdrv_drained_begin (bs=0x557eb795e950) at 
../block/io.c:391
#6  0x0000557eb45ec125 in bdrv_change_aio_context (bs=0x557eb795e950, 
ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
errp=0x0)
     at ../block.c:7682
#7  0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb7964250, 
ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
errp=0x0)
     at ../block.c:7608
#8  0x0000557eb45ec0c4 in bdrv_change_aio_context (bs=0x557eb79575e0, 
ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
errp=0x0)
     at ../block.c:7668
#9  0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb7e59110, 
ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
errp=0x0)
     at ../block.c:7608
#10 0x0000557eb45ec0c4 in bdrv_change_aio_context (bs=0x557eb7e51960, 
ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
errp=0x0)
     at ../block.c:7668
#11 0x0000557eb45ebf2b in bdrv_child_change_aio_context (c=0x557eb814ed80, 
ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
errp=0x0)
     at ../block.c:7608
#12 0x0000557eb45ee8e4 in child_job_change_aio_ctx (c=0x557eb7c9d3f0, 
ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
errp=0x0)
     at ../blockjob.c:157
#13 0x0000557eb45ebe2d in bdrv_parent_change_aio_context (c=0x557eb7c9d3f0, 
ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
errp=0x0)
     at ../block.c:7592
#14 0x0000557eb45ec06b in bdrv_change_aio_context (bs=0x557eb7d74310, 
ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
errp=0x0)
     at ../block.c:7661
#15 0x0000557eb45dcd7e in bdrv_child_cb_change_aio_ctx
     (child=0x557eb8565af0, ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, 
tran=0x557eb7a87160, errp=0x0) at ../block.c:1234
#16 0x0000557eb45ebe2d in bdrv_parent_change_aio_context (c=0x557eb8565af0, 
ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
errp=0x0)
     at ../block.c:7592
#17 0x0000557eb45ec06b in bdrv_change_aio_context (bs=0x557eb79575e0, 
ctx=0x557eb76c5f20, visited=0x557eb7e06b60 = {...}, tran=0x557eb7a87160, 
errp=0x0)
     at ../block.c:7661
#18 0x0000557eb45ec1f3 in bdrv_try_change_aio_context (bs=0x557eb79575e0, 
ctx=0x557eb76c5f20, ignore_child=0x0, errp=0x0) at ../block.c:7715
#19 0x0000557eb45e1b15 in bdrv_root_unref_child (child=0x557eb7966f30) at 
../block.c:3317
#20 0x0000557eb45eeaa8 in block_job_remove_all_bdrv (job=0x557eb7952800) at 
../blockjob.c:209
#21 0x0000557eb45ee641 in block_job_free (job=0x557eb7952800) at 
../blockjob.c:82
#22 0x0000557eb45f17af in job_unref_locked (job=0x557eb7952800) at ../job.c:474
#23 0x0000557eb45f257d in job_do_dismiss_locked (job=0x557eb7952800) at 
../job.c:771
#24 0x0000557eb45f25fe in job_dismiss_locked (jobptr=0x7ffd94b4f400, 
errp=0x7ffd94b4f488) at ../job.c:783
--Type <RET> for more, q to quit, c to continue without paging--
#25 0x0000557eb45d8e84 in qmp_job_dismiss (id=0x557eb7aa42b0 "commit-snap1", 
errp=0x7ffd94b4f488) at ../job-qmp.c:138
#26 0x0000557eb472f6a3 in qmp_marshal_job_dismiss (args=0x7fc52c00a3b0, 
ret=0x7fc53c880da8, errp=0x7fc53c880da0) at qapi/qapi-commands-job.c:221
#27 0x0000557eb47a35f3 in do_qmp_dispatch_bh (opaque=0x7fc53c880e40) at 
../qapi/qmp-dispatch.c:128
#28 0x0000557eb47d1cd2 in aio_bh_call (bh=0x557eb79568f0) at ../util/async.c:172
#29 0x0000557eb47d1df5 in aio_bh_poll (ctx=0x557eb76c0200) at 
../util/async.c:219
#30 0x0000557eb47b12f3 in aio_dispatch (ctx=0x557eb76c0200) at 
../util/aio-posix.c:436
#31 0x0000557eb47d2266 in aio_ctx_dispatch (source=0x557eb76c0200, 
callback=0x0, user_data=0x0) at ../util/async.c:361
#32 0x00007fc549232f4f in g_main_dispatch (context=0x557eb76c6430) at 
../glib/gmain.c:3364
#33 g_main_context_dispatch (context=0x557eb76c6430) at ../glib/gmain.c:4079
#34 0x0000557eb47d3ab1 in glib_pollfds_poll () at ../util/main-loop.c:287
#35 0x0000557eb47d3b38 in os_host_main_loop_wait (timeout=0) at 
../util/main-loop.c:310
#36 0x0000557eb47d3c58 in main_loop_wait (nonblocking=0) at 
../util/main-loop.c:589
#37 0x0000557eb4218b01 in qemu_main_loop () at ../system/runstate.c:835
#38 0x0000557eb46df166 in qemu_default_main (opaque=0x0) at ../system/main.c:50
#39 0x0000557eb46df215 in main (argc=24, argv=0x7ffd94b4f8d8) at 
../system/main.c:80

And here's coroutine trying to acquire read lock:

(gdb) qemu coroutine reader_queue->entries.sqh_first
#0  0x0000557eb47d7068 in qemu_coroutine_switch (from_=0x557eb7aa48b0, 
to_=0x7fc537fff508, action=COROUTINE_YIELD) at ../util/coroutine-ucontext.c:321
#1  0x0000557eb47d4d4a in qemu_coroutine_yield () at 
../util/qemu-coroutine.c:339
#2  0x0000557eb47d56c8 in qemu_co_queue_wait_impl (queue=0x557eb59954c0 
<reader_queue>, lock=0x7fc53c57de50, flags=0) at 
../util/qemu-coroutine-lock.c:60
#3  0x0000557eb461fea7 in bdrv_graph_co_rdlock () at ../block/graph-lock.c:231
#4  0x0000557eb460c81a in graph_lockable_auto_lock (x=0x7fc53c57dee3) at 
/home/root/src/qemu/master/include/block/graph-lock.h:213
#5  0x0000557eb460fa41 in blk_co_do_preadv_part
     (blk=0x557eb84c0810, offset=6890553344, bytes=4096, qiov=0x7fc530006988, 
qiov_offset=0, flags=BDRV_REQ_REGISTERED_BUF) at ../block/block-backend.c:1339
#6  0x0000557eb46104d7 in blk_aio_read_entry (opaque=0x7fc530003240) at 
../block/block-backend.c:1619
#7  0x0000557eb47d6c40 in coroutine_trampoline (i0=-1213577040, i1=21886) at 
../util/coroutine-ucontext.c:175
#8  0x00007fc547c2a360 in __start_context () at 
../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91
#9  0x00007ffd94b4ea40 in  ()
#10 0x0000000000000000 in  ()

So it looks like main thread is processing job-dismiss request and is
holding write lock taken in block_job_remove_all_bdrv() (frame #20
above).  At the same time iothread spawns a coroutine which performs IO
request.  Before the coroutine is spawned, blk_aio_prwv() increases
'in_flight' counter for Blk.  Then blk_co_do_preadv_part() (frame #5) is
trying to acquire the read lock.  But main thread isn't releasing the
lock as blk_root_drained_poll() returns true since blk->in_flight > 0.
Here's the deadlock.

Any comments and suggestions on the subject are welcomed.  Thanks!
I think this is what the blk_wait_while_drained() call was supposed to
address in blk_co_do_preadv_part(). However, with the use of multiple
I/O threads, this is racy.

Do you think that in your case we hit the small race window between the
checks in blk_wait_while_drained() and GRAPH_RDLOCK_GUARD()? Or is there
another reason why blk_wait_while_drained() didn't do its job?

Kevin

At my opinion there is very big race window. Main thread has
eaten graph write lock. After that another coroutine is stalled
within GRAPH_RDLOCK_GUARD() as there is no drain at the moment and only after that main thread has started drain. That is why Fiona's idea is looking working. Though this would mean that normally we should always do that at the moment when we acquire write lock. May be even inside this function. Den

Reply via email to