On 4/24/25 8:32 PM, Andrey Drobyshev wrote: > Hi all, > > There's a bug in block layer which leads to block graph deadlock. > Notably, it takes place when blockdev IO is processed within a separate > iothread. > > This was initially caught by our tests, and I was able to reduce it to a > relatively simple reproducer. Such deadlocks are probably supposed to > be covered in iotests/graph-changes-while-io, but this deadlock isn't. > > Basically what the reproducer does is launches QEMU with a drive having > 'iothread' option set, creates a chain of 2 snapshots, launches > block-commit job for a snapshot and then dismisses the job, starting > from the lower snapshot. If the guest is issuing IO at the same time, > there's a race in acquiring block graph lock and a potential deadlock. > > Here's how it can be reproduced: > > [...] >
I took a closer look at iotests/graph-changes-while-io, and have managed to reproduce the same deadlock in a much simpler setup, without a guest. 1. Run QSD:> ./build/storage-daemon/qemu-storage-daemon --object iothread,id=iothread0 \ > --blockdev null-co,node-name=node0,read-zeroes=true \ > > --nbd-server addr.type=unix,addr.path=/var/run/qsd_nbd.sock \ > > --export > nbd,id=exp0,node-name=node0,iothread=iothread0,fixed-iothread=true,writable=true > \ > --chardev > socket,id=qmp-sock,path=/var/run/qsd_qmp.sock,server=on,wait=off \ > --monitor chardev=qmp-sock 2. Launch IO: > qemu-img bench -f raw -c 2000000 > 'nbd+unix:///node0?socket=/var/run/qsd_nbd.sock' 3. Add 2 snapshots and remove lower one (script attached):> while /bin/true ; do ./rls_qsd.sh ; done And then it hangs. I'll also send a patch with corresponding test case added directly to iotests. This reproduce seems to be hanging starting from Fiona's commit 67446e605dc ("blockjob: drop AioContext lock before calling bdrv_graph_wrlock()"). AioContext locks were dropped entirely later on in Stefan's commit b49f4755c7 ("block: remove AioContext locking"), but the problem remains. Andrey
rls_qsd.sh
Description: application/shellscript