This series is an attempt to fix a deadlock issue reported by Andrey here [0].
bdrv_drained_begin() polls and is not allowed to be called with the block graph lock held. Mark the function as GRAPH_UNLOCKED. This does not catch the issue reported by Andrey, because there is a bdrv_graph_rdunlock_main_loop() before bdrv_drained_begin() in the function bdrv_change_aio_context(). That unlock is of course ineffective if the exclusive lock is held, but it seems to prevent TSA from finding the issue. Thus the bdrv_drained_begin() call from inside bdrv_change_aio_context() needs to be moved up the call stack before acquiring the locks. This required the most changes, patch 09/11. Granular draining is not trivially possible, because many of the affected functions can recursively call themselves. In place where bdrv_drained_begin() calls were removed, assertions are added, checking the quiesced_counter to ensure that the nodes already got drained further up in the call stack. In bdrv_graph_wrlock() there is a comment that it uses bdrv_drain_all_begin_nopoll() to make sure that constantly arriving new I/O doesn't cause starvation. The changes from this series are at odds with that, but there doesn't seem to be any (new) test failures. [0]: https://lore.kernel.org/qemu-devel/73839c04-7616-407e-b057-80ca69e63...@virtuozzo.com/ I did not CC the maintainers for blkverify, vmdk, etc. because this is just an RFC and the changes there are mechanical, for adapting some callers of block layer functions. Andrey Drobyshev (1): iotests/graph-changes-while-io: add test case with removal of lower snapshot Fiona Ebner (10): block: remove outdated comments about AioContext locking block: move drain outside of read-locked bdrv_reopen_queue_child() block/snapshot: move drain outside of read-locked bdrv_snapshot_delete() block: drain while unlocked in bdrv_reopen_parse_file_or_backing() block: move drain outside of read-locked bdrv_inactivate_recurse() blockdev: drain while unlocked in internal_snapshot_action() blockdev: drain while unlocked in external_snapshot_action() block: mark bdrv_drained_begin() as GRAPH_UNLOCKED block: move drain out of bdrv_change_aio_context() block/graph-lock: add drain flag to bdrv_graph_wr{,un}lock block.c | 170 +++++++++++------- block/backup.c | 4 +- block/blklogwrites.c | 8 +- block/blkverify.c | 4 +- block/block-backend.c | 8 +- block/commit.c | 16 +- block/graph-lock.c | 22 ++- block/mirror.c | 22 +-- block/qcow2.c | 4 +- block/quorum.c | 8 +- block/replication.c | 14 +- block/snapshot.c | 22 ++- block/stream.c | 18 +- block/vmdk.c | 20 +-- blockdev.c | 59 ++++-- blockjob.c | 12 +- include/block/block-global-state.h | 8 +- include/block/block-io.h | 3 +- include/block/graph-lock.h | 8 +- qemu-img.c | 2 + scripts/block-coroutine-wrapper.py | 4 +- .../qemu-iotests/tests/graph-changes-while-io | 101 ++++++++++- .../tests/graph-changes-while-io.out | 4 +- tests/unit/test-bdrv-drain.c | 40 ++--- tests/unit/test-bdrv-graph-mod.c | 20 +-- 25 files changed, 384 insertions(+), 217 deletions(-) -- 2.39.5