bdrv_next() has no guarantee that its caller has stopped all block graph operations; for example, bdrv_flush_all() does not.
The latter can actually provoke such operations, because its bdrv_flush() call, which runs a coroutine (bdrv_co_flush()), may run this coroutine in a different AioContext than the main one, and then when this coroutine is done and invokes aio_wait_kick(), the monitor may get a chance to run and start executing some graph-modifying QMP command. One example for this is when the VM encounters an I/O error on a block device and stops, triggering a bdrv_flush_all(), and a blockdev-mirror is started simultaneously on a block node in an I/O thread. When bdrv_flush_all() comes to that node[1] and flushes it, the aio_wait_kick() at the end of bdrv_co_flush_entry() may cause the monitor to process the mirror request, and mirror_start_job() will then replace the node by a mirror filter node, before bdrv_flush_all() resumes and can invoke bdrv_next() again to continue iterating. [1] Say there is a BlockBackend on top of the node in question, and so bdrv_next() finds that BB and returns the node as the BB's blk_bs(). bdrv_next() will bdrv_ref() the node such that it remains valid through bdrv_flush_all()'s iteration, and drop the reference when it is called the next time. The problem is that bdrv_next() does not store to which BDS it retains a strong reference when the BDS is a BB's child, so on the subsequent call, it will just invoke blk_bs() again and bdrv_unref() the returned node -- but as the example above shows, this node might be a different one than the one that was bdrv_ref()-ed before. This can lead to a use-after-free (for the mirror filter node in our example), because this negligent bdrv_unref() would steal a strong reference from someone else. We can solve this problem by always storing the returned (and strongly referenced) BDS in BdrvNextIterator.bs. When we want to drop the strong reference of a BDS previously returned, always drop BdrvNextIterator.bs instead of using other ways of trying to figure out what that BDS was that we returned last time. Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2058457 Signed-off-by: Hanna Reitz <[email protected]> --- Sadly, I didn't find a nice way to test this, primarily because I believe reproducing this requires a bdrv_flush_all() to come from the VM (without QMP command). The only way I know that this can happen is when the VM encounters an I/O error and responds with stopping the guest. It's also anything but easily reproducible, and I can't think of a way to improve on that, because it really seems to be based on chance whether the aio_wait_kick() wakes up the monitor and has it process an incoming QMP command. --- block/block-backend.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/block/block-backend.c b/block/block-backend.c index 4ff6b4d785..c822f257dc 100644 --- a/block/block-backend.c +++ b/block/block-backend.c @@ -575,7 +575,7 @@ BlockBackend *blk_next(BlockBackend *blk) * the monitor or attached to a BlockBackend */ BlockDriverState *bdrv_next(BdrvNextIterator *it) { - BlockDriverState *bs, *old_bs; + BlockDriverState *bs, *old_bs = it->bs; /* Must be called from the main loop */ assert(qemu_get_current_aio_context() == qemu_get_aio_context()); @@ -586,8 +586,6 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it) if (it->phase == BDRV_NEXT_BACKEND_ROOTS) { BlockBackend *old_blk = it->blk; - old_bs = old_blk ? blk_bs(old_blk) : NULL; - do { it->blk = blk_all_next(it->blk); bs = it->blk ? blk_bs(it->blk) : NULL; @@ -601,11 +599,12 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it) if (bs) { bdrv_ref(bs); bdrv_unref(old_bs); + it->bs = bs; return bs; } it->phase = BDRV_NEXT_MONITOR_OWNED; - } else { - old_bs = it->bs; + /* Start from the first monitor-owned BDS */ + it->bs = NULL; } /* Then return the monitor-owned BDSes without a BB attached. Ignore all -- 2.34.1
