[PATCH] block: Fix BB.root changing across bdrv_next()

Hanna Reitz Tue, 01 Mar 2022 09:40:28 -0800

bdrv_next() has no guarantee that its caller has stopped all block graph
operations; for example, bdrv_flush_all() does not.


The latter can actually provoke such operations, because its
bdrv_flush() call, which runs a coroutine (bdrv_co_flush()), may run
this coroutine in a different AioContext than the main one, and then
when this coroutine is done and invokes aio_wait_kick(), the monitor may
get a chance to run and start executing some graph-modifying QMP
command.

One example for this is when the VM encounters an I/O error on a block
device and stops, triggering a bdrv_flush_all(), and a blockdev-mirror
is started simultaneously on a block node in an I/O thread.  When
bdrv_flush_all() comes to that node[1] and flushes it, the
aio_wait_kick() at the end of bdrv_co_flush_entry() may cause the
monitor to process the mirror request, and mirror_start_job() will then
replace the node by a mirror filter node, before bdrv_flush_all()
resumes and can invoke bdrv_next() again to continue iterating.

[1] Say there is a BlockBackend on top of the node in question, and so
bdrv_next() finds that BB and returns the node as the BB's blk_bs().
bdrv_next() will bdrv_ref() the node such that it remains valid through
bdrv_flush_all()'s iteration, and drop the reference when it is called
the next time.

The problem is that bdrv_next() does not store to which BDS it retains a
strong reference when the BDS is a BB's child, so on the subsequent
call, it will just invoke blk_bs() again and bdrv_unref() the returned
node -- but as the example above shows, this node might be a different
one than the one that was bdrv_ref()-ed before.  This can lead to a
use-after-free (for the mirror filter node in our example), because this
negligent bdrv_unref() would steal a strong reference from someone else.

We can solve this problem by always storing the returned (and strongly
referenced) BDS in BdrvNextIterator.bs.  When we want to drop the strong
reference of a BDS previously returned, always drop BdrvNextIterator.bs
instead of using other ways of trying to figure out what that BDS was
that we returned last time.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2058457
Signed-off-by: Hanna Reitz <[email protected]>
---
Sadly, I didn't find a nice way to test this, primarily because I
believe reproducing this requires a bdrv_flush_all() to come from the
VM (without QMP command).  The only way I know that this can happen is
when the VM encounters an I/O error and responds with stopping the
guest.
It's also anything but easily reproducible, and I can't think of a way
to improve on that, because it really seems to be based on chance
whether the aio_wait_kick() wakes up the monitor and has it process an
incoming QMP command.
---
 block/block-backend.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/block/block-backend.c b/block/block-backend.c
index 4ff6b4d785..c822f257dc 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -575,7 +575,7 @@ BlockBackend *blk_next(BlockBackend *blk)
  * the monitor or attached to a BlockBackend */
 BlockDriverState *bdrv_next(BdrvNextIterator *it)
 {
-    BlockDriverState *bs, *old_bs;
+    BlockDriverState *bs, *old_bs = it->bs;
 
     /* Must be called from the main loop */
     assert(qemu_get_current_aio_context() == qemu_get_aio_context());
@@ -586,8 +586,6 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it)
     if (it->phase == BDRV_NEXT_BACKEND_ROOTS) {
         BlockBackend *old_blk = it->blk;
 
-        old_bs = old_blk ? blk_bs(old_blk) : NULL;
-
         do {
             it->blk = blk_all_next(it->blk);
             bs = it->blk ? blk_bs(it->blk) : NULL;
@@ -601,11 +599,12 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it)
         if (bs) {
             bdrv_ref(bs);
             bdrv_unref(old_bs);
+            it->bs = bs;
             return bs;
         }
         it->phase = BDRV_NEXT_MONITOR_OWNED;
-    } else {
-        old_bs = it->bs;
+        /* Start from the first monitor-owned BDS */
+        it->bs = NULL;
     }
 
     /* Then return the monitor-owned BDSes without a BB attached. Ignore all
-- 
2.34.1

[PATCH] block: Fix BB.root changing across bdrv_next()

Reply via email to