Re: [Qemu-devel] [PULL 21/35] block: fix QEMU crash with scsi-hd and drive_del

Eric Blake Mon, 06 Aug 2018 15:05:09 -0700

On 06/18/2018 11:44 AM, Kevin Wolf wrote:

From: Greg Kurz <gr...@kaod.org>


Removing a drive with drive_del while it is being used to run an I/O
intensive workload can cause QEMU to crash.

...

The problem is that we should avoid making block driver graph
changes while we have in-flight requests. Let's drain all I/O
for this BB before calling bdrv_root_unref_child().

Signed-off-by: Greg Kurz <gr...@kaod.org>
Signed-off-by: Kevin Wolf <kw...@redhat.com>
---
  block/block-backend.c | 5 +++++
  1 file changed, 5 insertions(+)

Interestingly enough, git bisect is reliably pointing to this commit asthe culprit in the changed output of iotests './check -nbd 83':


083 8s ... - output mismatch (see 083.out.bad)

--- /home/eblake/qemu/tests/qemu-iotests/083.out 2018-02-2611:40:31.605792220 -0600+++ /home/eblake/qemu/tests/qemu-iotests/083.out.bad 2018-08-0616:57:03.411861660 -0500

@@ -41,7 +41,6 @@

 === Check disconnect after neg2 ===

-Connection closed
 read failed: Input/output error

 === Check disconnect 8 neg2 ===
@@ -54,7 +53,6 @@

 === Check disconnect before request ===

-Connection closed
 read failed: Input/output error

 === Check disconnect after request ===
@@ -116,7 +114,6 @@

 === Check disconnect after neg-classic ===

-Connection closed
 read failed: Input/output error

 === Check disconnect before neg1 ===
Failures: 083
Failed 1 of 1 tests


diff --git a/block/block-backend.c b/block/block-backend.c
index 2d1a3463e8..6b75bca317 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -767,6 +767,11 @@ void blk_remove_bs(BlockBackend *blk)

blk_update_root_state(blk);+ /* bdrv_root_unref_child() will cause blk->root to become stale and may

+     * switch to a completion coroutine later on. Let's drain all I/O here
+     * to avoid that and a potential QEMU crash.
+     */
+    blk_drain(blk);
      bdrv_root_unref_child(blk->root);
      blk->root = NULL;
  }

Test 83 sets up a client that intentionally disconnects at criticalpoints in the NBD protocol exchange, to ensure that the server reactssanely. I suspect that somewhere in the NBD code, the server detectsthe disconnect and was somehow calling into blk_remove_bs() (although Icould not quickly find the backtrace); and that prior to this patch, the'Connection closed' message resulted from other NBD coroutines getting ashot at the (now-closed) connection, while after this patch, theadditional blk_drain() somehow tweaks things in a way that prevents theother NBD coroutines from printing a message. If so, then the change in83 reference output is probably intentional, and we should update it.

But I'm having a hard time convincing myself that this is the case,particularly since I'm not even sure how to easily debug the assumptionsI made above.

Since I'm very weak on the whole notion of what blk_drain() vs.blk_remove_bs() is really supposed to be doing, and could easily bepersuaded that the change in output is a regression instead of a fix.


Thoughts?

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-devel] [PULL 21/35] block: fix QEMU crash with scsi-hd and drive_del

Reply via email to