From: Zhu Yangyang <zhuyangyan...@huawei.com>

In the disable branch of qmp_block_set_io_throttle(), we call 
bdrv_drained_begin().
We know that bdrv_drained_begin() is a blocking interface used to wait for all 
submitted
I/O operations to complete, i.e., to wait until bs->in_flight becomes zero.

Theoretically, once we stop submitting I/O operations, bs->in_flight should 
quickly
become zero(regardless of success or failure). However, if we are using network 
storage
and a network link failure occurs at that moment, things become different.
The underlying storage driver will then retry the operation, which may take 1 
to 2 minutes
before responding with an I/O error to QEMU. As a result, bdrv_drained_begin() 
will be blocked
for 1 to 2 minutes, leading to two issues:

1. qmp_block_set_io_throttle() gets blocked, which is an external interface,
   and this can provide a poor user experience.
2. More seriously, qmp_block_set_io_throttle() holds the BQL during its 
execution, which could
lead to the blocking of vcpu thread, further causing guest os softlockup and 
potentially a panic.

The stack when qmp_block_set_io_throttle() is blocked is as follows:
At this point, there are no IO submit events or IO response events.
aio_poll() will remain blocked until the network storage reports an EIO to QEMU.

#0  0x00007f54877fc39e in ppoll () from target:/usr/lib64/libc.so.6
#1  0x0000556dced07b7d in ppoll (__ss=0x0, __timeout=0x0, __nfds=<optimized 
out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:81
#2  0x0000556dcecee599 in fdmon_poll_wait (ctx=0x556de95f8e40, 
ready_list=0x7ffcc6378b18, timeout=-1) at ../util/fdmon-poll.c:79
#3  0x0000556dceceda9e in aio_poll (ctx=0x556de95f8e40, 
blocking=blocking@entry=true) at ../util/aio-posix.c:671
#4  0x0000556dcebe654a in bdrv_do_drained_begin (poll=<optimized out>, 
parent=0x0, bs=0x556de9896420) at ../block/io.c:378
#5  bdrv_do_drained_begin (bs=0x556de9896420, parent=0x0, poll=<optimized out>) 
at ../block/io.c:347
#6  0x0000556dcebdc5a1 in blk_io_limits_disable (blk=0x556dea739470) at 
../block/block-backend.c:2701
#7  0x0000556dce917584 in qmp_block_set_io_throttle 
(arg=arg@entry=0x7ffcc6378d30, errp=errp@entry=0x7ffcc6378d18) at 
../block/qapi-system.c:505
#8  0x0000556dcec5a8d2 in qmp_marshal_block_set_io_throttle (args=<optimized 
out>, ret=<optimized out>, errp=0x7f5486395ea0) at 
qapi/qapi-commands-block.c:368
#9  0x0000556dcece3ed9 in do_qmp_dispatch_bh (opaque=0x7f5486395eb0) at 
../qapi/qmp-dispatch.c:128
#10 0x0000556dced03c55 in aio_bh_poll (ctx=ctx@entry=0x556de95e07c0) at 
../util/async.c:219
#11 0x0000556dceced93e in aio_dispatch (ctx=0x556de95e07c0) at 
../util/aio-posix.c:424
#12 0x0000556dced038fe in aio_ctx_dispatch (source=<optimized out>, 
callback=<optimized out>, user_data=<optimized out>) at ../util/async.c:361
#13 0x00007f5487ec606e in g_main_context_dispatch () from 
target:/usr/lib64/libglib-2.0.so.0
#14 0x0000556dced04fd8 in glib_pollfds_poll () at ../util/main-loop.c:287
#15 os_host_main_loop_wait (timeout=0) at ../util/main-loop.c:310
#16 main_loop_wait (nonblocking=nonblocking@entry=0) at ../util/main-loop.c:589
#17 0x0000556dce925431 in qemu_main_loop () at ../system/runstate.c:835
#18 0x0000556dcec565c7 in qemu_default_main (opaque=opaque@entry=0x0) at 
../system/main.c:48
#19 0x0000556dce6f7848 in main (argc=<optimized out>, argv=<optimized out>) at 
../system/main.c:76

Zhu Yangyang (2):
  io/block: Refactoring the bdrv_drained_begin() function and implement
    a timeout mechanism.
  qapi: Fix qmp_block_set_io_throttle blocked for too long

 block/block-backend.c                       | 14 ++++-
 block/io.c                                  | 55 +++++++++++++++----
 block/qapi-system.c                         |  7 ++-
 include/block/aio-wait.h                    | 58 +++++++++++++++++++++
 include/block/block-io.h                    |  7 +++
 include/system/block-backend-global-state.h |  1 +
 util/aio-wait.c                             |  7 +++
 7 files changed, 137 insertions(+), 12 deletions(-)

-- 
2.33.0


Reply via email to