On 04/04/2016 13:57, Stefan Hajnoczi wrote: > On Fri, Apr 01, 2016 at 09:57:38PM +0800, Fam Zheng wrote: >> Using the nested aio_poll() in coroutine is a bad idea. This patch >> replaces the aio_poll loop in bdrv_drain with a BH, if called in >> coroutine. >> >> For example, the bdrv_drain() in mirror.c can hang when a guest issued >> request is pending on it in qemu_co_mutex_lock(). >> >> Mirror coroutine in this case has just finished a request, and the block >> job is about to complete. It calls bdrv_drain() which waits for the >> other coroutine to complete. The other coroutine is a scsi-disk request. >> The deadlock happens when the latter is in turn pending on the former to >> yield/terminate, in qemu_co_mutex_lock(). The state flow is as below >> (assuming a qcow2 image): >> >> mirror coroutine scsi-disk coroutine >> ------------------------------------------------------------- >> do last write >> >> qcow2:qemu_co_mutex_lock() >> ... >> scsi disk read >> >> tracked request begin >> >> qcow2:qemu_co_mutex_lock.enter >> >> qcow2:qemu_co_mutex_unlock() >> >> bdrv_drain >> while (has tracked request) >> aio_poll() >> >> In the scsi-disk coroutine, the qemu_co_mutex_lock() will never return >> because the mirror coroutine is blocked in the aio_poll(blocking=true). >> >> With this patch, the added qemu_coroutine_yield() allows the scsi-disk >> coroutine to make progress as expected: >> >> mirror coroutine scsi-disk coroutine >> ------------------------------------------------------------- >> do last write >> >> qcow2:qemu_co_mutex_lock() >> ... >> scsi disk read >> >> tracked request begin >> >> qcow2:qemu_co_mutex_lock.enter >> >> qcow2:qemu_co_mutex_unlock() >> >> bdrv_drain.enter >>> schedule BH >>> qemu_coroutine_yield() >>> qcow2:qemu_co_mutex_lock.return >>> ... >> tracked request end >> ... >> (resumed from BH callback) >> bdrv_drain.return >> ... >> >> Reported-by: Laurent Vivier <lviv...@redhat.com> >> Suggested-by: Paolo Bonzini <pbonz...@redhat.com> >> Signed-off-by: Fam Zheng <f...@redhat.com> >> >> --- >> >> v2: Call qemu_bh_delete() in BH callback. [Paolo] >> Change the loop to an assertion. [Paolo] >> Elaborate a bit about the fix in commit log. [Paolo] >> --- >> block/io.c | 41 +++++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 41 insertions(+) >> >> diff --git a/block/io.c b/block/io.c >> index c4869b9..ddcfb4c 100644 >> --- a/block/io.c >> +++ b/block/io.c >> @@ -253,6 +253,43 @@ static void bdrv_drain_recurse(BlockDriverState *bs) >> } >> } >> >> +typedef struct { >> + Coroutine *co; >> + BlockDriverState *bs; >> + QEMUBH *bh; >> + bool done; >> +} BdrvCoDrainData; >> + >> +static void bdrv_co_drain_bh_cb(void *opaque) >> +{ >> + BdrvCoDrainData *data = opaque; >> + Coroutine *co = data->co; >> + >> + qemu_bh_delete(data->bh); >> + bdrv_drain(data->bs); >> + data->done = true; >> + qemu_coroutine_enter(co, NULL); >> +} >> + >> +static void coroutine_fn bdrv_co_drain(BlockDriverState *bs) >> +{ > > Please document why a BH is needed: > > /* Calling bdrv_drain() from a BH ensures the > * current coroutine yields and other coroutines run if they were > * queued from qemu_co_queue_run_restart(). > */ > >> + BdrvCoDrainData data; >> + >> + assert(qemu_in_coroutine()); >> + data = (BdrvCoDrainData) { >> + .co = qemu_coroutine_self(), >> + .bs = bs, >> + .done = false, >> + .bh = aio_bh_new(bdrv_get_aio_context(bs), bdrv_co_drain_bh_cb, >> &data), >> + }; >> + qemu_bh_schedule(data.bh); >> + >> + qemu_coroutine_yield(); >> + /* If we are resumed from some other event (such as an aio completion >> or a >> + * timer callback), it is a bug in the caller that should be fixed. */ >> + assert(data.done); >> +} >> + >> /* >> * Wait for pending requests to complete on a single BlockDriverState >> subtree, >> * and suspend block driver's internal I/O until next request arrives. >> @@ -269,6 +306,10 @@ void bdrv_drain(BlockDriverState *bs) >> bool busy = true; >> >> bdrv_drain_recurse(bs); >> + if (qemu_in_coroutine()) { >> + bdrv_co_drain(bs); >> + return; >> + } >> while (busy) { >> /* Keep iterating */ >> bdrv_flush_io_queue(bs); >> -- >> 2.7.4 > > block/mirror.c should call bdrv_co_drain() explicitly and bdrv_drain() > should assert(!qemu_in_coroutine()). > > The reason why existing bdrv_read() and friends detect coroutine context > at runtime is because it is meant for legacy code that runs in both > coroutine and non-coroutine contexts. > > Modern coroutine code coroutine code the coroutine interfaces explicitly > instead.
For what it's worth, I suspect Fam's patch removes the need for http://article.gmane.org/gmane.comp.emulators.qemu/401375. That's a nice bonus. :) Paolo