Am 01.03.2019 um 14:47 hat Sergio Lopez geschrieben: > >> Otherwise, we can simply add an extra condition at > >> child_job_drained_poll(), before the drv->drained_poll(), to return > >> true if the job isn't yet paused. > > > > Yes, I think something like this is this right fix. > > Fixing this has uncovered another issue also triggered by issuing > 'block_commit' and 'device_del' consecutively. At the end, mirror_run() > calls to bdrv_drained_begin(), which is scheduled of later (via > bdrv_co_yield_to_drain()) as the mirror job is running in a coroutine. > > At the same time, the Guest requests the device to be unplugged, which > leads to blk_unref()->blk_drain()->bdrv_do_drained_begin(). When the > latter reaches BDRV_POLL_WHILE, the bdrv_drained_begin scheduled above > is run, which also runs BDRV_POLL_WHILE, leading to the thread getting > stuck in aio_poll(). > > Is it really safe scheduling a bdrv_drained_begin() with poll == true?
I don't see what the problem would be with it in theory. Once the node becomes idle, both the inner and the outer BDRV_POLL_WHILE() should return. The question with such hangs is usually, what is the condition that made bdrv_drain_poll() return true, and why aren't we making progress so that is would become false. With iothreads, it could also be that the condition has actually already changed, but aio_wait_kick() wasn't called, so aio_poll() isn't woken up. Kevin