On 13/04/2017 09:11, Jeff Cody wrote: >> It didn't make it into 2.9-rc4 because of limited time. :( >> >> Looks like there is no -rc5, we'll have to document this as a known issue. >> Users should "block-job-complete/cancel" as soon as possible to avoid such a >> hang. > > I'd argue for including a fix for 2.9, since this is both a regression, and > a hard lock without possible recovery short of restarting the QEMU process.
It is a bit of a corner case (and jobs on I/O thread are relatively rare too), so maybe it's not worth delaying 2.9. It has been delayed already quite a bit. Another reason I think I prefer to wait is to ensure that we have an entry in qemu-iotests to avoid the future regression. Fam explained to me what happens, and the root cause is that bdrv_drain never does a release/acquire pair in this case, so the I/O thread run remains stuck in a callback that tries to acquire. Ironically reintroducing RFifoLock would probably fix this (not 100% sure). Oops. His solution is a bit hacky, but we will hopefully be able to revert it in 2.10 or whenever aio_context_acquire/release will go away. Thanks, Paolo