On Thu, Apr 13, 2017 at 6:45 AM, Paolo Bonzini <pbonz...@redhat.com> wrote: > On 13/04/2017 09:11, Jeff Cody wrote: >>> It didn't make it into 2.9-rc4 because of limited time. :( >>> >>> Looks like there is no -rc5, we'll have to document this as a known issue. >>> Users should "block-job-complete/cancel" as soon as possible to avoid such a >>> hang. >> >> I'd argue for including a fix for 2.9, since this is both a regression, and >> a hard lock without possible recovery short of restarting the QEMU process. > > It is a bit of a corner case (and jobs on I/O thread are relatively rare > too), so maybe it's not worth delaying 2.9. It has been delayed already > quite a bit. Another reason I think I prefer to wait is to ensure that > we have an entry in qemu-iotests to avoid the future regression. > > Fam explained to me what happens, and the root cause is that bdrv_drain > never does a release/acquire pair in this case, so the I/O thread run > remains stuck in a callback that tries to acquire. Ironically > reintroducing RFifoLock would probably fix this (not 100% sure). Oops. > > His solution is a bit hacky, but we will hopefully be able to revert it > in 2.10 or whenever aio_context_acquire/release will go away.
Fam, many of us will be offline Friday and Monday due to public holidays. Can you work on a patch that addresses Kevin's concerns with "[PATCH for-2.9 4/5] block: Drain BH in bdrv_drained_begin"? I'll be officially offline too but am willing to review the patch. Thanks, Stefan