On Tue, 02/09 13:47, Stefan Hajnoczi wrote: > On Mon, Feb 08, 2016 at 03:17:23PM +0000, Dr. David Alan Gilbert wrote: > > Does this make sense to everyone else, or does anyone have any better > > suggestions? > > As a concrete example, any monitor command that calls bdrv_drain_all() > can hang forever with the QEMU global mutex held if I/O requests are > stuck (e.g. NFS mount is unreachable). > > bdrv_aio_cancel() can also hang but is mostly exposed to device > emulation, not the monitor. > > One solution for these block layer functions is to add a timeout > argument and let them return an error. This way the monitor and device > emulation do not hang forever.
Yes, there are a few places in block layer invoking aio_poll() in a loop waiting for certain events, and a disconnected network link could make QEMU hang. In these cases a timeout is a huge improvement. Maybe we can mark the BDS as "hanging" (-EIO is returned for all further requests) and let bdrv_drain_all() return. > > The benefit of the timeout is that both monitor and device emulation > hangs are tackled. It also doesn't require monitor changes. > > I'm not sure who chooses the timeout value and which value makes sense > (policy vs mechanism separation)... Default to 30 seconds like Linux, and make it tunable through command line options as well as QMP? Fam