On 16/12/2015 13:54, Christian Borntraeger wrote: > Just some quick remarks before I leave into vacation: > > Performance seems to be better than the initial version. I have some > hangs from time to time when shutting down (also with your previous > version)
Yes, I've seen them too. To reproduce, just start some "dd" and repeatedly stop/cont. It hangs after 3-4 tries typically. It's like this: I/O thread main thread ----------------------------------------------------- aio_context_acquire bdrv_drain bh->scheduled = 0 call bottom half aio_context_acquire <hang> The bottom half is hung in the I/O thread, and will never be processed by the main thread because bh->scheduled = 0. My current idea is that *all* callers of aio_poll have to call it without the AioContext acquired, including bdrv_drain. It's not a problem now that Fam has fixed almost all bdrv_drain users to use bdrv_drained_begin/end instead. The nastiness here is two-fold: 1) doing aio_context_release aio_poll aio_context_acquire in bdrv_drain is not very safe because the mutex is recursive. (Have I ever mentioned I don't like recursive mutexes?...) 2) there is still a race in bdrv_drain bdrv_flush_io_queue(bs); if (bdrv_requests_pending(bs)) { busy = true; aio_context_release(aio_context); aio_poll(aio_context, busy); aio_context_acquire(aio_context); if we release before aio_poll, the last request can be completed before aio_poll is invoked, and aio_poll will block forever. In Linux terms, bdrv_request_pending would be prepare_to_wait() and aio_poll would be schedule(). But, we don't have prepare_to_wait(). I have a hackish solution here, but I don't like it really. Paolo