Currently, blk-mq timeout path synchronizes against the usual issue/completion path using a complex scheme involving atomic bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence rules. Unfortunatley, it contains quite a few holes.
It's pretty easy to make blk_mq_check_expired() terminate a later instance of a request. If we induce 5 sec delay before time_after_eq() test in blk_mq_check_expired(), shorten the timeout to 2s, and issue back-to-back large IOs, blk-mq starts timing out requests spuriously pretty quickly. Nothing actually timed out. It just made the call on a recycle instance of a request and then terminated a later instance long after the original instance finished. The scenario isn't theoretical either. This patchset replaces the broken synchronization mechanism with a RCU and generation number based one. Please read the patch description of the second path for more details. Oleg, Peter, I'd really appreciate if you guys can go over the reported breakages and the new implementation. This patchset contains the following six patches. 0001-blk-mq-protect-completion-path-with-RCU.patch 0002-blk-mq-replace-timeout-synchronization-with-a-RCU-an.patch 0003-blk-mq-use-blk_mq_rq_state-instead-of-testing-REQ_AT.patch 0004-blk-mq-make-blk_abort_request-trigger-timeout-path.patch 0005-blk-mq-remove-REQ_ATOM_COMPLETE-usages-from-blk-mq.patch 0006-blk-mq-remove-REQ_ATOM_STARTED.patch and is available in the following git branch. git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git blk-mq-timeout diffstat follows. Thanks. block/blk-core.c | 2 block/blk-mq-debugfs.c | 4 block/blk-mq.c | 246 +++++++++++++++++++++++++++---------------------- block/blk-mq.h | 48 +++++++++ block/blk-timeout.c | 9 - block/blk.h | 7 - include/linux/blk-mq.h | 1 include/linux/blkdev.h | 23 ++++ 8 files changed, 218 insertions(+), 122 deletions(-) -- tejun