On Sun, 2019-04-28 at 16:14 +0800, Ming Lei wrote: > Just like aio/io_uring, we need to grab 2 refcount for queuing one > request, one is for submission, another is for completion. > > If the request isn't queued from plug code path, the refcount grabbed > in generic_make_request() serves for submission. In theroy, this > refcount should have been released after the sumission(async run queue) > is done. blk_freeze_queue() works with blk_sync_queue() together > for avoiding race between cleanup queue and IO submission, given async > run queue activities are canceled because hctx->run_work is scheduled with > the refcount held, so it is fine to not hold the refcount when > running the run queue work function for dispatch IO. > > However, if request is staggered into plug list, and finally queued > from plug code path, the refcount in submission side is actually missed. > And we may start to run queue after queue is removed because the queue's > kobject refcount isn't guaranteed to be grabbed in flushing plug list > context, then kernel oops is triggered, see the following race: > > blk_mq_flush_plug_list(): > blk_mq_sched_insert_requests() > insert requests to sw queue or scheduler queue > blk_mq_run_hw_queue > > Because of concurrent run queue, all requests inserted above may be > completed before calling the above blk_mq_run_hw_queue. Then queue can > be freed during the above blk_mq_run_hw_queue(). > > Fixes the issue by grab .q_usage_counter before calling > blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is > safe because the queue is absolutely alive before inserting request. > > Cc: Dongli Zhang <dongli.zh...@oracle.com> > Cc: James Smart <james.sm...@broadcom.com> > Cc: Bart Van Assche <bart.vanass...@wdc.com> > Cc: linux-scsi@vger.kernel.org, > Cc: Martin K . Petersen <martin.peter...@oracle.com>, > Cc: Christoph Hellwig <h...@lst.de>, > Cc: James E . J . Bottomley <j...@linux.vnet.ibm.com>, > Reviewed-by: Bart Van Assche <bvanass...@acm.org> > Reviewed-by: Johannes Thumshirn <jthumsh...@suse.de> > Reviewed-by: Hannes Reinecke <h...@suse.com> > Tested-by: James Smart <james.sm...@broadcom.com> > Signed-off-by: Ming Lei <ming....@redhat.com>
I added my "Reviewed-by" to a previous version of this patch but not to this version of this patch. Several "Reviewed-by" tags probably should be removed. > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c > index aa6bc5c02643..dfe83e7935d6 100644 > --- a/block/blk-mq-sched.c > +++ b/block/blk-mq-sched.c > @@ -414,6 +414,13 @@ void blk_mq_sched_insert_requests(struct blk_mq_hw_ctx > *hctx, > { > struct elevator_queue *e; > > + /* > + * blk_mq_sched_insert_requests() is called from flush plug > + * context only, and hold one usage counter to prevent queue > + * from being released. > + */ > + percpu_ref_get(&hctx->queue->q_usage_counter); > + > e = hctx->queue->elevator; > if (e && e->type->ops.insert_requests) > e->type->ops.insert_requests(hctx, list, false); > @@ -432,6 +439,8 @@ void blk_mq_sched_insert_requests(struct blk_mq_hw_ctx > *hctx, > } > > blk_mq_run_hw_queue(hctx, run_queue_async); > + > + percpu_ref_put(&hctx->queue->q_usage_counter); > } I think that 'hctx' can disappear if all requests queued by this function finish just before blk_mq_run_hw_queue() returns and if the number of hardware queues is changed from another thread. Shouldn't the request queue pointer be stored in a local variable instead of reading hctx->queue twice? Bart.