On Sun, 2019-04-28 at 16:14 +0800, Ming Lei wrote:
> Just like aio/io_uring, we need to grab 2 refcount for queuing one
> request, one is for submission, another is for completion.
> 
> If the request isn't queued from plug code path, the refcount grabbed
> in generic_make_request() serves for submission. In theroy, this
> refcount should have been released after the sumission(async run queue)
> is done. blk_freeze_queue() works with blk_sync_queue() together
> for avoiding race between cleanup queue and IO submission, given async
> run queue activities are canceled because hctx->run_work is scheduled with
> the refcount held, so it is fine to not hold the refcount when
> running the run queue work function for dispatch IO.
> 
> However, if request is staggered into plug list, and finally queued
> from plug code path, the refcount in submission side is actually missed.
> And we may start to run queue after queue is removed because the queue's
> kobject refcount isn't guaranteed to be grabbed in flushing plug list
> context, then kernel oops is triggered, see the following race:
> 
> blk_mq_flush_plug_list():
>         blk_mq_sched_insert_requests()
>                 insert requests to sw queue or scheduler queue
>                 blk_mq_run_hw_queue
> 
> Because of concurrent run queue, all requests inserted above may be
> completed before calling the above blk_mq_run_hw_queue. Then queue can
> be freed during the above blk_mq_run_hw_queue().
> 
> Fixes the issue by grab .q_usage_counter before calling
> blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
> safe because the queue is absolutely alive before inserting request.
> 
> Cc: Dongli Zhang <dongli.zh...@oracle.com>
> Cc: James Smart <james.sm...@broadcom.com>
> Cc: Bart Van Assche <bart.vanass...@wdc.com>
> Cc: linux-scsi@vger.kernel.org,
> Cc: Martin K . Petersen <martin.peter...@oracle.com>,
> Cc: Christoph Hellwig <h...@lst.de>,
> Cc: James E . J . Bottomley <j...@linux.vnet.ibm.com>,
> Reviewed-by: Bart Van Assche <bvanass...@acm.org>
> Reviewed-by: Johannes Thumshirn <jthumsh...@suse.de>
> Reviewed-by: Hannes Reinecke <h...@suse.com>
> Tested-by: James Smart <james.sm...@broadcom.com>
> Signed-off-by: Ming Lei <ming....@redhat.com>

I added my "Reviewed-by" to a previous version of this patch but not
to this version of this patch. Several "Reviewed-by" tags probably
should be removed.

> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index aa6bc5c02643..dfe83e7935d6 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -414,6 +414,13 @@ void blk_mq_sched_insert_requests(struct blk_mq_hw_ctx 
> *hctx,
>  {
>         struct elevator_queue *e;
>  
> +       /*
> +        * blk_mq_sched_insert_requests() is called from flush plug
> +        * context only, and hold one usage counter to prevent queue
> +        * from being released.
> +        */
> +       percpu_ref_get(&hctx->queue->q_usage_counter);
> +
>         e = hctx->queue->elevator;
>         if (e && e->type->ops.insert_requests)
>                 e->type->ops.insert_requests(hctx, list, false);
> @@ -432,6 +439,8 @@ void blk_mq_sched_insert_requests(struct blk_mq_hw_ctx 
> *hctx,
>         }
>  
>         blk_mq_run_hw_queue(hctx, run_queue_async);
> +
> +       percpu_ref_put(&hctx->queue->q_usage_counter);
>  }

I think that 'hctx' can disappear if all requests queued by this function
finish just before blk_mq_run_hw_queue() returns and if the number of hardware
queues is changed from another thread. Shouldn't the request queue pointer be
stored in a local variable instead of reading hctx->queue twice?

Bart.

Reply via email to