On 1/25/19 9:33 AM, Ming Lei wrote:
> On Fri, Jan 25, 2019 at 8:45 AM Florian Stecker <m...@florianstecker.de>
> wrote:
>>
>>
>>
>> On 1/22/19 4:22 AM, Ming Lei wrote:
>>> On Tue, Jan 22, 2019 at 5:13 AM Florian Stecker <m...@florianstecker.de>
>>> wrote:
>>>>
>>>> Hi everyone,
>>>>
>>>> on my laptop, I am experiencing occasional hangs of applications during
>>>> fsync(), which are sometimes up to 30 seconds long. I'm using a BTRFS
>>>> which spans two partitions on the same SSD (one of them used to contain
>>>> a Windows, but I removed it and added the partition to the BTRFS volume
>>>> instead). Also, the problem only occurs when an I/O scheduler
>>>> (mq-deadline) is in use. I'm running kernel version 4.20.3.
>>>>
>>>> From what I understand so far, what happens is that a sync request
>>>> fails in the SCSI/ATA layer, in ata_std_qc_defer(), because it is a
>>>> "Non-NCQ command" and can not be queued together with other commands.
>>>> This propagates up into blk_mq_dispatch_rq_list(), where the call
>>>>
>>>> ret = q->mq_ops->queue_rq(hctx, &bd);
>>>>
>>>> returns BLK_STS_DEV_RESOURCE. Later in blk_mq_dispatch_rq_list(), there
>>>> is the piece of code
>>>>
>>>> needs_restart = blk_mq_sched_needs_restart(hctx);
>>>> if (!needs_restart ||
>>>> (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
>>>> blk_mq_run_hw_queue(hctx, true);
>>>> else if (needs_restart && (ret == BLK_STS_RESOURCE))
>>>> blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
>>>>
>>>> which restarts the queue after a delay if BLK_STS_RESOURCE was returned,
>>>> but somehow not for BLK_STS_DEV_RESOURCE. Instead, nothing happens and
>>>> fsync() seems to hang until some other process wants to do I/O.
>>>>
>>>> So if I do
>>>>
>>>> - else if (needs_restart && (ret == BLK_STS_RESOURCE))
>>>> + else if (needs_restart && (ret == BLK_STS_RESOURCE || ret ==
>>>> BLK_STS_DEV_RESOURCE))
>>>>
>>>> it fixes my problem. But was there a reason why BLK_STS_DEV_RESOURCE was
>>>> treated differently that BLK_STS_RESOURCE here?
>>>
>>> Please see the comment:
>>>
>>> /*
>>> * BLK_STS_DEV_RESOURCE is returned from the driver to the block layer if
>>> * device related resources are unavailable, but the driver can guarantee
>>> * that the queue will be rerun in the future once resources become
>>> * available again. This is typically the case for device specific
>>> * resources that are consumed for IO. If the driver fails allocating these
>>> * resources, we know that inflight (or pending) IO will free these
>>> * resource upon completion.
>>> *
>>> * This is different from BLK_STS_RESOURCE in that it explicitly references
>>> * a device specific resource. For resources of wider scope, allocation
>>> * failure can happen without having pending IO. This means that we can't
>>> * rely on request completions freeing these resources, as IO may not be in
>>> * flight. Examples of that are kernel memory allocations, DMA mappings, or
>>> * any other system wide resources.
>>> */
>>> #define BLK_STS_DEV_RESOURCE ((__force blk_status_t)13)
>>>
>>>>
>>>> In any case, it seems wrong to me that ret is used here at all, as it
>>>> just contains the return value of the last request in the list, and
>>>> whether we rerun the queue should probably not only depend on the last
>>>> request?
>>>>
>>>> Can anyone of the experts tell me whether this makes sense or I got
>>>> something completely wrong?
>>>
>>> Sounds a bug in SCSI or ata driver.
>>>
>>> I remember there is hole in SCSI wrt. returning BLK_STS_DEV_RESOURCE,
>>> but I never get lucky to reproduce it.
>>>
>>> scsi_queue_rq():
>>> ......
>>> case BLK_STS_RESOURCE:
>>> if (atomic_read(&sdev->device_busy) ||
>>> scsi_device_blocked(sdev))
>>> ret = BLK_STS_DEV_RESOURCE;
>>>
>>
>> OK, please tell me if I understand this right now. What is supposed to
>> happen is
>>
>> - request 1 is being processed by the device
>> - request 2 (sync) fails with STS_DEV_RESOURCE because request 1 is
>> still being processed
>> - request 2 is put into hctx->dispatch inside blk_mq_dispatch_rq_list *
>> - queue does not get rerun because of STS_DEV_RESOURCE
>> - request 1 finishes
>> - scsi_end_request calls blk_mq_run_hw_queues
>> - blk_mq_run_hw_queue finds request 2 in hctx->dispatch **
>> - runs request 2
>
> Right, it is one typical case if the hw queue depth is 2.
>
>>
>> However, what happens for me is that request 1 finishes earlier, so that
>> ** is executed before *, and finds an empty hctx->dispatch list.
>>
>> At least this is consistent with what I see.
>>
>> > All in-flight request may complete between reading 'sdev->device_busy'
>> > and setting ret as 'BLK_STS_DEV_RESOURCE', then this IO hang may
>> > be triggered.
>> >
>>
>> More accurate would be: between reading sdev->device_busy and doing
>> list_splice_init(list,&hctx->dispatch) inside blk_mq_dispatch_rq_list.
>
> Exactly.
It sounds like not so easy to trigger.
blk_mq_dispatch_rq_list
scsi_queue_rq
if (atomic_read(&sdev->device_busy) ||
scsi_device_blocked(sdev))
ret = BLK_STS_DEV_RESOURCE; scsi_end_request
__blk_mq_end_request
blk_mq_sched_restart //
clear RESTART
blk_mq_run_hw_queue
blk_mq_run_hw_queues
list_splice_init(list, &hctx->dispatch)
needs_restart = blk_mq_sched_needs_restart(hctx)
The 'needs_restart' will be false, so the queue would be rerun.
Thanks
Jianchao
>
>>
>> I'm not sure how the SCSI driver can do anything about this except
>> returning BLK_STS_RESOURCE instead (which I guess would not be a great
>> solution), as basically everything else happens inside blk-mq. Or what
>> do you have in mind?
>
> Not have a better idea yet, :-(
>
> thanks,
> Ming Lei
>