Re: fsync hangs after scsi rejected a request

jianchao.wang Fri, 25 Jan 2019 17:17:13 -0800

On 1/26/19 6:10 AM, Florian Stecker wrote:
> 
> 
> On 1/25/19 10:05 AM, jianchao.wang wrote:
>>
>>
>> On 1/25/19 4:56 PM, Florian Stecker wrote:
>>> On 1/25/19 4:49 AM, jianchao.wang wrote:
>>>>
>>>> It sounds like not so easy to trigger.
>>>>
>>>> blk_mq_dispatch_rq_list
>>>>     scsi_queue_rq
>>>>        if (atomic_read(&sdev->device_busy) ||
>>>>          scsi_device_blocked(sdev))
>>>>          ret = BLK_STS_DEV_RESOURCE;             scsi_end_request
>>>>                                                    __blk_mq_end_request
>>>>                                                      blk_mq_sched_restart 
>>>> // clear RESTART
>>>>                                                        blk_mq_run_hw_queue
>>>>                                                    blk_mq_run_hw_queues
>>>>       list_splice_init(list, &hctx->dispatch)
>>>>     needs_restart = blk_mq_sched_needs_restart(hctx)
>>>>
>>>> The 'needs_restart' will be false, so the queue would be rerun.
>>>>
>>>> Thanks
>>>> Jianchao
>>>
>>> Good point. So the RESTART flag is supposed to protect against this? Now I 
>>> see, this is also sort of what the lengthy comment in 
>>> blk_mq_dispatch_rq_list is saying.
>>>
>>> May I complain that this is very unintuitive (the queue gets rerun when the 
>>> RESTART flag is _not_ set) and also unreliable, as not every caller of 
>>> blk_mq_dispatch_rq_list seems to set the flag, and also it does not always 
>>> get cleared in __blk_mq_end_request?
>>>
>>> __blk_mq_end_request does the following:
>>>
>>>      if (rq->end_io) {
>>>          rq_qos_done(rq->q, rq);
>>>          rq->end_io(rq, error);
>>>      } else {
>>>          if (unlikely(blk_bidi_rq(rq)))
>>>              blk_mq_free_request(rq->next_rq);
>>>          blk_mq_free_request(rq);
>>>      }
>>>
>>> and blk_mq_free_request then calls blk_mq_sched_restart, which clears the 
>>> flag. But in my case, rq->end_io != 0, so blk_mq_free_request is never 
>>> called.
>>
>> So what is your rq->end_io ?
>> flush_end_io ? or mq_flush_data_end_io ? or other ?
>> In normal case, the blk_mq_end_request should be finally invoked.
>>
>> Did you ever try the bfq io scheduler instead of mq-deadline ?
>>
>> Can you share your dmesg and config file here ?
> 
> Sure.
> 
> dmesg: 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__florianstecker.de_dmesg-2D2019-2D01-2D25.txt&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=t_udGGsH8ivtjXHin4oQ4LxELuPoXbl_ARP7oSxiQ6k&s=DC2fhRw2g4haGhhHxOYyX4zD6FWQQ5YoERdfptVRHM4&e=
> Note that no I/O scheduler is in use here, I deactivated them in a udev rule 
> because I still want to be able to use my laptop. When I test this stuff I 
> just reactivate mq-deadline manually.
> 
> Config is the default in Arch Linux: 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__git.archlinux.org_svntogit_packages.git_tree_trunk_config-3Fh-3Dpackages_linux&d=DwIDaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=t_udGGsH8ivtjXHin4oQ4LxELuPoXbl_ARP7oSxiQ6k&s=GjV_fyXaKZ8za8zdB_gqmKW7yBlbla2OHRaAT0iGZkc&e=
> 
> The problem also appears with BFQ.
> 
> And rq->end_io is set to mq_flush_data_end_io when this happens. The only 
> point I see where this function could invoke blk_mq_end_request is via 
> blk_flush_complete_seq, but it does so only if seq == REQ_FSEQ_DONE. However, 
> seq is REQ_FSEQ_POSTFLUSH for me (i.e. it just transitioned from 
> REQ_FSEQ_DATA to REQ_FSEQ_POSTFLUSH).


Could it be this scenario ?

data + post flush

blk_flush_complete_seq          a flush
  case REQ_FSEQ_DATA
  blk_flush_queue_rq
  
  finally issued to driver      blk_mq_dispatch_rq_list
                                try to issue a flush req
                                failed due to NON-NCQ command
                                due to RESTART is set, do nothing
  req complete
  req->end_io（） // doesn't check RESTART
  mq_flush_data_end_io
　  blk_flush_complete_seq
     case REQ_FSEQ_POSTFLUSH
     blk_kick_flush
       do nothing because previous flush
       has not been completed, so 
       flush_pending_idx != flush_running_idx

Thanks
Jianchao
  


>>
>
Re: fsync hangs after scsi rejected a request

Reply via email to