Hi,

>> Root Cause
>> - Block layer timeout happens after power off UAS USB device which is 
>> accessed as reproduce step. During timeout error handler process, scsi host 
>> state becomes SHOST_CANCEL_RECOVERY that causes IO hangs up and lock cannot 
>> be released. And in final, usb subsystem hangs up.
>> Follow is function call:
>> blk_mq_timeout_work 
>>   …->scsi_times_out  (… means some functions are not listed before this 
>> function.)
>>     …-> scsi_eh_scmd_add(scsi_host_set_state to SHOST_RECOVERY) 
>>       … -> scsi_error_handler
>>         …-> uas_eh_device_reset_handler
>>             -> usb_lock_device_for_reset  <- take lock
>>               -> usb_reset_device
>>                 …-> rebind = uas_post_reset (return 1 since ENODEV) 
>>                 …-> usb_unbind_and_rebind_marked_interfaces (rebind=1)
>>                    …-> uas_disconnect  (scsi_host_set_state to 
>> SHOST_CANCEL_RECOVERY)
>>                         … -> scsi_queue_rq
>
>How does scsi_queue_rq get called here?  As far as I can see, this shouldn't 
>happen.

We confirmed the function call path on linux 4.9 when this problem occured 
since we are working on it. In linux 4.9, the last function is scsi_request_fn 
instead of scsi_queue_rq. In staging.git, we think the scsi_queue_rq is called 
by follow path.
uas_disconnect
|- scsi_remove_host
 |- scsi_forget_host
  |- __scsi_remove_device
   |- device_del
    |- bus_remove_device
     |- device_release_driver
      |- device_release_driver_internal
       |- __device_release_driver
        |- drv->remove(dev) (sd_remove)  
         |- sd_shutdown
          |- sd_sync_cache
           |- scsi_execute
            |- __scsi_execute
             |- blk_execute_rq
              |- blk_execute_rq_nowait
               |- blk_mq_sched_insert_request
                |- blk_mq_run_hw_queue
                 |- __blk_mq_delay_run_hw_queue
                  |- __blk_mq_run_hw_queue
                   |- blk_mq_sched_dispatch_requests
                    |- blk_mq_dispatch_rq_list
                     |- q->mq_ops->queue_rq (scsi_queue_rq)

>> Countermeasure
>> - Make uas_post_reset doesn’t return 1 when ENODEV returns from 
>> uas_configure_endpoints since usb_unbind_and_rebind_marded_interfaces 
>> doesn’t need to do unbind/rebind operations in this situation.
>> blk_mq_timeout_work
>>   …->scsi_times_out  (… means some functions are not listed before this 
>> function.)
>>     …-> scsi_eh_scmd_add(scsi_host_set_state to SHOST_RECOVERY) 
>>       … -> scsi_error_handler
>>        …-> uas_eh_device_reset_handler (*1)
>>            -> usb_lock_device_for_reset  <- take lock
>>              -> usb_reset_device
>>                -> usb_reset_and_verify_device (return ENODEV and FAILED will 
>> be reported to *1)
>>                -> uas_post_reset returns 0 when ENODEV => rebind=0 
>>                -> usb_unbind_and_rebind_marked_interfaces (rebind=0)
>
>The difference is that uas_disconnect wasn't called here.  But that routine 
>should not cause any problems -- you're always supposed to be able to unbind a 
>driver from a device.  So it looks like this is not the right way to solve the 
>problem.

We confirmed usb_driver_release_interface will call usb_unbind_interface when 
this problem occurs.
So usb_unbind_interface will call driver disconnect callbak.

Regards,
Kento Kobayashi

Reply via email to