On Wed, 19 Mar 2014, Andreas Reis wrote:

> I've uploaded a dmesg with the new debugging patch to bugzilla:
> https://bugzilla.kernel.org/attachment.cgi?id=130041

Thanks.  I have now managed to reproduce many of the features of this
problem on my own computer.

James, I will need your help (or help from somebody who understands the 
SCSI error handler) to figure out how this problem should be fixed.

Basically, usb-storage deadlocks when the SCSI error handler invokes
the eh_device_reset_handler callback while a command is running.  The
command has timed out and will never complete normally, because the
device's firmware has crashed.  But usb-storage's device-reset routine
waits for the current command to finish, which brings everything to a
standstill.

Is this design wrong?  That is, should the device-reset routine wait 
for currently executing commands to finish, or should it abort them, or 
what?

Or should the SCSI error handler abort the running command before 
invoking the eh_device_reset_handler callback?

For the record, and in case anyone is curious, here's the detailed
sequence of events during my test:

        sd issues a READ(10) command.  For whatever reason, the device
        goes nuts and the command times out.

        scsi_times_out() calls scsi_abort_command(), which queues an
        abort request.

        scmd_eh_abort_handler() calls scsi_try_to_abort_cmd(), which
        succeeds in aborting the READ.

        The READ command is retried (I didn't trace through the details
        of this).  The retry fails with a Unit Attention (SK=6, 
        ASC=0x29, Reset or Bus Device Reset Occurred).

        The READ command is retried a second time, and it times out 
        again.

        This time around, scsi_times_out() calls scsi_abort_command()
        unsuccessfully (because the SCSI_EH_ABORT_SCHEDULED flag is
        still set).

        As a result, scsi_error_handler() calls scsi_unjam_host(), 
        which calls scsi_eh_get_sense().

        That routine calls scsi_request_sense(), which goes into
        scsi_send_eh_cmnd().

        The calls to shost->hostt->queuecommand() all fail, because the
        READ command is still running and usb-storage has a queue
        depth of 1.  The error messages produced by these failures are
        disconcerting but not dangerous.

        Since the REQUEST SENSE command was never issued, 
        scsi_eh_get_sense() returns 0.

        scsi_unjam_host() goes on to call scsi_eh_abort_cmds(), which
        does essentially nothing because the SCSI_EH_CANCEL_CMD flag
        for the only command on work_q is clear.  
        scsi_eh_test_devices() returns 0 because check_list is empty
        and work_q isn't.

        scsi_unjam_host() then calls scsi_eh_ready_devs().  This
        routine ends up calling scsi_eh_bus_device_reset(), at which 
        point usb-storage deadlocks as described above.

(On Andreas's system, the first READ retry times out as opposed to the
second retry as on my computer.  I doubt this makes any difference.)

I can't tell if this is all working as intended or if it went off the 
tracks somewhere.

Thanks for any guidance.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to