Hi, list

Background: As we know, when the SCSI disk is in failure state, for
example bad sectors appear on disk or disk is surprise-removed, the SCSI
middle-layer will endless retry the failed I/O requests if SCSI
mid-layer can't get the notification from LLDD to stop the retry.
Unfortunately now *not all LLDD* can notify the mid-layer to stop the
retry when SCSI disk is in failure state. 

Let me tell you the testing experience in our lab: 
1 we install kernel 2.6.11.2 on a Tiger4 platform, 
2 there're two SCSI disks, one is sda for root fs, the other is sdb for
/mnt
3 execute "cp -r /usr/src/linux-2.6.11.2 /mnt"
4 during the process of copying, we surprise-removed sdb 
5 then system become very busy and freezing, even the user can't login
into the system whether locally or remotely
6 the error output on the screen demonstrates that SCSI mid-layer is
endless retrying the failed I/O requests

To overcome this morbid(or weird) behavior, I propose to add a new sysfs
attribute to SCSI device.
Attribute name: stop_retry_threshold
Description: user set a threshold value through this interface, so that
after SCSI mid-layer has retried "threshold" times, it'll automatically
stop the further retries to make system calm down and usable to other
users.
Usage example: user execute "echo 100 >
/sys/block/sdb/device/stop_retry_threshold" to tell SCSI mid-layer to
automatically stop the further retries after it has retries 100 times.

What's your comment about this proposal? If there's no objection, I'll
send out the patch soon.

Thanks,
Forrest
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to