Thanks for all comments!
OK, I'd like to sort our situation:
################
$ Here are 2 features: - iochk_clear/read() interface for error "detection" by Seto ... me :-) - callback, thread, and event notification for error "recovery" by Linas ... expert in PPC64
$ What will "detection" interface provides? - allow drivers to get error information - device/bus was isolated/going-reset/re-enabled/etc. - error status which hardware and PCI subsystem provides - allow drivers to do "simple retry" easily - major soft errors(*1) would be recovered by a simple retry - in cases that device/bus was re-enabled but a retry is required
$ What will "recovery" infrastructure provides? - allow drivers to help OS's recovery - usually OS cannot re-enable affected devices by itself - allow drivers to respond asynchronous error event - allow drivers to implement "device specific recovery"
$ Difference of stance - "detection" - Assume that the number of soft error is far more than that of hard error. (PCI-Express has ECC, but traditional PCI does not.) - Assume that it isn't too late that attempt of device isolation and/or recovery comes after a simple retry(*2), and that a retry would be required even if the recovery had go well. - It isn't matter whether device isolation is actually possible or not for the arch. The fundamental intention of this interface is prevent user applications from data pollution. - Currently DMA and asynchronous I/O is not target. - "recovery" - (I'd appreciate it if Linas could fill here by his suitable words.) - (Maybe,) it is based on assuming that erroneous device should be isolated immediately irrespective of type of the error. - (I guess that) once a device was isolated, it become harder to re-enable it. It seems like a kind of hotplug feature. - Currently there are few platform which can isolate devices and attempt to recover from the I/O error.
$ How to use - "detection" ... easy. - clip I/Os by iochk_clear() and iochk_read() - if iochk_read() returns non-0, retry once and/or notify the error to user application. - "recovery" ... rather hard. - (I'd appreciate it if Linas could fill here by his suitable words.) - write callback function for each event(*3)
-----
*1: Traditionally, there are 2 types of error: - soft error: data was broken (ex. due to low voltage, natural radiation etc.) temporary error - hard error: device or bus was physically broken (i.e. uncorrectable) permanent error
*2: it's difficult to distinguish hard errors from soft errors, without any retry.
*3: Linas, how many stages/events would you prefer to be there? is 3 enough?
ex. IMHO:
IOERR_DETECTED - An error was detected, so error logging or device isolation would be major request. On PPC64, isolation would be already done by hardware. IOERR_PREPARE_RECOVERY - Require preparation before attempting error recovery by OS. IOERR_DO_RECOVERY - Require device specific recovery and result of the recovery. OS will gather all results and will decide recovered or not. IOERR_RECOVERED - OS recovery was succeeded. IOERR_DEAD - OS recovery was failed.
And as Ben said and as you already proposed, I also think only one callback is enough and better, like: int pci_emergency_callback(pci_dev *dev, err_event event, void *extra)
It allows us to add new event if desired.
################
Thanks, H.Seto
- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/