Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-04 Thread Benjamin Herrenschmidt
On Fri, 2005-03-04 at 23:57 +0100, Pavel Machek wrote: > What prevents driver from being run on another CPU, maybe just doing > mdelay() between hardware accesses? Almost all drivers that I know have some sort of locking. Nothing nasty about it. Besides, you can't expect everything to be as simp

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-04 Thread Benjamin Herrenschmidt
On Fri, 2005-03-04 at 14:54 +0100, Pavel Machek wrote: > Hi! > > > > If there's no ->error method, at leat call ->remove so one device only > > > takes itself down. > > > > > > Does this make sense? > > > > This was my thought too last time we had this discussion. A completely > > asynchronous

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-04 Thread Benjamin Herrenschmidt
On Sat, 2005-03-05 at 00:18 +0100, Pavel Machek wrote: > On So 05-03-05 10:03:37, Benjamin Herrenschmidt wrote: > > On Fri, 2005-03-04 at 23:57 +0100, Pavel Machek wrote: > > > > > What prevents driver from being run on another CPU, maybe just doing > > > mdelay() between hardware accesses? > >

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-04 Thread Pavel Machek
On So 05-03-05 10:03:37, Benjamin Herrenschmidt wrote: > On Fri, 2005-03-04 at 23:57 +0100, Pavel Machek wrote: > > > What prevents driver from being run on another CPU, maybe just doing > > mdelay() between hardware accesses? > > Almost all drivers that I know have some sort of locking. Nothing

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-04 Thread Pavel Machek
Hi! > > Hmm, before we go async way (nasty locking, no?) could driver simply > > ask "did something bad happen while I was sleeping?" at begining of each > > function? > > > > For DMA problems, driver probably has its own, timer-based, > > "something is wrong" timer, anyway, no? > > No, there is

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-04 Thread Jesse Barnes
On Friday, March 4, 2005 5:54 am, Pavel Machek wrote: > Hi! > > > > If there's no ->error method, at leat call ->remove so one device only > > > takes itself down. > > > > > > Does this make sense? > > > > This was my thought too last time we had this discussion. A completely > > asynchronous call

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-04 Thread Linas Vepstas
On Fri, Mar 04, 2005 at 11:03:29AM +0900, Hidetoshi Seto was heard to remark: > >p.s. I would like to have iochk_read() take struct pci_dev * as an > >argument. (I could store a pointer to pci_dev in the "cookie" but > >that seems odd). > > I'd like to store the pointer and handle all only with t

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-04 Thread Pavel Machek
Hi! > > If there's no ->error method, at leat call ->remove so one device only > > takes itself down. > > > > Does this make sense? > > This was my thought too last time we had this discussion. A completely > asynchronous call is probably needed in addition to Hidetoshi's proposed API, > since

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-04 Thread Hidetoshi Seto
Thanks for all comments! OK, I'd like to sort our situation: $ Here are 2 features: - iochk_clear/read() interface for error "detection" by Seto ... me :-) - callback, thread, and event notification for error "recovery" by Linas ... expert in PPC64 $ What will "dete

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-03 Thread Hidetoshi Seto
Linas Vepstas wrote: Below is some "pseudocode" version (mentally substitute "pci error event" for every occurance of "eeh"). Its got some ppc64-specific crud in there that we have to fix to make it truly generic (I just cut and pasted from current code). Would a cleaned up version of this code

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-03 Thread Hidetoshi Seto
Linas Vepstas wrote: If their defaults are no-ops, device maintainers who develops their driver on not-implemented arch should be more careful. Why? People who write device drivers already know if/when they need to disable interrupts, and so they already disable if they need it. OK, I'll remake

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-02 Thread Jesse Barnes
On Wednesday, March 2, 2005 3:30 pm, Linas Vepstas wrote: > Put it another way: a device driver author should have the opportunity > to poll the pci bus status if they so desire. Polling for bus status > on ppc64 is real easy. Given what Jesse Barnes was saying, it sounded > like a simple (option

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-02 Thread Linas Vepstas
On Thu, Mar 03, 2005 at 09:41:43AM +1100, Benjamin Herrenschmidt was heard to remark: > On Wed, 2005-03-02 at 12:22 -0600, Linas Vepstas wrote: > > On Tue, Mar 01, 2005 at 08:49:45AM -0800, Linus Torvalds was heard to > > remark: > > > > > > The new API is what _allows_ a driver to care. It does

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-02 Thread Linas Vepstas
On Thu, Mar 03, 2005 at 09:46:12AM +1100, Benjamin Herrenschmidt was heard to remark: > On Wed, 2005-03-02 at 14:02 -0600, Linas Vepstas wrote: > > On Wed, Mar 02, 2005 at 09:27:27AM +1100, Benjamin Herrenschmidt was heard > > to remark: > > > That's a style issue. Propose an API, I'll code it.

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-02 Thread Benjamin Herrenschmidt
On Wed, 2005-03-02 at 13:03 -0500, linux-os wrote: > > event->dev = dev; > > event->reset_state = rets[0]; > > event->time_unavail = rets[2]; > > > > /* We may be called in an interrupt context */ > > spin_lock_irqsave(&eeh_eventlist_lock, flags); > ^^ > > list_add

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-02 Thread Benjamin Herrenschmidt
On Wed, 2005-03-02 at 12:22 -0600, Linas Vepstas wrote: > On Tue, Mar 01, 2005 at 08:49:45AM -0800, Linus Torvalds was heard to remark: > > > > The new API is what _allows_ a driver to care. It doesn't handle DMA, but > > I think that's because nobody knows how to handle it (ie it's probably > > h

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-02 Thread Benjamin Herrenschmidt
> One issue with that is how to notify drivers that they need to make this > call. > In may cases, DMA completion will be signalled by an interrupt, but if the > DMA failed, that interrupt may never happen, which means the call to > pci_unmap or the above function from the interrupt handler m

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-02 Thread Benjamin Herrenschmidt
On Wed, 2005-03-02 at 14:02 -0600, Linas Vepstas wrote: > On Wed, Mar 02, 2005 at 09:27:27AM +1100, Benjamin Herrenschmidt was heard to > remark: > That's a style issue. Propose an API, I'll code it. We can have > the master recovery thread be a state machine, and so every device > driver gets

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-02 Thread Linas Vepstas
On Wed, Mar 02, 2005 at 09:27:27AM +1100, Benjamin Herrenschmidt was heard to remark: > On Tue, 2005-03-01 at 12:33 -0600, Linas Vepstas wrote: > > > The current proposal (and prototype) has a "master recovery thread" > > to handle the coordinated reset of the pci controller. This master > > rec

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-02 Thread Linas Vepstas
On Wed, Mar 02, 2005 at 10:41:06AM -0800, Jesse Barnes was heard to remark: > On Wednesday, March 2, 2005 10:22 am, Linas Vepstas wrote: > > On Tue, Mar 01, 2005 at 08:49:45AM -0800, Linus Torvalds was heard to > remark: > > > The new API is what _allows_ a driver to care. It doesn't handle DMA, b

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-02 Thread Linas Vepstas
On Wed, Mar 02, 2005 at 03:13:05PM +0900, Hidetoshi Seto was heard to remark: [ .. iochk_clear() and iochk_read() ...] > And then, I don't think it need to have "pci" ... limitation of this > API's target. It would not be match if there are a recoverable device > over some PCI to XXX bridge, or i

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-02 Thread Jesse Barnes
On Wednesday, March 2, 2005 10:22 am, Linas Vepstas wrote: > On Tue, Mar 01, 2005 at 08:49:45AM -0800, Linus Torvalds was heard to remark: > > The new API is what _allows_ a driver to care. It doesn't handle DMA, but > > I think that's because nobody knows how to handle it (ie it's probably > > hw

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-02 Thread Linas Vepstas
On Tue, Mar 01, 2005 at 08:49:45AM -0800, Linus Torvalds was heard to remark: > > The new API is what _allows_ a driver to care. It doesn't handle DMA, but > I think that's because nobody knows how to handle it (ie it's probably > hw-dependent and all existign implementations would thus be > drive

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-02 Thread linux-os
On Wed, 2 Mar 2005, Linas Vepstas wrote: On Wed, Mar 02, 2005 at 11:28:01AM +0900, Hidetoshi Seto was heard to remark: Note that here is a difficulty: the MCA handler on some arch would run on special context - MCA environment. In other words, since some MCA handler [SNIPPED...] /** * queue up a pc

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-02 Thread Linas Vepstas
On Wed, Mar 02, 2005 at 11:28:01AM +0900, Hidetoshi Seto was heard to remark: > > Note that here is a difficulty: the MCA handler on some arch would run on > special context - MCA environment. In other words, since some MCA handler > would be called by non-maskable interrupt(e.g. NMI), so it's dif

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-01 Thread Hidetoshi Seto
Linas Vepstas wrote: >> I'd prefer to see it as ioerr_clear(), ioerr_read() ... > > I'd prefer pci_io_start() and pci_io_check_err() > > The names should have "pci" in them. > > I don't like "ioerr_clear" because it implies we are clearing the io error; we are not; we are clearing the checker for

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-01 Thread Hidetoshi Seto
Jesse Barnes wrote: This was my thought too last time we had this discussion. A completely asynchronous call is probably needed in addition to Hidetoshi's proposed API, since as you point out, the driver may not be running when an error occurs (e.g. in the case of a DMA error or more general bu

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-01 Thread Hidetoshi Seto
Matthew Wilcox wrote: I think what Jeff meant was "this new API handles none of this". And that's true, it doesn't handle DMA errors. But I think that's just something that hasn't been written/designed yet. Yes, this API just supports drivers wanting to be more RAS-aware. It would be happy if how

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-01 Thread Benjamin Herrenschmidt
> In fact, I'd argue that even a driver that _uses_ the interface should not > necessarily shut itself down on error. Obviously, it should always log the > error, but outside of that it might be good if the operator can decide and > set a flag whether it should try to re-try (which may not always

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-01 Thread Benjamin Herrenschmidt
On Tue, 2005-03-01 at 12:33 -0600, Linas Vepstas wrote: > The current proposal (and prototype) has a "master recovery thread" > to handle the coordinated reset of the pci controller. This master > recovery thyread makes three calls in struct pci_driver: > >void (*frozen) (struct pci_dev *);

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-01 Thread Benjamin Herrenschmidt
On Tue, 2005-03-01 at 18:19 +0100, Andi Kleen wrote: > Hidetoshi Seto <[EMAIL PROTECTED]> writes: > > > > > int sample_read_with_iochk(struct pci_dev *dev, u32 *buf, int words) > > { > > unsigned long ofs = pci_resource_start(dev, 0) + DATA_OFFSET; > > int i; > > > > /* Create magical

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-01 Thread Benjamin Herrenschmidt
On Tue, 2005-03-01 at 09:10 -0800, Jesse Barnes wrote: > On Tuesday, March 1, 2005 8:59 am, Matthew Wilcox wrote: > > The MCA handler has to go and figure out what the hell just happened > > (was it a DIMM error, PCI bus error, etc). OK, fine, it finds that it > > was an error on PCI bus 73. At t

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-01 Thread Benjamin Herrenschmidt
On Tue, 2005-03-01 at 08:49 -0800, Linus Torvalds wrote: > > On Tue, 1 Mar 2005, Jeff Garzik wrote: > > > > A new API handles none of this. > > Ehh? > > The new API is what _allows_ a driver to care. It doesn't handle DMA, but > I think that's because nobody knows how to handle it (ie it's pro

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-01 Thread Benjamin Herrenschmidt
> I have been thinking about PCI system and parity errors, and how to > handle them. I do not think this is the correct approach. > > A simple retry is... too simple. If you are having a massive problem on > your PCI bus, more action should be taken than a retry. It goes beyond that, see bel

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-01 Thread Linus Torvalds
On Tue, 1 Mar 2005, Linas Vepstas wrote: > > > > - Additionally adds special token - abstract "iocookie" structure > > > to control/identifies/manage I/Os, by passing it to OS. > > > Actual type of "iocookie" could be arch-specific. Device drivers > > > could use the iocookie structure wit

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-01 Thread Linas Vepstas
On Tue, Mar 01, 2005 at 02:42:11PM +, Matthew Wilcox was heard to remark: > On Tue, Mar 01, 2005 at 05:33:48PM +0900, Hidetoshi Seto wrote: > > Today's patch is 3rd one - iochk_clear/read() interface. > > - This also adds pair-interface, but not to sandwich only readX(). > > Depends on platfo

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-01 Thread Linas Vepstas
On Tue, Mar 01, 2005 at 11:37:24AM -0500, Jeff Garzik was heard to remark: > > A new API handles none of this. Seto is propsing an API that solves a different problem than what you are thinking about. In my case, the hardware (pci controller) will shut down a pci slot(s) in the case of a pci err

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-01 Thread Linas Vepstas
On Tue, Mar 01, 2005 at 10:08:48AM -0800, Linus Torvalds was heard to remark: > > On Tue, 1 Mar 2005, Andi Kleen wrote: > > > > But what would the default handling be? It would be nice if there > > was a simple way for a driver to say "just shut me down on an error" > > without adding iochk_* to

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-01 Thread Andi Kleen
On Tue, Mar 01, 2005 at 10:08:48AM -0800, Linus Torvalds wrote: > The thing is, IO errors just will be very architecture-dependent. Some > might have exceptions happening, without the exception handler really > having much of an idea of who caused it, unless that driver had prepared > it some wa

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-01 Thread Linas Vepstas
On Tue, Mar 01, 2005 at 09:10:29AM -0800, Jesse Barnes was heard to remark: > On Tuesday, March 1, 2005 8:59 am, Matthew Wilcox wrote: > > The MCA handler has to go and figure out what the hell just happened > > (was it a DIMM error, PCI bus error, etc). I assume "MCA" stands for machine check a

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-01 Thread Linus Torvalds
On Tue, 1 Mar 2005, Andi Kleen wrote: > > But what would the default handling be? It would be nice if there > was a simple way for a driver to say "just shut me down on an error" > without adding iochk_* to each function. Ideally this would be just > a standard callback that knows how to clean u

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-01 Thread Andi Kleen
Hidetoshi Seto <[EMAIL PROTECTED]> writes: > > int sample_read_with_iochk(struct pci_dev *dev, u32 *buf, int words) > { > unsigned long ofs = pci_resource_start(dev, 0) + DATA_OFFSET; > int i; > > /* Create magical cookie on the stack */ > iocookie cookie; > > /* Crit

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-01 Thread Jesse Barnes
On Tuesday, March 1, 2005 8:59 am, Matthew Wilcox wrote: > The MCA handler has to go and figure out what the hell just happened > (was it a DIMM error, PCI bus error, etc). OK, fine, it finds that it > was an error on PCI bus 73. At this point, I think the architecture > error handler needs to ca

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-01 Thread Matthew Wilcox
On Tue, Mar 01, 2005 at 08:49:45AM -0800, Linus Torvalds wrote: > On Tue, 1 Mar 2005, Jeff Garzik wrote: > > A new API handles none of this. > > Ehh? I think what Jeff meant was "this new API handles none of this". And that's true, it doesn't handle DMA errors. But I think that's just something

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-01 Thread Linus Torvalds
On Tue, 1 Mar 2005, Jeff Garzik wrote: > > A new API handles none of this. Ehh? The new API is what _allows_ a driver to care. It doesn't handle DMA, but I think that's because nobody knows how to handle it (ie it's probably hw-dependent and all existign implementations would thus be driver-s

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-01 Thread Jeff Garzik
Hidetoshi Seto wrote: Hi, long time no see :-) Currently, I/O error is not a leading cause of system failure. However, since Linux nowadays is making great progress on its scalability, and ever larger number of PCI devices are being connected to a single high-performance server, the risk of the I/O

Re: [PATCH/RFC] I/O-check interface for driver's error handling

2005-03-01 Thread Matthew Wilcox
On Tue, Mar 01, 2005 at 05:33:48PM +0900, Hidetoshi Seto wrote: > Today's patch is 3rd one - iochk_clear/read() interface. > - This also adds pair-interface, but not to sandwich only readX(). > Depends on platform, starting with ioreadX(), inX(), writeX() > if possible... and so on could be tar

[PATCH/RFC] I/O-check interface for driver's error handling

2005-03-01 Thread Hidetoshi Seto
Hi, long time no see :-) Currently, I/O error is not a leading cause of system failure. However, since Linux nowadays is making great progress on its scalability, and ever larger number of PCI devices are being connected to a single high-performance server, the risk of the I/O error is increasing d