On Tue, Oct 13, 2015 at 01:48:54PM +1100, Daniel Axtens wrote: >Gavin Shan <gws...@linux.vnet.ibm.com> writes: > >> Danienl, The issue is tracked by IBM's bugzilla 127612 reported from Nvida >> private GPU drivers. I tried to find the source code from upstream kernel, >> but failed. > >OK. So I've read the internal bug, and I'm going to do my best to summarise >without including confidential info. > > 1) A PHB with 2 devices is fenced via error injection. > > 2) The error_detected() callback is run on both devices. One returns > CAN_RECOVER, the other returns NONE. > >We then fall through to partial-hotplug handling. (BTW this isn't >documented in Documentation/PCI/pci-error-recovery.txt, so at some point >this should be fixed!) >
No hotplug is triggered when EEH core receives CAN_RECOVER. It seems the bug brought confusion instead of helping to explain the situation as I intended to. I was intended to say: there has driver which implements part of the EEH callbacks to collect diag-data. >Partial hotplug is detected by the presence of an err_handler, not by >storing the result of error_detected. Would it be better to store the >result from eeh_report_error in the eeh_dev structure, rather than by >looking at more elements of the err_handler structure? > I don't see the benefit to do that. In eeh_report_error(), the specific error handlers still need to be checked and the result (from the check) is temporary, and not worthy to store that in eeh_dev. The current code looks good. >More generally, drivers using error_detect and then returning NONE as a >way to get data and then not participate in EEH is a hack, and it's not >surprising it's breaking in odd ways, especially with partial hotplug. > I think you're talking about the situation reported from the bug. It's CAN_RECOVER instead of NONE returned from error_detected(). With the CAN_RECOVER, the driver hopes the EEH core to enable the IO path so that it can collect diag-data from IO space at late point. >Partial hotplug is pretty hacky to begin with, and a driver being able >to opt out of EEH selectively is a useful feature, so we probably want >to redesign the state machine to handle them both better. That would be >a long term project. > Thanks, Gavin >>>> Signed-off-by: Gavin Shan <gws...@linux.vnet.ibm.com> >>>> --- >>>> arch/powerpc/kernel/eeh_driver.c | 5 ++++- >>>> 1 file changed, 4 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/arch/powerpc/kernel/eeh_driver.c >>>> b/arch/powerpc/kernel/eeh_driver.c >>>> index 3a626ed..32178a4 100644 >>>> --- a/arch/powerpc/kernel/eeh_driver.c >>>> +++ b/arch/powerpc/kernel/eeh_driver.c >>>> @@ -416,7 +416,10 @@ static void *eeh_rmv_device(void *data, void >>>> *userdata) >>>> driver = eeh_pcid_get(dev); >>>> if (driver) { >>>> eeh_pcid_put(dev); >>>> - if (driver->err_handler) >>>> + if (driver->err_handler && >>>> + driver->err_handler->error_detected && >>>> + driver->err_handler->slot_reset && >>>> + driver->err_handler->resume) >>>> return NULL; >>>> } >>>> >>>> -- >>>> 2.1.0 >>>> >>>> _______________________________________________ >>>> Linuxppc-dev mailing list >>>> Linuxppc-dev@lists.ozlabs.org >>>> https://lists.ozlabs.org/listinfo/linuxppc-dev _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev