Gavin Shan <gws...@linux.vnet.ibm.com> writes: > Danienl, The issue is tracked by IBM's bugzilla 127612 reported from Nvida > private GPU drivers. I tried to find the source code from upstream kernel, > but failed.
OK. So I've read the internal bug, and I'm going to do my best to summarise without including confidential info. 1) A PHB with 2 devices is fenced via error injection. 2) The error_detected() callback is run on both devices. One returns CAN_RECOVER, the other returns NONE. We then fall through to partial-hotplug handling. (BTW this isn't documented in Documentation/PCI/pci-error-recovery.txt, so at some point this should be fixed!) Partial hotplug is detected by the presence of an err_handler, not by storing the result of error_detected. Would it be better to store the result from eeh_report_error in the eeh_dev structure, rather than by looking at more elements of the err_handler structure? More generally, drivers using error_detect and then returning NONE as a way to get data and then not participate in EEH is a hack, and it's not surprising it's breaking in odd ways, especially with partial hotplug. Partial hotplug is pretty hacky to begin with, and a driver being able to opt out of EEH selectively is a useful feature, so we probably want to redesign the state machine to handle them both better. That would be a long term project. Regards, Daniel > Thanks, > Gavin > >>> >>> Signed-off-by: Gavin Shan <gws...@linux.vnet.ibm.com> >>> --- >>> arch/powerpc/kernel/eeh_driver.c | 5 ++++- >>> 1 file changed, 4 insertions(+), 1 deletion(-) >>> >>> diff --git a/arch/powerpc/kernel/eeh_driver.c >>> b/arch/powerpc/kernel/eeh_driver.c >>> index 3a626ed..32178a4 100644 >>> --- a/arch/powerpc/kernel/eeh_driver.c >>> +++ b/arch/powerpc/kernel/eeh_driver.c >>> @@ -416,7 +416,10 @@ static void *eeh_rmv_device(void *data, void *userdata) >>> driver = eeh_pcid_get(dev); >>> if (driver) { >>> eeh_pcid_put(dev); >>> - if (driver->err_handler) >>> + if (driver->err_handler && >>> + driver->err_handler->error_detected && >>> + driver->err_handler->slot_reset && >>> + driver->err_handler->resume) >>> return NULL; >>> } >>> >>> -- >>> 2.1.0 >>> >>> _______________________________________________ >>> Linuxppc-dev mailing list >>> Linuxppc-dev@lists.ozlabs.org >>> https://lists.ozlabs.org/listinfo/linuxppc-dev
signature.asc
Description: PGP signature
_______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev