Gavin Shan <gws...@linux.vnet.ibm.com> writes:

> Danienl, The issue is tracked by IBM's bugzilla 127612 reported from Nvida
> private GPU drivers. I tried to find the source code from upstream kernel,
> but failed.

OK. So I've read the internal bug, and I'm going to do my best to summarise
without including confidential info.

 1) A PHB with 2 devices is fenced via error injection.

 2) The error_detected() callback is run on both devices. One returns
    CAN_RECOVER, the other returns NONE.

We then fall through to partial-hotplug handling. (BTW this isn't
documented in Documentation/PCI/pci-error-recovery.txt, so at some point
this should be fixed!)

Partial hotplug is detected by the presence of an err_handler, not by
storing the result of error_detected. Would it be better to store the
result from eeh_report_error in the eeh_dev structure, rather than by
looking at more elements of the err_handler structure?

More generally, drivers using error_detect and then returning NONE as a
way to get data and then not participate in EEH is a hack, and it's not
surprising it's breaking in odd ways, especially with partial hotplug.

Partial hotplug is pretty hacky to begin with, and a driver being able
to opt out of EEH selectively is a useful feature, so we probably want
to redesign the state machine to handle them both better. That would be
a long term project.

Regards,
Daniel

> Thanks,
> Gavin
>
>>>
>>> Signed-off-by: Gavin Shan <gws...@linux.vnet.ibm.com>
>>> ---
>>>  arch/powerpc/kernel/eeh_driver.c | 5 ++++-
>>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/powerpc/kernel/eeh_driver.c 
>>> b/arch/powerpc/kernel/eeh_driver.c
>>> index 3a626ed..32178a4 100644
>>> --- a/arch/powerpc/kernel/eeh_driver.c
>>> +++ b/arch/powerpc/kernel/eeh_driver.c
>>> @@ -416,7 +416,10 @@ static void *eeh_rmv_device(void *data, void *userdata)
>>>     driver = eeh_pcid_get(dev);
>>>     if (driver) {
>>>             eeh_pcid_put(dev);
>>> -           if (driver->err_handler)
>>> +           if (driver->err_handler &&
>>> +               driver->err_handler->error_detected &&
>>> +               driver->err_handler->slot_reset &&
>>> +               driver->err_handler->resume)
>>>                     return NULL;
>>>     }
>>>  
>>> -- 
>>> 2.1.0
>>>
>>> _______________________________________________
>>> Linuxppc-dev mailing list
>>> Linuxppc-dev@lists.ozlabs.org
>>> https://lists.ozlabs.org/listinfo/linuxppc-dev

Attachment: signature.asc
Description: PGP signature

_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Reply via email to