On Fri, Jan 06, 2017 at 10:46:21AM +1100, Russell Currey wrote: >On Fri, 2017-01-06 at 10:39 +1100, Gavin Shan wrote: >> We give up recovery on permanent error, simply shutdown the affected >> devices and remove them. If the devices can't be put into quiet state, >> they spew more traffic that is likely to cause another unexpected EEH >> error. This was observed on "p8dtu2u" machine: >> >> 0002:00:00.0 PCI bridge: IBM Device 03dc >> 0002:01:00.0 Ethernet controller: Intel Corporation \ >> Ethernet Controller X710/X557-AT 10GBASE-T (rev 02) >> 0002:01:00.1 Ethernet controller: Intel Corporation \ >> Ethernet Controller X710/X557-AT 10GBASE-T (rev 02) >> 0002:01:00.2 Ethernet controller: Intel Corporation \ >> Ethernet Controller X710/X557-AT 10GBASE-T (rev 02) >> 0002:01:00.3 Ethernet controller: Intel Corporation \ >> Ethernet Controller X710/X557-AT 10GBASE-T (rev 02) >> >> On P8 PowerNV platform, the IO path is frozen when shutdowning the >> devices, meaning the memory registers are inaccessible. It is why >> the devices can't be put into quiet state before removing them. >> This fixes the issue by enabling IO path prior to putting the devices >> into quiet state. >> >> Link: https://github.com/open-power/supermicro-openpower/issues/419 > >FYI this link isn't publicly accessible. >
Yeah, I knew it. The reason I put it here is more details out there for you or me. >> Reported-by: Pridhiviraj Paidipeddi <ppaid...@linux.vnet.ibm.com> >> Signed-off-by: Gavin Shan <gws...@linux.vnet.ibm.com> >> --- >> arch/powerpc/kernel/eeh.c | 10 +++++++++- >> 1 file changed, 9 insertions(+), 1 deletion(-) >> >> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c >> index 8180bfd..9de7f79 100644 >> --- a/arch/powerpc/kernel/eeh.c >> +++ b/arch/powerpc/kernel/eeh.c >> @@ -298,9 +298,17 @@ void eeh_slot_error_detail(struct eeh_pe *pe, int >> severity) >> * >> * For pHyp, we have to enable IO for log retrieval. Otherwise, >> * 0xFF's is always returned from PCI config space. >> + * >> + * When the @severity is EEH_LOG_PERM, the PE is going to be >> + * removed. Prior to that, the drivers for devices included in >> + * the PE will be closed. The drivers rely on working IO path >> + * to bring the devices to quiet state. Otherwise, PCI traffic >> + * from those devices after they are removed is like to cause >> + * another unexpected EEH error. >> */ >> if (!(pe->type & EEH_PE_PHB)) { >> - if (eeh_has_flag(EEH_ENABLE_IO_FOR_LOG)) >> + if (eeh_has_flag(EEH_ENABLE_IO_FOR_LOG) || >> + severity == EEH_LOG_PERM) >> eeh_pci_enable(pe, EEH_OPT_THAW_MMIO); >> >> /* >