On Fri, 2017-01-06 at 10:39 +1100, Gavin Shan wrote: > We give up recovery on permanent error, simply shutdown the affected > devices and remove them. If the devices can't be put into quiet state, > they spew more traffic that is likely to cause another unexpected EEH > error. This was observed on "p8dtu2u" machine: > > 0002:00:00.0 PCI bridge: IBM Device 03dc > 0002:01:00.0 Ethernet controller: Intel Corporation \ > Ethernet Controller X710/X557-AT 10GBASE-T (rev 02) > 0002:01:00.1 Ethernet controller: Intel Corporation \ > Ethernet Controller X710/X557-AT 10GBASE-T (rev 02) > 0002:01:00.2 Ethernet controller: Intel Corporation \ > Ethernet Controller X710/X557-AT 10GBASE-T (rev 02) > 0002:01:00.3 Ethernet controller: Intel Corporation \ > Ethernet Controller X710/X557-AT 10GBASE-T (rev 02) > > On P8 PowerNV platform, the IO path is frozen when shutdowning the > devices, meaning the memory registers are inaccessible. It is why > the devices can't be put into quiet state before removing them. > This fixes the issue by enabling IO path prior to putting the devices > into quiet state. > > Link: https://github.com/open-power/supermicro-openpower/issues/419
FYI this link isn't publicly accessible. > Reported-by: Pridhiviraj Paidipeddi <ppaid...@linux.vnet.ibm.com> > Signed-off-by: Gavin Shan <gws...@linux.vnet.ibm.com> > --- > arch/powerpc/kernel/eeh.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c > index 8180bfd..9de7f79 100644 > --- a/arch/powerpc/kernel/eeh.c > +++ b/arch/powerpc/kernel/eeh.c > @@ -298,9 +298,17 @@ void eeh_slot_error_detail(struct eeh_pe *pe, int > severity) > * > * For pHyp, we have to enable IO for log retrieval. Otherwise, > * 0xFF's is always returned from PCI config space. > + * > + * When the @severity is EEH_LOG_PERM, the PE is going to be > + * removed. Prior to that, the drivers for devices included in > + * the PE will be closed. The drivers rely on working IO path > + * to bring the devices to quiet state. Otherwise, PCI traffic > + * from those devices after they are removed is like to cause > + * another unexpected EEH error. > */ > if (!(pe->type & EEH_PE_PHB)) { > - if (eeh_has_flag(EEH_ENABLE_IO_FOR_LOG)) > + if (eeh_has_flag(EEH_ENABLE_IO_FOR_LOG) || > + severity == EEH_LOG_PERM) > eeh_pci_enable(pe, EEH_OPT_THAW_MMIO); > > /*