On Tue, Sep 17, 2019 at 1:16 PM Sam Bobroff <sbobr...@linux.ibm.com> wrote: > > On Tue, Sep 03, 2019 at 08:16:03PM +1000, Oliver O'Halloran wrote: > > Detecting an frozen EEH PE usually occurs when an MMIO load returns a 0xFFs > > response. When performing EEH testing using the EEH error injection feature > > available on some platforms there is no simple way to kick-off the kernel's > > recovery process since any accesses from userspace (usually /dev/mem) will > > bypass the MMIO helpers in the kernel which check if a 0xFF response is due > > to an EEH freeze or not. > > > > If a device contains a 0xFF byte in it's config space it's possible to > > trigger the recovery process via config space read from userspace, but this > > is not a reliable method. If a driver is bound to the device an in use it > > will frequently trigger the MMIO check, but this is also inconsistent. > > > > To solve these problems this patch adds a debugfs file called > > "eeh_dev_check" which accepts a <domain>:<bus>:<dev>.<fn> string and runs > > eeh_dev_check_failure() on it. This is the same check that's done when the > > kernel gets a 0xFF result from an config or MMIO read with the added > > benifit that it can be reliably triggered from userspace. > > > > Signed-off-by: Oliver O'Halloran <ooh...@gmail.com> > > Looks good, and I tested it with the next patch and it seems to work. > > But I think you should make it clear that this does not work with > the hardware "EEH error injection" facility accessible via debugfs in > err_injct (that doesn't seem clear to me from the commit message).
It's not intended to be a separate mechanisms in the long term. I'm planning on converting this interface to make use the platform defined error injection mechanism once I can find how to use the PAPR ones reliably. The idea is to use this as a generic "cause an EEH to happen on this device" interface for userspace which we can use in test scripts and the like.