On Fri, May 7, 2021 at 3:43 AM Mahesh Salgaonkar <mah...@linux.ibm.com> wrote:
>
> When certain PHB HW failure causes phyp to recover PHB, it marks the PE
> state as temporarily unavailable. In this case, per PAPR, rtas call
> ibm,read-slot-reset-state2 returns a PE state as temporarily unavailable(5)
> and OS has to wait until that recovery is complete. During this state the
> slot presence check 'get-sensor-state(dr-entity-sense)' returns as DR
> connector empty which leads to assumption that the device has been
> hot-removed. This results into no EEH recovery on this device and it stays
> in failed state forever.
>
> This patch fixes this issue by skipping slot presence check only if device
> PE state is temporarily unavailable(5).
>
> Signed-off-by: Mahesh Salgaonkar <mah...@linux.ibm.com>
> ---
> * snip*
>
>         /*
>          * It should be corner case that the parent PE has been
> diff --git a/arch/powerpc/kernel/eeh_driver.c 
> b/arch/powerpc/kernel/eeh_driver.c
> index 3eff6a4888e79..a0913768f33de 100644
> --- a/arch/powerpc/kernel/eeh_driver.c
> +++ b/arch/powerpc/kernel/eeh_driver.c
> @@ -851,6 +851,17 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
>                 return;
>         }
>
> +       /*
> +        * When PE's state is temporarily unavailable, the slot
> +        * presence check returns as DR connector empty.

That sounds like a bug in either RTAS or the hotplug slot driver (or
both). The presence check is there largely to filter out events that
we can guarantee are not recoverable (i.e. surprise hot-unplug). In
every other case (especially if we can't determine the state) we
should be going down the recovery path. If the hotplug slot driver is
incorrectly reporting the card has been removed then you should be
fixing the slot driver.

> +        * to assumption that the device is hot-removed and causes EEH
> +        * recovery to stop leaving the device in failed state forever.
> +        * Hence skip the slot presence check if PE's state is
> +        * temporarily unavailable and go down EEH recovery path.
> +        */
> +       if (pe->state & EEH_PE_TEMP_UNAVAIL)
> +               goto skip_slot_presence_check;

There's a time-of-check-vs-time-of-use error here. You're setting this
flag at the point of detection, but there can be a significant lag
time between when an EEH is initially detected and when it's handled
by the recovery thread (usually due to other events being recovered).
Transitioning the PE into and out of PE_TEMP_UNAVAILABLE is handled
autonomously by the hypervisor so by the time we get to recovery the
PE may be back into a normal state where the slot check works fine.

Reply via email to