Michael Neuling <mi...@neuling.org> writes:

> On an unrecoverable HMI or MCE only generate an checkstop (via
> PLATFORM ERROR opal reboot call) when panic_on_oops is set.
>
> We currently generate an checkstop as an attempt for the FSP to grab a
> dump and then reboot us. Unfortunately this never works and no one

Never? WT#.

> I've talked to has ever seen a resulting dump, let alone got useful
> information from it.
>
> Even worse, the checkstop gets in the way of debugging real
> problems. If we hit a software bug that results in this, we get no
> opportunity to debug it live. Similarly if the bug is due to hardware
> that is not in the dump (say PCI or NVLINK GPU), we get no information
> in the dump about that hardware.
>
> So let's remove it unless someone sets panic_on_oops.

Nick just rewrote pnv_platform_error_reboot(), so please talk to him to
make sure you're not stepping on each other.

> diff --git a/arch/powerpc/platforms/powernv/opal-hmi.c 
> b/arch/powerpc/platforms/powernv/opal-hmi.c
> index c9e1a4ff29..23780970d0 100644
> --- a/arch/powerpc/platforms/powernv/opal-hmi.c
> +++ b/arch/powerpc/platforms/powernv/opal-hmi.c
> @@ -284,6 +285,11 @@ static void hmi_event_handler(struct work_struct *work)
>                       print_hmi_event_info(hmi_evt);
>               }
>  
> +             if (!panic_on_oops) {
> +                     die("Unrecoverable HMI exception", NULL, SIGBUS);
> +                     return;

I don't think we should return.

Otherwise we risk persisting corrupt data to disk and so on.

If we're getting unrecoverable HMI/MCEs that are not actually indicative
of something bad happening then we need to filter those out somewhere.

cheers

Reply via email to