Michael Neuling <mi...@neuling.org> writes: > On an unrecoverable HMI or MCE only generate an checkstop (via > PLATFORM ERROR opal reboot call) when panic_on_oops is set. > > We currently generate an checkstop as an attempt for the FSP to grab a > dump and then reboot us. Unfortunately this never works and no one
Never? WT#. > I've talked to has ever seen a resulting dump, let alone got useful > information from it. > > Even worse, the checkstop gets in the way of debugging real > problems. If we hit a software bug that results in this, we get no > opportunity to debug it live. Similarly if the bug is due to hardware > that is not in the dump (say PCI or NVLINK GPU), we get no information > in the dump about that hardware. > > So let's remove it unless someone sets panic_on_oops. Nick just rewrote pnv_platform_error_reboot(), so please talk to him to make sure you're not stepping on each other. > diff --git a/arch/powerpc/platforms/powernv/opal-hmi.c > b/arch/powerpc/platforms/powernv/opal-hmi.c > index c9e1a4ff29..23780970d0 100644 > --- a/arch/powerpc/platforms/powernv/opal-hmi.c > +++ b/arch/powerpc/platforms/powernv/opal-hmi.c > @@ -284,6 +285,11 @@ static void hmi_event_handler(struct work_struct *work) > print_hmi_event_info(hmi_evt); > } > > + if (!panic_on_oops) { > + die("Unrecoverable HMI exception", NULL, SIGBUS); > + return; I don't think we should return. Otherwise we risk persisting corrupt data to disk and so on. If we're getting unrecoverable HMI/MCEs that are not actually indicative of something bad happening then we need to filter those out somewhere. cheers