Printing of machine check severity

Michael Ellerman Mon, 04 Mar 2019 22:58:16 -0800

Hi all,

RE: https://github.com/linuxppc/issues/issues/230


> Host dmesg throws lot of below SLB [Multihit] HMI's
> 
> [295216.837358] Severe Machine check interrupt [Recovered]
> [295216.837365] Harmless Hypervisor Maintenance interrupt [Recovered]
> [295216.837374]   Guest NIP: c00000000024a7dc
> [295216.837378]  Error detail: Processor Recovery done
> [295216.837381]         HMER: 2040000000000000
> [295216.837388]   Initiator: CPU
> [295216.837406]   Error type: SLB [Multihit]
> [295216.837415]     Effective address: d00000000316c400


Paul points out that these aren't severe errors from the hosts point of
view, and possibly not even for the guest.

I think the key problem here is that we print "Severe" for most types of
MCEs, even though some really aren't.

That comes from the severity being set to `MCE_SEV_ERROR_SYNC` in the
i/derror table.

All the enum values are `MCE_SEV` so the value is actually `ERROR_SYNC`,
which I think means "synchronous error". That is correct. But I don't
think it's correct that all synchronous errors are "severe".

We also have some errors in `mce_ierror_table` that are marked
`MCE_SEV_FATAL` and then have a comment saying `/* ASYNC is fatal */`.

So I feel like we have severity and sync/async conflated in the severity
value, ie. we should split out sync/async and then have a separate
severity field.

We need to be careful because a few places check for `MCE_SEV_ERROR_SYNC`,
it's not *only* used for the severity string.

We could then mark eg. SLB multi-hits as warning rather than severe.

Additionally we probably want to use the `in_guest` flag to modulate the
severity or the message, or both.

cheers

Printing of machine check severity

Reply via email to