Mahesh J Salgaonkar <mah...@linux.vnet.ibm.com> writes:
> From: Mahesh Salgaonkar <mah...@linux.vnet.ibm.com>
>
> Print more information about mce error whether it is an hardware or
> software error.
>
> Some of the mce errors can be easily categorized as hardware or software
> errors e.g. UEs are due to hardware error, where as error triggered due to
> invalid usage of tlbie is a pure software bug. But not all the mce errors
> can be easily categorize into either software or hardware. There are errors
> like multihit errors which are usually result of a software bug, but in
> some rare cases a hardware failure can cause a multihit error. In past, we
> have seen case where after replacing faulty chip, multihit errors stopped
> occurring. Same with parity errors, which are usually due to faulty hardware
> but there are chances where multihit can also cause an parity error. Such
> errors are difficult to determine what really caused it. Hence this patch
> classifies mce errors into following four categorize:
>       1. Hardware error:
>               UE and Link timeout failure errors.
>       2. Hardware error, small probability of software cause:
>               SLB/ERAT/TLB Parity errors.
>       3. Software error
>               Invalid tlbie form.
>       4. Software error, small probability of hardware failure
>               SLB/ERAT/TLB Multihit errors.

I like the idea, but I think the phrasing is a little confusing.

> Sample o/p:
>
> [ 1259.331319] MCE: CPU40: (Warning) Guest SLB Multihit at 00007fff9a59dc60 
> DAR: 000001003d740320 [Recovered]
> [ 1259.331324] MCE: CPU40: PID: 24051 Comm: qemu-system-ppc
> [ 1259.331345] MCE: CPU40: Software error, small probability of hardware 
> failure

"Software error, small probability of hardware failure"

That can be read as "there has been a software error, *and now* there is
a small probability of a hardware failure".

I also think "probability" suggests we actually know the mathematical
probability of it being a hardware failure, which is not true.

Instead maybe we use:

        "Hardware error",
        "Probable hardware error (some chance of software cause)",
        "Software error",
        "Probable software error (some chance of hardware cause)",

??

cheers

Reply via email to