Mahesh J Salgaonkar <mah...@linux.vnet.ibm.com> writes: > From: Mahesh Salgaonkar <mah...@linux.vnet.ibm.com> > > Print more information about mce error whether it is an hardware or > software error. > > Some of the mce errors can be easily categorized as hardware or software > errors e.g. UEs are due to hardware error, where as error triggered due to > invalid usage of tlbie is a pure software bug. But not all the mce errors > can be easily categorize into either software or hardware. There are errors > like multihit errors which are usually result of a software bug, but in > some rare cases a hardware failure can cause a multihit error. In past, we > have seen case where after replacing faulty chip, multihit errors stopped > occurring. Same with parity errors, which are usually due to faulty hardware > but there are chances where multihit can also cause an parity error. Such > errors are difficult to determine what really caused it. Hence this patch > classifies mce errors into following four categorize: > 1. Hardware error: > UE and Link timeout failure errors. > 2. Hardware error, small probability of software cause: > SLB/ERAT/TLB Parity errors. > 3. Software error > Invalid tlbie form. > 4. Software error, small probability of hardware failure > SLB/ERAT/TLB Multihit errors.
I like the idea, but I think the phrasing is a little confusing. > Sample o/p: > > [ 1259.331319] MCE: CPU40: (Warning) Guest SLB Multihit at 00007fff9a59dc60 > DAR: 000001003d740320 [Recovered] > [ 1259.331324] MCE: CPU40: PID: 24051 Comm: qemu-system-ppc > [ 1259.331345] MCE: CPU40: Software error, small probability of hardware > failure "Software error, small probability of hardware failure" That can be read as "there has been a software error, *and now* there is a small probability of a hardware failure". I also think "probability" suggests we actually know the mathematical probability of it being a hardware failure, which is not true. Instead maybe we use: "Hardware error", "Probable hardware error (some chance of software cause)", "Software error", "Probable software error (some chance of hardware cause)", ?? cheers