>> No information besides that it is a machine check. This happens in two cases:
>> 1) The CPU logs the error with the MCi_STATUS.EN bit set to zero, and Linux
>>    ignores EN=0 entries (as it should).

> Well, I guess we shouldn't anymore. Apparently hw forgets to set the
> bit when raising an MCE so then we should ignore it too in mce-severity
> and delete that piece or grade it as higher severity based on, I dunno,
> b0rked hardware family/model/stepping or whatever bit we set...
>
>        MCESEV(
>                NO, "Not enabled",
>                BITCLR(MCI_STATUS_EN)
>                ),

The SDM has this to say about EN=0 (in section 15.10.4.1 of volume 3B):

   When the EN flag is zero but the VAL and UC flags are one in
   the IA32_MCi_STATUS register, the reported uncorrected error
   in this bank is not enabled. As uncorrected errors with the
   EN flag = 0 are not the source of machine check exceptions,
   the MCE handler should log and clear non-enabled errors when
   the S bit is set and should continue searching for enabled
   errors from the other IA32_MCi_STATUS registers. Note that
   when IA32_MCG_CAP [24] is 0, any uncorrected error condition
   (VAL =1 and UC=1) including the one with the EN flag cleared
   are fatal and the handler must signal the operating system to
   reset the system. For the errors that do not generate machine
   check exceptions, the EN flag has no meaning.

Note the "should log and clear".  We just clear ... just need to shuffle some 
code
in mce.c to add the logging.

But we still need something like Rui's patch - calling mcelog() doesn't ensure 
that
we see something on the console about possible cause of the problem.

-Tony

Reply via email to