On Wed, May 21, 2014 at 4:05 PM, Luck, Tony <tony.l...@intel.com> wrote: > On Wed, May 21, 2014 at 03:39:11PM -0700, Andy Lutomirski wrote: >> But if we get a new MCE in here, it will be an MCE from kernel context >> and it's fatal. So, yes, we'll clobber the stack, but we'll never >> return (unless tolerant is set to something insane), so who cares? > > Remember that machine checks are broadcast. So some other cpu > can hit a recoverable machine check in user mode ... but that int#18 > goes everywhere. Other cpus are innocent bystanders ... they will > see MCG_STATUS.RIPV=1, MCG_STATUS.EIPV=0 and nothing important > in any of their machine check banks. > > But if we are still finishing off processing the previous machine check, > this will be a nested one - and BOOM, we are dead.
Oh. Well, crap. FWIW, this means that there really is a problem if one of these #MC errors hits an innocent bystander who just happens to be handling an NMI, at least if we delete the nested NMI code. But I think my simplified proposal gets this right. > > -Tony > > [If you peer closely at the latest edition of the SDM - you'll see the > bits are defined for a non-broadcast model ... e.g. LMCE_S bit in > MCG_STATUS .... but currently shipping silicon doesn't use that] -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/