On Mon, May 19, 2014 at 05:59:23PM +0000, Luck, Tony wrote: > - atomic_inc(&mce_entry); > - > > I have used this in the past (in conjunction with an external debugger) to > diagnose problems (not all cpus showing up in the machine check handler). > > But I suppose these can also be diagnosed from the "Timeout synchronizing ..." > message from mce_timed_out() [though with a bit less precision ... we know > that some cpus didn't show up, but we don't have a count of how many did, > or how many are missing. > > If we print the value of "mce_callin" somewhere in mce_timed_out() ... > then I think we'd have equivalent functionality (in fact better - because > we don't need the external debugger to peek at mce_entry).
Right, I was thinking about it and this is something maybe you guys should decide: do we want to panic by default in mce_timed_out if some cores didn't show up? I'm looking at this snippet: /* CHECKME: Make panic default for 1 too? */ if (mca_cfg.tolerant < 1) mce_panic("Timeout synchronizing machine check over CPUs", NULL, NULL); and since we have .tolerant=1 by default... I mean, does the machine even recover after some of the cores have gone into the weeds in #MC? Provided, of course, we don't have a no-way-out MCE and we can resume execution. Or is the box so hammered that there's no turning back? Concerning mce_entry, I don't care all that much - if it is really useful, you might slap a comment saying so and keep it, for all I care. Thanks. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/