Re: [PATCH RFC x86/mce] Make mce_timed_out() identify holdout CPUs

Paul E. McKenney Wed, 06 Jan 2021 11:18:50 -0800

On Wed, Jan 06, 2021 at 06:39:30PM +0000, Luck, Tony wrote:
> > The "Timeout: Not all CPUs entered broadcast exception handler" message
> > will appear from time to time given enough systems, but this message does
> > not identify which CPUs failed to enter the broadcast exception handler.
> > This information would be valuable if available, for example, in order to
> > correlated with other hardware-oriented error messages.  This commit
> > therefore maintains a cpumask_t of CPUs that have entered this handler,
> > and prints out which ones failed to enter in the event of a timeout.
> 
> I tried doing this a while back, but found that in my test case where I forced
> an error that would cause both threads from one core to be "missing", the
> output was highly unpredictable. Some random number of extra CPUs were
> reported as missing. After I added some extra breadcrumbs it became clear
> that pretty much all the CPUs (except the missing pair) entered 
> do_machine_check(),
> but some got hung up at various points beyond the entry point. My only theory
> was that they were trying to snoop caches from the dead core (or access some
> other resource held by the dead core) and so they hung too.
> 
> Your code is much neater than mine ... and perhaps works in other cases, but
> maybe the message needs to allow for the fact that some of the cores that
> are reported missing may just be collateral damage from the initial problem.


Understood.  The system is probably not in the best shape if this code
is ever executed, after all.  ;-)

So how about like this?

        pr_info("%s: MCE holdout CPUs (may include false positives): %*pbl\n",

Easy enough if so!

> If I get time in the next day or two, I'll run my old test against your code 
> to
> see what happens.

Thank you very much in advance!

For my own testing, is this still the right thing to use?

        https://github.com/andikleen/mce-inject

                                                        Thanx, Paul

Re: [PATCH RFC x86/mce] Make mce_timed_out() identify holdout CPUs

Reply via email to