On Wed, Jan 06, 2021 at 06:39:30PM +0000, Luck, Tony wrote: > > The "Timeout: Not all CPUs entered broadcast exception handler" message > > will appear from time to time given enough systems, but this message does > > not identify which CPUs failed to enter the broadcast exception handler. > > This information would be valuable if available, for example, in order to > > correlated with other hardware-oriented error messages. This commit > > therefore maintains a cpumask_t of CPUs that have entered this handler, > > and prints out which ones failed to enter in the event of a timeout. > > I tried doing this a while back, but found that in my test case where I forced > an error that would cause both threads from one core to be "missing", the > output was highly unpredictable. Some random number of extra CPUs were > reported as missing. After I added some extra breadcrumbs it became clear > that pretty much all the CPUs (except the missing pair) entered > do_machine_check(), > but some got hung up at various points beyond the entry point. My only theory > was that they were trying to snoop caches from the dead core (or access some > other resource held by the dead core) and so they hung too. > > Your code is much neater than mine ... and perhaps works in other cases, but > maybe the message needs to allow for the fact that some of the cores that > are reported missing may just be collateral damage from the initial problem.
Understood. The system is probably not in the best shape if this code is ever executed, after all. ;-) So how about like this? pr_info("%s: MCE holdout CPUs (may include false positives): %*pbl\n", Easy enough if so! > If I get time in the next day or two, I'll run my old test against your code > to > see what happens. Thank you very much in advance! For my own testing, is this still the right thing to use? https://github.com/andikleen/mce-inject Thanx, Paul