> Why? Those CPUs are offlined and num_online_cpus() in mce_start() should
> account for that, no?
>
> And if those are offlined, they're very very unlikely to trigger an MCE
> as they're idle and not executing code.

Let's step back a few feet and look at the big picture.  There are three main 
classes of machine check
that we might see while trying to run kdump - an remember that all machine 
checks are currently
broadcast, so all cpus whether online or offline will see them

1) Fatal
We have to crash - lose the dump.  Having a new machine check handler will make 
things a bit easier
to see what happened because we won't have any synchronization failed messages 
from the offline
cpus.

2) Execution path recoverable (SRAR in SDM parlance).
Also going to be fatal (kdump is all running in ring0, and we can't recover 
from errors in ring 0). Cleaner
messages as above. Potentially in the future we might be able to make the kdump 
machine check handler
actually recover by just skipping a page - if the location of the error was in 
the old kernel image.

3) Non-execution path recoverable (SRAO in SDM)
We ought to be able to keep kdump running if this happens - the "AO" stands for 
"action optional",
so we are going to choose to not take an action. Wherever the error was, it 
won't affect correctness
of execution of the current context.

-Tony
N�����r��y����b�X��ǧv�^�)޺{.n�+����{����zX����ܨ}���Ơz�&j:+v�������zZ+��+zf���h���~����i���z��w���?�����&�)ߢf��^jǫy�m��@A�a���
0��h���i

Reply via email to