On Wednesday 16 May 2007 22:29, Tim Hockin wrote:
> From: Tim Hockin <[EMAIL PROTECTED]>
>
> Background:
>  The MCE handler has several paths that it can take, depending on various
>  conditions of the MCE status and the value of the 'tolerant' knob.  The
>  exact semantics are not well defined and the code is a bit twisty.
>
> Description:
>  This patch makes the MCE handler's behavior more clear by documenting the
>  behavior for various 'tolerant' levels.  It also fixes or enhances
>  several small things in the handler.  Specifically:
>      * If RIPV is set it is not safe to restart, so set the 'no way out'
>        flag rather than the 'kill it' flag.

Why? It is not PCC. We cannot return of course, but killing isn't returning.

>      * Don't panic() on correctable MCEs.

The idea behind this was that if you get an exception it is always a bit risky
because there are a few potential deadlocks that cannot be avoided.
And normally non UC is just polled which will never cause a panic.
So I don't quite see the value of this change.

> This patch also calls nonseekable_open() in mce_open (as suggested by akpm).

That should be a separate patch

> +             0: always panic on uncorrected errors, log corrected errors
> +             1: panic or SIGBUS on uncorrected errors, log corrected errors
> +             2: SIGBUS or log uncorrected errors, log corrected errors

Just saying SIGBUS is misleading because it isn't a catchable
signal.



> +
> +     /*
> +      * If the error seems to be unrecoverable, something should be
> +      * done.  Try to kill as little as possible.  If we can kill just
> +      * one task, do that.  If the user has set the tolerance very
> +      * high, don't try to do anything at all.
> +      */
> +     if (kill_it && tolerant < 3) {
>               int user_space = 0;
>
> -             if (m.mcgstatus & MCG_STATUS_RIPV)
> +             /*
> +              * If the EIPV bit is set, it means the saved IP is the
> +              * instruction which caused the MCE.
> +              */
> +             if (m.mcgstatus & MCG_STATUS_EIPV)
>                       user_space = panicm.rip && (panicm.cs & 3);
> -
> -             /* When the machine was in user space and the CPU didn't get
> -                confused it's normally not necessary to panic, unless you
> -                are paranoid (tolerant == 0)
> -
> -                RED-PEN could be more tolerant for MCEs in idle,
> -                but most likely they occur at boot anyways, where
> -                it is best to just halt the machine. */
> -             if ((!user_space && (panic_on_oops || tolerant < 2)) ||
> -                 (unsigned)current->pid <= 1)
> -                     mce_panic("Uncorrected machine check", &panicm, 
> mcestart);
> -
> -             /* do_exit takes an awful lot of locks and has as
> -                slight risk of deadlocking. If you don't want that
> -                don't set tolerant >= 2 */
> -             if (tolerant < 3)
> +
> +             /*
> +              * If we know that the error was in user space, send a
> +              * SIGBUS.  Otherwise, panic if tolerance is low.
> +              *
> +              * do_exit() takes an awful lot of locks and has a slight
> +              * risk of deadlocking.
> +              */
> +             if (user_space) {
>                       do_exit(SIGBUS);
> +             } else if (panic_on_oops || tolerant < 2) {
> +                     mce_panic("Uncorrected machine check",
> +                             &panicm, mcestart);
> +             }

Why did you remove the idle special case?


-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to