On Thu, Apr 18, 2019 at 5:26 PM Borislav Petkov <b...@alien8.de> wrote:
> Now, if any of that above still doesn't make it clear, please state what
> you're trying to achieve and I'll try to help.

Sorry that I misled you to believe we don't even enable
CONFIG_X86_MCELOG_LEGACY. Here is what we have and
what we have tried:


2. We also have CONFIG_RAS=y and CONFIG_RAS_CEC=y

3. mcelog started as a daemon successfully, like before

4. Some real correctable memory errors happened, as logged in

5. mcelog couldn't receive any of them, reported 0 errors

6. Admin's complained to us as they believe this is a kernel bug

7. We dug into kernel source code and found out CONFIG_RAS
hijacks all these errors, by stopping there in the notification chain:

static int mce_first_notifier(struct notifier_block *nb, unsigned long val,
                              void *data)
        struct mce *m = (struct mce *)data;

        if (!m)
                return NOTIFY_DONE;

        if (cec_add_mce(m))
                return NOTIFY_STOP; // <=== Returns and stops here

        /* Emit the trace record: */

        set_bit(0, &mce_need_notify);

        mce_notify_irq(); // <=== There is where MCELOG receives

        return NOTIFY_DONE;

8. I noticed rasdaemon, and tried to start it instead of mcelog.

9. I injected some memory error and could successfully read them
via ras-mc-ctl.

To demonstrate what I think we should have, here is the PoC code
ONLY to show the idea (please don't judge it):

@ -567,12 +567,12 @@ static int mce_first_notifier(struct
notifier_block *nb, unsigned long val,
                              void *data)
        struct mce *m = (struct mce *)data;
+       bool consumed;

        if (!m)
                return NOTIFY_DONE;

-       if (cec_add_mce(m))
-               return NOTIFY_STOP;
+       consumed = cec_add_mce(m);

        /* Emit the trace record: */
@@ -581,7 +581,7 @@ static int mce_first_notifier(struct
notifier_block *nb, unsigned long val,


-       return NOTIFY_DONE;
+       return consumed ? NOTIFY_STOP : NOTIFY_DONE;

With this change, although not even compiled, mcelog should still
receive correctable memory errors like before, even when we have

Does this make any sense to you?


Reply via email to