On Tue, Sep 12, 2017 at 3:03 PM, Nicholas Piggin <npig...@gmail.com> wrote: > Hi Balbir, > > Very cool. How are you testing it? Is it failing memory pages > and poisoning them out properly? >
Yep, I tested it and it seems to work correctly so far. I am testing this on a simulator with injected MCE UE errors for both the data and instruction side. > Looks like you have a printk in the machine_check_early path, > which you shouldn't. I guess because we don't mark that context > as an NMI. Which we could... but I think you want to put as > little as possible in that path, so avoiding the print would > be preferable. Perhaps you could mark the mce event somehow that > the failure can be reported during processing it? > Good point, I did see that printk handles stuff via printk_nmi_enter/exit, but its best avoided. Will spin v2 > Firmware logging is a good question, I could not really see > where this all gets plumbed through. If this is expected to be > a common problem for some types of attached memory, then we > really need to build up a log of these errors that can be used > to exclude the memory after a reboot too. Do we have anything > like this capability in firmware? It's to be built, we should log these to NVRAM and revisit at every boot to isolate these pages Balbir Singh.