RE: [RFC PATCH 0/3] Machine check recovery when kernel accesses poison

2015-11-11 Thread Luck, Tony
> If you know that it is in the nvdimm range, you can grade the error with > lower severity... Grading the severity isn't the main issue. > Or do you mean that without the exception table we'll return back to the > insn causing the error and loop indefinitely this way? Yes. We need to NOT return

Re: [RFC PATCH 0/3] Machine check recovery when kernel accesses poison

2015-11-11 Thread Borislav Petkov
On Wed, Nov 11, 2015 at 01:48:04PM -0800, Luck, Tony wrote: > No flag. We can search MCi_ADDR across the ranges to see whether this > was a normal RAM error on non-volatile. But that doesn't make this patch > moot. We still need to change the return address to go to the fixup code > instead of back

Re: [RFC PATCH 0/3] Machine check recovery when kernel accesses poison

2015-11-11 Thread Luck, Tony
On Wed, Nov 11, 2015 at 09:41:58PM +0100, Borislav Petkov wrote: > On Tue, Nov 10, 2015 at 01:55:46PM -0800, Luck, Tony wrote: > > I need to add more to the motivation part of this. The people who want > > this are playing with NVDIMMs as storage. So think of many GBytes of > > non-volatile memory

Re: [RFC PATCH 0/3] Machine check recovery when kernel accesses poison

2015-11-11 Thread Borislav Petkov
On Tue, Nov 10, 2015 at 01:55:46PM -0800, Luck, Tony wrote: > I need to add more to the motivation part of this. The people who want > this are playing with NVDIMMs as storage. So think of many GBytes of > non-volatile memory on the source end of the memcpy(). People are used > to disk errors just

Re: [RFC PATCH 0/3] Machine check recovery when kernel accesses poison

2015-11-10 Thread Luck, Tony
On Tue, Nov 10, 2015 at 12:21:01PM +0100, Borislav Petkov wrote: > Just a general, why-do-we-do-this, question: on big systems, the memory > occupied by the kernel is a very small percentage compared to whole RAM, > right? And yet we want to recover from there too? Not, say, kexec... I need to add

Re: [RFC PATCH 0/3] Machine check recovery when kernel accesses poison

2015-11-10 Thread Borislav Petkov
On Mon, Nov 09, 2015 at 10:26:08AM -0800, Tony Luck wrote: > This is a first draft to show the direction I'm taking to > make it possible for the kernel to recover from machine > checks taken while kernel code is executing. Just a general, why-do-we-do-this, question: on big systems, the memory oc

Re: [RFC PATCH 0/3] Machine check recovery when kernel accesses poison

2015-11-09 Thread Tony Luck
On Mon, Nov 9, 2015 at 10:26 AM, Tony Luck wrote: > This is a first draft to show the direction I'm taking to > make it possible for the kernel to recover from machine > checks taken while kernel code is executing. Simple test case to show it actually works. You need a Xeon E7 class system and t