On Thu, 2017-04-13 at 13:31 +0200, Borislav Petkov wrote: > On Thu, Apr 13, 2017 at 12:29:25AM +0200, Borislav Petkov wrote: > > On Wed, Apr 12, 2017 at 03:26:19PM -0700, Luck, Tony wrote: > > > We can futz with that and have them specify which chain (or both) > > > that they want to be added to. > > > > Well, I didn't want the atomic chain to be a notifier because we can > > keep it simple and non-blocking. Only the process context one will > > be. > > > > So the question is, do we even have a use case for outside consumers > > hanging on the atomic chain? Because if not, we're good to go. > > Ok, new day, new patch. > > Below is what we could do: we don't call the notifier at all on the > atomic path but only print the MCEs. We do log them and if the machine > survives, we process them accordingly. This is only a fix for upstream > so that the current issue at hand is addressed. > > For later, we'd need to split the paths in: > > critical_print_mce() > > or somesuch which immediately dumps the MCE to dmesg, and > > mce_log() > > which does the slow path of logging MCEs and calling the blocking > notifier. > > Now, I'd want to have decoding of the MCE on the critical path too so > I have to think about how to do that nicely. Maybe move the decoding > bits which are the same between Intel and AMD in mce.c and have some > vendor-specific, fast calls. We'll see. Btw, this is something Ingo > has > been mentioning for a while. > > Anyway, here's just the urgent fix for now. > > Thanks. > > --- > From: Vishal Verma <vishal.l.ve...@intel.com> > Date: Tue, 11 Apr 2017 16:44:57 -0600 > Subject: [PATCH] x86/mce: Make the MCE notifier a blocking one > > The NFIT MCE handler callback (for handling media errors on NVDIMMs) > takes a mutex to add the location of a memory error to a list. But > since > the notifier call chain for machine checks (x86_mce_decoder_chain) is > atomic, we get a lockdep splat like: > > BUG: sleeping function called from invalid context at > kernel/locking/mutex.c:620 > in_atomic(): 1, irqs_disabled(): 0, pid: 4, name: kworker/0:0 > [..] > Call Trace: > dump_stack > ___might_sleep > __might_sleep > mutex_lock_nested > ? __lock_acquire > nfit_handle_mce > notifier_call_chain > atomic_notifier_call_chain > ? atomic_notifier_call_chain > mce_gen_pool_process > > Convert the notifier to a blocking one which gets to run only in > process > context. > > Boris: remove the notifier call in atomic context in print_mce(). For > now, let's print the MCE on the atomic path so that we can make sure > it > goes out. We still log it for process context later. > > Reported-by: Ross Zwisler <ross.zwis...@linux.intel.com> > Signed-off-by: Vishal Verma <vishal.l.ve...@intel.com> > Cc: Tony Luck <tony.l...@intel.com> > Cc: Dan Williams <dan.j.willi...@intel.com> > Cc: linux-edac <linux-e...@vger.kernel.org> > Cc: x86-ml <x...@kernel.org> > Cc: <sta...@vger.kernel.org> > Link: http://lkml.kernel.org/r/20170411224457.24777-1-vishal.l.verma@i > ntel.com > Fixes: 6839a6d96f4e ("nfit: do an ARS scrub on hitting a latent media > error") > Signed-off-by: Borislav Petkov <b...@suse.de> > --- > arch/x86/kernel/cpu/mcheck/mce-genpool.c | 2 +- > arch/x86/kernel/cpu/mcheck/mce-internal.h | 2 +- > arch/x86/kernel/cpu/mcheck/mce.c | 18 ++++-------------- > 3 files changed, 6 insertions(+), 16 deletions(-) >
I noticed this patch was picked up in tip, in ras/urgent, but didn't see a pull request for 4.11 - was this the intention? Or will it just be added for 4.12? -Vishal