+ if (!xchg(&reboot_notifier_registered, true))
+ register_reboot_notifier(&cmci_reboot_notifier);
This is super-safe ... but isn't the xchg() overkill? I thought we serialized
bringup
of other cpus.
Has this "do it once" caught on elsewhere in the kernel ... I suppose it is
if (!(flags & MCP_UC) &&
- (m.status & (mca_cfg.ser ? MCI_STATUS_S : MCI_STATUS_UC)))
+ (m.status & (mca_cfg.ser ? MCI_STATUS_S : MCI_STATUS_UC))) {
+ spin_unlock_irqrestore(&mce_banks[i].poll_spinlock,
+
+ if (!(no_way_out && cfg->tolerant < 3))
mce_clear_state(toclear);
Style - I think this is easier to grok:
if (!no_way_out || cfg->tolerant >=3)
mce_clear_state(toclear);
but not too strongly if other like !(a && b) form.
I'm never sure how to trea
+ /* Ensure a CMCI interrupt can't preempt this. */
+ local_irq_save(flags);
if (mce_available(__this_cpu_ptr(&cpu_info))) {
machine_check_poll(MCP_TIMESTAMP,
&__get_cpu_var(mce_poll_banks));
Does this remove the problem that you
> to run with tolerant=3, but I kind of understood the logic to be that
> if we're going to keep running, we need to clear the banks, and if
> we're going to crash, we need to leave them intact
That makes sense ... fold some text like that into the commit
description, and this part is:
Acked-by:
> I don't think we got the description right here. I think the real
> issue here was machine check polls happening on multiple CPUs with
> shared banks, all reporting the same MCEs. This is very reproducible
> when booting with mce=no_cmci, since all CPUs will handle all banks,
> and there's AFAICT
> Wait, do we really need a drivers/ras directory for one single driver?
> Why not put it in drivers/misc/ instead? A whole subdir at the top of
> drivers seems overkill and odd.
>
> As it's a memory driver, what about drivers/firmware/ or drivers/edac/
> or drivers/platform?
It isn't really a fi
> Currently, all the stuff we're doing is x86-only so arch/x86/ras/ might
> be a good place too, if other arches wanna do their own thing or if x86
> RAS facilities turn out to be PITA to make arch-independent.
Oops - not quite. We moved creation of the EDAC trace point into
drivers/ras/ras.c
an
> Are there any objections against this series from the x86 and ia64
> maintainers?
I'll believe the claim of no functional changes for ia64 ... so no objections
from me.
-Tony
> + select ACPI_LEGACY_TABLES_LOOKUP if ACPI
> This shouldn't actually be set on IA64, should it? IA64 doesn't have
> BIOS, either, it has EFI/UEFI, like ARM64...
Which ACPI tables are in the "LEGACY" category affected by this option?
-Tony
> Hohum, __raw_spin_lock_irqsave does preempt_disable(). And
> machine_check_poll should be running in irq context so why would the
> original issue happen?
>
>> kernel: [7.341085] BUG: using __this_cpu_write() in preemptible
>> [] code: modprobe/546
>
> Unfortunately, I have only one
The following changes since commit b098d6726bbfb94c06d6e1097466187afddae61f:
Linux 3.14-rc8 (2014-03-24 19:31:17 -0700)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git
tags/please-pull-cmci-storm
for you to fetch changes up to 27f6c573e0f77
> Looks good to me, thank you.
> Reviewed-by: Naoya Horiguchi
Thanks for your time reviewing this
> and I think this is worth going into stable trees.
Good point. I should dig in the git history and make one of those
fancy "Fixes: sha1 title" tags too.
-Tony
--
To unsubscribe from this list:
>> mce_regin, which is only called by monarch CPU, can be used for system
>> panics as quickly as possible if there is a truly data corrupting error.
>> But Monarch CPU don't have to help all other CPU to clean mces_clean.
>> One advantage of Per-CPU is the isolation of errors propagation, being
>>
> But sending signals from #MC context is definitely a bad idea. I think
> we had addressed this with irq_work at some point but my memory is very
> hazy.
We added code for recoverable errors to get out of the MC context
before trying to lookup the page and send the signal. Bottom of
do_machine_c
>> That TIF_MCE_NOTIFY prevents the return to user mode, and we end up in
>> mce_notify_process().
>
> Why is this necessary?
The recovery path has to do more than just send a signal - it needs to walk
processes and
"mm"s to see which have mapped the physical address that the h/w told us has
go
>> The recovery path has to do more than just send a signal - it needs to walk
>> processes and
>> "mm"s to see which have mapped the physical address that the h/w told us has
>> gone bad.
>
> I still feel like I'm missing something. If we interrupted user space
> code, then the context we're in
On Wed, May 21, 2014 at 03:39:11PM -0700, Andy Lutomirski wrote:
> But if we get a new MCE in here, it will be an MCE from kernel context
> and it's fatal. So, yes, we'll clobber the stack, but we'll never
> return (unless tolerant is set to something insane), so who cares?
Remember that machine c
> FWIW, this means that there really is a problem if one of these #MC
> errors hits an innocent bystander who just happens to be handling an
> NMI, at least if we delete the nested NMI code. But I think my
> simplified proposal gets this right.
Yes. Bystander broadcast machine checks can and will
> MCE is frankly misdesigned. It's a piece of shit, and any of the
> hardware designers that claim that what they do is for system
> stability are out to lunch. This is a prime example of what *NOT* to
> do, and how you can actually spread what was potentially a localized
> and recoverable error, a
>> So I think we can reduce it to just the one rwsem (with recursion) if we
>> shoot CPU_POST_DEAD in the head.
>
> Here's the first bullet. Stressing my box here with Steve's hotplug
> script seems to work fine.
>
> Tony, any objections?
what was this comment referring to:
/* intentionally i
>> When we suspend a laptop we offline all but one processor. But
>> the mce code registers on a notify chain so it can clean up
>> some sysfs entries. Part of that code calls device_unregister()
>> which will fire kobject_uevent() which might wake up some user
>> code that is watching for such thi
>> I think the comment is still not explaining the big part of what the
>> discussion was about -- i.e. if it was in kernel context, we always
>> panic.
>
> I thought the pointer to mce_severity was enough? People should open an
> editor and look at the function and at its gory insanity. :-P
It is
> And this tolerant check looks fishy to me:
>
>if (s->sev >= MCE_UC_SEVERITY && ctx == IN_KERNEL) {
>if (panic_on_oops || tolerant < 1)
>return MCE_PANIC_SEVERITY;
>}
>
> since we set it to 1 by default. But I'
> A possible alternative would be to soft-offline the page. This is
> currently done in APEI code when corrected memory error thresholds are
> exceeded and reported by UEFI via a generic hardware error source
> (GHES).
+1
This is what the existing mcelog(8) daemon does when it sees an excessive
+err_device_create:
+ /*
+* mce_device_remove behave properly if mce_device_create was not
+* called on that device.
+*/
+ for_each_possible_cpu(i)
+ mce_device_remove(i);
grammar comment "s/behave/behaves/"
Though perhaps this is better:
> Switch over to the new interface. No functional change.
ia64 parts:
Tested-by: Tony Luck
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please re
> This patchset is the summary of recent discussion about memory error handling
> on multithread application. Patch 1 and 2 is for action required errors, and
> patch 3 is for action optional errors.
Naoya,
You suggested early in the discussion (when there were just two patches) that
they deserve
>> For memory error location, I will utilize type offset to save one
>> more byte, furthermore, I want to drop requestor_id, responder_id
>> and target_id. 1) They are very rare (I've never seen them by now)
>
> My concern is, are we sure we're never going to need them at all? Tony,
> what's your t
>> All of this stuff only applies to server systems - so quibbling over
>> a handful of *bytes* in an error record on a system that has tens,
>> hundreds or even thousands of *gigabytes* of memory seems
>> a bit pointless.
>
> But there's still only a limited number of bytes in the ring buffer no
>
> I'm not sure that "[PATCH 3/3] mm/memory-failure.c: support dedicated
> thread to handle SIGBUS(BUS_MCEERR_AO)" is a -stable thing? That's a
> feature addition more than a bugfix?
No - the old behavior was crazy - someone with a multithreaded process might
well expect that if they call prctl(PF
> I am CC'ing IA-64 guys.
The *_unmap() functions are no-op on ia64 - because we have mappings for
everything all the time
- the *_map() functions just need to compute the proper address to use to get
the right attributes
(so we don't mix and match cacheable and uncachable access to the same
ad
> I added this series to my "next" branch for v3.14. Tony, let me know
> if you see any ia64 issues.
It showed up in next-20140204 - and doesn't seem to have caused any build or
boot
problems on my test machines.
-Tony
+ if (severity != CPER_SEV_FATAL)
>>>
>>> Shouldn't this just be (severity == CPER_SEV_CORRECTED)?
>> IMO, only fatal error can't be handlered gracefully in current
>> kernel plus H/W. Once it can be recovered by H/W and OS, we
>> can call it recovered.
> Sure, but we don't recover in all
> But yes, this is possible and it would make it all even cleaner
> and simpler by simply not needing the reg/dereg interfaces for
> mce_ext_err_print but adding it to the chain.
So this is on top of the 9 patch series (using the V4 that Chen Gong
posted for part 4/9 and V3 for all the others). O
Kumar
Tested-by: Tony Luck
-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Ingo,
Ultimate plan is to use these enhanced error logs to feed a
perf/trace event ... but we are still discussing the exact
format of that, and also how it should interact/complement/replace
the existing EDAC trace event. Meanwhile all this precursor work
has been reviewed and agreed on by Mauro
Replacement for yesterday's pull request - fixes a build bug when CONFIG_SMP=n
found by Fengguang's zero-day auto-build robot army. If you pulled (and pushed)
that one before finding this in your mailbox - then I can send the one-line
patch to be applied on top of yesterday's version.
-Tony
The
From: "Chen, Gong"
CPER (Common Platform Error Record - See UEFI spec, appendix N) support
is implemented via cper.c, which is under drivers/acpi/apei. But it is
not APEI specific, nor even ACPI specific. So move it to lib/ as a function
library.
Signed-off-by: Chen, Gong
Signed-off-by: Tony Lu
> No, it hasn't. But I explicitly checked the relevant EFI=n and EFI=y
> cases.
Jan,
I pushed your patch into my "next" tree - the robots will notice soon and send
us e-mail if they find any issues.
-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of
Ingo,
Please queue in x86/ras branch for next merge window.
Thanks
-Tony
The following changes since commit 319e2e3f63c348a9b66db4667efa73178e18b17d:
Linux 3.13-rc4 (2013-12-15 12:31:33 -0800)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.g
> This is v2 of the patch series. Changes from version 1:
>
> o Added acks. arm, ia64, and sh are only ones without acks.
ia64 bits look OK
Acked-by: Tony Luck
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More m
> ERROR: "boot_cpu_physical_apicid" [drivers/acpi/acpi_extlog.ko] undefined!
>
> The symbol needs to be exported for it to be available.
Good - but I wonder how many more useless layers there are to this onion :-(
First I had to add a "#include "
Then add the dependency on CONFIG_X86_LOCAL_APIC
N
,.. then
Acked-by: Tony Luck
-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
> I haven't double checked, but I'm assuming the hot plug locks are held
> while you are doing this.
I dug into that when looking at v2 - the whole thing is under "stop_machine()"
so locking was not an issue.
Reviewed-by: Tony Luck
-Tony
--
To unsubscribe fro
> The general idea of preemptively poisoning pages which contain deferred
> errors is fine though.
Agreed. I used to think that it wasn't likely to be very useful because in many
cases the UCNA errors are just a trail of breadcrumbs set by different units
on the chip as the poison passed through o
>> +int mce_severity(struct mce *m, int tolerant, char **msg, bool is_excp)
>
> You're adding a function argument which is carrying redundant info which
> is already present in *m...
>
>> {
>> +enum exception excp = (is_excp ? EXCP_CONTEXT : NO_EXCP);
>
> ... and so this should be:
>
> e
> Basically, this check is being done only for machine check exceptions
> only.
But you proposed setting excp by looking at mcg_status:
> excp = ((m->mcg_status & MCG_STATUS_MCIP) ? EXCP_CONTEXT : NO_EXCP);
Which makes the code rather self referential. If we actually did arrive in MCE
handler
w
+ m->mcgstatus |= (MCG_STATUS_MCIP|MCG_STATUS_RIPV);
+ severity = mce_severity(m, mca_cfg.tolerant, NULL);
This seems a big hack to make mce_severity() work when called from
CMCI context (when MCG_STATUS register is not set). It would also
be confusing as the subsequent logged entries
> I'm under the assumption that at all times, when we get a MCE, MCIP will
> be set. For example, mce_gather_info() reads MCG_STATUS before we call
> mce_severity() in do_machine_check().
>
> Or am I missing something?
Architecturally it is true that MCIP will be set when machine check is signaled
> In 7d375bff, NUM_CHANNELS was changed to 8 and the channel space was
> renumerated to handle EN, EP, and EX configurations.
>
> The *_mci_bind_devs functions, except for sbridge_mci_bind_devs(), got a
> new device presence check in the form of saw_chan_mask. However,
> sbridge_mci_bind_devs() st
> It appears this should be backported into -stable kernels, yes? Do you
> know which kernel versions need the fix?
For my setup the problem is first seen after:
commit bdee237c0343 " x86: mm: Use 2GB memory block size on large memory x86-64
systems"
which appeared in v3.19 and forced a 2GB me
> I'm running kernels configured that way (i.e. using libata PATA
> drivers) for years on my hp workstation zx6000 (zx1 chipset) without
> apparent problems.
Do you have PATA drives? My zx6000 just has SCSI:
scsi host0: ioc0: LSI53C1030 C0, FwRev=01032341h, Ports=1, MaxQ=255, IRQ=57
scsi 0:0
> Tony, should I take over the pstore tree? Do you have any testing
> procedures I could use? My testing is rather manual at the moment.
Kees,
Sure - I seem to be bad about keeping track of stuff here.
I don't have any good tests ... just manually crash a machine and
make sure things show up in
> Replace smp_call_function_single() with a direct call to
> ia64_mca_cmc_vector_adjust(). The function itselfs handles disable and
> enable interrupts, therefore the smp_call_function_single() calling
> convention is not preserved.
Applied. Thanks.
-Tony
> ia64/PCI: Fix incorrect PCI resource end address
> ia64/PCI: Remove unused 'addr' and fix build warning
> ia64: Reduce stack usage by iterating over nodemask
> ia64/traps: Silence GCC warning about uninitialised variable
> ia64/unaligned: Silence another GCC warning about an uninitialized va
> I've verified that the 'ce_count' is correctly incrementing with bad dimms.
Did you re-test on at least one of the previous 3 generations of CPUs supported
by this driver? All would be nice, but the bulk of the opportunities for
cut&paste
errors seem to be in code that looks like:
if
> I verified that at least the memory sizes, ie the 'size_mb' files
> are correct on the old h/w. I don't have bad dimms atm to test
> the old h/w error paths though. That said this driver does get a
> lot indirect testing here (just from being loaded), - so I would
> likely find out if there were
>> (3) Also we may not want to count at every sched_in and sched_out
>> because the MSR reads involve quite a bit of overhead.
>
> Every single other PMU driver just does this; why are you special?
They just have to read a register. We have to write the IA32_EM_EVT_SEL MSR
and then read fro
> [ 55.677523] EDAC sbridge: ECC is disabled. Aborting
Works on my HSW-EX. Maybe it depends on memory configuration or some BIOS
settings?
The EDAC driver is looking at the MCMTR register to determine whether ECC is
enabled (and this
change in the code shouldn't really affect that).
What doe
> Tony / Borislav, do we have tests for the machine check code that could
> have caught this?
If I had built one of my recovery test programs as a 32-byte binary instead of
native 64-bit I might have noticed (I only print the lsb field ... which would
have been garbage on the stack, maybe I'd ha
On Wed, Nov 11, 2015 at 10:16:45AM -0500, Chen, Gong wrote:
> UCNA errors share the same handler with CMCI. But it doesn't
> need extra operation to save error record in genpool. Remove
> these uselss codes.
I'd have emphasised that this same mce is being added to the genpool
*twice* (once here, a
On Wed, Nov 11, 2015 at 09:41:58PM +0100, Borislav Petkov wrote:
> On Tue, Nov 10, 2015 at 01:55:46PM -0800, Luck, Tony wrote:
> > I need to add more to the motivation part of this. The people who want
> > this are playing with NVDIMMs as storage. So think of many GBytes of
> >
> If you know that it is in the nvdimm range, you can grade the error with
> lower severity...
Grading the severity isn't the main issue.
> Or do you mean that without the exception table we'll return back to the
> insn causing the error and loop indefinitely this way?
Yes. We need to NOT return
>>> module_init(efivars_pstore_init);
>>
>> Looks OK to me. Kees, are you picking this up?
>
> I can, though usually it goes through Tony.
Can I count that as "Acked-by" from both of you?
-Tony
On Wed, Nov 11, 2015 at 08:14:56PM -0800, Andy Lutomirski wrote:
> On 11/06/2015 12:57 PM, Tony Luck wrote:
> >Copy the existing page fault fixup mechanisms to create a new table
> >to be used when fixing machine checks. Note:
> >1) At this time we only provide a macro to annotate assembly code
> >
On Wed, Nov 11, 2015 at 08:19:35PM -0800, Andy Lutomirski wrote:
> >@@ -1132,9 +1133,15 @@ void do_machine_check(struct pt_regs *regs, long
> >error_code)
> > if (no_way_out)
> > mce_panic("Fatal machine check on current CPU", &m,
> > msg);
> > if (wors
On Thu, Nov 12, 2015 at 08:53:13AM +0100, Ingo Molnar wrote:
> > +extern phys_addr_t mcsafe_memcpy(void *dst, const void __user *src,
> > + unsigned size);
>
> So what's the longer term purpose, where will mcsafe_memcpy() be used?
The initial plan is to use this for file
On Thu, Nov 12, 2015 at 12:04:36PM -0800, Andy Lutomirski wrote:
> > We already have code to recover from machine checks encountered
> > while the processor is executing ring3 code.
>
> I meant failures during copy_from_user, copy_to_user, etc.
Yes. copy_from_user() will be pretty interesting fr
> Franky, I'm not sure at all and very very wary of adding *any* code
> which runs on an offlined CPU. Because *no one* does that and it hasn't
> been tested at all. So who knows what happens.
>
> What we should be doing is execute the *minimal* amount of code possible
> and get out. No counting, n
> Suggested-by: Steven Rostedt
> Signed-off-by: Li Bin
Sure.
Acked-by: Tony Luck
[assuming that Steven is going to apply this whole series]
-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo in
> I don't mean that - I mean the stuff we do before we call
> cpu_is_offline() like ist_enter, this_cpu_inc(mce_exception_count),
> etc. Then we do a whole another bunch of stuff at the "out:" label like
> printk and whatnot which shouldn't run on an offlined CPU.
ist_enter() is black magic to me.
> Whether it is kosher or not is beside the point. Why should an offlined
> CPU even noodle through all that code if it doesn't need/have to? It can
> return immediately instead.
Ashok wants to move in stage 2 to having the offline cpu scan banks and report
any errors seen there. To do that we'll
> With that hunk here you want to clear MSR_IA32_MCG_STATUS in the
> !cfg->banks case, right?
I can't imagine how we'd get into do_machine_check without any banks.
Would indeed be a separate patch ... but value seems limited.
-Tony
N�r��yb�X��ǧv�^�){.n�+{zX����ܨ}���Ơz�&j:+v��
4.4 isn't going smoothly on my 4 socket Xeon servers (18 core per
socket if that is important). User space is RHEL 7.2. Kernel config
is the RHEL one (with whatever mods happen running "make oldconfig"
and hitting to every question.)
1) there was a problem in drm_calc_timestamping_constants(),
th
> Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast
> exception handler
Is that what we printed in this case? ... boy is that a misleading message ...
we got *extra*
cpus (the offline ones), not "Not all".
Good job we have a fix :-)
-Tony
N�r��yb�X��ǧv�^�){.n�+{
> And that is incorrect too, because the MCE (at least the one I'm
> injecting) gets broadcasted to the CPUs on the *node* and not to the
> whole system.
Which system? What kind of machine check? On Intel we expect machine checks
to be broadcast to all logical cpus on all nodes (unless local mac
On Mon, Dec 07, 2015 at 11:34:27PM +0100, Borislav Petkov wrote:
> BIOS is doing funny cores enumeration:
>
> node #0, CPUs 0-7
> node #1, CPUs 8-15
> node #2, CPUs 16-23
> node #3, CPUs 24-31
>
> and then starts from node 0 again:
>
> node #0, CPUs:#32 #33 #34 #35 #36 #37 #38
>> +/* Fault was in recoverable area of the kernel */
>> +if ((m.cs & 3) != 3 && worst == MCE_AR_SEVERITY)
>> +if (!fixup_mcexception(regs, m.addr))
>> +mce_panic("Failed kernel mode recovery", &m, NULL);
> ^^^
> Looks generally good.
>
> Reviewed-by: Andy Lutomirski
You say that to part 1/3 ... what happens when you get to part 3/3 and you
read my attempts at writing x86 assembly code?
>> +#ifdef CONFIG_MCE_KERNEL_RECOVERY
>> +int fixup_mcexception(struct pt_regs *regs)
>> +{
>> + const struct e
>>> As Tony requested, we may need a knob to stop a fallback in
>>> "movable->normal", later.
>>>
>>
>> If the mirrored memory is small and the other is large,
>> I think we can both enable "non-mirrored -> normal" and "normal ->
>> non-mirrored".
>
> Size of mirrored memory can be configured by
>Hmm...like this ?
> sysctl.vm.fallback_mirror_memory = 0 // never fallback # default.
> sysctl.vm.fallback_mirror_memory = 1 // the user memory may be
> allocated from mirrored zone.
> sysctl.vm.fallback_mirror_memory = 2 // usually kernel allocates
> memory from mirrored z
> Applied, thanks.
Did you test it (note the "UNTESTED" in the subject!). My usual system for
this is getting upgrades and being
flaky at the moment.
> Btw, looking at that mce.usable_addr, it doesn't make a whole lotta
> sense to me and we can use mce_usable_address() directly instead and use
> For patch 2 and 3 I'd need an ack from Mauro/Tony. CCed.
parts 2 & 3 are OK
Acked-by: Tony Luck
part4 (the actual KNL piece) seems not to break earlier (Broadwell) system ...
but that doesn't qualify enough for Ack/Review/Tested -by.
-Tony
N�r��yb�X��ǧv�^�){.n�+{zX����ܨ}�
> It already has your Reviewed-by. Is it still valid?
So it does ... that was a long time ago ... but not so long that anything
important changed. Yes, still valid.
-Tony
On Mon, Dec 14, 2015 at 09:36:25AM +0100, Ingo Molnar wrote:
> > /* deal with it */
> >
> > That way the magic is isolated to the function that needs the magic.
>
> Seconded - this is the usual pattern we use in all assembly functions.
Ok - you want me to write some x86 assembly code (you ma
On Sat, Dec 12, 2015 at 11:11:42AM +0100, Borislav Petkov wrote:
> > +config MCE_KERNEL_RECOVERY
> > + depends on X86_MCE && X86_64
> > + def_bool y
>
> Shouldn't that depend on NVDIMM or whatnot? Looks too generic now.
Not sure what the "whatnot" would be though. Making it depend on
X86_MCE
>> ... and the non-temporal version is the optimal one even though we're
>> defaulting to copy_user_enhanced_fast_string for memcpy on modern Intel
>> CPUs...?
My current generation cpu has a bit of an issue with recovering from a
machine check in a "rep mov" ... so I'm working with a version of m
>> Ok ... applied those two on top of my "UNTESTED" patch and injected an error
>> to force a UCNA log.
>
> Ok, what error type is that in EINJ nomenclature? I had only
>
> /sys/kernel/debug/apei/einj/available_error_type:0x0002 Processor
> Uncorrectable non-fatal
> /sys/kernel/debug/apei
> No, the system did panic in both times. The "strange" observation is
> that the MCE gets reported only on the cores on node 0. Or at least only
> the printks from mce_panic() on the cores on node0 reach the serial
> console.
You only see messages and logs from node0, because the cpus there are
t
> Is that an "Acked-by"? I'd like to pull this plus Vishal's
> gendisk-badblocks patches into a unified libnvdimm-error-handling
> branch. We're looking to have v4.5 able to avoid or survive nvdimm
> media errors through the pmem driver and DAX paths.
I'm making a V2 that fixes some build errors
> How about add some comment, if mirrored memroy is too small, then the
> normal zone is small, so it may be oom.
> The mirrored memory is at least 1/64 of whole memory, because struct
> pages usually take 64 bytes per page.
1/64th is the absolute lower bound (for the page structures as you say).
> asm-generic/barrier.h defines a nop() macro.
> To be able to use this header on ia64, we shouldn't
> call local functions/variables nop().
>
> There's one instance where this breaks on ia64:
> rename the function to iosapic_nop to avoid the conflict.
Acked-by: Tony Luck
--
To unsubscribe from
> On ia64 smp_rmb, smp_wmb, read_barrier_depends, smp_read_barrier_depends
> and smp_store_mb() match the asm-generic variants exactly. Drop the
> local definitions and pull in asm-generic/barrier.h instead.
>
> This is in preparation to refactoring this code area.
Acked-by: Tony Luck
--
To unsu
> So you're touching those again in patch 2. Why not add those defines to
> patch 1 directly and diminish the churn?
To preserve authorship. Andy did patch 1 (the clever part). Patch 2 is just
syntactic
sugar on top of it.
-Tony
> May I humbly ask why the [Finnish] you don't use the equivalent of the
> x86 _ASM_EXTABLE() macro? In fact, why don't we make that one generic, too?
I'm messing with that right now (with help from Andy Lutomirski and Boris) to
add different classes of exception table (so I can tag some instruct
On Mon, Jan 04, 2016 at 08:28:52PM +0100, Ard Biesheuvel wrote:
> On 4 January 2016 at 20:21, H. Peter Anvin wrote:
> > I suspect that means we will also need to go back to arch-specific
> > sorting for x86.
> >
>
> AFAICT, Tony's patches are not incompatible with mine. The fixup
> address is off
The following changes since commit 18558cae0272f8fd9647e69d3fec1565a7949865:
Linux 4.5-rc4 (2016-02-14 13:05:20 -0800)
are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git
tags/please-pull-mcsafev11
for you to fetch changes up to 2e5bfb23c89800a
> Tony, are you able to pull these?
I've been distracted ... I need to dig into the pile of pending pstore patches.
Was there a consensus on the device tree ones? I saw a "you shouldn't do that",
and a "but it's really convenient and doesn't hurt anyone else" exchange.
-Tony
> > > I think the whole notion of mcsafe here is 'wrong'. This copy variant
> > > simply
> > > reports the kind of trap that happened (#PF or #MC) and could arguably be
> > > extended to include more types if the hardware were to generate more.
> >
> > What would a better name be? memcpy_ret()
601 - 700 of 1172 matches
Mail list logo