On Tue, Sep 10, 2019 at 08:20:07AM +, Tony W Wang-oc wrote:
> Zhaoxin newer CPUs support LMCE that compatible with Intel's
> "Machine-Check Architecture", so add support for Zhaoxin LMCE
> in mce/core.c.
Your mailer included a header:
Content-Language: zh-CN
which seems to have made
> Looks ok to me at a quick glance, ACK.
Me too. Also ACK.
-Tony
On Wed, Sep 11, 2019 at 12:01:42PM +, Tony W Wang-oc wrote:
> + /* Checks after this one are Intel/Zhaoxin-specific: */
> + if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL &&
> + boot_cpu_data.x86_vendor != X86_VENDOR_ZHAOXIN)
Is it time to have a big cleanup on how we handle
On Mon, Sep 16, 2019 at 11:37:18AM +, Tony W Wang-oc wrote:
> Zhaoxin newer CPUs support LMCE that compatible with Intel's
> "Machine-Check Architecture", so add support for Zhaoxin LMCE
> in mce/core.c.
>
> Signed-off-by: Tony W Wang-oc
> ---
> arch/x86/kernel/cpu/mce/core.c | 35 ++
On Tue, Sep 17, 2019 at 06:54:05AM +, Tony W Wang-oc wrote:
> But have a question about below codes:
> if (mcgstatus & MCG_STATUS_RIPV) {
> mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
> return true;
> }
> These seems require all #MC exception errors set MCG_STATU
On Tue, Sep 17, 2019 at 08:29:28AM +, David Laight wrote:
> From: Tony Luck
> > Sent: 16 September 2019 23:40
> > From: Fenghua Yu
> >
> > The x86_capability array in cpuinfo_x86 is defined as u32 and thus is
> > naturally aligned to 4 bytes. But, set_bit() and clear_bit() require
> > the arr
> I have been investigating a regression in our environment where pstore
> (efi-pstore specifically but I suspect this would affect all
> implementations) no longer works after upgrading from a 4.4 to 5.0
> kernel when running under xen. (This is an Ubuntu kernel but I don't
> think there are
>> +#define INTEL_FAM6_ATOM_AIRMONT_NP0x75 /* Lightning Mountain */
>
> What's _NP ?
Network Processor. But that is too narrow a descriptor. This is going to be
used in
other areas besides networking.
I’m contemplating calling it AIRMONT2
-Tony
> Author: Peter Zijlstra
> Date: Tue Aug 7 10:17:27 2018 -0700
>
>x86/cpu: Sanitize FAM6_ATOM naming
>
>
> What 2 or 3 or other number means?
In this case I want it to mean “This is an Airmont derived core. Mostly like
original Airmont, so you might see some places where we have the s
On Thu, Apr 30, 2020 at 11:42:20AM -0700, Andy Lutomirski wrote:
> I suppose there could be a consistent naming like this:
>
> copy_from_user()
> copy_to_user()
>
> copy_from_unchecked_kernel_address() [what probe_kernel_read() is]
> copy_to_unchecked_kernel_address() [what probe_kernel_write() i
On Thu, Apr 30, 2020 at 12:50:40PM -0700, Linus Torvalds wrote:
I see your point about the namimg being important. I think Dan's
case is indeed "copy from pmem to user" where only options for faulting
are #MC on the source addresses, and #PF on the destination.
> The only *fundamental* access wo
> Now maybe copy_to_user() should *always* work this way, but I’m not convinced.
> Certainly put_user() shouldn’t — the result wouldn’t even be well defined.
> And I’m
> unconvinced that it makes much sense for the majority of copy_to_user()
> callers
> that are also directly accessing the sour
repeated "rtc-efi"
at the start of the line is redundant).
Acked-by: Tony Luck
-Tony
> If fd release cleans up then how should there be something in flight at
> the final mmdrop?
ENQCMD from the user is only synchronous in that it lets the user know their
request has been added to a queue (or not). Execution of the request may happen
later (if the device is busy working on reques
>> So the driver needs to use flush/drain operations to make sure all
>> the in-flight work has completed before releasing/re-using the PASID.
>>
> Are you suggesting we should let driver also hold a reference of the
> PASID?
The sequence for bare metal is:
process is queuing requests to
> There are two users of a PASID, mm and device driver(FD). If
> either one is not done with the PASID, it cannot be reclaimed. As you
> mentioned, it could take a long time for the driver to abort. If the
> abort ends *after* mmdrop, we are in trouble.
> If driver drops reference after abort/drain
On Mon, Oct 14, 2019 at 11:36:18PM +0200, Borislav Petkov wrote:
> This description is already *begging* for this delay value to be
> automatically set by the kernel. Putting yet another knob in front of
> the user who doesn't have a clue most of the time shows one more time
> that we haven't done
>> That all sounds like the printk should be downgraded too, it is not a
>> KERN_CRIT warning. It is more a notification that we're getting warm.
>
> Right, and I think we should take Benjamin's patch after all - perhaps
> even tag it for stable if that message is annoying people too much - and
> S
> If that's not going to happen, then we just bury the whole thing and put it
> on hold until a sane implementation of that functionality surfaces in
> silicon some day in the not so foreseeable future.
We will drop the patches to flip the MSR bits to enable checking.
But we can fix the split loc
> * we throttle the machine from within the kernel - whatever that may mean
> * if that doesn't help, we stop scheduling !root tasks
> * if that doesn't help, we halt
The silicon will do that "halt" step all by itself if the temperature
continues to rise and hits the highest of the temperature thr
> As said in commit f2c2cbcc35d4 ("powerpc: Use pr_warn instead of
> pr_warning"), removing pr_warning so all logging messages use a
> consistent _warn style. Let's do it.
Acked-by: Tony Luck
On Fri, Oct 18, 2019 at 03:23:09PM +0200, Borislav Petkov wrote:
> On Fri, Oct 18, 2019 at 05:26:36AM -0700, Srinivas Pandruvada wrote:
> > Server/desktops generally rely on the embedded controller for FAN
> > control, which kernel have no control. For them this warning helps to
> > either bring i
On Fri, Oct 18, 2019 at 09:45:03PM +0200, Borislav Petkov wrote:
> On Fri, Oct 18, 2019 at 11:02:57AM -0700, Luck, Tony wrote:
> > So what should we do next?
>
> I was simply keying off this statement of yours:
>
> "Depending on what we end up with from Srinivas ... w
On Thu, Feb 07, 2019 at 03:01:31PM +0100, Peter Zijlstra wrote:
> On Thu, Feb 07, 2019 at 11:50:52AM +, Linus Torvalds wrote:
> > If you re-generate the canonical address in __cpa_addr(), now we'll
> > actually have the real virtual address around for a lot of code-paths
> > (pte lookup etc), w
On Thu, Feb 07, 2019 at 06:57:20PM +0100, Peter Zijlstra wrote:
> Something like so then? AFAICT CLFLUSH will also #GP if feed it crap.
Correct. CFLUSH will also #GP on a non-canonical address.
> - __flush_tlb_one_kernel(__cpa_addr(cpa, i));
> + __flush_tlb_one_kernel(fix_
On Thu, Feb 07, 2019 at 10:07:28AM -0800, Andy Lutomirski wrote:
> Joining this thread late...
>
> This is all IMO rather crazy. How about we fiddle with CR0 to turn off
> the cache, then fiddle with page tables, then turn caching on? Or, heck,
> see if there’s some chicken bit we can set to imp
On Wed, Jan 30, 2019 at 12:08:45AM +0100, Borislav Petkov wrote:
> On Tue, Jan 29, 2019 at 05:52:18PM -0500, Johannes Weiner wrote:
> > config X86_RESCTRL
> > - bool "Resource Control support"
> > + bool "x86 cache control support"
>
> Except that it is not only cache but memory (bandwidth) c
On Fri, Feb 01, 2019 at 10:55:53AM +0100, Borislav Petkov wrote:
> On Thu, Jan 31, 2019 at 04:33:41PM -0800, Tony Luck wrote:
> > if (mce_severity(m, mca_cfg.tolerant, &tmp, true) >=
> > MCE_PANIC_SEVERITY) {
> > + m->bank = i;
>
> So conceptually this write belongs
>> What about
>>
>> s/X86_RESCTRL/X86_CPU_RESCTRL/g
>
> Good idea.
>
> Tony, Babu, that look okay to you guys as well?
For now. But very soon we will also have ARM_CPU_RESCTRL, and some of
this code will become generic. Will we need an arch-independent name for the
bits of code shared by arm an
On Wed, May 08, 2019 at 11:42:01PM +0100, Colin King wrote:
> From: Colin Ian King
>
> The variable tad_base is being set to a value that is never read
> and is being over-written on the next iteration of a for-loop.
> This assignment is therefore redundant and can be removed.
>
> Addresses-Cove
On Tue, May 21, 2019 at 10:29:02PM +0200, Borislav Petkov wrote:
>
> Can we do instead:
>
> -static DEFINE_PER_CPU_READ_MOSTLY(struct mce_bank *, mce_banks_array);
> +static DEFINE_PER_CPU_READ_MOSTLY(struct mce_bank,
> mce_banks_array[MAX_NR_BANKS]);
>
> which should be something like 9*32 = 2
On Fri, May 17, 2019 at 09:34:31PM +0200, Borislav Petkov wrote:
> On Fri, May 17, 2019 at 11:06:07AM -0700, Luck, Tony wrote:
> > and thus end up with that extra level on indent for the rest
> > of the function.
>
> Ok:
>
> @@ -1569,7 +1575,13 @@ static void __mchec
On Fri, Apr 19, 2019 at 01:29:10AM +0200, Borislav Petkov wrote:
> Which reminds me, Tony, I think all those debugging files "pfn"
> and "array" and the one you add now, should all be under a
> CONFIG_RAS_CEC_DEBUG which is default off and used only for development.
> Mind adding that too pls?
Pat
On Fri, Apr 19, 2019 at 02:29:11AM +0200, Borislav Petkov wrote:
> On Thu, Apr 18, 2019 at 05:07:45PM -0700, Luck, Tony wrote:
> > On Fri, Apr 19, 2019 at 01:29:10AM +0200, Borislav Petkov wrote:
> > > Which reminds me, Tony, I think all those debugging files "pfn"
&
> Err, this all sounds to me like the storm detection code should
> *automatically* disable the CEC in such cases, I'd say.
Sounds good. But we should distinguish storms that have many different
addresses from storms that just ping a few addresses. CEC will see counts
hit the threshold in the lat
> Now, if you still want to know how many errors and where they happened
> and when they happened and yadda yadda, you *disable* the CEC.
Rebooting isn't popular in many end user situations. Many CSP (cloud
service providers) vehemently hate the idea of rebooting.
-Tony
>> Rebooting isn't popular in many end user situations. Many CSP (cloud
>> service providers) vehemently hate the idea of rebooting.
>
> I meant disable in Kconfig - not build it in at all.
If rebooting is bad, then re-compiling and rebooting is 100x worse. :-)
-Tony
> I think we're talking past each other here: I mean disable the CEC
> *forever* and *never* use it. Use only a userspace agent and log errors
> with it.
>
> Makes sense?
Not really. We want pretty much everyone to enable and use CEC. That way
people don't bother use about the occasional neutron s
On Mon, Apr 22, 2019 at 07:15:32PM +0200, Borislav Petkov wrote:
> On Mon, Apr 22, 2019 at 03:59:16PM +0000, Luck, Tony wrote:
> > > Err, this all sounds to me like the storm detection code should
> > > *automatically* disable the CEC in such cases, I'd say.
> >
> Drop the RELEVANT_IFLAG() macro which hasn't been used for over a
> decade.
>
> Cc: Tony Luck
> Cc: Fenghua Yu
> Signed-off-by: Johan Hovold
> ---
> arch/ia64/hp/sim/simserial.c | 2 --
> 1 file changed, 2 deletions(-)
Acked-by: Tony Luck
> ia64 has a such a huge number of memory model choices. Maybe we
> need to cut it down to a small set that actually work.
SGI systems had extremely discontiguous memory (they used some high
order physical address bits in the tens/hundreds of terabyte range for the
node number ... so there would
From: Randy Dunlap [mailto:rdun...@infradead.org]
>>ERROR: "paddr_to_nid" [drivers/block/brd.ko] undefined!
>>ERROR: "paddr_to_nid" [crypto/ccm.ko] undefined!
>>
>
> ---
> Exporting paddr_to_nid() in arch/ia64/mm/numa.c fixes all of these build
> errors.
> Is there a problem with doing t
>> Exporting paddr_to_nid() in arch/ia64/mm/numa.c fixes all of these build
>> errors.
>> Is there a problem with doing that?
>
> I don't see a problem with exporting it.
But I also don't see these build errors. I'm using the same HEAD commit. I
think the
same .config (derived from arch/ia64/co
> arch/ia64/sn/pci/tioce_provider.c | 4 ++--
Thanks for the patch, but Christoph is working on a patch series that deletes
all of arch/ia64/sn/
-Tony
> this little series fixes various warnings I see in ia64 builds.
Applied. Thanks.
[I assume you are using some up-to-date version of gcc that generates these
warnings ... I'm not seeing them, but I'm still using a compiler from the stone
age]
-Tony
- base = ioremap((resource_size_t)addr, 0x1);
+ base = ioremap((resource_size_t)addr, 0x8000);
Changing one magic value for another. :-(
Do different BIOS do different things? I don't recall seeing this error
(but perhaps I missed it, or perhaps the kernel has ad
On Fri, Aug 09, 2019 at 02:18:02PM +, Stephen Douthit wrote:
> Depending on how BIOS has marked the reserved region containing the 32KB
> MCHBAR you can get warnings like:
>
> resource sanity check: requesting [mem 0xfed1-0xfed1], which spans
> more than reserved [mem 0xfed1-0xfed
On Thu, Aug 01, 2019 at 12:03:41PM +0200, Borislav Petkov wrote:
> On Wed, Jul 31, 2019 at 02:05:54AM +0300, Kirill A. Shutemov wrote:
> > Several upcoming patchsets will make use of the helper.
>
> ... so why aren't you sending it together with its first user?
Just to get another of the non-cont
On Thu, Aug 01, 2019 at 10:43:48PM +0300, Alexey Dobriyan wrote:
> > +static inline void movdir64b(void *dst, const void *src)
> > +{
> > + /* movdir64b [rdx], rax */
> > + asm volatile(".byte 0x66, 0x0f, 0x38, 0xf8, 0x02"
> > + : "=m" (*(char *)dst)
>
> I think Tony's in the right direction. We already do dst "sizing" like
> that for the compiler in clwb().
The clwb case does look like what we want for movdir64b().
But is it right for clwb() ... that doesn't modify anything, just pushes
things from cache to memory. So why is it using "+m"?
-T
ub/scm/linux/kernel/git/bp/bp.git for-next
> > -T: git
> > git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-edac.git
> > linux_next
> > +T: git git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git
> > edac-for-next
> > S: Supported
> > F: Documentation/admin-guide/ras.rst
> > F: Documentation/driver-api/edac.rst
> > --
>
> Acked-by: Borislav Petkov
>
Acked-by: Tony Luck
-Tony
Some processors may mispredict an array bounds check and
speculatively access memory that they should not. With
a user supplied array index we like to play things safe
by masking the value with the array size before it is
used as an index.
Signed-off-by: Tony Luck
---
V2: Mask the index *AFTER
> Here, set mod->arch.init_unw_table = NULL after remove the unwind
> table to avoid double free.
Applied. Thanks.
-Tony
I like the idea ... and it sure gets rid of a lot of code.
> A git tree is also available at:
>
>git://git.infradead.org/users/hch/misc.git ia64-remove-machvecs
I grabbed this tree and ran though my build scripts. I found that
vmlinux.gz doesn't get built. Which is odd, because I don't see
> Even if I explicitly run:
>
> $ make compressed
>
> It still doesn't build it. Weird.
Ugh! The rule to do the compression was in arch/ia64/hp/sim/boot/Makefile
which went away as part of the deletion of hpsim.
-Tony
On Wed, Aug 07, 2019 at 01:26:17PM -0700, Luck, Tony wrote:
> Ugh! The rule to do the compression was in arch/ia64/hp/sim/boot/Makefile
> which went away as part of the deletion of hpsim.
This fixes it ... should fold into the patch that dropped the
arch/ia64/hp/sim/boot/Makefile
I ju
On Thu, Aug 08, 2019 at 08:51:23AM +0200, 'Christoph Hellwig' wrote:
> On Wed, Aug 07, 2019 at 04:07:37PM -0700, Luck, Tony wrote:
> > On Wed, Aug 07, 2019 at 01:26:17PM -0700, Luck, Tony wrote:
> > > Ugh! The rule to do the compression was in arch/ia64/hp/sim/boot/Ma
On Sun, Jun 09, 2019 at 05:16:13PM +0200, Marco Elver wrote:
Marco,
Thanks for the patch. One comment below.
> - {
> - PCI_VEND_DEV(INTEL, IE31200_HB_1), PCI_ANY_ID, PCI_ANY_ID, 0, 0,
> - IE31200},
> - {
> - PCI_VEND_DEV(INTEL, IE31200_HB_2), PCI_ANY_I
> Reformat device table after Coffee Lake additions to be more readable.
I like that you put the reformat second ... if some old version needs a backport
to get Coffee Lake support they can just take part 1 to get the functionality
and then decide whether or not to take part 2.
Both parts:
Acked
On Thu, Jun 27, 2019 at 06:11:18PM +0100, James Morse wrote:
> Hello,
>
> (CC: +Tony Luck.
> Original Patch: lore.kernel.org/r/20190626054011.30044-1-de...@etsukata.com )
Heh: My mail agent "helpfully" made that clickable, but as a "mailto:";
URL rather than an https: one!
>
> On 26/06/2019 06:
The following changes since commit 9e0babf2c06c73cda2c0cd37a1653d823adb40ec:
Linux 5.2-rc5 (2019-06-16 08:49:45 -1000)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git
tags/please-pull-for_5.3
for you to fetch changes up to d8655e7630dafa88b
On Mon, Mar 11, 2019 at 08:25:53PM +, Ghannam, Yazen wrote:
> > + if (!(m.status & MCI_STATUS_PCC) && !(m.status & MCI_STATUS_S))
> > + goto log_it;
> > +
>
> Can you please include a vendor check with this? MCi_STATUS[56] is
> not defined the same way on AMD system
> I think the last time this came up, it was said that those people still
> running Linux on Itanium were running old distro kernels, not upstream.
>
> So yeah, we could probably do whatever and nobody would ever notice,
> except maybe Al, who is rumoured to still have an ia64 :-)
I haven't heard
From: Qiuxu Zhuo
Kbuild failed on the kernel configurations below:
CONFIG_ACPI_NFIT=y
CONFIG_EDAC_DEBUG=y
CONFIG_EDAC_SKX=m
CONFIG_EDAC_I10NM=y
or
CONFIG_ACPI_NFIT=y
CONFIG_EDAC_DEBUG=y
CONFIG_EDAC_SKX=y
CONFIG_EDAC_I10NM=m
Failed log:
...
CC [M] drivers/edac/skx
On Fri, Mar 22, 2019 at 03:00:25PM +0100, Arnd Bergmann wrote:
> Sorry, this was my mistake, my email was garbled. The patch was
> correct though: the idea here is not to change the Kconfig symbols
> but to change the Makefile to do the right thing even when Kconfig
> is set wrong.
Well this does
Code refactoring to share some source code with a new
EDAC driver resulted in renaming one file (skx_edac.c
became skx_base.c) and adding a new file (skx_common.c).
Update the file pattern in MAINTAINERS to take account of
this change.
Reported-by: Joe Perches
Fixes: 98f2fc829e3b ("EDAC, skx_eda
Code restructuring renamed arch/x86/kernel/cpu/mcheck/ to
be arch/x86/kernel/cpu/mce/
Update the MAINTAINERS file pattern to account for this change.
Fixes: 21afaf181362 ("x86/mce: Streamline MCE subsystem's naming")
Reported-by: Joe Perches
Signed-off-by: Tony Luck
---
MAINTAINERS | 2 +-
1 f
We forgot to update the MAINTAINERS file when adding this
new driver.
Fixes: d4dc89d069aa ("EDAC, i10nm: Add a driver for Intel 10nm server
processors")
Signed-off-by: Tony Luck
---
MAINTAINERS | 6 ++
1 file changed, 6 insertions(+)
diff --git a/MAINTAINERS b/MAINTAINERS
index e5f3230d3f1
On Mon, Mar 04, 2019 at 01:16:47PM +, Steven Price wrote:
> On 01/03/2019 21:57, Kirill A. Shutemov wrote:
> > On Wed, Feb 27, 2019 at 05:05:42PM +, Steven Price wrote:
> >> walk_page_range() is going to be allowed to walk page tables other than
> >> those of user space. For this it needs t
From: Qiuxu Zhuo
Kbuild failed on the kernel configurations below:
CONFIG_ACPI_NFIT=y
CONFIG_EDAC_DEBUG=y
CONFIG_EDAC_SKX=m
CONFIG_EDAC_I10NM=y
or
CONFIG_ACPI_NFIT=y
CONFIG_EDAC_DEBUG=y
CONFIG_EDAC_SKX=y
CONFIG_EDAC_I10NM=m
Failed log:
...
CC [M] drivers/edac/skx_c
ree. If you feel it
> should go through your arch tree let me know. All of the prerequisites
> should have been merged several releases ago.
Sure. Merge away.
Acked-by: Tony Luck
-Tony
On Tue, Sep 25, 2018 at 05:26:59PM +0200, Borislav Petkov wrote:
> On Tue, Sep 25, 2018 at 09:34:49AM -0500, Justin Ernst wrote:
> > We observe an oops in the skx_edac module during boot.
> > Examining /var/log/messages:
> > [ 3401.985757] EDAC MC0: Giving out device to module skx_edac controller
Nobody(*) uses them. Dropping this will allow us to make the total
number of memory controllers configurable (as we won't have to
worry about duplicated device names under this directory).
(*) https://marc.info/?l=linux-edac&m=153809709903987&w=2
Signed-off-by: Tony Luck
---
Boris: Apply this,
The trick with flipping bit 63 to avoid loading the address of the
1:1 mapping of the poisoned page while we update the 1:1 map used
to work when we wanted to unmap the page. But it falls down horribly
when we try to directly set the page as uncacheable.
The problem is that when we change the cach
On Fri, May 25, 2018 at 02:42:09PM -0700, Tony Luck wrote:
> Currently we just check the "CAPID0" register to see whether the CPU
> can recover from machine checks.
>
> But there are also some special SKUs which do not have all advanced
> RAS features, but do enable machine check recovery for use
On Thu, Jun 07, 2018 at 10:24:46PM +0200, Borislav Petkov wrote:
> On Thu, Jun 07, 2018 at 01:18:31PM -0700, Dan Williams wrote:
> > I'm making an effort to get all persistent memory error handling holes
> > covered this cycle, so I think it makes sense for this to go through
> > the nvdimm tree. T
On Mon, May 21, 2018 at 05:31:52PM +0530, Jeffrin Thalakkottoor wrote:
> > Ok, but please do not top-post.
>
> Ok
>
> > Looks like mcelog has trouble decoding this. Have you updated mcelog to
> > the latest version in your distro?
> .
> mcelog 153+dfsg-1
So this is
I guess I didn't explain that very clearly. I need all the lines
in betweeen.
How about this:
$ sudo dmesg -r | grep -C 30 Bank
-Tony
On Tue, May 22, 2018 at 02:43:37AM +0530, Jeffrin Thalakkottoor wrote:
> mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: ee40110b
> mce: [Hardware Error]: TSC 0 ADDR 16080 MISC 5040008086
> mce: [Hardware Error]: PROCESSOR 0:306d4 TIME 1526932210 SOCKET 0 APIC
> 0 microcode 2a
T
v4.16 boots cleanly. But with the first bunch of merges
(Linus HEAD = 46e0d28bdb8e6d00e27a0fe9e1d15df6098f0ffb)
I see a bunch of:
ia64_handle_unaligned: 4863 callbacks suppressed
kernel unaligned access to 0xe0031660fd74, ip=0xa001000f23e0
kernel unaligned access to 0xe0033bdffbcc, ip=
> kernel unaligned access to 0xe0031660fd74, ip=0xa001000f23e0
> kernel unaligned access to 0xe0033bdffbcc, ip=0xa001000f2370
Here's the disassembly of dequeu_task_fair() in case it would help to see
which two instructions are getting all the faults:
a001000f21c0 :
a001000
> I was asking for requirements, not a design proposal. In order to make a
> design you need a requirements specification.
Here's what I came up with ... not a fully baked list, but should allow for
some useful
discussion on whether any of these are not really needed, or if there is a
glaring ho
On Mon, Jan 23, 2017 at 03:50:56PM +0100, Borislav Petkov wrote:
> On Mon, Jan 23, 2017 at 09:35:53PM +0800, Xunlei Pang wrote:
> > One possible timing sequence would be:
> > 1st kernel running on multiple cpus panicked
> > then the crash dump code starts
> > the crash dump code stops the others cp
On Mon, Jan 23, 2017 at 06:51:30PM +0100, Borislav Petkov wrote:
> Hey Tony,
>
> a "welcome back" is in order? :-)
Yes - first day back today. Lots of catching up to do.
> And apparently crash knows about poisoned pages and handles them:
>
> static int __init crash_save_vmcoreinfo_init(void)
>
On Wed, Jul 13, 2016 at 02:47:30PM +0200, Thomas Gleixner wrote:
> On Tue, 12 Jul 2016, Fenghua Yu wrote:
> > +3. Hierarchy in rscctrl
> > +===
>
> What means rscctrl?
>
> You were not able to find a more cryptic acronym?
rscctrl == resource control
Intel marketing would (pr
On Thu, Jul 14, 2016 at 08:53:17AM +0200, Thomas Gleixner wrote:
> > Happy to take suggestions for something in between those
> > extremes :-)
>
> I'd suggest "resctrl" and the abbreviation dictionaries tell me that the most
> common ones for resource are: R, RESORC, RES
OK. "resctrl" it is.
> A
On Mon, Jul 25, 2016 at 11:31:24AM -0500, Nilay Vaish wrote:
> I was thinking more about this software caching of CLOSids. How
> likely do you think these CLOSids would be found cached? I think the
> software cache would be very infrequently accessed, so it seems you
> are likely to miss these in
the 80% when they are on their
own socket and the spare 20% if the wander off to the other socket.
Sent from my iPhone
> On Jul 25, 2016, at 19:13, Marcelo Tosatti wrote:
>
>> On Fri, Jul 22, 2016 at 02:43:23PM -0700, Luck, Tony wrote:
>>> On Fri, Jul 22, 2016 at 04:1
On Fri, Jul 01, 2016 at 12:21:43PM +0200, Borislav Petkov wrote:
> On Wed, Jun 29, 2016 at 06:56:10PM -0700, Fenghua Yu wrote:
> > From: Fenghua Yu
> >
> > Each cache node is described by cacheinfo and is a unique node across
>
> What is a cache node?
Clearly not a good name for the concept we
> Basically all cache indices carry the APIC ID of the core, so L1D on
> CPU0 has ID 0 and then L1I has ID 0 too and then L2 has also the same
> ID.
>
> How does that look on a CAT system? Do all the different cache levels
> get different IDs?
For CAT we only need the IDs to be unique at each lev
> Another straightforward replacement of magic numbers.
It would be if I hadn't forgotten that INTEL_FAM6_MODEL_BROADWELL_XEON_D had
a separate model number from the other Broadwell Xeons when I switched the
driver
from PCI device lookup to cpu model number.
This needs to add an entry for BDX-D
> This needs to add an entry for BDX-DE (use the same table initializer).
> Probably as
> a separate patch before/after this.
Oops ... a bit worse than that. I assumed that index into the array matches the
enum ... (with a comment!) ... having two entries for the same "type" would
break
that. I'
>> -m.bank = 1;
>> +m.bank = mca_cfg.banks;
>
> There's struct cper_sec_mem_err.bank. Why aren't we copying that?
Because that is DDR3/DDR4 "bank" (internal DIMM detail) as opposed to machine
check "bank"
(CPU microarchitecture detail). We need the latter here.
-Tony
> Btw, would it have any benefit of writing a "magic" value in m.bank
> to denote the error comes from APEI instead of number of banks which
> differs between generations?
>
> Something like
>
> m.bank = -1;
>
> or so?
That might be a bit more obvious than my subtle "one more than possible
o
> So, get rid of all that and simply log an MCE with a TSC value always.
> Simplifies the code a bit too.
I'm not necessarily opposed to this ... but there was once some logic behind
when
logged TSC, and when we didn't. Essentially we wanted the TSC when we were
logging from #CMCI or #MC be
> One other possibility would be to use ->time and write ->tsc *only*
> when exact - i.e., in the handler - and this is then enough info about
> timing.
>
> ->time will give you somewhere around where it happened and ->tsc - only
> if set - will give you exact, well, *timestamp* :)
>
> This sounds
On Mon, Nov 07, 2016 at 07:45:32PM +0100, Borislav Petkov wrote:
> On Thu, Nov 03, 2016 at 03:50:18PM +0100, Sebastian Andrzej Siewior wrote:
> > Part of the init (memory allocation and so on) is done
> > in mcheck_cpu_init(). While moving the the allocation to
> > mcheck_init_device() (where the h
> This still preserves the precise TSC timestamp in intel_threshold_interrupt().
Yup - this looks right.
Acked-by: Tony Luck
-Tony
> That's why the hotplug callback mce_disable_cpu() doesn't fiddle with
> CR4 - it only clears the bits in MCi_CTL. And I think we should remain
> that way.
N.B. See vendor_disable_error_reporting() ... on Intel we don't clear MCi_CTL.
-Tony
From: Tony Luck
Intel Xeons from Ivy Bridge onwards support a processor identification
number. Kernels v4.9 and higher include it in the "mce" record.
Signed-off-by: Tony Luck
---
mcelog.c | 3 +++
mcelog.h | 3 +++
2 files changed, 6 insertions(+)
diff --git a/mcelog.c b/mcelog.c
index 7214a
801 - 900 of 1172 matches
Mail list logo