[PATCH] irqchip/gicv3: silence noisy DEBUG_PER_CPU_MAPS warning
gic_raise_softirq() walks the list of cpus using for_each_cpu(), it calls gic_compute_target_list() which advances the iterator by the number of CPUs in the cluster. If gic_compute_target_list() reaches the last CPU it leaves the iterator pointing at the last CPU. This means the next time round the for_each_cpu() loop cpumask_next() will be called with an invalid CPU. This triggers a warning when built with CONFIG_DEBUG_PER_CPU_MAPS: [3.077738] GICv3: CPU1: found redistributor 1 region 0:0x2f12 [3.077943] CPU1: Booted secondary processor [410fd0f0] [3.078542] [ cut here ] [3.078746] WARNING: CPU: 1 PID: 0 at ../include/linux/cpumask.h:121 gic_raise_softirq+0x12c/0x170 [3.078812] Modules linked in: [3.078869] [3.078930] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.8.0-rc5+ #5188 [3.078994] Hardware name: Foundation-v8A (DT) [3.079059] task: 80087a1a0080 task.stack: 80087a19c000 [3.079145] PC is at gic_raise_softirq+0x12c/0x170 [3.079226] LR is at gic_raise_softirq+0xa4/0x170 [3.079296] pc : [] lr : [] pstate: 21c9 [3.081139] Call trace: [3.081202] Exception stack(0x80087a19fbe0 to 0x80087a19fd10) [3.082269] [] gic_raise_softirq+0x12c/0x170 [3.082354] [] smp_send_reschedule+0x34/0x40 [3.082433] [] resched_curr+0x50/0x88 [3.082512] [] check_preempt_curr+0x60/0xd0 [3.082593] [] ttwu_do_wakeup+0x20/0xe8 [3.082672] [] ttwu_do_activate+0x90/0xc0 [3.082753] [] try_to_wake_up+0x224/0x370 [3.082836] [] default_wake_function+0x10/0x18 [3.082920] [] __wake_up_common+0x5c/0xa0 [3.083003] [] __wake_up_locked+0x14/0x20 [3.083086] [] complete+0x40/0x60 [3.083168] [] secondary_start_kernel+0x15c/0x1d0 [3.083240] [<808911a4>] 0x808911a4 [3.113401] Detected PIPT I-cache on CPU2 Avoid updating the iterator if the next call to cpumask_next() would cause the for_each_cpu() loop to exit. There is no change to gic_raise_softirq()'s behaviour, (cpumask_next()s eventual call to _find_next_bit() will return early as start >= nbits), this patch just silences the warning. Signed-off-by: James Morse --- drivers/irqchip/irq-gic-v3.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c index ede5672ab34d..9e37c447cc2c 100644 --- a/drivers/irqchip/irq-gic-v3.c +++ b/drivers/irqchip/irq-gic-v3.c @@ -548,6 +548,7 @@ static int gic_starting_cpu(unsigned int cpu) static u16 gic_compute_target_list(int *base_cpu, const struct cpumask *mask, unsigned long cluster_id) { + int next_cpu; int cpu = *base_cpu; unsigned long mpidr = cpu_logical_map(cpu); u16 tlist = 0; @@ -562,9 +563,10 @@ static u16 gic_compute_target_list(int *base_cpu, const struct cpumask *mask, tlist |= 1 << (mpidr & 0xf); - cpu = cpumask_next(cpu, mask); - if (cpu >= nr_cpu_ids) + next_cpu = cpumask_next(cpu, mask); + if (next_cpu >= nr_cpu_ids) goto out; + cpu = next_cpu; mpidr = cpu_logical_map(cpu); -- 2.8.0.rc3
Re: [RFC] Arm64 boot fail with numa enable in BIOS
On 19/09/16 15:07, Mark Rutland wrote: > On Mon, Sep 19, 2016 at 09:05:26PM +0800, Yisheng Xie wrote: >> For the crash log, it seems caused by error number of cpumask. >> Any ideas about it? > Much earlier in your log, there was a (non-fatal) warning, as below. Do > you see this without NUMA/SRAT enabled in your FW? >> [0.297337] Detected PIPT I-cache on CPU1 >> [0.297347] GICv3: CPU1: found redistributor 10001 region >> 1:0x4d14 >> [0.297356] CPU1: Booted secondary processor [410fd082] >> [0.297375] [ cut here ] >> [0.320390] WARNING: CPU: 1 PID: 0 at ./include/linux/cpumask.h:121 >> gic_raise_softirq+0x128/0x17c >> [0.329356] Modules linked in: >> [0.332434] >> [0.333932] CPU: 1 PID: 0 Comm: swapper/1 Not tainted >> 4.8.0-rc4-00163-g803ea3a #21 >> [0.341581] Hardware name: Hisilicon Hi1616 Evaluation Board (DT) >> [0.347735] task: 8013e9dd task.stack: 8013e9dcc000 >> [0.353714] PC is at gic_raise_softirq+0x128/0x17c >> [0.358550] LR is at gic_raise_softirq+0xa0/0x17c I've seen this first trace when built with DEBUG_PER_CPU_MAPS. My version of this trace[0] was just noise due to gic_compute_target_list() and gic_raise_softirq() sharing an iterator. This patch silenced it for me: https://lkml.org/lkml/2016/9/19/623 Yours may be a different problem with the same symptom. Thanks, James [0] gicv3 trace when built with DEBUG_PER_CPU_MAPS [3.077738] GICv3: CPU1: found redistributor 1 region 0:0x2f12 [3.077943] CPU1: Booted secondary processor [410fd0f0] [3.078542] [ cut here ] [3.078746] WARNING: CPU: 1 PID: 0 at ../include/linux/cpumask.h:121 gic_raise_softirq+0x12c/0x170 [3.078812] Modules linked in: [3.078869] [3.078930] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.8.0-rc5+ #5188 [3.078994] Hardware name: Foundation-v8A (DT) [3.079059] task: 80087a1a0080 task.stack: 80087a19c000 [3.079145] PC is at gic_raise_softirq+0x12c/0x170 [3.079226] LR is at gic_raise_softirq+0xa4/0x170
Re: [PATCH v3 2/3] hwmon: xgene: Add hwmon driver
Hi, On 08/09/16 09:14, Arnd Bergmann wrote: > On Wednesday, September 7, 2016 3:37:05 PM CEST Guenter Roeck wrote: >> On Wed, Sep 07, 2016 at 11:41:44PM +0200, Arnd Bergmann wrote: >>> On Thursday, July 21, 2016 1:55:56 PM CEST Hoan Tran wrote: + ctx->comm_base_addr = cppc_ss->base_address; + if (ctx->comm_base_addr) { + ctx->pcc_comm_addr = + acpi_os_ioremap(ctx->comm_base_addr, + cppc_ss->length); >>> >>> This causes the arm64 allmodconfig build to fail now, according to >>> kernelci: >>> >>> 1 ERROR: "memblock_is_memory" [drivers/hwmon/xgene-hwmon.ko] >>> undefined! >>> >>> Should this perhaps call ioremap() or memremap() instead? >>> >> Hmmm ... almost sounds to me like blaming the messenger. e7cd190385d1 >> ("arm64: >> mark reserved memblock regions explicitly in iomem") starts using a function >> in acpi_os_ioremap() which is not exported. On top of that, >> memblock_is_memory() >> is declared as __init_memblock, which makes me really uncomfortable. >> If acpi_os_ioremap() must not be used by modules, and possibly only during >> early (?) initialization, maybe its declaration should state those >> limitations ? > > Ah, I didn't notice that. I guess both patches were correct individually and > got added to linux-next around the same time but caused allmodconfig to blow > up > when used together. > > Adding everyone who was involved in the memblock patch to Cc here, maybe one > of them has an idea what the correct fix is. There are only two other drivers > using acpi_os_ioremap() and one of them is x86-specific, so it's still likely > that drivers are not actually supposed to use this symbol. Making > acpi_os_ioremap() an exported function in arm64 would also work. You could use acpi_os_map_iomem()/acpi_os_unmap_iomem() from acpi/acpi_io.h. If there isn't an existing mapping these end up in acpi_os_ioremap(), and are already EXPORT_SYMBOL_GPL(). (I'm still waiting for allmodconfig on linux-next to finish building) Thanks, James
Re: [PATCH v3 2/3] hwmon: xgene: Add hwmon driver
Hi, On 09/09/16 04:18, AKASHI Takahiro wrote: > On Thu, Sep 08, 2016 at 11:47:59AM +0100, James Morse wrote: >> On 08/09/16 09:14, Arnd Bergmann wrote: >>> On Wednesday, September 7, 2016 3:37:05 PM CEST Guenter Roeck wrote: >>>> On Wed, Sep 07, 2016 at 11:41:44PM +0200, Arnd Bergmann wrote: >>>>> On Thursday, July 21, 2016 1:55:56 PM CEST Hoan Tran wrote: >>>>>> + ctx->comm_base_addr = cppc_ss->base_address; >>>>>> + if (ctx->comm_base_addr) { >>>>>> + ctx->pcc_comm_addr = >>>>>> + >>>>>> acpi_os_ioremap(ctx->comm_base_addr, >>>>>> + cppc_ss->length); >>>>>> >>>>> >>>>> This causes the arm64 allmodconfig build to fail now, according to >>>>> kernelci: >>>>> >>>>> 1 ERROR: "memblock_is_memory" [drivers/hwmon/xgene-hwmon.ko] >>>>> undefined! >>>>> >>>>> Should this perhaps call ioremap() or memremap() instead? >>>>> >>>> Hmmm ... almost sounds to me like blaming the messenger. e7cd190385d1 >>>> ("arm64: >>>> mark reserved memblock regions explicitly in iomem") starts using a >>>> function >>>> in acpi_os_ioremap() which is not exported. On top of that, >>>> memblock_is_memory() >>>> is declared as __init_memblock, which makes me really uncomfortable. >>>> If acpi_os_ioremap() must not be used by modules, and possibly only during >>>> early (?) initialization, maybe its declaration should state those >>>> limitations ? >>> >>> Ah, I didn't notice that. I guess both patches were correct individually and >>> got added to linux-next around the same time but caused allmodconfig to >>> blow up >>> when used together. >>> >>> Adding everyone who was involved in the memblock patch to Cc here, maybe one >>> of them has an idea what the correct fix is. There are only two other >>> drivers >>> using acpi_os_ioremap() and one of them is x86-specific, so it's still >>> likely >>> that drivers are not actually supposed to use this symbol. Making >>> acpi_os_ioremap() an exported function in arm64 would also work. >> >> You could use acpi_os_map_iomem()/acpi_os_unmap_iomem() from acpi/acpi_io.h. >> If there isn't an existing mapping these end up in acpi_os_ioremap(), and are >> already EXPORT_SYMBOL_GPL(). > > acpi_os_ioremap() is re-defined in arm64/include/asm/acpi.h. > > The problem is that, as memblock_is_memory() is declared as __init, __init_memblock ... ... as is memblock_is_map_memory(), which we call from pfn_valid() which is EXPORT_SYMBOL()'d and used from modules, (e.g. mac80211.ko). So something fishy is going on... >From include/linux/memblock.h: > #ifdef CONFIG_ARCH_DISCARD_MEMBLOCK > #define __init_memblock __meminit > #define __initdata_memblock __meminitdata > #else > #define __init_memblock > #define __initdata_memblock > #endif arm64 doesn't define ARCH_DISCARD_MEMBLOCK, so we always keep these symbols. If we didn't, pfn_valid() would break too. Thanks, James
Re: [PATCH v5 0/3] arm64: hibernate: Resume when hibernate image created on non-boot CPU
Hi Rafael, On 14/09/16 02:07, Rafael J. Wysocki wrote: > What's the status of this? Will has queued it in his for-next/core branch. Thanks, James
Re: [PATCH v6 5/7] arm64: kvm: route synchronous external abort exceptions to el2
Hi gengdongjiu, On 14/09/17 12:12, gengdongjiu wrote: > On 2017/9/8 0:31, James Morse wrote: >> KVM already handles external aborts from lower exception levels, no more work >> needs doing for TEA. > If it is firmware first solution, that is SCR_EL3.EA=1, all SError interrupt > and synchronous External > Abort exceptions are taken to EL3, so EL3 firmware will handle it, KVM no > needs to handle it. > > HCR_EL3.TEA is only for EL3 to check its value to decide to jump to > hypervisor or kernel. > >> >> What happens when a guest access the RAS-Error-Record registers? >> >> Before we can set HCR_EL2.TERR I think we need to add some minimal emulation >> for >> the registers it traps. Most of them should be RAZ/WI, so it should be >> straightforward. (I think KVMs default is to emulate an undef for unknown >> traps). > Today I added the support to do some minimal emulation for RAS-Error-Record > registers, thanks > for the good suggestion. Where can I find this patch? I'd like to repost it as part of the SError_rework/RAS/IESB series: this is one of the bits KVM needs but I didn't touch as it looks like your updated version of this patch should cover it. Thanks, James
[RFC/RFT PATCH 0/6] Switch GHES ioremap_page_range() to use fixmap
GHES is doing ioremap_page_range() in both NMI and irq context, neither are safe as it may sleep to allocate intermediate levels of page table. Replace the NMI/irq GHES_IOREMAP_PAGES to use a fixmap entry each. After this nothing uses ghes_ioremap_area or arch_apei_flush_tlb_one(), rip them out. RFC as I've only build-tested this on x86. For arm64 I've tested it on a software model. Any more testing would be welcome. These patches are based on rc7. Thanks, James Morse (6): arm64: fixmap: Add GHES fixmap entries x86/mm/fixmap: Add GHES fixmap entries ACPI / APEI: Replace ioremap_page_range() with fixmap ACPI / APEI: Remove ghes_ioremap_area arm64: mm: Remove arch_apei_flush_tlb_one() ACPI / APEI: Remove arch_apei_flush_tlb_one() arch/arm64/include/asm/acpi.h | 12 -- arch/arm64/include/asm/fixmap.h | 5 +++ arch/arm64/mm/mmu.c | 4 ++ arch/x86/include/asm/fixmap.h | 4 ++ arch/x86/kernel/acpi/apei.c | 5 --- drivers/acpi/apei/ghes.c| 84 + include/acpi/apei.h | 1 - 7 files changed, 30 insertions(+), 85 deletions(-) -- 2.15.0.rc2
[RFC/RFT PATCH 1/6] arm64: fixmap: Add GHES fixmap entries
GHES is switching to use fixmap for its dynamic mapping of CPER records, to avoid using ioremap_page_range() in IRQ/NMI context. Signed-off-by: James Morse --- arch/arm64/include/asm/fixmap.h | 5 + 1 file changed, 5 insertions(+) diff --git a/arch/arm64/include/asm/fixmap.h b/arch/arm64/include/asm/fixmap.h index caf86be815ba..4edcdcb01a68 100644 --- a/arch/arm64/include/asm/fixmap.h +++ b/arch/arm64/include/asm/fixmap.h @@ -51,6 +51,11 @@ enum fixed_addresses { FIX_EARLYCON_MEM_BASE, FIX_TEXT_POKE0, + + /* Used for GHES mapping from assorted contexts */ + FIX_APEI_GHES_IRQ, + FIX_APEI_GHES_NMI, + __end_of_permanent_fixed_addresses, /* -- 2.15.0.rc2
[RFC/RFT PATCH 2/6] x86/mm/fixmap: Add GHES fixmap entries
GHES is switching to use fixmap for its dynamic mapping of CPER records, to avoid using ioremap_page_range() in IRQ/NMI context. Signed-off-by: James Morse --- arch/x86/include/asm/fixmap.h | 4 1 file changed, 4 insertions(+) diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h index dcd9fb55e679..be3cc32db7f0 100644 --- a/arch/x86/include/asm/fixmap.h +++ b/arch/x86/include/asm/fixmap.h @@ -104,6 +104,10 @@ enum fixed_addresses { FIX_GDT_REMAP_BEGIN, FIX_GDT_REMAP_END = FIX_GDT_REMAP_BEGIN + NR_CPUS - 1, + /* Used for GHES mapping from assorted contexts */ + FIX_APEI_GHES_IRQ, + FIX_APEI_GHES_NMI, + __end_of_permanent_fixed_addresses, /* -- 2.15.0.rc2
[RFC/RFT PATCH 6/6] ACPI / APEI: Remove arch_apei_flush_tlb_one()
Nothing calls arch_apei_flush_tlb_one() anymore, instead relying on __set_pte_vaddr() to do the invalidation when called from clear_fixmap() Remove arch_apei_flush_tlb_one(). Signed-off-by: James Morse --- arch/x86/kernel/acpi/apei.c | 5 - include/acpi/apei.h | 1 - 2 files changed, 6 deletions(-) diff --git a/arch/x86/kernel/acpi/apei.c b/arch/x86/kernel/acpi/apei.c index ea3046e0b0cf..bb8d300fecbd 100644 --- a/arch/x86/kernel/acpi/apei.c +++ b/arch/x86/kernel/acpi/apei.c @@ -52,8 +52,3 @@ void arch_apei_report_mem_error(int sev, struct cper_sec_mem_err *mem_err) apei_mce_report_mem_error(sev, mem_err); #endif } - -void arch_apei_flush_tlb_one(unsigned long addr) -{ - __flush_tlb_one(addr); -} diff --git a/include/acpi/apei.h b/include/acpi/apei.h index c46694abea28..82c451698c98 100644 --- a/include/acpi/apei.h +++ b/include/acpi/apei.h @@ -50,7 +50,6 @@ int erst_clear(u64 record_id); int arch_apei_enable_cmcff(struct acpi_hest_header *hest_hdr, void *data); void arch_apei_report_mem_error(int sev, struct cper_sec_mem_err *mem_err); -void arch_apei_flush_tlb_one(unsigned long addr); #endif #endif -- 2.15.0.rc2
[RFC/RFT PATCH 5/6] arm64: mm: Remove arch_apei_flush_tlb_one()
Nothing calls arch_apei_flush_tlb_one() anymore, instead relying on __set_fixmap() to do the invalidation. Remove it. Move the IPI-considered-harmful comment to __set_fixmap(). Signed-off-by: James Morse --- arch/arm64/include/asm/acpi.h | 12 arch/arm64/mm/mmu.c | 4 2 files changed, 4 insertions(+), 12 deletions(-) diff --git a/arch/arm64/include/asm/acpi.h b/arch/arm64/include/asm/acpi.h index 59cca1d6ec54..32f465a80e4e 100644 --- a/arch/arm64/include/asm/acpi.h +++ b/arch/arm64/include/asm/acpi.h @@ -126,18 +126,6 @@ static inline const char *acpi_get_enable_method(int cpu) */ #define acpi_disable_cmcff 1 pgprot_t arch_apei_get_mem_attribute(phys_addr_t addr); - -/* - * Despite its name, this function must still broadcast the TLB - * invalidation in order to ensure other CPUs don't end up with junk - * entries as a result of speculation. Unusually, its also called in - * IRQ context (ghes_iounmap_irq) so if we ever need to use IPIs for - * TLB broadcasting, then we're in trouble here. - */ -static inline void arch_apei_flush_tlb_one(unsigned long addr) -{ - flush_tlb_kernel_range(addr, addr + PAGE_SIZE); -} #endif /* CONFIG_ACPI_APEI */ #ifdef CONFIG_ACPI_NUMA diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index f1eb15e0e864..267d2b79d52d 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -778,6 +778,10 @@ void __init early_fixmap_init(void) } } +/* + * Unusually, this is also called in IRQ context (ghes_iounmap_irq) so if we + * ever need to use IPIs for TLB broadcasting, then we're in trouble here. + */ void __set_fixmap(enum fixed_addresses idx, phys_addr_t phys, pgprot_t flags) { -- 2.15.0.rc2
[RFC/RFT PATCH 3/6] ACPI / APEI: Replace ioremap_page_range() with fixmap
Replace ghes_io{re,un}map_pfn_{nmi,irq}()s use of ioremap_page_range() with __set_fixmap() as ioremap_page_range() may sleep to allocate a new level of page-table, even if its passed an existing final-address to use in the mapping. clear_fixmap() does the TLB invalidation in __set_fixmap() for arm64 and __set_pte_vaddr() for x86. In each case its the same as the respective arch_apei_flush_tlb_one(). Reported-by: Fengguang Wu Suggested-by: Linus Torvalds Signed-off-by: James Morse CC: Tyler Baicar CC: Dongjiu Geng CC: Xie XiuQi --- CC'd people I've seen posting CPER log fragments, could you give this a test on your platforms? drivers/acpi/apei/ghes.c | 45 +++-- 1 file changed, 15 insertions(+), 30 deletions(-) diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index 3c3a37b8503b..f3269816b997 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -51,6 +51,7 @@ #include #include #include +#include #include #include @@ -112,7 +113,7 @@ static DEFINE_MUTEX(ghes_list_mutex); * Because the memory area used to transfer hardware error information * from BIOS to Linux can be determined only in NMI, IRQ or timer * handler, but general ioremap can not be used in atomic context, so - * a special version of atomic ioremap is implemented for that. + * the fixmap is used instead. */ /* @@ -126,8 +127,8 @@ static DEFINE_MUTEX(ghes_list_mutex); /* virtual memory area for atomic ioremap */ static struct vm_struct *ghes_ioremap_area; /* - * These 2 spinlock is used to prevent atomic ioremap virtual memory - * area from being mapped simultaneously. + * These 2 spinlocks are used to prevent the fixmap entries from being used + * simultaneously. */ static DEFINE_RAW_SPINLOCK(ghes_ioremap_lock_nmi); static DEFINE_SPINLOCK(ghes_ioremap_lock_irq); @@ -159,52 +160,36 @@ static void ghes_ioremap_exit(void) static void __iomem *ghes_ioremap_pfn_nmi(u64 pfn) { - unsigned long vaddr; phys_addr_t paddr; pgprot_t prot; - vaddr = (unsigned long)GHES_IOREMAP_NMI_PAGE(ghes_ioremap_area->addr); - paddr = pfn << PAGE_SHIFT; prot = arch_apei_get_mem_attribute(paddr); - ioremap_page_range(vaddr, vaddr + PAGE_SIZE, paddr, prot); + __set_fixmap(FIX_APEI_GHES_NMI, paddr, prot); - return (void __iomem *)vaddr; + return (void __iomem *) fix_to_virt(FIX_APEI_GHES_NMI); } static void __iomem *ghes_ioremap_pfn_irq(u64 pfn) { - unsigned long vaddr, paddr; + unsigned long paddr; pgprot_t prot; - vaddr = (unsigned long)GHES_IOREMAP_IRQ_PAGE(ghes_ioremap_area->addr); - paddr = pfn << PAGE_SHIFT; prot = arch_apei_get_mem_attribute(paddr); + __set_fixmap(FIX_APEI_GHES_IRQ, paddr, prot); - ioremap_page_range(vaddr, vaddr + PAGE_SIZE, paddr, prot); - - return (void __iomem *)vaddr; + return (void __iomem *) fix_to_virt(FIX_APEI_GHES_IRQ); } -static void ghes_iounmap_nmi(void __iomem *vaddr_ptr) +static void ghes_iounmap_nmi(void) { - unsigned long vaddr = (unsigned long __force)vaddr_ptr; - void *base = ghes_ioremap_area->addr; - - BUG_ON(vaddr != (unsigned long)GHES_IOREMAP_NMI_PAGE(base)); - unmap_kernel_range_noflush(vaddr, PAGE_SIZE); - arch_apei_flush_tlb_one(vaddr); + clear_fixmap(FIX_APEI_GHES_NMI); } -static void ghes_iounmap_irq(void __iomem *vaddr_ptr) +static void ghes_iounmap_irq(void) { - unsigned long vaddr = (unsigned long __force)vaddr_ptr; - void *base = ghes_ioremap_area->addr; - - BUG_ON(vaddr != (unsigned long)GHES_IOREMAP_IRQ_PAGE(base)); - unmap_kernel_range_noflush(vaddr, PAGE_SIZE); - arch_apei_flush_tlb_one(vaddr); + clear_fixmap(FIX_APEI_GHES_IRQ); } static int ghes_estatus_pool_init(void) @@ -360,10 +345,10 @@ static void ghes_copy_tofrom_phys(void *buffer, u64 paddr, u32 len, paddr += trunk; buffer += trunk; if (in_nmi) { - ghes_iounmap_nmi(vaddr); + ghes_iounmap_nmi(); raw_spin_unlock(&ghes_ioremap_lock_nmi); } else { - ghes_iounmap_irq(vaddr); + ghes_iounmap_irq(); spin_unlock_irqrestore(&ghes_ioremap_lock_irq, flags); } } -- 2.15.0.rc2
[RFC/RFT PATCH 4/6] ACPI / APEI: Remove ghes_ioremap_area
Now that nothing is using the ghes_ioremap_area pages, rip them out. Signed-off-by: James Morse --- drivers/acpi/apei/ghes.c | 39 ++- 1 file changed, 2 insertions(+), 37 deletions(-) diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index f3269816b997..1a4f16b10052 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -114,19 +114,7 @@ static DEFINE_MUTEX(ghes_list_mutex); * from BIOS to Linux can be determined only in NMI, IRQ or timer * handler, but general ioremap can not be used in atomic context, so * the fixmap is used instead. - */ - -/* - * Two virtual pages are used, one for IRQ/PROCESS context, the other for - * NMI context (optionally). - */ -#define GHES_IOREMAP_PAGES 2 -#define GHES_IOREMAP_IRQ_PAGE(base)(base) -#define GHES_IOREMAP_NMI_PAGE(base)((base) + PAGE_SIZE) - -/* virtual memory area for atomic ioremap */ -static struct vm_struct *ghes_ioremap_area; -/* + * * These 2 spinlocks are used to prevent the fixmap entries from being used * simultaneously. */ @@ -141,23 +129,6 @@ static atomic_t ghes_estatus_cache_alloced; static int ghes_panic_timeout __read_mostly = 30; -static int ghes_ioremap_init(void) -{ - ghes_ioremap_area = __get_vm_area(PAGE_SIZE * GHES_IOREMAP_PAGES, - VM_IOREMAP, VMALLOC_START, VMALLOC_END); - if (!ghes_ioremap_area) { - pr_err(GHES_PFX "Failed to allocate virtual memory area for atomic ioremap.\n"); - return -ENOMEM; - } - - return 0; -} - -static void ghes_ioremap_exit(void) -{ - free_vm_area(ghes_ioremap_area); -} - static void __iomem *ghes_ioremap_pfn_nmi(u64 pfn) { phys_addr_t paddr; @@ -1270,13 +1241,9 @@ static int __init ghes_init(void) ghes_nmi_init_cxt(); - rc = ghes_ioremap_init(); - if (rc) - goto err; - rc = ghes_estatus_pool_init(); if (rc) - goto err_ioremap_exit; + goto err; rc = ghes_estatus_pool_expand(GHES_ESTATUS_CACHE_AVG_SIZE * GHES_ESTATUS_CACHE_ALLOCED_MAX); @@ -1300,8 +1267,6 @@ static int __init ghes_init(void) return 0; err_pool_exit: ghes_estatus_pool_exit(); -err_ioremap_exit: - ghes_ioremap_exit(); err: return rc; } -- 2.15.0.rc2
Re: [PATCH v1 0/3] manually add Error Synchronization Barrier at exception handler entry and exit
Hi Dongjiu Geng, On 01/11/17 19:14, Dongjiu Geng wrote: > Some hardware platform can support RAS Extension, but not support IESB, > such as Huawei's platform, so software need to insert Synchronization Barrier > operations at exception handler entry. > > This series patches are based on James's series patches "SError rework + > RAS&IESB for firmware first support". In Huawei's platform, we do not > support IESB, so software needs to insert that. Surely you don't implement it because your CPU doesn't need it. Can unrecoverable errors really cross an exception without becoming an SError? The ESB instruction does the barrier, but it also consumes any pending SError. As it is this series will silently consume-and-discard uncontainable errors. Thanks, James
Re: [RFC/RFT PATCH 3/6] ACPI / APEI: Replace ioremap_page_range() with fixmap
Hi gengdonjiu, On 01/11/17 04:13, gengdongjiu wrote: > On 2017/10/31 23:38, James Morse wrote: >> CC'd people I've seen posting CPER log fragments, could you give this a >> test on your platforms? > Thanks for the fixing, not found obviously issue. Can I take that as a 'Tested-by:'? These tags also let us record who has a system that can test changes to this driver. Thanks, James
Re: [RFC/RFT PATCH 0/6] Switch GHES ioremap_page_range() to use fixmap
Hi guys, (+CC: Chen Gong and Huang Ying from the git log of [0]) On 31/10/17 15:38, James Morse wrote: > RFC as I've only build-tested this on x86. Does anyone have an x86 machine that does firmware-first using NOTIFY_NMI? > Any more testing would be welcome. ('ls /sys/firmware/acpi/tables/', if you have a HEST and EINJ we should be able to work something out) Thanks, James [0] https://www.kernel.org/doc/Documentation/acpi/apei/einj.txt
Re: [RFC/RFT PATCH 0/6] Switch GHES ioremap_page_range() to use fixmap
Hi Linus, On 31/10/17 15:52, Linus Torvalds wrote: > On Tue, Oct 31, 2017 at 8:38 AM, James Morse wrote: >> 7 files changed, 30 insertions(+), 85 deletions(-) > > Lovely. > > I obviously can't test it, but it looks fine. I *would* suggest just > making the "add fixmap entries" commits with the code that actually > uses them. There's no real reason to have two commits that just add > two entries that aren't used yet. > > If it was some meaningful helper function where a split of the commits > makes each commit easier to read, that would be one thing. As it is, > the split just makes it harder to look at the history of the code (ie > "I wonder where this was introduced - let's use 'git blame'. Oh, > that's not useful"). I will squash the first three patches together to fix this, and add something about HAVE_ACPI_APEI to the commit message. (I'm still holding out for someone to test this on an x86 system with NOTIFY_NMI) Thanks, James
Re: [PATCH v8 0/7] Support RAS virtualization in KVM
Hi Dongjiu Geng, On 10/11/17 19:54, Dongjiu Geng wrote: > This series patches mainly do below things: > > 1. Trap RAS ERR* registers Accesses to EL2 from Non-secure EL1, >KVM will will do a minimum simulation, there registers are simulated >to RAZ/WI in KVM. > 2. Route synchronous External Abort exceptions from Non-secure EL0 >and EL1 to EL2. When exception EL3 routing is enabled by firmware, >system will trap to EL3 firmware instead of EL2 KVM, then firmware >judges whether El2 routing is enabled, if enabled, jump to EL2 KVM, >otherwise jump to EL1 host kernel. > 3. Enable APEI ARv8 SEI notification to parse the CPER records for SError >in the ACPI GHES driver, KVM will call handle_guest_sei() to let ACPI >driver to parse the CPER record for SError which happened in the guest > 4. Although we can use APEI driver to handle the guest SError, but not all >system support SEI notification, such as kernel-first. So here KVM will >also classify the Error through Exception Syndrome Register and do > different >approaches according to Asynchronous Error Type > 5. If the guest SError error is not propagated and not consumed, then KVM > return >recoverable error status to user-space, user-space will specify the guest > ESR I thought we'd gone over this. There should be no RAS errors/notifications in user space. Only the symptoms should be sent, using the SIGBUS_MCEERR_A{O,R} if the kernel has handled as much as it can. This hides the actual mechanisms the kernel and firmware used. User-space should not have to know how to handle RAS errors directly. This is a service the operating system provides for it. This abstraction means the smae user-space code is portable between x86, arm64, powerpc etc. What if the firmware uses another notification method? User space should expect the kernel to hide things like this from it. If the kernel has no information to interpret a notification, how is user space supposed to know? I understand you are trying to work around your 'memory corruption at an unknown address'[0] problem, but if the kernel can't know where this corrupt memory is it should really reboot. What stops this corrupt data being swapped to disk? Killing 'the thing' that was running at the time is not sufficient because we don't know that this 'got' all the users of the corrupt memory. KSM can merge pages between guests. This is the difference between the error persisting forever killing off all the VMs one by one, and the corrupt page being silently re-read from disk clearing the error. >and inject a virtual SError. For other Asynchronous Error Type, KVM > directly >injects virtual SError with IMPLEMENTATION DEFINED ESR or KVM is panic if > the >error is fatal. In the RAS extension, guest virtual ESR must be set, > because >all-zero means 'RAS error: Uncategorized' instead of 'no valid ISS', so > set >this ESR to IMPLEMENTATION DEFINED by default if user space does not > specify it. Thanks, James [0] https://www.spinics.net/lists/arm-kernel/msg605345.html
Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization
Hi Dongjiu Geng, On 10/11/17 19:54, Dongjiu Geng wrote: > If it is not RAS SError, directly inject virtual SError, > which will keep the old way. If it is RAS SError, firstly > let host ACPI module to handle it. > For the ACPI handling, > if the error address is invalid, APEI driver will not > identify the address to hwpoison memory and can not notify > guest to do the recovery. The guest can't do any recover either. There is no recovery you can do without some information about what the error is. This is your memory corruption at an unknown address? We should reboot. (I agree memory_failure.c's::me_kernel() is ignoring kernel errors, we should try and fix this. It makes some sense for polled or irq notifications, but not SEA/SEI). > In order to safe, KVM continues > categorizing errors and handle it separately. > If the RAS error is not propagated, let host user space to > handle it. No. Host user space should not know anything about the kernel or platform RAS support. Doing so creates an ABI link between EL3 firmware and Qemu. This is totally unmaintainable. This thing needs to be portable. The kernel should handle the error, and report any symptoms to user-space. e.g. 'this memory is gone'. We shouldn't special case KVM. > The reason is that sometimes we can only kill the > guest effected application instead of panic whose guest OS. > Host user space specifies a valid ESR and inject virtual > SError, guest can just kill the current application if the > non-consumed error coming from guest application. > > Signed-off-by: Dongjiu Geng > Signed-off-by: Quanming Wu The last Signed-off-by should match the person posting the patch. It's a chain of custody for GPL-signoff purposes, not a 'partially-written-by'. If you want to credit Quanming Wu you can add CC and they can Ack/Review your patch. > diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c > index 7debb74..1afdc87 100644 > --- a/arch/arm64/kvm/handle_exit.c > +++ b/arch/arm64/kvm/handle_exit.c > @@ -178,6 +179,66 @@ static exit_handle_fn kvm_get_exit_handler(struct > kvm_vcpu *vcpu) > return arm_exit_handlers[hsr_ec]; > } > > +/** > + * kvm_handle_guest_sei - handles SError interrupt or asynchronous aborts > + * @vcpu:the VCPU pointer > + * > + * For RAS SError interrupt, firstly let host kernel handle it. > + * If the AET is [ESR_ELx_AET_UER], then let user space handle it, > + */ > +static int kvm_handle_guest_sei(struct kvm_vcpu *vcpu, struct kvm_run *run) > +{ > + unsigned int esr = kvm_vcpu_get_hsr(vcpu); > + bool impdef_syndrome = esr & ESR_ELx_ISV; /* aka IDS */ > + unsigned int aet = esr & ESR_ELx_AET; > + > + /* > + * This is not RAS SError > + */ > + if (!cpus_have_const_cap(ARM64_HAS_RAS_EXTN)) { > + kvm_inject_vabt(vcpu); > + return 1; > + } > + /* The host kernel may handle this abort. */ > + handle_guest_sei(); This has to claim the SError as a notification. If APEI claims the error, KVM doesn't need to do anything more. You ignore its return code. > + > + /* > + * In below two conditions, it will directly inject the > + * virtual SError: > + * 1. The Syndrome is IMPLEMENTATION DEFINED > + * 2. It is Uncategorized SEI > + */ > + if (impdef_syndrome || > + ((esr & ESR_ELx_FSC) != ESR_ELx_FSC_SERROR)) { > + kvm_inject_vabt(vcpu); > + return 1; > + } > + > + switch (aet) { > + case ESR_ELx_AET_CE:/* corrected error */ > + case ESR_ELx_AET_UEO: /* restartable error, not yet consumed */ > + return 1; /* continue processing the guest exit */ > + case ESR_ELx_AET_UER: /* The error has not been propagated */ > + /* > + * Userspace only handle the guest SError Interrupt(SEI) if the > + * error has not been propagated > + */ > + run->exit_reason = KVM_EXIT_EXCEPTION; > + run->ex.exception = ESR_ELx_EC_SERROR; > + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE; > + return 0; We should not pass RAS notifications to user space. The kernel either handles them, or it panics(). User space shouldn't even know if the kernel supports RAS until it gets an MCEERR signal. You're making your firmware-first notification an EL3->EL0 signal, bypassing the OS. If we get a RAS SError and there are no CPER records or values in the ERR nodes, we should panic as it looks like the CPU/firmware is broken. (spurious RAS errors) > + default: > + /* > + * Until now, the CPU supports RAS and SEI is fatal, or host > + * does not support to handle the SError. > + */ > + panic("This Asynchronous SError interrupt is dangerous, panic"); > + } > + > + return 0; > +} > + > /* > * Return > 0 to return to guest, < 0 on error, 0 (and set exit_reason) on > * prop
get_online_cpus() from a preemptible() context (bug?)
Hi Thomas, Peter, I'm trying to work out what stops a thread being pre-empted and migrated between calling get_online_cpus() and put_online_cpus(). According to __percpu_down_read(), its the pre-empt count: > * Due to having preemption disabled the decrement happens on > * the same CPU as the increment, avoiding the > * increment-on-one-CPU-and-decrement-on-another problem. So this: > void cpus_read_lock(void) > { >percpu_down_read(&cpu_hotplug_lock); > + > + /* Can we migrated before we release this per-cpu lock? */ > + WARN_ON(preemptible()); > } should never fire? It does, some of the offenders: > kmem_cache_create > apply_workqueue_attrs > stop_machine > static_key_enable > lru_add_drain_all > __cpuhp_setup_state > kmem_cache_shrink > vmstat_shepherd > __cpuhp_state_add_instance Trying to leave preempt disabled between the down/up leads to scheduling-while-atomic instead. Can you point out what I've missed here? Thanks, James
Re: [RFC/RFT PATCH 3/6] ACPI / APEI: Replace ioremap_page_range() with fixmap
Hi gengdongjiu On 02/11/17 12:01, gengdongjiu wrote: > James Morse wrote: >> Can I take that as a 'Tested-by:'? >> >> These tags also let us record who has a system that can test changes to this >> driver. > > sure. > Thanks for the fixing. > Qiang Zheng who is my colleague have tested it. > > CC Qiang. > > Tested-by: Qiang Zheng I can't do anything with this unless Qiang posts it. (I'll add to the CC list in the next version) >From Documentation/process/5.Posting.rst: > Be careful in the addition of tags to your patches: only Cc: is appropriate > for addition without the explicit permission of the person named. Thanks, James
Re: [RFC/RFT PATCH 0/6] Switch GHES ioremap_page_range() to use fixmap
On 01/11/17 18:20, Kani, Toshimitsu wrote: > On Wed, 2017-11-01 at 16:30 +0100, Borislav Petkov wrote: >> On Wed, Nov 01, 2017 at 02:58:33PM +0000, James Morse wrote: >>> Does anyone have an x86 machine that does firmware-first using NOTIFY_NMI? >> AFAIK, the only one who has access to a reportedly somewhat working GHES >> implementation is Toshi. CCed. > > Thanks for the heads-up. My x86 system only supports GHES with SCI > error sources. It uses MCE for synchronous error events. > > So, for x86 SCI error sources: > > Tested-by: Toshi Kani Thanks Toshi! > nit: I think ghes_ioremap_pfn_[nmi|irq] should be renamed since they no > longer use ioremap. I thought that would be too noisy... I have some more rework to do in here before we can merge the other arm64 RAS stuff, I can take that into account then. Thanks, James
[PATCH 0/4] Switch GHES ioremap_page_range() to use fixmap
GHES is doing ioremap_page_range() in both NMI and irq context, neither are safe as it may sleep to allocate intermediate levels of page table. Replace the NMI/irq GHES_IOREMAP_PAGES to use a fixmap entry each. After this nothing uses ghes_ioremap_area or arch_apei_flush_tlb_one(), rip them out. This hasn't been tested on a system with x86's NOTIFY_NMI. Any more more testing would be welcome. These patches are (still) based on rc7. Changes since RFC: * Added #ifdefs around the entries in fixmap.h * Added a paragraph about HAVE_ACPI_APEI to the commit message * Merged the first three patches for improved history I've tried to be clear with who-acked-what when merging the patches. For reference, the arch-acks are here: https://lkml.org/lkml/2017/11/2/254 https://lkml.org/lkml/2017/10/31/780 Thanks, James Morse (4): ACPI / APEI: Replace ioremap_page_range() with fixmap ACPI / APEI: Remove ghes_ioremap_area arm64: mm: Remove arch_apei_flush_tlb_one() ACPI / APEI: Remove arch_apei_flush_tlb_one() arch/arm64/include/asm/acpi.h | 12 -- arch/arm64/include/asm/fixmap.h | 7 arch/arm64/mm/mmu.c | 4 ++ arch/x86/include/asm/fixmap.h | 6 +++ arch/x86/kernel/acpi/apei.c | 5 --- drivers/acpi/apei/ghes.c| 84 + include/acpi/apei.h | 1 - 7 files changed, 34 insertions(+), 85 deletions(-) -- 2.15.0.rc2
[PATCH 2/4] ACPI / APEI: Remove ghes_ioremap_area
Now that nothing is using the ghes_ioremap_area pages, rip them out. Signed-off-by: James Morse Reviewed-by: Borislav Petkov Tested-by: Tyler Baicar --- drivers/acpi/apei/ghes.c | 39 ++- 1 file changed, 2 insertions(+), 37 deletions(-) diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index f3269816b997..1a4f16b10052 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -114,19 +114,7 @@ static DEFINE_MUTEX(ghes_list_mutex); * from BIOS to Linux can be determined only in NMI, IRQ or timer * handler, but general ioremap can not be used in atomic context, so * the fixmap is used instead. - */ - -/* - * Two virtual pages are used, one for IRQ/PROCESS context, the other for - * NMI context (optionally). - */ -#define GHES_IOREMAP_PAGES 2 -#define GHES_IOREMAP_IRQ_PAGE(base)(base) -#define GHES_IOREMAP_NMI_PAGE(base)((base) + PAGE_SIZE) - -/* virtual memory area for atomic ioremap */ -static struct vm_struct *ghes_ioremap_area; -/* + * * These 2 spinlocks are used to prevent the fixmap entries from being used * simultaneously. */ @@ -141,23 +129,6 @@ static atomic_t ghes_estatus_cache_alloced; static int ghes_panic_timeout __read_mostly = 30; -static int ghes_ioremap_init(void) -{ - ghes_ioremap_area = __get_vm_area(PAGE_SIZE * GHES_IOREMAP_PAGES, - VM_IOREMAP, VMALLOC_START, VMALLOC_END); - if (!ghes_ioremap_area) { - pr_err(GHES_PFX "Failed to allocate virtual memory area for atomic ioremap.\n"); - return -ENOMEM; - } - - return 0; -} - -static void ghes_ioremap_exit(void) -{ - free_vm_area(ghes_ioremap_area); -} - static void __iomem *ghes_ioremap_pfn_nmi(u64 pfn) { phys_addr_t paddr; @@ -1270,13 +1241,9 @@ static int __init ghes_init(void) ghes_nmi_init_cxt(); - rc = ghes_ioremap_init(); - if (rc) - goto err; - rc = ghes_estatus_pool_init(); if (rc) - goto err_ioremap_exit; + goto err; rc = ghes_estatus_pool_expand(GHES_ESTATUS_CACHE_AVG_SIZE * GHES_ESTATUS_CACHE_ALLOCED_MAX); @@ -1300,8 +1267,6 @@ static int __init ghes_init(void) return 0; err_pool_exit: ghes_estatus_pool_exit(); -err_ioremap_exit: - ghes_ioremap_exit(); err: return rc; } -- 2.15.0.rc2
[PATCH 4/4] ACPI / APEI: Remove arch_apei_flush_tlb_one()
Nothing calls arch_apei_flush_tlb_one() anymore, instead relying on __set_pte_vaddr() to do the invalidation when called from clear_fixmap() Remove arch_apei_flush_tlb_one(). Signed-off-by: James Morse Reviewed-by: Borislav Petkov --- arch/x86/kernel/acpi/apei.c | 5 - include/acpi/apei.h | 1 - 2 files changed, 6 deletions(-) diff --git a/arch/x86/kernel/acpi/apei.c b/arch/x86/kernel/acpi/apei.c index ea3046e0b0cf..bb8d300fecbd 100644 --- a/arch/x86/kernel/acpi/apei.c +++ b/arch/x86/kernel/acpi/apei.c @@ -52,8 +52,3 @@ void arch_apei_report_mem_error(int sev, struct cper_sec_mem_err *mem_err) apei_mce_report_mem_error(sev, mem_err); #endif } - -void arch_apei_flush_tlb_one(unsigned long addr) -{ - __flush_tlb_one(addr); -} diff --git a/include/acpi/apei.h b/include/acpi/apei.h index c46694abea28..82c451698c98 100644 --- a/include/acpi/apei.h +++ b/include/acpi/apei.h @@ -50,7 +50,6 @@ int erst_clear(u64 record_id); int arch_apei_enable_cmcff(struct acpi_hest_header *hest_hdr, void *data); void arch_apei_report_mem_error(int sev, struct cper_sec_mem_err *mem_err); -void arch_apei_flush_tlb_one(unsigned long addr); #endif #endif -- 2.15.0.rc2
[PATCH 1/4] ACPI / APEI: Replace ioremap_page_range() with fixmap
Replace ghes_io{re,un}map_pfn_{nmi,irq}()s use of ioremap_page_range() with __set_fixmap() as ioremap_page_range() may sleep to allocate a new level of page-table, even if its passed an existing final-address to use in the mapping. The GHES driver can only be enabled for architectures that select HAVE_ACPI_APEI: Add fixmap entries to both x86 and arm64. clear_fixmap() does the TLB invalidation in __set_fixmap() for arm64 and __set_pte_vaddr() for x86. In each case its the same as the respective arch_apei_flush_tlb_one(). CC: Qiang Zheng Reported-by: Fengguang Wu Suggested-by: Linus Torvalds Signed-off-by: James Morse Reviewed-by: Borislav Petkov Tested-by: Tyler Baicar Tested-by: Toshi Kani [ For the arm64 bits: ] Acked-by: Will Deacon [ For the x86 bits: ] Acked-by: Ingo Molnar --- Changes since RFC: * Added #ifdefs around the entries in fixmap.h * Added a paragraph about HAVE_ACPI_APEI to the commit message * Merged the first three patches for improved history, arch/arm64/include/asm/fixmap.h | 7 +++ arch/x86/include/asm/fixmap.h | 6 ++ drivers/acpi/apei/ghes.c| 45 ++--- 3 files changed, 28 insertions(+), 30 deletions(-) diff --git a/arch/arm64/include/asm/fixmap.h b/arch/arm64/include/asm/fixmap.h index caf86be815ba..4052ec39e8db 100644 --- a/arch/arm64/include/asm/fixmap.h +++ b/arch/arm64/include/asm/fixmap.h @@ -51,6 +51,13 @@ enum fixed_addresses { FIX_EARLYCON_MEM_BASE, FIX_TEXT_POKE0, + +#ifdef CONFIG_ACPI_APEI_GHES + /* Used for GHES mapping from assorted contexts */ + FIX_APEI_GHES_IRQ, + FIX_APEI_GHES_NMI, +#endif /* CONFIG_ACPI_APEI_GHES */ + __end_of_permanent_fixed_addresses, /* diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h index dcd9fb55e679..b0c505fe9a95 100644 --- a/arch/x86/include/asm/fixmap.h +++ b/arch/x86/include/asm/fixmap.h @@ -104,6 +104,12 @@ enum fixed_addresses { FIX_GDT_REMAP_BEGIN, FIX_GDT_REMAP_END = FIX_GDT_REMAP_BEGIN + NR_CPUS - 1, +#ifdef CONFIG_ACPI_APEI_GHES + /* Used for GHES mapping from assorted contexts */ + FIX_APEI_GHES_IRQ, + FIX_APEI_GHES_NMI, +#endif + __end_of_permanent_fixed_addresses, /* diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index 3c3a37b8503b..f3269816b997 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -51,6 +51,7 @@ #include #include #include +#include #include #include @@ -112,7 +113,7 @@ static DEFINE_MUTEX(ghes_list_mutex); * Because the memory area used to transfer hardware error information * from BIOS to Linux can be determined only in NMI, IRQ or timer * handler, but general ioremap can not be used in atomic context, so - * a special version of atomic ioremap is implemented for that. + * the fixmap is used instead. */ /* @@ -126,8 +127,8 @@ static DEFINE_MUTEX(ghes_list_mutex); /* virtual memory area for atomic ioremap */ static struct vm_struct *ghes_ioremap_area; /* - * These 2 spinlock is used to prevent atomic ioremap virtual memory - * area from being mapped simultaneously. + * These 2 spinlocks are used to prevent the fixmap entries from being used + * simultaneously. */ static DEFINE_RAW_SPINLOCK(ghes_ioremap_lock_nmi); static DEFINE_SPINLOCK(ghes_ioremap_lock_irq); @@ -159,52 +160,36 @@ static void ghes_ioremap_exit(void) static void __iomem *ghes_ioremap_pfn_nmi(u64 pfn) { - unsigned long vaddr; phys_addr_t paddr; pgprot_t prot; - vaddr = (unsigned long)GHES_IOREMAP_NMI_PAGE(ghes_ioremap_area->addr); - paddr = pfn << PAGE_SHIFT; prot = arch_apei_get_mem_attribute(paddr); - ioremap_page_range(vaddr, vaddr + PAGE_SIZE, paddr, prot); + __set_fixmap(FIX_APEI_GHES_NMI, paddr, prot); - return (void __iomem *)vaddr; + return (void __iomem *) fix_to_virt(FIX_APEI_GHES_NMI); } static void __iomem *ghes_ioremap_pfn_irq(u64 pfn) { - unsigned long vaddr, paddr; + unsigned long paddr; pgprot_t prot; - vaddr = (unsigned long)GHES_IOREMAP_IRQ_PAGE(ghes_ioremap_area->addr); - paddr = pfn << PAGE_SHIFT; prot = arch_apei_get_mem_attribute(paddr); + __set_fixmap(FIX_APEI_GHES_IRQ, paddr, prot); - ioremap_page_range(vaddr, vaddr + PAGE_SIZE, paddr, prot); - - return (void __iomem *)vaddr; + return (void __iomem *) fix_to_virt(FIX_APEI_GHES_IRQ); } -static void ghes_iounmap_nmi(void __iomem *vaddr_ptr) +static void ghes_iounmap_nmi(void) { - unsigned long vaddr = (unsigned long __force)vaddr_ptr; - void *base = ghes_ioremap_area->addr; - - BUG_ON(vaddr != (unsigned long)GHES_IOREMAP_NMI_PAGE(base)); - unmap_kernel_range_noflush(vaddr, PAGE_SIZE); - arch_apei_flush_tlb_one(vaddr); + clear_fixmap(FIX_APEI_GHES_NMI); } -static void ghes_iounmap_irq(vo
[PATCH 3/4] arm64: mm: Remove arch_apei_flush_tlb_one()
Nothing calls arch_apei_flush_tlb_one() anymore, instead relying on __set_fixmap() to do the invalidation. Remove it. Move the IPI-considered-harmful comment to __set_fixmap(). Signed-off-by: James Morse Acked-by: Will Deacon Tested-by: Tyler Baicar --- arch/arm64/include/asm/acpi.h | 12 arch/arm64/mm/mmu.c | 4 2 files changed, 4 insertions(+), 12 deletions(-) diff --git a/arch/arm64/include/asm/acpi.h b/arch/arm64/include/asm/acpi.h index 59cca1d6ec54..32f465a80e4e 100644 --- a/arch/arm64/include/asm/acpi.h +++ b/arch/arm64/include/asm/acpi.h @@ -126,18 +126,6 @@ static inline const char *acpi_get_enable_method(int cpu) */ #define acpi_disable_cmcff 1 pgprot_t arch_apei_get_mem_attribute(phys_addr_t addr); - -/* - * Despite its name, this function must still broadcast the TLB - * invalidation in order to ensure other CPUs don't end up with junk - * entries as a result of speculation. Unusually, its also called in - * IRQ context (ghes_iounmap_irq) so if we ever need to use IPIs for - * TLB broadcasting, then we're in trouble here. - */ -static inline void arch_apei_flush_tlb_one(unsigned long addr) -{ - flush_tlb_kernel_range(addr, addr + PAGE_SIZE); -} #endif /* CONFIG_ACPI_APEI */ #ifdef CONFIG_ACPI_NUMA diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index f1eb15e0e864..267d2b79d52d 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -778,6 +778,10 @@ void __init early_fixmap_init(void) } } +/* + * Unusually, this is also called in IRQ context (ghes_iounmap_irq) so if we + * ever need to use IPIs for TLB broadcasting, then we're in trouble here. + */ void __set_fixmap(enum fixed_addresses idx, phys_addr_t phys, pgprot_t flags) { -- 2.15.0.rc2
Re: get_online_cpus() from a preemptible() context (bug?)
Hi Peter, (combining your replies) On 06/11/17 10:32, Peter Zijlstra wrote: > On Fri, Nov 03, 2017 at 02:45:45PM +0000, James Morse wrote: >> I'm trying to work out what stops a thread being pre-empted and migrated >> between >> calling get_online_cpus() and put_online_cpus(). > Nothing; why would you think it would? To stop the this_cpu_*() operations in down/up being applied on different CPUs, affecting a different percpu:read_count. > All those functions guarantee is > that any CPU observed as being online says online (and its converse, > that a CPU observed as being offline, says offline, although less people > care about that one). >> According to __percpu_down_read(), its the pre-empt count: >>> * Due to having preemption disabled the decrement happens on >>> * the same CPU as the increment, avoiding the >>> * increment-on-one-CPU-and-decrement-on-another problem. >> >> >> So this: >>> void cpus_read_lock(void) >>> { >>>percpu_down_read(&cpu_hotplug_lock); >>> + >>> + /* Can we migrated before we release this per-cpu lock? */ >>> + WARN_ON(preemptible()); >>> } >> >> should never fire? > It should.. You're reading a comment on __percpu_down_read() and using > percpu_down_read(), _not_ the same function ;-) Yes, sorry, I thought you did a better job of describing the case I'm trying to work-out. > If you look at percpu_down_read(), you'll note it'll disable preemption > before calling __percpu_down_read(). Yes, this is how __percpu_down_read() protects the combination of it's fast/slow paths. But next percpu_down_read() calls preempt_enable(), I can't see what stops us migrating before percpu_up_read() preempt_disable()s to call __this_cpu_dec(), which now affects a different variable. > And yes, that whole percpu-rwsem code is fairly magical :-) I think I'll file this under magical. That rcu_sync_is_idle() must know something I don't! Thanks! James
Re: [PATCH v5 2/2] acpi: apei: Add SEI notification type support for ARMv8
Hi Borislav! On 18/10/17 10:25, Borislav Petkov wrote: > On Wed, Oct 18, 2017 at 05:17:27PM +0800, gengdongjiu wrote: >> Thanks Borislav, can I write it as asynchronous exception or >> asynchronous abort? > > WTF?! Yup. > The thing is abbreviated as "SEI" and apparently means "System Error > Interrupt". Nothing else. ARM has 'external abort', which are either synchronous or asynchronous, both are delivered as different types of exception. Asynchronous external abort is treated as a special kind of interrupt, 'SError Interrupt', (where SError stands for System Error, but its rarely written like that). 'SEI' is a relatively new abbreviation for SError interrupt. What should we call this thing? In the ACPI code I'd prefer 'SEI' as that is what the ACPI spec calls it. Here we are talking about an GHES notification. But in the arm64 arch code this should be called SError Interrupt as this is what the ARM-ARM calls it. This code cares about exception routing and interrupt masking. But, I don't really care. Thanks, James
Re: [PATCH v5 2/2] acpi: apei: Add SEI notification type support for ARMv8
Hi Dongjiu Geng, On 17/10/17 09:02, Dongjiu Geng wrote: > ARMv8.2 requires implementation of the RAS extension, in > this extension it adds SEI(SError Interrupt) notification > type, this patch adds new GHES error source SEI handling > functions. This paragraph is merging two things that aren't related. The 'ARM v8.2 architecture extensions' have some RAS bits, which if your CPU implements v8.2 are required. ACPIv6.1 added NOTIFY_SEI as a notification type for ARMv8 systems. This patch adds a GHES function for NOTIFY_SEI. Please leave the CPU RAS extensions out of it. > Because this error source parsing and handling > methods are similar with the SEA. So share some SEA handling > functions with the SEI > > Expose one API ghes_notify_abort() to external users. External > modules can call this exposed API to parse and handle the > SEA or SEI. This series doesn't add a caller/user for this new API, so why do we need to do this now? (I still haven't had a usable answer for 'what does your firmware do when SError is masked', but I'll go beat that drum on the other thread). More important for the APEI code is: How do SEA and SEI interact? As far as I can see they can both interrupt each other, which isn't something the single in_nmi() path in APEI can handle. I thinks we should fix this first. (I'll try and polish my RFC that had a stab at that...) SEA gets away with a lot of things because its synchronous. SEI isn't. Xie XiuQi pointed to the memory_failure_queue() code. We can use this directly from SEA, but not SEI. (what happens if an SError arrives while we are queueing memory_failure work from an IRQ). The one that scares me is the trace-point reporting stuff. What happens if an SError arrives while we are enabling a trace point? (these are static-keys right?) I don't think we can just plumb SEI in like this and be done with it. (I'm looking at teasing out the estatus cache code from being x86:NMI only. This way we solve the same 'cant do this from NMI context' with the same code'.) Thanks, James boring nits below: > Note: For the SEI(SError Interrupt), it is asynchronous external > abort, the error address recorded by firmware may be not accurate. > If not accurate, EL3 firmware needs to identify the address to a > invalid value. This paragraph keeps cropping up. Who expects an address with an SError? We don't get one for IRQs, but that never needs stating. > Cc: Borislav Petkov > Cc: James Morse > Signed-off-by: Dongjiu Geng > Tested-by: Tyler Baicar > Tested-by: Dongjiu Geng (It's expected you test your own code) > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c > index 2509e4f..c98c1b3 100644 > --- a/arch/arm64/mm/fault.c > +++ b/arch/arm64/mm/fault.c > @@ -585,7 +585,7 @@ static int do_sea(unsigned long addr, unsigned int esr, > struct pt_regs *regs) > if (interrupts_enabled(regs)) > nmi_enter(); > > - ret = ghes_notify_sea(); > + ret = ghes_notify_abort(ACPI_HEST_NOTIFY_SEA); > > if (interrupts_enabled(regs)) > nmi_exit(); > @@ -682,7 +682,7 @@ int handle_guest_sea(phys_addr_t addr, unsigned int esr) > int ret = -ENOENT; > > if (IS_ENABLED(CONFIG_ACPI_APEI_SEA)) > - ret = ghes_notify_sea(); > + ret = ghes_notify_abort(ACPI_HEST_NOTIFY_SEA); > > return ret; > } > diff --git a/drivers/acpi/apei/Kconfig b/drivers/acpi/apei/Kconfig > index de14d49..47fcb0c 100644 > --- a/drivers/acpi/apei/Kconfig > +++ b/drivers/acpi/apei/Kconfig > @@ -54,6 +54,21 @@ config ACPI_APEI_SEA > option allows the OS to look for such hardware error record, and > take appropriate action. > > +config ACPI_APEI_SEI > + bool "APEI Asynchronous SError Interrupt logging/recovering support" > + depends on ARM64 && ACPI_APEI_GHES > + default y > + help > + This option should be enabled if the system supports > + firmware first handling of SEI (asynchronous SError interrupt). > + > + SEI happens with asynchronous external abort for errors on device > + memory reads on ARMv8 systems. If a system supports firmware first > + handling of SEI, the platform analyzes and handles hardware error > + notifications from SEI, and it may then form a HW error record for > + the OS to parse and handle. This option allows the OS to look for > + such hardware error record, and take appropriate action. > + > config ACPI_APEI_MEMORY_FAILURE > bool "APEI memory error recovering support" > depends on ACPI_APEI && MEMORY_FAIL
Re: [PATCH v7 1/4] arm64: kvm: route synchronous external abort exceptions to EL2
Hi Dongjiu Geng, On 17/10/17 15:14, Dongjiu Geng wrote: > ARMv8.2 adds a new bit HCR_EL2.TEA which controls to > route synchronous external aborts to EL2, and adds a > trap control bit HCR_EL2.TERR which controls to > trap all Non-secure EL1&0 error record accesses to EL2. The bulk of this patch is about trap-and-emulating these ERR registers, but that's not reflected in the title: > KVM: arm64: Emulate RAS error registers and set HCR_EL2's TERR & TEA > This patch enables the two bits for the guest OS. > when an synchronous abort is generated in the guest OS, > it will trap to EL3 firmware, EL3 firmware will check the *buzz* This depends on SCR_EL3.EA, which this patch doesn't touch and the normal-world can't even know about. This is what your system does, the commit message should be about the change to Linux. (I've said this before) > HCR_EL2.TEA value to decide to jump to hypervisor or host > OS. Enabling HCR_EL2.TERR makes error record access > from guest trap to EL2. > > Add some minimal emulation for RAS-Error-Record registers. > In the emulation, ERRIDR_EL1 and ERRSELR_EL1 are zero. > Then, the others ERX* registers are RAZ/WI. > diff --git a/arch/arm64/include/asm/kvm_emulate.h > b/arch/arm64/include/asm/kvm_emulate.h > index fe39e68..47983db 100644 > --- a/arch/arm64/include/asm/kvm_emulate.h > +++ b/arch/arm64/include/asm/kvm_emulate.h > @@ -47,6 +47,13 @@ static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu) > vcpu->arch.hcr_el2 = HCR_GUEST_FLAGS; > if (is_kernel_in_hyp_mode()) > vcpu->arch.hcr_el2 |= HCR_E2H; > + if (cpus_have_const_cap(ARM64_HAS_RAS_EXTN)) { This ARM64_HAS_RAS_EXTN isn't in mainline, nor is it added by your series. I know where it comes from, but other reviewers may not. If you have dependencies on another series, please call them out in the cover letter. This is the first cpus_have_const_cap() user in this header file, it probably needs: #include > + /* route synchronous external abort exceptions to EL2 */ > + vcpu->arch.hcr_el2 |= HCR_TEA; > + /* trap error record accesses */ > + vcpu->arch.hcr_el2 |= HCR_TERR; > + } > + > if (test_bit(KVM_ARM_VCPU_EL1_32BIT, vcpu->arch.features)) > vcpu->arch.hcr_el2 &= ~HCR_RW; > } > diff --git a/arch/arm64/include/asm/kvm_host.h > b/arch/arm64/include/asm/kvm_host.h > index d686300..af55b3bc 100644 > --- a/arch/arm64/include/asm/kvm_host.h > +++ b/arch/arm64/include/asm/kvm_host.h > @@ -105,6 +105,8 @@ enum vcpu_sysreg { > TTBR1_EL1, /* Translation Table Base Register 1 */ > TCR_EL1,/* Translation Control Register */ > ESR_EL1,/* Exception Syndrome Register */ > + ERRIDR_EL1, /* Error Record ID Register */ Page 39 of [0]: 'ERRIDR_EL1 is a 64-bit read-only ...'. > + ERRSELR_EL1,/* Error Record Select Register */ We're emulating these as RAZ/WI, do we really need to allocate vcpu->arch.ctxt.sys_regs space for them? If we always return 0 for ERRIDR, then we don't need to keep ERRSELR as 'the value read back [..] is UNKNOWN'. I think we only need space for these once their value needs to be migrated, user-space doesn't need to know they exist until then. > AFSR0_EL1, /* Auxiliary Fault Status Register 0 */ > AFSR1_EL1, /* Auxiliary Fault Status Register 1 */ > FAR_EL1,/* Fault Address Register */ > diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c > index 2e070d3..a74617b 100644 > --- a/arch/arm64/kvm/sys_regs.c > +++ b/arch/arm64/kvm/sys_regs.c > @@ -775,6 +775,36 @@ static bool access_pmovs(struct kvm_vcpu *vcpu, struct > sys_reg_params *p, > return true; > } > > +static bool access_error_reg(struct kvm_vcpu *vcpu, struct sys_reg_params *p, > + const struct sys_reg_desc *r) > +{ > + /* accessing ERRIDR_EL1 */ > + if (r->CRm == 3 && r->Op2 == 0) { > + if (p->is_write) > + vcpu_sys_reg(vcpu, ERRIDR_EL1) = 0; As above, this register is read-only. > + return trap_raz_wi(vcpu, p, r); If we can get rid of the vcpu storage this just becomes trap_raz_wi() . > + } > + > + /* accessing ERRSELR_EL1 */ > + if (r->CRm == 3 && r->Op2 == 1) { > + if (p->is_write) > + vcpu_sys_reg(vcpu, ERRSELR_EL1) = 0; > + > + return trap_raz_wi(vcpu, p, r); > + } Same here. > + > + /* > + * If ERRSELR_EL1.SEL is greater than or equal to ERRIDR_EL1.NUM, > + * the ERX* registers are RAZ/WI. > + */ > + if ((vcpu_sys_reg(vcpu, ERRSELR_EL1) & 0xff) >= > + (vcpu_sys_reg(vcpu, ERRIDR_EL1) && 0xff)) > + return trap_raz_wi(vcpu, p, r); The trick here is ERRIDR_EL1 is read only, KVM can choose how many records it emulates. Let's pick zero: > Zero indicates no records can be accessed through the Error Record System > registers. Which lets
Re: [PATCH v3] arm64: Introduce IRQ stack
On 05/10/15 07:37, AKASHI Takahiro wrote: > On 10/04/2015 11:32 PM, Jungseok Lee wrote: >> On Oct 3, 2015, at 1:23 AM, James Morse wrote: >>> One observed change in behaviour: >>> Any stack-unwinding now stops at el1_irq(), which is the bottom of the irq >>> stack. This shows up with perf (using incantation [0]), and with any calls >>> to dump_stack() (which actually stops the frame before el1_irq()). >>> >>> I don't know if this will break something, (perf still seems to work) - but >>> it makes the panic() output less useful, as all the 'other' cpus print: >> >> Agreed. A process stack should be walked to deliver useful information. >> >> There are two approaches I've tried as experimental. >> >> 1) Link IRQ stack to a process one via frame pointer >> As saving x29 and elr_el1 into IRQ stack and then updating x29, IRQ stack >> could be linked to a process one. It is similar to your patch except some >> points. However, it might complicate "stack tracer on ftrace" issue. > > Well, as far as object_is_on_stack() works correctly, stack tracer will not > follow an interrupt stack even if unwind_frame() might traverse from > an interrupt stack to a process stack. See check_stack(). > > Under this assumption, I'm going to simplify my "stack tracer" bugfix > by removing interrupt-related nasty hacks that I described in RFC. > > Thanks, > -Takahiro AKASHI > > >> 2) Walk a process stack followed by IRQ one >> This idea, which is straightforward, comes from x86 implementation [1]. The >> approach might be orthogonal to "stack tracer on ftrace" issue. In this >> case, x86 has to walk interrupt/exception stacks because the order may be: process -> hw_irq -> debug_exception -> double_fault. Where each of these could have its own stack, the code needs to determine the correct order to produce a correct stack trace. Our case is a lot simpler, as we could only ever have two, and know the order. We only need to walk the irq stack if we are currently using it, and we always know the process stack is last. I would go with the first option, being careful of stack corruption when stepping between them. >> unfortunately, a top bit comparison of stack pointer cannot be adopted >> due to >> a necessity of a final snapshot of a process stack pointer, which is struct >> irq_stack::thread_sp in v2 patch. I'm not sure I follow you here. We can check if regs->sp is an irq stack by comparing it with the per-cpu irq_stack value, (top bits comparison). Then we know that the last frame-pointer (in your (1) above), will point to the process stack, at which point we can walk onto that stack. >> BTW, I have another question. Is it reasonable to introduce THREAD_SIZE as a >> kernel configuration option like page size for the sake of convenience >> because >> a combination of ARM64 and a small ram is not unusual in real practice? We want the smallest safe value. It's probably best leaving as it is for now - once we have this feature, we can collect maximum stack-usage sizes for different platforms and workloads, and decide on the smallest safe value. Thanks, James -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 1/2] arm64: kvm: allows kvm cpu hotplug
On 29/05/15 06:38, AKASHI Takahiro wrote: > The current kvm implementation on arm64 does cpu-specific initialization > at system boot, and has no way to gracefully shutdown a core in terms of > kvm. This prevents, especially, kexec from rebooting the system on a boot > core in EL2. > > This patch adds a cpu tear-down function and also puts an existing cpu-init > code into a separate function, kvm_arch_hardware_disable() and > kvm_arch_hardware_enable() respectively. > We don't need arm64-specific cpu hotplug hook any more. I think we do... on platforms where cpuidle uses psci to temporarily turn off cores that aren't in use, we lose the el2 state. This hotplug hook restores the state, even if there a no vms running. This patch prevents me from running vms on such a platform, qemu gives: > kvm [1500]: Unsupported exception type: 6264688KVM internal error. Suberror: 0 kvmtool goes with a more dramatic: > KVM exit reason: 17 ("KVM_EXIT_INTERNAL_ERROR") Disabling CONFIG_ARM_CPUIDLE solves this problem. (Sorry to revive an old thread - I've been using v4 of this patch for the hibernate/suspend-to-disk series). > Since this patch modifies common part of code between arm and arm64, one > stub definition, __cpu_reset_hyp_mode(), is added on arm side to avoid > compiling errors. > > Signed-off-by: AKASHI Takahiro > diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S > index fd085ec..afe6263 100644 > --- a/arch/arm64/kvm/hyp.S > +++ b/arch/arm64/kvm/hyp.S > @@ -1136,6 +1136,11 @@ ENTRY(kvm_call_hyp) > ret > ENDPROC(kvm_call_hyp) > > +ENTRY(kvm_call_reset) > + hvc #HVC_RESET > + ret > +ENDPROC(kvm_call_reset) > + > .macro invalid_vectorlabel, target > .align 2 > \label: > @@ -1179,10 +1184,27 @@ el1_sync: // > Guest trapped into EL2 > cmp x18, #HVC_GET_VECTORS > b.ne1f > mrs x0, vbar_el2 > - b 2f > - > -1: /* Default to HVC_CALL_HYP. */ > + b do_eret > > + /* jump into trampoline code */ > +1: cmp x18, #HVC_RESET > + b.ne2f > + /* > + * Entry point is: > + * TRAMPOLINE_VA > + * + (__kvm_hyp_reset - (__hyp_idmap_text_start & PAGE_MASK)) > + */ > + adrpx2, __kvm_hyp_reset > + add x2, x2, #:lo12:__kvm_hyp_reset > + adrpx3, __hyp_idmap_text_start > + add x3, x3, #:lo12:__hyp_idmap_text_start > + and x3, x3, PAGE_MASK > + sub x2, x2, x3 > + ldr x3, =TRAMPOLINE_VA > + add x2, x2, x3 > + br x2 // no return > + > +2: /* Default to HVC_CALL_HYP. */ What was the reason not to use kvm_call_hyp(__kvm_hyp_reset, ...)? (You mentioned you wanted to at [0] - I can't find the details in the archive) Thanks, James [0] http://lists.infradead.org/pipermail/kexec/2015-April/335533.html > pushlr, xzr > > /* > @@ -1196,7 +1218,9 @@ el1_sync: // > Guest trapped into EL2 > blr lr > > pop lr, xzr > -2: eret > + > +do_eret: > + eret > > el1_trap: > /* > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
Hi Jungseok, On 12/10/15 15:53, Jungseok Lee wrote: > On Oct 9, 2015, at 11:24 PM, James Morse wrote: >> I think unwind_frame() needs to walk the irq stack too. [2] is an example >> of perf tracing back to userspace, (and there are patches on the list to >> do/fix this), so we need to walk back to the start of the first stack for >> the perf accounting to be correct. > > Frankly, I missed the case where perf does backtrace to userspace. > > IMO, this statement supports why the stack trace feature commit should be > written independently. The [1/2] patch would be pretty stable if 64KB page > is supported. If this hasn't been started yet, here is a build-test-only first-pass at the 64K page support - based on the code in kernel/fork.c: ==%<== diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c index a6bdf4d3a57c..deb057a735ad 100644 --- a/arch/arm64/kernel/irq.c +++ b/arch/arm64/kernel/irq.c @@ -27,8 +27,22 @@ #include #include #include +#include +#include #include +#if THREAD_SIZE >= PAGE_SIZE +#define __alloc_irq_stack(x) (void *)__get_free_pages(THREADINFO_GFP, \ + THREAD_SIZE_ORDER) + +extern struct kmem_cache *irq_stack_cache; /* dummy declaration */ +#else +#define __alloc_irq_stack(cpu) (void *)kmem_cache_alloc_node(irq_stack_cache, \ + THREADINFO_GFP, cpu_to_node(cpu)) + +static struct kmem_cache *irq_stack_cache; +#endif /* THREAD_SIZE >= PAGE_SIZE */ unsigned long irq_err_count; DEFINE_PER_CPU(struct irq_stack, irq_stacks); @@ -128,7 +142,17 @@ int alloc_irq_stack(unsigned int cpu) if (per_cpu(irq_stacks, cpu).stack) return 0; - stack = (void *)__get_free_pages(THREADINFO_GFP, THREAD_SIZE_ORDER); + if (THREAD_SIZE < PAGE_SIZE) { + if (!irq_stack_cache) { + irq_stack_cache = kmem_cache_create("irq_stack", + THREAD_SIZE, + THREAD_SIZE, 0, + NULL); + BUG_ON(!irq_stack_cache); + } + } + + stack = __alloc_irq_stack(cpu); if (!stack) return -ENOMEM; ==%<== (my mail client will almost certainly mangle that) Having two kmem_caches for 16K stacks on a 64K page system may be wasteful (especially for systems with few cpus)... The alternative is to defining CONFIG_ARCH_THREAD_INFO_ALLOCATOR and allocate all stack memory from arch code. (Largely copied code, prevents irq stacks being a different size, and nothing uses that define today!) Thoughts? > >>> +*/ >>> + if (fp < low || fp > high - 0x10 || fp & 0xf) >>> return -EINVAL; >>> >>> frame->sp = fp + 0x10; >>> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c >>> index f93aae5..44b2f828 100644 >>> --- a/arch/arm64/kernel/traps.c >>> +++ b/arch/arm64/kernel/traps.c >>> @@ -146,6 +146,8 @@ static void dump_instr(const char *lvl, struct pt_regs >>> *regs) >>> static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk) >>> { >>> struct stackframe frame; >>> + unsigned int cpu = smp_processor_id(); >> >> I wonder if there is any case where dump_backtrace() is called on another >> cpu? >> >> Setting the cpu value from task_thread_info(tsk)->cpu would protect against >> this. > > IMO, no, but your suggestion makes sense. I will update it. > >>> + bool in_irq = in_irq_stack(cpu); >>> >>> pr_debug("%s(regs = %p tsk = %p)\n", __func__, regs, tsk); >>> >>> @@ -170,6 +172,10 @@ static void dump_backtrace(struct pt_regs *regs, >>> struct task_struct *tsk) >>> } >>> >>> pr_emerg("Call trace:\n"); >>> +repeat: >>> + if (in_irq) >>> + pr_emerg("\n"); >> >> Do we need these? 'el1_irq()' in the trace is a giveaway… > > I borrow this idea from x86 implementation in order to show a separate stack > explicitly. There is no issue to remove these tags, and . Ah okay - if its done elsewhere, its better to be consistent. Thanks, James -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 1/2] arm64: kvm: allows kvm cpu hotplug
Hi, On 13/10/15 06:38, AKASHI Takahiro wrote: > On 10/12/2015 10:28 PM, James Morse wrote: >> On 29/05/15 06:38, AKASHI Takahiro wrote: >>> The current kvm implementation on arm64 does cpu-specific initialization >>> at system boot, and has no way to gracefully shutdown a core in terms of >>> kvm. This prevents, especially, kexec from rebooting the system on a boot >>> core in EL2. >>> >>> This patch adds a cpu tear-down function and also puts an existing cpu-init >>> code into a separate function, kvm_arch_hardware_disable() and >>> kvm_arch_hardware_enable() respectively. >>> We don't need arm64-specific cpu hotplug hook any more. >> >> I think we do... on platforms where cpuidle uses psci to temporarily turn >> off cores that aren't in use, we lose the el2 state. This hotplug hook >> restores the state, even if there a no vms running. I've just noticed there are two cpu notifiers - we may be referring to different ones. (hyp_init_cpu_pm_nb and hyp_init_cpu_nb) > If I understand you correctly, with or without my patch, kvm doesn't work > under cpuidle anyway. Right? It works with, and without, v4. This patch v5 causes the problem. > If so, saving/restoring cpu states (or at least, kicking cpu hotplug hooks) > is cpuidle driver's responsibility, isn't it? Yes - but with v5, (at least one of) the hotplug hooks isn't having the same effect as before: Before v5, cpu_init_hyp_mode() is called via cpu_notify() each time cpu_suspend() suspends/wakes-up the core. Logically it should be the 'pm' notifier that does this work: > if (cmd == CPU_PM_EXIT && > __hyp_get_vectors() == hyp_default_vectors) { > cpu_init_hyp_mode(NULL); > return NOTIFY_OK; > With v5, kvm_arch_hardware_enable() isn't called each time cpu_suspend() cycles the core. The problem appears to be this hunk, affecting the above code: > - if (cmd == CPU_PM_EXIT && > - __hyp_get_vectors() == hyp_default_vectors) { > - cpu_init_hyp_mode(NULL); > + if (cmd == CPU_PM_EXIT && kvm_arm_get_running_vcpu()) { > + kvm_arch_hardware_enable(); Changing this to just rename cpu_init_hyp_mode() to kvm_arch_hardware_enable() solves the problem. Presumably kvm_arm_get_running_vcpu() evaluates to false before the first vm is started, meaning no vms can be started if pm events occur before starting the first vm. Sorry I blamed the wrong cpu notifier hook - I didn't realise there were two! Thanks, James >> This patch prevents me from running vms on such a platform, qemu gives: >>> kvm [1500]: Unsupported exception type: 6264688KVM internal error. >> Suberror: 0 >> >> kvmtool goes with a more dramatic: >>> KVM exit reason: 17 ("KVM_EXIT_INTERNAL_ERROR") >> >> Disabling CONFIG_ARM_CPUIDLE solves this problem. >> >> >> (Sorry to revive an old thread - I've been using v4 of this patch for the >> hibernate/suspend-to-disk series). >> >> >>> Since this patch modifies common part of code between arm and arm64, one >>> stub definition, __cpu_reset_hyp_mode(), is added on arm side to avoid >>> compiling errors. >>> >>> Signed-off-by: AKASHI Takahiro >> >>> diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S >>> index fd085ec..afe6263 100644 >>> --- a/arch/arm64/kvm/hyp.S >>> +++ b/arch/arm64/kvm/hyp.S >>> @@ -1136,6 +1136,11 @@ ENTRY(kvm_call_hyp) >>> ret >>> ENDPROC(kvm_call_hyp) >>> >>> +ENTRY(kvm_call_reset) >>> +hvc#HVC_RESET >>> +ret >>> +ENDPROC(kvm_call_reset) >>> + >>> .macro invalid_vectorlabel, target >>> .align2 >>> \label: >>> @@ -1179,10 +1184,27 @@ el1_sync:// Guest trapped >>> into EL2 >>> cmpx18, #HVC_GET_VECTORS >>> b.ne1f >>> mrsx0, vbar_el2 >>> -b2f >>> - >>> -1:/* Default to HVC_CALL_HYP. */ >>> +bdo_eret >>> >>> +/* jump into trampoline code */ >>> +1:cmpx18, #HVC_RESET >>> +b.ne2f >>> +/* >>> + * Entry point is: >>> + *TRAMPOLINE_VA >>> + *+ (__kvm_hyp_reset - (__hyp_idmap_text_start & PAGE_MASK)) >>> + */ >>> +adrpx2, __kvm_hyp_reset >>> +addx2, x2, #:lo12:__kvm_hyp_reset >>> +adrpx3, __hyp_idmap_text_start >>> +addx3, x3,
Re: [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
Hi Jungseok, On 12/10/15 23:13, Jungseok Lee wrote: > On Oct 13, 2015, at 1:34 AM, James Morse wrote: >> Having two kmem_caches for 16K stacks on a 64K page system may be wasteful >> (especially for systems with few cpus)… > > This would be a single concern. To address this issue, I drop the 'static' > keyword in thread_info_cache. Please refer to the below hunk. Its only a problem on systems with 64K pages, which don't have a multiple of 4 cpus. I suspect if you turn on 64K pages, you have many cores with plenty of memory... >> The alternative is to defining CONFIG_ARCH_THREAD_INFO_ALLOCATOR and >> allocate all stack memory from arch code. (Largely copied code, prevents >> irq stacks being a different size, and nothing uses that define today!) >> >> >> Thoughts? > > Almost same story I've been testing. > > I'm aligned with yours Regarding CONFIG_ARCH_THREAD_INFO_ALLOCATOR. > > Another approach I've tried is the following data structure, but it's not > a good fit for this case due to __per_cpu_offset which is page-size aligned, > not thread-size. > > struct irq_stack { > char stack[THREAD_SIZE]; > char *highest; > } __aligned(THREAD_SIZE); > > DEFINE_PER_CPU(struct irq_stack, irq_stacks); Yes, x86 does this - but it increases the Image size by 16K, as that space could have some initialisation values. This isn't a problem on x86 as no-one uses the uncompressed image. I would avoid this approach due to the bloat! > > 8<- > diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h > index 6ea82e8..d3619b3 100644 > --- a/arch/arm64/include/asm/irq.h > +++ b/arch/arm64/include/asm/irq.h > @@ -1,7 +1,9 @@ > #ifndef __ASM_IRQ_H > #define __ASM_IRQ_H > > +#include > #include > +#include > > #include > > @@ -9,6 +11,21 @@ struct irq_stack { > void *stack; > }; > > +#if THREAD_SIZE >= PAGE_SIZE > +static inline void *__alloc_irq_stack(void) > +{ > + return (void *)__get_free_pages(THREADINFO_GFP | __GFP_ZERO, > + THREAD_SIZE_ORDER); > +} > +#else > +extern struct kmem_cache *thread_info_cache; If this has been made a published symbol, it should go in a header file. > + > +static inline void *__alloc_irq_stack(void) > +{ > + return kmem_cache_alloc(thread_info_cache, THREADINFO_GFP | > __GFP_ZERO); > +} > +#endif > + > struct pt_regs; > > extern void migrate_irqs(void); > diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c > index a6bdf4d..4e13bdd 100644 > --- a/arch/arm64/kernel/irq.c > +++ b/arch/arm64/kernel/irq.c > @@ -50,10 +50,13 @@ void __init set_handle_irq(void (*handle_irq)(struct > pt_regs *)) > handle_arch_irq = handle_irq; > } > > +static char boot_irq_stack[THREAD_SIZE] __aligned(THREAD_SIZE); > + > void __init init_IRQ(void) > { > - if (alloc_irq_stack(smp_processor_id())) > - panic("Failed to allocate IRQ stack for a boot cpu"); > + unsigned int cpu = smp_processor_id(); > + > + per_cpu(irq_stacks, cpu).stack = boot_irq_stack + THREAD_START_SP; > > irqchip_init(); > if (!handle_arch_irq) > @@ -128,7 +131,7 @@ int alloc_irq_stack(unsigned int cpu) > if (per_cpu(irq_stacks, cpu).stack) > return 0; > > - stack = (void *)__get_free_pages(THREADINFO_GFP, THREAD_SIZE_ORDER); > + stack = __alloc_irq_stack(); > if (!stack) > return -ENOMEM; > > diff --git a/kernel/fork.c b/kernel/fork.c > index 2845623..9c55f86 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -172,7 +172,7 @@ static inline void free_thread_info(struct thread_info > *ti) > free_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER); > } > # else > -static struct kmem_cache *thread_info_cache; > +struct kmem_cache *thread_info_cache; > > static struct thread_info *alloc_thread_info_node(struct task_struct *tsk, > int node) > 8<- Looks good! Thanks, James -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH 0/3] Implement IRQ stack on ARM64
On 04/09/15 15:23, Jungseok Lee wrote: > ARM64 kernel allocates 16KB kernel stack when creating a process. In case > of low memory platforms with tough workloads on userland, this order-2 > allocation request reaches to memory pressure and performance degradation > simultaenously since VM page allocator falls into slowpath frequently, > which triggers page reclaim and compaction. > > I believe that one of the best solutions is to reduce kernel stack size. > According to the following data from stack tracer with some fixes, [1], > a separate IRQ stack would greatly help to decrease a kernel stack depth. > Hi Jungseok Lee, I was working on a similar patch for irq stack, (patch as a follow up email). I suggest we work together on a single implementation. I think the only major difference is that you're using sp_el0 as a temporary register to store a copy of the stack-pointer to find struct thread_info, whereas I was copying it between stacks (ends up as 2x ldp/stps), which keeps the change restricted to irq_stack setup code. We should get some feedback as to which approach is preferred. Thanks, James Morse -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] arm64: kernel: Use a separate stack for irq interrupts.
Having to handle interrupts on top of an existing kernel stack means the kernel stack must be large enough to accomodate both the maximum kernel usage, and the maximum irq handler usage. Switching to a different stack when processing irqs allows us to make the stack size smaller. Maximum kernel stack usage (running ltp and generating usb+ethernet interrupts) was 7256 bytes. With this patch, the same workload gives a maximum stack usage of 5816 bytes. Signed-off-by: James Morse --- arch/arm64/include/asm/irq.h | 12 + arch/arm64/include/asm/thread_info.h | 8 -- arch/arm64/kernel/entry.S| 33 --- arch/arm64/kernel/irq.c | 52 arch/arm64/kernel/smp.c | 4 +++ arch/arm64/kernel/stacktrace.c | 4 ++- 6 files changed, 107 insertions(+), 6 deletions(-) diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h index bbb251b14746..050d4196c736 100644 --- a/arch/arm64/include/asm/irq.h +++ b/arch/arm64/include/asm/irq.h @@ -2,14 +2,20 @@ #define __ASM_IRQ_H #include +#include #include +#include + +DECLARE_PER_CPU(unsigned long, irq_sp); struct pt_regs; extern void migrate_irqs(void); extern void set_handle_irq(void (*handle_irq)(struct pt_regs *)); +extern int alloc_irq_stack(unsigned int cpu); + static inline void acpi_irq_init(void) { /* @@ -21,4 +27,10 @@ static inline void acpi_irq_init(void) } #define acpi_irq_init acpi_irq_init +static inline bool is_irq_stack(unsigned long sp) +{ + struct thread_info *ti = get_thread_info(sp); + return (get_thread_info(per_cpu(irq_sp, ti->cpu)) == ti); +} + #endif diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h index dcd06d18a42a..b906254fc400 100644 --- a/arch/arm64/include/asm/thread_info.h +++ b/arch/arm64/include/asm/thread_info.h @@ -69,12 +69,16 @@ register unsigned long current_stack_pointer asm ("sp"); /* * how to get the thread information struct from C */ +static inline struct thread_info *get_thread_info(unsigned long sp) +{ + return (struct thread_info *)(sp & ~(THREAD_SIZE - 1)); +} + static inline struct thread_info *current_thread_info(void) __attribute_const__; static inline struct thread_info *current_thread_info(void) { - return (struct thread_info *) - (current_stack_pointer & ~(THREAD_SIZE - 1)); + return get_thread_info(current_stack_pointer); } #define thread_saved_pc(tsk) \ diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S index e16351819fed..d42371f3f5a1 100644 --- a/arch/arm64/kernel/entry.S +++ b/arch/arm64/kernel/entry.S @@ -190,10 +190,37 @@ tsk .reqx28 // current thread_info * Interrupt handling. */ .macro irq_handler - adrpx1, handle_arch_irq - ldr x1, [x1, #:lo12:handle_arch_irq] - mov x0, sp + mrs x21, tpidr_el1 + adr_l x20, irq_sp + add x20, x20, x21 + + ldr x21, [x20] + mov x20, sp + + mov x0, x21 + mov x1, x20 + bl irq_copy_thread_info + + /* test for recursive use of irq_sp */ + cbz w0, 1f + mrs x30, elr_el1 + mov sp, x21 + + /* +* Create a fake stack frame to bump unwind_frame() onto the original +* stack. This relies on x29 not being clobbered by kernel_entry(). +*/ + pushx29, x30 + +1: ldr_l x1, handle_arch_irq + mov x0, x20 blr x1 + + mov x0, x20 + mov x1, x21 + bl irq_copy_thread_info + mov sp, x20 + .endm .text diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c index 463fa2e7e34c..10b57a006da8 100644 --- a/arch/arm64/kernel/irq.c +++ b/arch/arm64/kernel/irq.c @@ -26,11 +26,14 @@ #include #include #include +#include #include #include unsigned long irq_err_count; +DEFINE_PER_CPU(unsigned long, irq_sp) = 0; + int arch_show_interrupts(struct seq_file *p, int prec) { #ifdef CONFIG_SMP @@ -55,6 +58,10 @@ void __init init_IRQ(void) irqchip_init(); if (!handle_arch_irq) panic("No interrupt controller found."); + + /* Allocate an irq stack for the boot cpu */ + if (alloc_irq_stack(smp_processor_id())) + panic("Failed to allocate irq stack for boot cpu."); } #ifdef CONFIG_HOTPLUG_CPU @@ -117,3 +124,48 @@ void migrate_irqs(void) local_irq_restore(flags); } #endif /* CONFIG_HOTPLUG_CPU */ + +/* Allocate an irq_stack for a cpu that is about to be brought up. */ +int alloc_irq_stack(unsigned int cpu) +{ + struct page *irq_stack_page; + union thread_union *irq_stack; + + /* reuse stack allocated previously */ + if (per_cpu(irq_sp, cpu)) + return 0; + + irq_sta
Re: [RFC PATCH 2/3] arm64: Introduce IRQ stack
On 04/09/15 15:23, Jungseok Lee wrote: > Currently, kernel context and interrupts are handled using a single > kernel stack navigated by sp_el1. This forces many systems to use > 16KB stack, not 8KB one. Low memory platforms naturally suffer from > both memory pressure and performance degradation simultaneously as > VM page allocator falls into slowpath frequently. > > This patch, thus, solves the problem as introducing a separate percpu > IRQ stack to handle both hard and soft interrupts with two ground rules: > > - Utilize sp_el0 in EL1 context, which is not used currently > - Do *not* complicate current_thread_info calculation > > struct thread_info can be tracked easily using sp_el0, not sp_el1 when > this feature is enabled. > > Signed-off-by: Jungseok Lee > --- > arch/arm64/Kconfig.debug | 10 ++ > arch/arm64/include/asm/irq.h | 8 ++ > arch/arm64/include/asm/thread_info.h | 11 ++ > arch/arm64/kernel/asm-offsets.c | 8 ++ > arch/arm64/kernel/entry.S| 83 +++- > arch/arm64/kernel/head.S | 7 ++ > arch/arm64/kernel/irq.c | 18 > 7 files changed, 142 insertions(+), 3 deletions(-) > > diff --git a/arch/arm64/Kconfig.debug b/arch/arm64/Kconfig.debug > index d6285ef..e16d91f 100644 > --- a/arch/arm64/Kconfig.debug > +++ b/arch/arm64/Kconfig.debug > @@ -18,6 +18,16 @@ config ARM64_PTDUMP > kernel. > If in doubt, say "N" > > +config IRQ_STACK > + bool "Use separate kernel stack when handling interrupts" > + depends on ARM64_4K_PAGES > + help > + Say Y here if you want to use separate kernel stack to handle both > + hard and soft interrupts. As reduceing memory footprint regarding > + kernel stack, it benefits low memory platforms. > + > + If in doubt, say N. > + I don't think it is necessary to have a debug-only Kconfig option for this. Reducing memory use is good for everyone! This would let you get rid of all the #ifdefs > config STRICT_DEVMEM > bool "Filter access to /dev/mem" > depends on MMU > diff --git a/arch/arm64/include/asm/thread_info.h > b/arch/arm64/include/asm/thread_info.h > index dcd06d1..5345a67 100644 > --- a/arch/arm64/include/asm/thread_info.h > +++ b/arch/arm64/include/asm/thread_info.h > @@ -71,11 +71,22 @@ register unsigned long current_stack_pointer asm ("sp"); > */ > static inline struct thread_info *current_thread_info(void) > __attribute_const__; > > +#ifndef CONFIG_IRQ_STACK > static inline struct thread_info *current_thread_info(void) > { > return (struct thread_info *) > (current_stack_pointer & ~(THREAD_SIZE - 1)); > } > +#else > +static inline struct thread_info *current_thread_info(void) > +{ > + unsigned long sp_el0; > + > + asm volatile("mrs %0, sp_el0" : "=r" (sp_el0)); > + > + return (struct thread_info *)(sp_el0 & ~(THREAD_SIZE - 1)); > +} > +#endif Because sp_el0 is only used as a stack value to find struct thread_info, you could just store the struct thread_info pointer in sp_el0, and save the masking on each read of the value. > > #define thread_saved_pc(tsk) \ > ((unsigned long)(tsk->thread.cpu_context.pc)) > diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S > index d23ca0d..f1fdfa9 100644 > --- a/arch/arm64/kernel/entry.S > +++ b/arch/arm64/kernel/entry.S > @@ -88,7 +88,11 @@ > > .if \el == 0 > mrs x21, sp_el0 > +#ifndef CONFIG_IRQ_STACK > get_thread_info tsk // Ensure MDSCR_EL1.SS is clear, > +#else > + get_thread_info \el, tsk > +#endif > ldr x19, [tsk, #TI_FLAGS] // since we can unmask debug > disable_step_tsk x19, x20 // exceptions when scheduling. > .endif > @@ -168,11 +172,56 @@ > eret// return to kernel > .endm > > +#ifndef CONFIG_IRQ_STACK > .macro get_thread_info, rd > mov \rd, sp > - and \rd, \rd, #~(THREAD_SIZE - 1) // top of stack > + and \rd, \rd, #~(THREAD_SIZE - 1) // bottom of stack > + .endm > +#else > + .macro get_thread_info, el, rd > + .if \el == 0 > + mov \rd, sp > + .else > + mrs \rd, sp_el0 > + .endif > + and \rd, \rd, #~(THREAD_SIZE - 1) // bottom of thread stack > + .endm > + > + .macro get_irq_stack > + get_thread_info 1, tsk > + ldr w22, [tsk, #TI_CPU] > + adr_l x21, irq_stacks > + mov x23, #IRQ_STACK_SIZE > + maddx21, x22, x23, x21 > .endm Using per_cpu variables would save the multiply here. You then wouldn't need IRQ_STACK_SIZE. > > + .macro irq_stack_entry > + get_irq_stack > + ldr w23, [x21, #IRQ_COUNT] > + cbnzw23, 1f > + mov x23, sp > + str x23, [x21, #IRQ_THREAD_SP] > + ldr x23, [x21, #IRQ_STACK] > + mov sp, x23 > + mov x23, xzr > +1: add w23, w2
Re: [RFC PATCH 1/3] arm64: entry: Remove unnecessary calculation for S_SP in EL1h
On 04/09/15 15:23, Jungseok Lee wrote: > Under EL1h, S_SP data is not seen in kernel_exit. Thus, x21 calculation > is not needed in kernel_entry. Currently, S_SP information is vaild only > when sp_el0 is used. > > Signed-off-by: Jungseok Lee > diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S > index e163518..d23ca0d 100644 > --- a/arch/arm64/kernel/entry.S > +++ b/arch/arm64/kernel/entry.S > @@ -91,8 +91,6 @@ > get_thread_info tsk // Ensure MDSCR_EL1.SS is clear, > ldr x19, [tsk, #TI_FLAGS] // since we can unmask debug > disable_step_tsk x19, x20 // exceptions when scheduling. > - .else > - add x21, sp, #S_FRAME_SIZE > .endif > mrs x22, elr_el1 > mrs x23, spsr_el1 > This sp value gets written to the struct pt_regs that is built on the stack, and passed to the fault handlers, see 'el1_sp_pc' in kernel/entry.S, which goes on to call do_sp_pc_abort() which prints this value out. (Other fault handlers may make decisions based on this value). It should be present and correct. Thanks, James -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] arm64: kernel: Use a separate stack for irq interrupts.
On 07/09/15 16:48, Jungseok Lee wrote: > On Sep 7, 2015, at 11:36 PM, James Morse wrote: > > Hi James, > >> Having to handle interrupts on top of an existing kernel stack means the >> kernel stack must be large enough to accomodate both the maximum kernel >> usage, and the maximum irq handler usage. Switching to a different stack >> when processing irqs allows us to make the stack size smaller. >> >> Maximum kernel stack usage (running ltp and generating usb+ethernet >> interrupts) was 7256 bytes. With this patch, the same workload gives >> a maximum stack usage of 5816 bytes. > > I'd like to know how to measure the max stack depth. > AFAIK, a stack tracer on ftrace does not work well. Did you dump a stack > region and find or track down an untouched region? I enabled the 'Trace max stack' option under menuconfig 'Kernel Hacking' -> 'Tracers', then looked in debugfs:/tracing/stack_max_size. What problems did you encounter? (I may be missing something...) Thanks, James -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] arm64: kernel: Use a separate stack for irq interrupts.
On 08/09/15 15:54, Jungseok Lee wrote: > On Sep 7, 2015, at 11:36 PM, James Morse wrote: > > Hi James, > >> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S >> index e16351819fed..d42371f3f5a1 100644 >> --- a/arch/arm64/kernel/entry.S >> +++ b/arch/arm64/kernel/entry.S >> @@ -190,10 +190,37 @@ tsk.reqx28 // current thread_info >> * Interrupt handling. >> */ >> .macro irq_handler >> -adrpx1, handle_arch_irq >> -ldr x1, [x1, #:lo12:handle_arch_irq] >> -mov x0, sp >> +mrs x21, tpidr_el1 >> +adr_l x20, irq_sp >> +add x20, x20, x21 >> + >> +ldr x21, [x20] >> +mov x20, sp >> + >> +mov x0, x21 >> +mov x1, x20 >> +bl irq_copy_thread_info >> + >> +/* test for recursive use of irq_sp */ >> +cbz w0, 1f >> +mrs x30, elr_el1 >> +mov sp, x21 >> + >> +/* >> + * Create a fake stack frame to bump unwind_frame() onto the original >> + * stack. This relies on x29 not being clobbered by kernel_entry(). >> + */ >> +pushx29, x30 > > It is the most challenging item to handle a frame pointer in this context. > > As I mentioned previously, a stack tracer on ftrace is not supported yet. > How about decoupling "IRQ Stack feature" from "Stack Tracer Fix"? Yes - I discovered today this was more complicated than I thought. I will need to do some more reading. >> + >> +1: ldr_l x1, handle_arch_irq >> +mov x0, x20 >> blr x1 >> + >> +mov x0, x20 >> +mov x1, x21 >> +bl irq_copy_thread_info >> +mov sp, x20 >> + >> .endm >> >> .text >> diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c >> index 463fa2e7e34c..10b57a006da8 100644 >> --- a/arch/arm64/kernel/irq.c >> +++ b/arch/arm64/kernel/irq.c >> @@ -26,11 +26,14 @@ >> #include >> #include >> #include >> +#include >> #include >> #include >> >> unsigned long irq_err_count; >> >> +DEFINE_PER_CPU(unsigned long, irq_sp) = 0; >> + >> int arch_show_interrupts(struct seq_file *p, int prec) >> { >> #ifdef CONFIG_SMP >> @@ -55,6 +58,10 @@ void __init init_IRQ(void) >> irqchip_init(); >> if (!handle_arch_irq) >> panic("No interrupt controller found."); >> + >> +/* Allocate an irq stack for the boot cpu */ >> +if (alloc_irq_stack(smp_processor_id())) >> +panic("Failed to allocate irq stack for boot cpu."); >> } >> >> #ifdef CONFIG_HOTPLUG_CPU >> @@ -117,3 +124,48 @@ void migrate_irqs(void) >> local_irq_restore(flags); >> } >> #endif /* CONFIG_HOTPLUG_CPU */ >> + >> +/* Allocate an irq_stack for a cpu that is about to be brought up. */ >> +int alloc_irq_stack(unsigned int cpu) >> +{ >> +struct page *irq_stack_page; >> +union thread_union *irq_stack; >> + >> +/* reuse stack allocated previously */ >> +if (per_cpu(irq_sp, cpu)) >> +return 0; > > I'd like to avoid even this simple check since CPU hotplug could be heavily > used for power management. I don't think its a problem: __cpu_up() contains a call to wait_for_completion_timeout() (which could eventually end up in the scheduler), so I don't think it could ever be on a 'really hot' path. For really frequent hotplug-like power management, cpu_suspend() makes use of firmware support to power-off cores - from what I can see it doesn't use __cpu_up(). >> + >> +irq_stack_page = alloc_kmem_pages(THREADINFO_GFP, THREAD_SIZE_ORDER); >> +if (!irq_stack_page) { >> +pr_crit("CPU%u: failed to allocate irq stack for cpu %u\n", >> +smp_processor_id(), cpu); >> +return -ENOMEM; >> +} >> +irq_stack = page_address(irq_stack_page); >> + >> +per_cpu(irq_sp, cpu) = (unsigned long)irq_stack->stack >> + + THREAD_START_SP; >> + >> +return 0; >> +} >> + >> +/* >> + * Copy struct thread_info between two stacks, and update current->stack. >> + * This is used when moving to the irq exception stack. >> + * Changing current->stack is necessary so that non-arch thread_info writers >> + * don't use the new thread_info->task->stack to find the old thread_info >>
Re: [PATCH] arm64: kernel: Use a separate stack for irq interrupts.
On 09/09/15 14:22, Jungseok Lee wrote: > On Sep 9, 2015, at 1:47 AM, James Morse wrote: >> On 08/09/15 15:54, Jungseok Lee wrote: >>> On Sep 7, 2015, at 11:36 PM, James Morse wrote: >>>> diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c >>>> index 463fa2e7e34c..10b57a006da8 100644 >>>> --- a/arch/arm64/kernel/irq.c >>>> +++ b/arch/arm64/kernel/irq.c >>>> @@ -26,11 +26,14 @@ >>>> #include >>>> #include >>>> #include >>>> +#include >>>> #include >>>> #include >>>> >>>> unsigned long irq_err_count; >>>> >>>> +DEFINE_PER_CPU(unsigned long, irq_sp) = 0; >>>> + >>>> int arch_show_interrupts(struct seq_file *p, int prec) >>>> { >>>> #ifdef CONFIG_SMP >>>> @@ -55,6 +58,10 @@ void __init init_IRQ(void) >>>>irqchip_init(); >>>>if (!handle_arch_irq) >>>>panic("No interrupt controller found."); >>>> + >>>> + /* Allocate an irq stack for the boot cpu */ >>>> + if (alloc_irq_stack(smp_processor_id())) >>>> + panic("Failed to allocate irq stack for boot cpu."); >>>> } >>>> >>>> #ifdef CONFIG_HOTPLUG_CPU >>>> @@ -117,3 +124,48 @@ void migrate_irqs(void) >>>>local_irq_restore(flags); >>>> } >>>> #endif /* CONFIG_HOTPLUG_CPU */ >>>> + >>>> +/* Allocate an irq_stack for a cpu that is about to be brought up. */ >>>> +int alloc_irq_stack(unsigned int cpu) >>>> +{ >>>> + struct page *irq_stack_page; >>>> + union thread_union *irq_stack; >>>> + >>>> + /* reuse stack allocated previously */ >>>> + if (per_cpu(irq_sp, cpu)) >>>> + return 0; >>> >>> I'd like to avoid even this simple check since CPU hotplug could be heavily >>> used for power management. >> >> I don't think its a problem: >> __cpu_up() contains a call to wait_for_completion_timeout() (which could >> eventually end up in the scheduler), so I don't think it could ever be on a >> 'really hot' path. >> >> For really frequent hotplug-like power management, cpu_suspend() makes use >> of firmware support to power-off cores - from what I can see it doesn't use >> __cpu_up(). > > In case of some platforms, CPU hotplug is triggered via sysfs for power > management > based on user data. What is advantage of putting stack allocation into this > path? It will only happen for CPUs that are brought up. > IRQ stack allocation is an critical one-shot operation. So, there would be no > issue > to give this work to a booting core. I agree, but: >From include/linux/cpumask.h: > * If HOTPLUG is enabled, then cpu_possible_mask is forced to have > * all NR_CPUS bits set, otherwise it is just the set of CPUs that > * ACPI reports present at boot. (This doesn't seem to happen with DT - but might with ACPI.) NR_CPUs could be much bigger than the number of cpus the system ever has. Allocating a stack for every possible cpu would waste memory. It is better to do it just-in-time, when we know the memory will be used. This already happens for the per-cpu idle task, (please check I traced these through correctly!) _cpu_up() idle_thread_get() init_idle() fork_idle() copy_process() dup_task_struct() alloc_task_struct_node() alloc_thread_info_node() arch_dup_task_struct() So plenty of memory-allocation occurs during _cpu_up(), idle_init() checks whether the idle task has already been created. >>>> + >>>> + irq_stack_page = alloc_kmem_pages(THREADINFO_GFP, THREAD_SIZE_ORDER); >>>> + if (!irq_stack_page) { >>>> + pr_crit("CPU%u: failed to allocate irq stack for cpu %u\n", >>>> + smp_processor_id(), cpu); >>>> + return -ENOMEM; >>>> + } >>>> + irq_stack = page_address(irq_stack_page); > > I forgot leaving a comment here. How about using __get_free_pages? It returns > an > address instead of page. Good idea, I was paranoid about the stack not being correctly aligned (as current_thread_info() relies on this). I lifted that alloc_kmem_pages line from kernel/fork.c. >>>> + >>>> + per_cpu(irq_sp, cpu) = (unsigned long)irq_stack->stack >>>> + + THREAD_START_SP; >>>> + >>>> + return 0; >>>> +} &
Re: [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
On 14/10/15 13:12, Jungseok Lee wrote: > On Oct 14, 2015, at 12:00 AM, Jungseok Lee wrote: >> On Oct 13, 2015, at 8:00 PM, James Morse wrote: >>> On 12/10/15 23:13, Jungseok Lee wrote: >>>> On Oct 13, 2015, at 1:34 AM, James Morse wrote: >>>>> Having two kmem_caches for 16K stacks on a 64K page system may be wasteful >>>>> (especially for systems with few cpus)… >>>> >>>> This would be a single concern. To address this issue, I drop the 'static' >>>> keyword in thread_info_cache. Please refer to the below hunk. >>> >>> Its only a problem on systems with 64K pages, which don't have a multiple >>> of 4 cpus. I suspect if you turn on 64K pages, you have many cores with >>> plenty of memory… >> >> Yes, the problem 'two kmem_caches' comes from only 64K page system. >> >> I don't get the statement 'which don't have a multiple of 4 cpus'. >> Could you point out what I am missing? > > You're talking about sl{a|u}b allocator behavior. If so, I got what you meant. Yes, With Nx4 cpus, the (currently) 16K irq stacks take up Nx64K - a nice multiple of pages, so no wasted memory. >>> If this has been made a published symbol, it should go in a header file. >> >> Sure. > > I had the wrong impression that there is a room under include/linux/*. Yes, I see there isn't anywhere obvious to put it... > IMO, this is architectural option whether arch relies on thread_info_cache or > not. > In other words, it would be clear to put this extern under > arch/*/include/asm/*. Its up to the arch whether or not to define CONFIG_ARCH_THREAD_INFO_ALLOCATOR. In the case where it hasn't defined it, and THREAD_SIZE >= PAGE_SIZE, your change is exposing thread_info_cache on all architectures, so it ought go in a header file accessible to all architectures. Something like this, (only build-tested!): =%<= --- a/include/linux/thread_info.h +++ b/include/linux/thread_info.h @@ -10,6 +10,8 @@ #include #include +#include + struct timespec; struct compat_timespec; @@ -145,6 +147,12 @@ static inline bool test_and_clear_restore_sigmask(void) #error "no set_restore_sigmask() provided and default one won't work" #endif +#ifndef CONFIG_ARCH_THREAD_INFO_ALLOCATOR +#if THREAD_SIZE >= PAGE_SIZE +extern struct kmem_cache *thread_info_cache; +#endif /* THREAD_SIZE >= PAGE_SIZE */ +#endif /* CONFIG_ARCH_THREAD_INFO_ALLOCATOR */ + #endif /* __KERNEL__ */ #endif /* _LINUX_THREAD_INFO_H */ =%<= Quite ugly! My concern is there could be push-back from the maintainer of kernel/fork.c, saying "define CONFIG_ARCH_THREAD_INFO_ALLOCATOR if the generic code isn't what you need", and push-back from the arm64 maintainers about copy-pasting that chunk into arch/arm64 both of which are fair, hence my initial version created a second kmem_cache. Thanks, James -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 2/2] arm64: Expand the stack trace feature to support IRQ stack
On 15/10/15 15:24, Jungseok Lee wrote: > On Oct 9, 2015, at 11:24 PM, James Morse wrote: >> I think unwind_frame() needs to walk the irq stack too. [2] is an example >> of perf tracing back to userspace, (and there are patches on the list to >> do/fix this), so we need to walk back to the start of the first stack for >> the perf accounting to be correct. > > I plan to do re-spin this series without [PATCH 2/2] since 1) Akashi's > approach looks better than mine and 2) you have the perf patches for [2]. They aren't my patches - the ones I saw on the list were for arm: https://lkml.org/lkml/2015/10/1/769 - its evidently something perf supports, so we shouldn't make it worse... Thanks! James -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5] arm64: Introduce IRQ stack
Hi Jungseok, On 17/10/15 15:27, Jungseok Lee wrote: > Currently, kernel context and interrupts are handled using a single > kernel stack navigated by sp_el1. This forces a system to use 16KB > stack, not 8KB one. This restriction makes low memory platforms > suffer from memory pressure accompanied by performance degradation. > > This patch addresses the issue as introducing a separate percpu IRQ > stack to handle both hard and soft interrupts with two ground rules: > > - Utilize sp_el0 in EL1 context, which is not used currently > - Do not complicate current_thread_info calculation > > It is a core concept to directly retrieve struct thread_info from > sp_el0. This approach helps to prevent text section size from being > increased largely as removing masking operation using THREAD_SIZE > in tons of places. I was worried we could end up in schedule() while on the irq stack. This can't happen, just to save anyone else the trip down the rabbit-hole: Q> If TIF_NEED_RESCHED is set, and we have multiple calls to el1_irq() on the same stack - will the most-recent one to exit call el1_preempt() -> preempt_schedule_irq()? A> No, because the code to check if TIF_NEED_RESCHED is set, also checks preempt_count is zero, and __do_softirq() increases by softirq_offset (via __local_bh_disable_ip()) before re-enabling interrupts, so there is no 'gap' in recursive use of the irq_stack where preempt_count can be zero. > --- > I've used Cc', not Tested-by tag, from James, since there is a gap > between v4 and v5. Re-tested, with both 4K and 64K pages. Tested-By: James Morse I also need to test this on top of Akashi Takahiros's series - in isolation this patch only lets perf/dump_stack() unwind as far as el1_irq(). (It would be good to note that dependency in this comments/changelog section - so that they get merged in the right order!) > > Changes since v4: > - Supported 64KB page system > - Introduced IRQ_STACK_* macro, per Catalin > - Rebased on top of for-next/core > > Changes since v3: > - Expanded stack trace to support IRQ stack > - Added more comments > > Changes since v2: > - Optmised current_thread_info function as removing masking operation > and volatile keyword, per James and Catalin > - Reworked irq re-enterance check logic using top-bit comparison of > stacks, per James > - Added sp_el0 update in cpu_resume, per James > - Selected HAVE_IRQ_EXIT_ON_IRQ_STACK to expose this feature explicitly > - Added a Tested-by tag from James > - Added comments on sp_el0 as a helper messeage > > Changes since v1: > - Rebased on top of v4.3-rc1 > - Removed Kconfig about IRQ stack, per James > - Used PERCPU for IRQ stack, per James > - Tried to allocate IRQ stack when CPU is about to start up, per James > - Moved sp_el0 update into kernel_entry macro, per James > - Dropped S_SP removal patch, per Mark and James > > arch/arm64/Kconfig | 1 + > arch/arm64/include/asm/irq.h | 27 ++ > arch/arm64/include/asm/thread_info.h | 10 +++- > arch/arm64/kernel/entry.S| 42 ++-- > arch/arm64/kernel/head.S | 5 ++ > arch/arm64/kernel/irq.c | 24 + > arch/arm64/kernel/sleep.S| 3 ++ > arch/arm64/kernel/smp.c | 13 - > 8 files changed, 116 insertions(+), 9 deletions(-) > > diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h > index 0916929..2755b2f 100644 > --- a/arch/arm64/include/asm/irq.h > +++ b/arch/arm64/include/asm/irq.h > @@ -1,14 +1,40 @@ > #ifndef __ASM_IRQ_H > #define __ASM_IRQ_H > > +#ifndef CONFIG_ARM64_64K_PAGES > +#define IRQ_STACK_SIZE_ORDER 2 > +#endif > + > +#define IRQ_STACK_SIZE 16384 > +#define IRQ_STACK_START_SP (IRQ_STACK_SIZE - 16) If the plan is to have the irq stack the same size, it would be good to use one definition in the other - just to make it obvious. e.g. #define IRQ_STACK_SIZE THREAD_SIZE Are these used in assembly code? If not, they could go after the ifndef __ASSEMBLY__. > + > +#ifndef __ASSEMBLY__ > + > +#include > #include > +#include > > #include > > +#if IRQ_STACK_SIZE >= PAGE_SIZE > +static inline void *__alloc_irq_stack(void) > +{ > + return (void *)__get_free_pages(THREADINFO_GFP | __GFP_ZERO, > + IRQ_STACK_SIZE_ORDER); Need to include linux/thread_info.h for THREADINFO_GFP, and linux/gfp.h for __GFP_ZERO, although, depending on CONFIG_DEBUG_STACK_USAGE THREADINFO_GFP includes __GFP_ZERO > +} > +#else > +static inline void *__alloc_irq_stack(void) > +{ > + return kmalloc(IRQ_STACK_SIZE, THREADINFO_GFP | __GFP_ZERO); > +} > +#end
Re: [PATCH v5] arm64: Introduce IRQ stack
On 20/10/15 16:05, Jungseok Lee wrote: > On Oct 20, 2015, at 7:05 PM, James Morse wrote: >> On 17/10/15 15:27, Jungseok Lee wrote: >>> diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c >>> index 9f17ec0..13fe8f4 100644 >>> --- a/arch/arm64/kernel/irq.c >>> +++ b/arch/arm64/kernel/irq.c >>> @@ -30,6 +30,8 @@ >>> >>> unsigned long irq_err_count; >>> >>> +DEFINE_PER_CPU(void *, irq_stacks); >>> + >>> int arch_show_interrupts(struct seq_file *p, int prec) >>> { >>> show_ipi_list(p, prec); >>> @@ -47,9 +49,31 @@ void __init set_handle_irq(void (*handle_irq)(struct >>> pt_regs *)) >>> handle_arch_irq = handle_irq; >>> } >>> >>> +static char boot_irq_stack[IRQ_STACK_SIZE] __aligned(IRQ_STACK_SIZE); >> >> Is kmalloc() not available this early? Regardless: >> As Catalin is happy with the Image size increase [0], this could become >> something like: >>> DEFINE_PER_CPU(union thread_union, irq_stack); >> Which will remove the need to __alloc_irq_stack()s. > > We cannot rely on static allocation using percpu in case of 4KB page system. > Since ARM64 utilizes generic setup_per_cpu_areas(), tpidr_el1 is PAGE_SIZE > aligned. That is, IRQ stack is allocated with PAGE_SIZE alignment for > secondary > cores. However, the top-bit method works well under the assumption that IRQ > stack is IRQ_STACK_SIZE aligned. It leads to IRQ re-entrance check failure. Yikes! That is nasty... well caught! Now I understand why you had the per-cpu version #ifdef'd in your example hunk earlier! Do we need the irq stack to be aligned like this? It was originally for the old implementation of current_thread_info(), which this patch changes. If its just the re-entrance check that needs the alignment, maybe the irq_count approach is better (but count late not early), and drop the alignment requirement on interrupt stacks. We know re-entrant irqs will keep sp_el0, so the new current_thread_info() still works. I think Catalin's comment is to count like x86 (64 bit version) does in arch/x86/entry/entry_64.S:do_softirq_own_stack, and treat this as a re-entrance flag in entry.S. task stacks still need to be aligned, as when user space is interrupted, we have a kernel stack, and no idea where its struct task_struct is, unless we know it was originally THREAD_SIZE aligned. James -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3] arm64: Introduce IRQ stack
Hi, On 22/09/15 13:11, Jungseok Lee wrote: > Currently, kernel context and interrupts are handled using a single > kernel stack navigated by sp_el1. This forces a system to use 16KB > stack, not 8KB one. This restriction makes low memory platforms suffer > from memory pressure accompanied by performance degradation. > > This patch addresses the issue as introducing a separate percpu IRQ > stack to handle both hard and soft interrupts with two ground rules: > > - Utilize sp_el0 in EL1 context, which is not used currently > - Do not complicate current_thread_info calculation > > It is a core concept to directly retrieve struct thread_info from > sp_el0. This approach helps to prevent text section size from being > increased largely as removing masking operation using THREAD_SIZE > in tons of places. One observed change in behaviour: Any stack-unwinding now stops at el1_irq(), which is the bottom of the irq stack. This shows up with perf (using incantation [0]), and with any calls to dump_stack() (which actually stops the frame before el1_irq()). I don't know if this will break something, (perf still seems to work) - but it makes the panic() output less useful, as all the 'other' cpus print: > CPU3: stopping > CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.3.0-rc3+ #223 > Hardware name: ARM Juno development board (r1) (DT) > Call trace: > [] dump_backtrace+0x0/0x164 > [] show_stack+0x1c/0x28 > [] dump_stack+0x88/0xc8 > [] handle_IPI+0x258/0x268 > [] gic_handle_irq+0x88/0xa4 > Exception stack(0x8009769e3fc0 to 0x8009769e40e0) > <...values from stack ...> > CPU4: stopping > CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.3.0-rc3+ #223 > Hardware name: ARM Juno development board (r1) (DT) > Call trace: So we don't get to see what they were doing, as the IPI-irq and subsequent switch to the irq_stack hide the state. I was trying to fix this with the other version, (see the changes to kernel/stacktrace.c), but as Akashi Takahiro pointed out, I wasn't quite right... I will try to produce a fragment to tidy this up next week. Thanks, James [0] perf record -e mem::x -ag -- sleep 10; perf report --call-graph --stdio -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: KVM guest sometimes failed to boot because of kernel stack overflow if KPTI is enabled on a hisilicon ARM64 platform.
Hi Wei, On 20/06/18 16:52, Wei Xu wrote: > On 2018/6/20 22:42, Will Deacon wrote: >> Hmm, I wonder if this is at all related to RAS, since we've just enabled >> that and if we take a fault whilst rewriting swapper then we're going to >> get stuck. What happens if you set CONFIG_ARM64_RAS_EXTN=n in the guest? > > I will try it now. It's not just the Kconfig symbol, could you also revert: f751daa4f9d3 ("arm64: Unconditionally enable IESB on exception entry/return for firmware-first") (reverts and build cleanly on 4.17) Thanks, James
Re: KVM guest sometimes failed to boot because of kernel stack overflow if KPTI is enabled on a hisilicon ARM64 platform.
Hi Will, Wei, On 20/06/18 17:25, Wei Xu wrote: > On 2018/6/20 23:54, James Morse wrote: > I have disabled CONFIG_ARM64_RAS_EXTN and reverted that commit. > But I still got the stack overflow issue sometimes. > Do you have more hint? > The log is as below: > Â Â Â [Â Â Â 0.00] Booting Linux on physical CPU 0x00 [0x480fd010] > Â Â Â [Â Â Â 0.00] Linux version 4.17.0-45865-g2b31fe7-dirty Could you reproduce this with v4.17? This says there are ~45,000 extra patches, and un-committed changes. None of the hashes so far have been commits in mainline, so we have no idea what this tree is. > (joyx@Turing-Arch-b) (gcc version 4.9.1 20140505 (prerelease) (crosstool-NG > linaro-1.13.1-4.9-2014.05 - Linaro GCC 4.9-2014.05)) #10 SMP PREEMPT Wed Jun > 20 > 23:59:05 CST 2018 > Â Â Â [Â Â Â 0.00] CPU0: using LPI pending table @0x7d86 > Â Â Â [Â Â Â 0.00] GIC: PPI11 is secure or misconfigured > Â Â Â [Â Â Â 0.00] arch_timer: WARNING: Invalid trigger for IRQ3, assuming > level > low > Â Â Â [Â Â Â 0.00] arch_timer: WARNING: Please fix your firmware > Â Â Â [Â Â Â 0.00] arch_timer: cp15 timer(s) running at 100.00MHz (virt). (No idea what these mean, but I doubt they are relevant) > Â Â Â [Â Â Â 0.042421] Insufficient stack space to handle exception! > Â Â Â [Â Â Â 0.042423] ESR: 0x9646 -- DABT (current EL) > Â Â Â [Â Â Â 0.043730] FAR: 0x093a80e0 > Â Â Â [Â Â Â 0.044714] Task stack: [0x093a8000..0x093ac000] This was a level 2 translation fault on a write, to an address that is within the stack > Â Â Â [Â Â Â 0.051113] IRQ stack: [0x0800..0x08004000] > Â Â Â [Â Â Â 0.057610] Overflow stack: [0x80003efce2f0..0x80003efcf2f0] > Â Â Â [Â Â Â 0.064003] CPU: 0 PID: 12 Comm: migration/0 Not tainted > 4.17.0-45865-g2b31fe7-dirty #10 > Â Â Â [Â Â Â 0.072201] Hardware name: linux,dummy-virt (DT) > Â Â Â [Â Â Â 0.076797] pstate: 604003c5 (nZCv DAIF +PAN -UAO) > Â Â Â [Â Â Â 0.081727] pc : el1_sync+0x0/0xb0 ... from the vectors. > Â Â Â [Â Â Â 0.085217] lr : kpti_install_ng_mappings+0x120/0x214 What I think is happening is: we come out of the kpti idmap with the stack unmapped. Shortly after we access the stack, which faults. el1_sync faults as well when it tries to push the registers to the stack, and we keep going until we overflow the stack. I can't reproduce this with kvmtool or qemu in the model. Thanks, James
Re: [PATCH 1/1] arm64/mm: move {idmap_pg_dir,tramp_pg_dir,swapper_pg_dir} to .rodata section
Hi guys, On 21/06/18 07:39, Ard Biesheuvel wrote: > On 21 June 2018 at 04:51, Jun Yao wrote: >> On Wed, Jun 20, 2018 at 12:09:49PM +0200, Ard Biesheuvel wrote: >>> On 20 June 2018 at 10:57, Jun Yao wrote: Move {idmap_pg_dir,tramp_pg_dir,swapper_pg_dir} to .rodata section. And update the swapper_pg_dir by fixmap. >>> >>> I think we may be able to get away with not mapping idmap_pg_dir and >>> tramp_pg_dir at all. >> >> I think we need to move tramp_pg_dir to .rodata. The attacker can write >> a block-mapping(AP=01) to tramp_pg_dir and then he can access kernel >> memory. > Why does it need to be mapped at all? When do we ever access it from the code? (We would want to make its fixmap entry read-only too) >>> As for swapper_pg_dir, it would indeed be nice if we could keep those >>> mappings read-only most of the time, but I'm not sure how useful this >>> is if we apply it to the root level only. >> >> The purpose of it is to make 'KSMA' harder, where an single arbitrary >> write is used to add a block mapping to the page-tables, giving the >> attacker full access to kernel memory. That's why we just apply it to >> the root level only. If the attacker can arbitrary write multiple times, >> I think it's hard to defend. >> > > So the assumption is that the root level is more easy to find? > Otherwise, I'm not sure I understand why being able to write a level 0 > entry is so harmful, given that we don't have block mappings at that > level. I think this thing assumes 3-level page tables with 39bit VA. @@ -417,12 +421,22 @@ static void __init __map_memblock(pgd_t *pgdp, phys_addr_t start, void __init mark_linear_text_alias_ro(void) { + size = (unsigned long)__init_begin - (unsigned long)swapper_pg_end; + update_mapping_prot(__pa_symbol(swapper_pg_end), + (unsigned long)lm_alias(swapper_pg_end), + size, PAGE_KERNEL_RO); >>> >>> I don't think this is necessary. Even if some pages are freed, it >>> doesn't harm to keep a read-only alias of them here since the new >>> owner won't access them via this mapping anyway. So we can keep >>> .rodata as a single region. >> >> To be honest, I didn't think of this issue at first. I later found a >> problem when testing the code on qemu: > > OK, you're right. I missed the fact that this operates on the linear > alias, not the kernel mapping itself. > > What I don't like is that we lose the ability to use block mappings > for the entire .rodata section this way. Isn't it possible to move > these pgdirs to the end of the .rodata segment, perhaps by using a > separate input section name and placing that explicitly? We could even > simply forget about freeing those pages, given that [on 4k pages] the > benefit of freeing 12 KB of space is likely to get lost in the > rounding noise anyway [segments are rounded up to 64 KB in size] I assumed that to move swapper_pg_dir into the .rodata section we would need to break it up. Today its ~3 levels, which we setup in head.S, then do a dance in paging_init() so that swapper_pg_dir is always the top level. We could generate all leves of the 'init_pg_dir' in the __initdata section, then copy only the top level into swapper_pg_dir into the rodata section during paging_init(). Thanks, James
Re: [PATCH 1/1] arm64/mm: move {idmap_pg_dir,tramp_pg_dir,swapper_pg_dir} to .rodata section
Hi Ard, On 21/06/18 10:29, Ard Biesheuvel wrote: > On 21 June 2018 at 10:59, James Morse wrote: >> On 21/06/18 07:39, Ard Biesheuvel wrote: >>> On 21 June 2018 at 04:51, Jun Yao wrote: >>>> On Wed, Jun 20, 2018 at 12:09:49PM +0200, Ard Biesheuvel wrote: >>>>> On 20 June 2018 at 10:57, Jun Yao wrote: >>>>>> Move {idmap_pg_dir,tramp_pg_dir,swapper_pg_dir} to .rodata >>>>>> section. And update the swapper_pg_dir by fixmap. >>>>> >>>>> I think we may be able to get away with not mapping idmap_pg_dir and >>>>> tramp_pg_dir at all. >>>> >>>> I think we need to move tramp_pg_dir to .rodata. The attacker can write >>>> a block-mapping(AP=01) to tramp_pg_dir and then he can access kernel >>>> memory. >> >>> Why does it need to be mapped at all? When do we ever access it from the >>> code? >> >> (We would want to make its fixmap entry read-only too) > > It already is. Sorry, I missed that, >>>>> As for swapper_pg_dir, it would indeed be nice if we could keep those >>>>> mappings read-only most of the time, but I'm not sure how useful this >>>>> is if we apply it to the root level only. >>>> >>>> The purpose of it is to make 'KSMA' harder, where an single arbitrary >>>> write is used to add a block mapping to the page-tables, giving the >>>> attacker full access to kernel memory. That's why we just apply it to >>>> the root level only. If the attacker can arbitrary write multiple times, >>>> I think it's hard to defend. >>> >>> So the assumption is that the root level is more easy to find? >>> Otherwise, I'm not sure I understand why being able to write a level 0 >>> entry is so harmful, given that we don't have block mappings at that >>> level. >> >> I think this thing assumes 3-level page tables with 39bit VA. > The attack, you mean? Because this code is unlikely to build with that > configuration, given that __pgd_populate() BUILD_BUG()s in that case. Yes, the attack. (I struggle to think of it as an 'attack' because you already have arbitrary write...) >>>>> @@ -417,12 +421,22 @@ static void __init __map_memblock(pgd_t *pgdp, >>>>> phys_addr_t start, >>>>>> >>>>>> void __init mark_linear_text_alias_ro(void) >>>>>> { >> >>>>>> + size = (unsigned long)__init_begin - (unsigned >>>>>> long)swapper_pg_end; >>>>>> + update_mapping_prot(__pa_symbol(swapper_pg_end), >>>>>> + (unsigned long)lm_alias(swapper_pg_end), >>>>>> + size, PAGE_KERNEL_RO); >>>>> >>>>> I don't think this is necessary. Even if some pages are freed, it >>>>> doesn't harm to keep a read-only alias of them here since the new >>>>> owner won't access them via this mapping anyway. So we can keep >>>>> .rodata as a single region. >>>> >>>> To be honest, I didn't think of this issue at first. I later found a >>>> problem when testing the code on qemu: >>> >>> OK, you're right. I missed the fact that this operates on the linear >>> alias, not the kernel mapping itself. >>> >>> What I don't like is that we lose the ability to use block mappings >>> for the entire .rodata section this way. Isn't it possible to move >>> these pgdirs to the end of the .rodata segment, perhaps by using a >>> separate input section name and placing that explicitly? We could even >>> simply forget about freeing those pages, given that [on 4k pages] the >>> benefit of freeing 12 KB of space is likely to get lost in the >>> rounding noise anyway [segments are rounded up to 64 KB in size] >> >> I assumed that to move swapper_pg_dir into the .rodata section we would need >> to >> break it up. Today its ~3 levels, which we setup in head.S, then do a dance >> in >> paging_init() so that swapper_pg_dir is always the top level. >> >> We could generate all leves of the 'init_pg_dir' in the __initdata section, >> then >> copy only the top level into swapper_pg_dir into the rodata section during >> paging_init(). > Is that complexity truly justified for a security sensitive piece of > code? Wouldn't this be less complex? (I've probably explained it badly.) Today head.S builds the initial page tables in ~3 levels of swapper_pg_dir, then during paging_init() build new tables with a temporary top level. We switch to the temporary top level, then copy over the first level of swapper_pg_dir, then switch back to swapper_pg_dir. Finally we free the no-longer-used levels of swapper_pg_dir. This looks like re-inventing __initdata for the bits of page table we eventually free. What I tried to describe is building the head.S/initial-page-tables in a reserved area of the the __initdata section. We no longer need a temporary top-level, we can build the final page tables directly in swapper_pg_dir, which means one fewer rounds of cpu_replace_ttbr1(). > Can't we just drop the memblock_free() and be done with it? That works, I assumed it would be at least frowned on! Thanks, James
Re: [RFC PATCH] EDAC, ghes: Enable per-layer error reporting for ARM
Hi Boris, On 24/08/18 13:01, Borislav Petkov wrote: > On Fri, Aug 24, 2018 at 10:48:24AM +0100, James Morse wrote: >> so edac_raw_mc_handle_error() has no clue where the error happened. (I >> haven't >> read what it does with this information yet). > > See edac_inc_ce_error(), for example - it uses the layers which are not > negative (-1) to increment the error counts of the respective layer. It > all depends on what granularity of the hardware part you're reporting > the error for: is it a DIMM rank, a whole DIMM or for a channel which > can span multiple DIMM ranks. And so on... > > Look at some of the drivers and how they're doing that layering. It all > depends on whether you can get the precise info from the hw. Hmmm, in this example we need the information from firmware, as that is where ghes-edac gets its information from. We already count the module/device/dimms in the smbios table, memory is described as 'EDAC_MC_LAYER_ALL_MEM' with num_dimms. I think all we're missing is which dimm in ghes_edac_report_mem_error(). We have the handle, we just need a number between 1 and num_dimms. If it turns out firmware doesn't populate the handles in its cper records, then we can keep e->enable_per_layer_report false when calling edac_raw_mc_handle_error(). (I suggest we ignore 'card', and just do this for the device/dimms). >> Naively I thought we could generate some index during >> ghes_edac_count_dimms(), >> and use this as e->${whichever}_layer. I hoped there would be something we >> could >> already use as the index, but I can't spot it, so this will be more than the >> one-liner I was hoping for! > > If you can get that info from the hardware and injecting an error into > a DIMM gives you the correct DIMM number so that we can increment the > proper counter, then you're golden. I don't think that works reliably on > x86, though, therefore the lumping together. ... 'correct DIMM number' ... Does x86 have another source of memory-topology information it needs to correlate smbios with? For arm there is nothing else describing the memory-topology, so as long as we can correlate the smbios table and ghes:cper records through the handles, we can get this working for all systems. Thanks, James
Re: [RFC PATCH] EDAC, ghes: Enable per-layer error reporting for ARM
Hi Fan, On 24/08/18 15:30, wufan wrote: >> Why get avoid the layer stuff? Isn't counting DIMM/memory-devices what >> EDAC_MC_LAYER_SLOT is for? > > Borislav has explained it in his response. Here let me elaborate a little > more. > To use the layer information you need an accurate way to pinpoint each > component > in the layer and the parent components in the layers above. For example, to > use > EDAC_MC_LAYER_SLOT you also need information for the parent layer say > EDAC_MC_LAYER_CHANNEL, or another layer on top say EDAC_MC_LAYER_BRANCH. I haven't spotted anything that forces a particular meaning/topology on these types. (there are four of them, but only three 'levels') i3000_edac.c has EDAC_MC_LAYER_CHIP_SELECT then EDAC_MC_LAYER_CHANNEL, but i5400_edac.c has EDAC_MC_LAYER_BRANCH then EDAC_MC_LAYER_CHANNEL. pnd2_edac.c agrees that channel comes before slot, but thunderx_edac.c just uses slot directly ghes-edac already describes memory as 'EDAC_MC_LAYER_ALL_MEM', with num_dimms, I think we just need to get 'the' dimm number. Using the cper-module-handle means we don't have to worry about firmware's count of dimms being different, as we both agree its smbios-type-17 we're talking about. > There > are no clear ways to get the information from SMBIOS table. In the case of > "memory > channel" we looked at type 37 which has the exact spelling but it was > introduced > to support RamBus and Synclink. Not sure we can readily use it for modern > architecture concept of "channel/slot". I think we should avoid this 'channel' thing as it means different things to different people! We can use ghes:card smbios:physical-memory-array as the UEFI spec tells us they are the same, and we don't actually need to know what they mean. > I think it is good enough if we can pin each error to the corresponding DIMM. > At the end of the day DIMMs are what customer can replace in the memory system > and that's all that they care about. For the manufacturers of the board/chips > they have the knowledge to map the specific DIMMs to the upper layer > components, > so they can easily collect error counter data for upper layers. I agree. >> CPER's "Memory Error Record 2" thinks that "NODE, CARD and MODULE >> should provide the information necessary to identify the failing FRU". As >> EDAC has three 'levels', these are what they should correspond to for ghes- >> edac. >> >> I assume NODE means rack/chassis in some distributed system. Lets ignore it >> as it doesn't seem to map to anything in the SMBIOS table. > > How about type 4 "Processor Information"? As the spec doesn't tell us what the field means, we can't really do anything other than print the value out. >> 'Card' doesn't mean much to me, but it maps to SMBIOS:17 "Memory Array >> Structure", which the Memory Device structure also points to. >> Card then must mean "a collection of memory devices (DIMMs) that operate >> together to form an address space". >> >> This might be what I think of as a memory-controller, or it might be >> something more complicated. Regardless, the CPER records think its relevant. > > Originally I thought "Card" were memory channel. But looking at the definition > of "Card Handle" in CPER: "... this field contains the SMBIOS handle for the > Type 16 Memory Array Structure that represents the memory card". So Card is > memory controller or something similar to that. > Right now ghes-edac assumes > one mc. We probably need to map mc(s) to the type 16 instances in SMBIOS > table. I think we should ignore 'mc's, and just report the dimm numbers. ghes-edac only cares about the number of dimms today, and this would work on systems that only describe the dimms in the smbios table. Thanks, James
Re: [RFC PATCH] EDAC, ghes: Enable per-layer error reporting for ARM
Hi Tyler, On 24/08/18 16:14, Tyler Baicar wrote: > On Fri, Aug 24, 2018 at 5:48 AM, James Morse wrote: >> On 23/08/18 16:46, Tyler Baicar wrote: >>> On Thu, Aug 23, 2018 at 5:29 AM James Morse wrote: >>>> On 19/07/18 19:36, Tyler Baicar wrote: >>>>> This seems pretty hacky to me, so if anyone has other suggestions please >>>>> share >>>>> them. >>>> >>>> CPER's "Memory Error Record 2" thinks that "NODE, CARD and MODULE should >>>> provide >>>> the information necessary to identify the failing FRU". As EDAC has three >>>> 'levels', these are what they should correspond to for ghes-edac. >>>> >>>> I assume NODE means rack/chassis in some distributed system. Lets ignore >>>> it as >>>> it doesn't seem to map to anything in the SMBIOS table. >>> >>> I believe NODE should map to socket number for multi-socket systems. >> >> Isn't the Memory Array Structure still unique in a multi-socket system? If so >> the node isn't telling us anything new. > Yes, the Memory Array structure in SMBIOS is still unique, but the NODE value > is needed in NODE, CARD, MODULE because the CARD number here typically > maps to channel number which each socket has their own channel numbers. /me flinches at 'typically'. Okay, so the hierarchy applies to the numbers, not the handles. How come there isn't a node handle? I think we should ignore the extra hierarchy. The module/devices/dimms are the only replaceable part, and we don't know if the firmware will provide the information. >> Do sockets show up in the SMBIOS table? We would need to know how many there >> are >> in advance. For arm systems the cpu topology from PPTT is the best bet for >> this >> information, but what do we do if that table is missing? (also, does firmware >> count from 1 or 0?) I suspect we can't use this field unless we know what the >> range of values is going to be in advance. > > An Fan mentioned in his response, what the customers really care about > is mapping to > a particular DIMM since that is what they can replace. To do this, the > Memory Device > handle should be enough since those are all unique regardless of > Memory Array handle > and which socket the DIMM is on. The Firmware I've worked with counts > from 0, but I'm > not sure if that is required. If the spec doesn't say, its difficult to know the range of values until we've seen one of the limits. > That won't matter if we just use the Memory Device handle. I agree. >>> I think the proper way to get this working would be to use these handles. >>> We can >>> avoid populating this layer information and instead have a mapping of type >>> 17 >>> index number (how edac is numbering the DIMMs today) to the handle number. >> >> Why get avoid the layer stuff? Isn't counting DIMM/memory-devices what >> EDAC_MC_LAYER_SLOT is for? > > The problem with the layer reporting is that you need to know all the > layer information as Fan mentioned. I haven't spotted what requires this, there seems to be a bit of a mix of numbers-of-layers and what order they go in. thunderx_edac.c just uses SLOT. I think we can get away with using EDAC_MC_LAYER_ALL_MEM, sized by num_dimms (which ghes-edac already does). Providing extra topology may not be useful unless the firmware populates this information. What do we do if we export card+module, but then firmware only populates the module-handle? > SoCs can support multiple board combinations (ie > 1DPC vs. 2DPC) > and there is no standardized way of knowing whether you are booted on a 1DPC > or > 2DPC board. > >>> Then we will need a new function to increment the counter based on the >>> handle >>> number rather than this layer information. Is that how you are envisioning >>> it? >> >> I'm not familiar with edac's internals, so I didn't have any particular >> vision! >> >> Isn't the problem that ghes_edac_report_mem_error() does this: >> | e->top_layer = -1; >> | e->mid_layer = -1; >> | e->low_layer = -1; > > The other problem is that the sysfs nodes are all setup with a single > layer representing > all of the memory on the board. > > https://elixir.bootlin.com/linux/latest/source/drivers/edac/ghes_edac.c#L469 > > So the DIMM counters exposed in sysfs are all under a single memory > controller and just > numbered from 0 to n-1 based on the order in which the type 17 SMBIOS > entries s
Re: [RFC PATCH] EDAC, ghes: Enable per-layer error reporting for ARM
Hi Boris, On 29/08/18 08:38, Borislav Petkov wrote: > On Tue, Aug 28, 2018 at 06:09:24PM +0100, James Morse wrote: >> Does x86 have another source of memory-topology information it needs to >> correlate smbios with? > > Bah, pinpointing the DIMM on x86 is a mess. There's no reliable way to > say which DIMM it is in certain cases (interleaving, mirrorring, ...) > and it is all platform-dependent. So we do the layers to dump a memory > location (node, memory controller, ) so that we can at least limit > the number of DIMMs the user needs to replace/try. Right. I'd like ghes-edac to work in the same way for both architectures. I think this is best done by stuffing the dmi-handle in struct dimm_info during ghes_edac_dmidecode(), then populating the struct edac_raw_error_desc layers from the matching mci->dimms 'location'. For EDAC_MC_LAYER_ALL_MEM this boils down to a flat index, so pointer arithmetic on mci->dimms is an appropriate short cut. (We should probably 'FIXME: It shouldn't be hard to also fill the DIMM labels' at the same time so that no-one is tempted to interpret the edac:dimm-idx) > In an ideal world, I'd like to be able to query the SPD chips on the (oh, that can be done?) > DIMMs and build the topology and then when an error happens to say, > "error in DIMM " where silkscreen is what is written on the > motherboard under the DIMM socket. > > But I don't see that happening any time soon... >> For arm there is nothing else describing the memory-topology, so as long as >> we >> can correlate the smbios table and ghes:cper records through the handles, we >> can >> get this working for all systems. > > And then make sure vendors fill in the proper info in smbios. Because that's > also a mess on x86. I got educated by the people who look after specifications last time I touched this [0]. SMBIOS tables are required by Arm's 'Server Base Boot Requirements', It lists the memory-device and physical-memory-array as required. I will drop them a note that we will be depending on the handle, and it should go on the list too... if its not populated on today's systems we can fall back to !e->enable_per_layer_report as we do today. Thanks, James [0] https://www.spinics.net/lists/arm-kernel/msg653133.html
Re: [PATCH] EDAC, ghes: use CPER module handles to locate DIMMs
Hi Fan, On 29/08/18 19:33, Fan Wu wrote: > The current ghes_edac driver does not update per-dimm error > counters when reporting memory errors, because there is no > platform-independent way to find DIMMs based on the error > information provided by firmware. I'd argue there is: its in the CPER records, we just didn't do anything useful with the information in the past! > This patch offers a solution > for platforms whose firmwares provide valid module handles > (SMBIOS type 17) in error records. In this case ghes_edac will > use the module handles to locate DIMMs and thus makes per-dimm > error reporting possible. > diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c > index 473aeec..db527f0 100644 > --- a/drivers/edac/ghes_edac.c > +++ b/drivers/edac/ghes_edac.c > @@ -81,6 +81,26 @@ static void ghes_edac_count_dimms(const struct dmi_header > *dh, void *arg) > (*num_dimm)++; > } > > +static int ghes_edac_dimm_index(u16 handle) > +{ > + struct mem_ctl_info *mci; > + int i; > + > + if (!ghes_pvt) > + return -1; ghes_edac_report_mem_error() already checked this, as its the only caller there is no need to check it again. > + mci = ghes_pvt->mci; > + > + if (!mci) > + return -1; Can this happen? ghes_edac_report_mem_error() would have dereferenced this already! If you need the struct mem_ctl_info, you may as well pass it in as the only caller has it to hand. > + > + for (i = 0; i < mci->tot_dimms; i++) { > + if (mci->dimms[i]->smbios_handle == handle) > + return i; > + } > + return -1; > +} > + > static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg) > { > struct ghes_edac_dimm_fill *dimm_fill = arg; > @@ -177,6 +197,8 @@ static void ghes_edac_dmidecode(const struct dmi_header > *dh, void *arg) > entry->total_width, entry->data_width); > } > > + dimm->smbios_handle = entry->handle; We aren't checking for duplicate handles, (e.g. they're all zero). I think this is fine as chances are firmware on those systems won't set CPER_MEM_VALID_MODULE_HANDLE. If it does, the handle it gives us is ambiguous, and we pick a dimm, instead of whine-ing about broken firmware tables. (I'm just drawing attention to it in case someone disagrees) > dimm_fill->count++; > } > } > @@ -327,12 +349,20 @@ void ghes_edac_report_mem_error(int sev, struct > cper_sec_mem_err *mem_err) > p += sprintf(p, "bit_pos:%d ", mem_err->bit_pos); > if (mem_err->validation_bits & CPER_MEM_VALID_MODULE_HANDLE) { > const char *bank = NULL, *device = NULL; > + int index = -1; > + > dmi_memdev_name(mem_err->mem_dev_handle, &bank, &device); > + p += sprintf(p, "DIMM DMI handle: 0x%.4x ", > + mem_err->mem_dev_handle); > if (bank != NULL && device != NULL) > p += sprintf(p, "DIMM location:%s %s ", bank, device); > - else > - p += sprintf(p, "DIMM DMI handle: 0x%.4x ", > - mem_err->mem_dev_handle); Why do we now print the handle every time? The handle is pretty meaningless, it can only be used to find the location-strings, if we get those we print them instead. > + index = ghes_edac_dimm_index(mem_err->mem_dev_handle); > + if (index >= 0) { > + e->top_layer = index; > + e->enable_per_layer_report = true; > + } > + > } > if (p > e->location) > *(p - 1) = '\0'; Looks good to me! Thanks, James
Re: [RFC PATCH] EDAC, ghes: Enable per-layer error reporting for ARM
Hi guys, (CC: +Fan Wu) On 19/07/18 19:36, Tyler Baicar wrote: > On 7/19/2018 10:46 AM, James Morse wrote: >> On 19/07/18 15:01, Borislav Petkov wrote: >>> On Mon, Jul 16, 2018 at 01:26:49PM -0400, Tyler Baicar wrote: >>>> Enable per-layer error reporting for ARM systems so that the error >>>> counters are incremented per-DIMM. This 'layer' term seems to be EDAC's artificial view of memory. >> I'm guessing this is the mapping between CPER records and the DMItable data. > Unfortunately the DMI table doesn't actually have channel and DIMM number > values > which makes this more complicated than I originally thought... >>>> information and enable per layer error reporting for ARM systems so that >>>> the EDAC error counters are incremented based on DIMM number as per the >>>> SMBIOS table rather than just incrementing the noinfo counters on the >>>> memory controller. >> Does this work on x86, and its just the dmi/cper fields have a subtle >> difference? > There are CPU specific EDAC drivers for a lot of x86 folks and those drivers > populate the layer information in a custom way. Not for GHES surely? > With more investigation and testing it turns out a simple patch like this is > not > going to work. This worked for > me on a 1DPC board since the card number turned out to always be the same as > the > index into the DMI table > to find the proper DIMM. On a 2DPC board this fails completely. The ghes_edac > driver only sets up a single > layer so it is only using the card number with this patch. (DPC == DIMM per Channel?) > So it is only setting up a single layer with all the DIMMs on that layer. In > order to properly enable the layers > to represent channel and DIMM number on that channel, we would need to have a > way of determining the > number of channels (which would be layers[0].size) and the number of DIMMs > each > channel supported > (layers[1].size). There doesn't appear to be a way to determine that > information > at this point. So, we're after a mapping for EDAC:layers that includes 'channel'? What would you do with a channel counter? Isn't that part of the SoC? (it can't be replaced!) > The goal is to be able to enable the per layer error reporting in the > ghes_edac > driver so that the per dimm counters exposed in the EDAC sysfs nodes are > properly > updated. What do you mean by layer? I can't find anything in the ACPI/UEFI/SMBIOS specs that uses this term... If its just 'per dimm counters' you're after, this looks straightforward. > The other obvious but more messy way would be to have notifiers register to > be called > by ghes_edac and have a custom EDAC driver for each CPU to properly populate > their layer > information. Yuck. This isn't platform specific, its firmware specific. You're hooking the driver to say "my firmware thinks 'card' means this". Where would this information come from? We don't want per-soc,per-firmware-version code mangling what was supposed to be a standard interface. ... we can't be the first people to try and do this ... [re-ordered hunk:] > This seems pretty hacky to me, so if anyone has other suggestions please share > them. CPER's "Memory Error Record 2" thinks that "NODE, CARD and MODULE should provide the information necessary to identify the failing FRU". As EDAC has three 'levels', these are what they should correspond to for ghes-edac. I assume NODE means rack/chassis in some distributed system. Lets ignore it as it doesn't seem to map to anything in the SMBIOS table. The CPER record's card and module numbers are useless to us, as we need to know how many there will be in advance. (does this version of firmware count from 0 or 1?) ... but CPER also gives us a 'Card Handle' and 'Module Handle'. 'Module Handle' maps to SMBIOS:17 Memory Device (aka, a DIMM). The Handle is a word-value in the structure, so it doesn't depend on the layout/parse-order of the SMBIOS tables. When we count the DIMMs in edac-ghes we can give them some level-idx, then use the handle to find which level-idx to use for this DIMM. ghes_edac_report_mem_error() already picks up the module-handle, but only uses it to print the bank/device. 'Card' doesn't mean much to me, but it maps to SMBIOS:17 "Memory Array Structure", which the Memory Device structure also points to. Card then must mean "a collection of memory devices (DIMMs) that operate together to form an address space". This might be what I think of as a memory-controller, or it might be something more complicated. Regardless, the CPER records think its relevant. For the edac:layers, we could walk the DMI table to find these structures, and build the layers from them. If the Memory-array-structures are missing, we can use the existing 1:NUM_DIMMS approach. Hope this makes sense! Thanks, James
Re: [RFC PATCH] EDAC, ghes: Enable per-layer error reporting for ARM
Hi Tyler, On 23/08/18 16:46, Tyler Baicar wrote: > On Thu, Aug 23, 2018 at 5:29 AM James Morse wrote: >> On 19/07/18 19:36, Tyler Baicar wrote: >>> On 7/19/2018 10:46 AM, James Morse wrote: >>>> On 19/07/18 15:01, Borislav Petkov wrote: >>>>> On Mon, Jul 16, 2018 at 01:26:49PM -0400, Tyler Baicar wrote: >>>>>> Enable per-layer error reporting for ARM systems so that the error >>>>>> counters are incremented per-DIMM. >> >> This 'layer' term seems to be EDAC's artificial view of memory. >> > > Yes, it's just the terminology that EDAC uses for locating a DIMM. > > "Layer" can mean several things here: > > https://elixir.bootlin.com/linux/latest/source/include/linux/edac.h#L318 Aha, its an enum. I thought it was an upper/middle/lower mapping at the whim of the edac driver. [...] >> [re-ordered hunk:] >>> This seems pretty hacky to me, so if anyone has other suggestions please >>> share >>> them. >> >> CPER's "Memory Error Record 2" thinks that "NODE, CARD and MODULE should >> provide >> the information necessary to identify the failing FRU". As EDAC has three >> 'levels', these are what they should correspond to for ghes-edac. >> >> I assume NODE means rack/chassis in some distributed system. Lets ignore it >> as >> it doesn't seem to map to anything in the SMBIOS table. > > I believe NODE should map to socket number for multi-socket systems. Isn't the Memory Array Structure still unique in a multi-socket system? If so the node isn't telling us anything new. Do sockets show up in the SMBIOS table? We would need to know how many there are in advance. For arm systems the cpu topology from PPTT is the best bet for this information, but what do we do if that table is missing? (also, does firmware count from 1 or 0?) I suspect we can't use this field unless we know what the range of values is going to be in advance. I assumed this node must be a level of information above Card/Memory-Array's address-space. Somehow the Card handle isn't no long unique, we need the node number too. If the CPER records were all being pumped at a single agent, (shared BMC in a blade/chassis thing) then this might matter. I suspect we can ignore it in linux. >> The CPER record's card and module numbers are useless to us, as we need to >> know >> how many there will be in advance. (does this version of firmware count from >> 0 >> or 1?) >> >> ... but CPER also gives us a 'Card Handle' and 'Module Handle'. >> 'Module Handle' maps to SMBIOS:17 Memory Device (aka, a DIMM). The Handle is >> a >> word-value in the structure, so it doesn't depend on the layout/parse-order >> of >> the SMBIOS tables. When we count the DIMMs in edac-ghes we can give them some >> level-idx, then use the handle to find which level-idx to use for this DIMM. >> >> ghes_edac_report_mem_error() already picks up the module-handle, but only >> uses >> it to print the bank/device. >> >> 'Card' doesn't mean much to me, but it maps to SMBIOS:17 "Memory Array >> Structure", which the Memory Device structure also points to. >> Card then must mean "a collection of memory devices (DIMMs) that operate >> together to form an address space". >> >> This might be what I think of as a memory-controller, or it might be >> something >> more complicated. Regardless, the CPER records think its relevant. >> >> For the edac:layers, we could walk the DMI table to find these structures, >> and >> build the layers from them. If the Memory-array-structures are missing, we >> can >> use the existing 1:NUM_DIMMS approach. > I think the proper way to get this working would be to use these handles. We > can > avoid populating this layer information and instead have a mapping of type 17 > index number (how edac is numbering the DIMMs today) to the handle number. Why get avoid the layer stuff? Isn't counting DIMM/memory-devices what EDAC_MC_LAYER_SLOT is for? > Then we will need a new function to increment the counter based on the handle > number rather than this layer information. Is that how you are envisioning it? I'm not familiar with edac's internals, so I didn't have any particular vision! Isn't the problem that ghes_edac_report_mem_error() does this: | e->top_layer = -1; | e->mid_layer = -1; | e->low_layer = -1; so edac_raw_mc_handle_error() has no clue where the error happened. (I haven't read what it does with this information yet). ghes_edac_report_mem_error() does check CPER_MEM_VALID_MODULE_HANDLE, and if its set, it uses the handle to find the bank/device strings and prints them out. Naively I thought we could generate some index during ghes_edac_count_dimms(), and use this as e->${whichever}_layer. I hoped there would be something we could already use as the index, but I can't spot it, so this will be more than the one-liner I was hoping for! Thanks, James
[RFC PATCH 12/20] x86/intel_rdt: Correct the closid when staging configuration changes
Now that apply_config() and update_domains() know the code/data/both value of what they are writing, and ctrl_val is correctly sized: use the hardware closid slot, based on the configuration type. This means cbm_idx() and its illusionary cache-properties can go. Signed-off-by: James Morse --- arch/x86/kernel/cpu/intel_rdt.c | 18 +--- arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c | 32 ++--- arch/x86/kernel/cpu/intel_rdt_rdtgroup.c| 2 +- include/linux/resctrl.h | 6 ++-- 4 files changed, 25 insertions(+), 33 deletions(-) diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c index 8d3544b6c149..6466c172c045 100644 --- a/arch/x86/kernel/cpu/intel_rdt.c +++ b/arch/x86/kernel/cpu/intel_rdt.c @@ -75,8 +75,6 @@ struct rdt_hw_resource rdt_resources_all[] = { .cache_level= 3, .cache = { .min_cbm_bits = 1, - .cbm_idx_mult = 1, - .cbm_idx_offset = 0, }, .domains= domain_init(RDT_RESOURCE_L3), .parse_ctrlval = parse_cbm, @@ -95,8 +93,6 @@ struct rdt_hw_resource rdt_resources_all[] = { .cache_level= 3, .cache = { .min_cbm_bits = 1, - .cbm_idx_mult = 2, - .cbm_idx_offset = 0, }, .domains= domain_init(RDT_RESOURCE_L3DATA), .parse_ctrlval = parse_cbm, @@ -116,8 +112,6 @@ struct rdt_hw_resource rdt_resources_all[] = { .cache_level= 3, .cache = { .min_cbm_bits = 1, - .cbm_idx_mult = 2, - .cbm_idx_offset = 1, }, .domains= domain_init(RDT_RESOURCE_L3CODE), .parse_ctrlval = parse_cbm, @@ -136,8 +130,6 @@ struct rdt_hw_resource rdt_resources_all[] = { .cache_level= 2, .cache = { .min_cbm_bits = 1, - .cbm_idx_mult = 1, - .cbm_idx_offset = 0, }, .domains= domain_init(RDT_RESOURCE_L2), .parse_ctrlval = parse_cbm, @@ -156,8 +148,6 @@ struct rdt_hw_resource rdt_resources_all[] = { .cache_level= 2, .cache = { .min_cbm_bits = 1, - .cbm_idx_mult = 2, - .cbm_idx_offset = 0, }, .domains= domain_init(RDT_RESOURCE_L2DATA), .parse_ctrlval = parse_cbm, @@ -176,8 +166,6 @@ struct rdt_hw_resource rdt_resources_all[] = { .cache_level= 2, .cache = { .min_cbm_bits = 1, - .cbm_idx_mult = 2, - .cbm_idx_offset = 1, }, .domains= domain_init(RDT_RESOURCE_L2CODE), .parse_ctrlval = parse_cbm, @@ -204,10 +192,6 @@ struct rdt_hw_resource rdt_resources_all[] = { }, }; -static unsigned int cbm_idx(struct rdt_resource *r, unsigned int closid) -{ - return closid * r->cache.cbm_idx_mult + r->cache.cbm_idx_offset; -} /* * cache_alloc_hsw_probe() - Have to probe for Intel haswell server CPUs @@ -408,7 +392,7 @@ cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r) struct rdt_hw_resource *hw_res = resctrl_to_rdt(r); for (i = m->low; i < m->high; i++) - wrmsrl(hw_res->msr_base + cbm_idx(r, i), hw_dom->ctrl_val[i]); + wrmsrl(hw_res->msr_base + i, hw_dom->ctrl_val[i]); } struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r) diff --git a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c index bab6032704c3..05c14d9f797c 100644 --- a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c +++ b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c @@ -28,6 +28,16 @@ #include #include "intel_rdt.h" +static u32 resctrl_closid_cdp_map(u32 closid, enum resctrl_conf_type t) +{ + if (t == CDP_CODE) + return (closid * 2) + 1; + else if (t == CDP_DATA) + return (closid * 2); +
[RFC PATCH 13/20] x86/intel_rdt: Allow different CODE/DATA configurations to be staged
Now that the staged configuration holds its CDP type and hardware closid, allow resctrl to stage more than configuration at a time for a single resource. To detect the same schema being specified twice when the schemata file is written, the same slot in the staged_configuration array must be used for each schema. Use the cdp_type enum directly as an index. Signed-off-by: James Morse --- arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c | 4 ++-- include/linux/resctrl.h | 4 +++- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c index 05c14d9f797c..f80a838cc36d 100644 --- a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c +++ b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c @@ -78,7 +78,7 @@ int parse_bw(char *buf, struct rdt_resource *r, struct rdt_domain *d, enum resctrl_conf_type t, u32 closid) { unsigned long data; - struct resctrl_staged_config *cfg = &d->staged_config[0]; + struct resctrl_staged_config *cfg = &d->staged_config[t]; if (cfg->have_new_ctrl) { rdt_last_cmd_printf("duplicate domain %d\n", d->id); @@ -144,7 +144,7 @@ int parse_cbm(char *buf, struct rdt_resource *r, struct rdt_domain *d, enum resctrl_conf_type t, u32 closid) { unsigned long data; - struct resctrl_staged_config *cfg = &d->staged_config[0]; + struct resctrl_staged_config *cfg = &d->staged_config[t]; if (cfg->have_new_ctrl) { rdt_last_cmd_printf("duplicate domain %d\n", d->id); diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index dad266f9b0fe..ede5c40756b4 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -12,6 +12,8 @@ enum resctrl_conf_type { CDP_CODE, CDP_DATA, }; +#define NUM_CDP_TYPES CDP_DATA + 1 + /** * struct resctrl_staged_config - parsed configuration to be applied @@ -39,7 +41,7 @@ struct rdt_domain { int id; struct cpumask cpu_mask; - struct resctrl_staged_configstaged_config[1]; + struct resctrl_staged_configstaged_config[NUM_CDP_TYPES]; }; /** -- 2.18.0
[RFC PATCH 02/20] x86/intel_rdt: Split struct rdt_domain
resctrl is the defacto Linux ABI for SoC resource partitioning features. To support it on another architecture, we need to abstract it from Intel RDT, and move it to /fs/. Split struct rdt_domain up too. Move everything that that is particular to resctrl into a new header file. resctrl code paths touching a 'hw' struct indicates where an abstraction is needed. No change in behaviour, this patch just moves types around. Signed-off-by: James Morse --- arch/x86/kernel/cpu/intel_rdt.c | 87 +++-- arch/x86/kernel/cpu/intel_rdt.h | 30 --- arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c | 12 ++- arch/x86/kernel/cpu/intel_rdt_monitor.c | 55 +++-- arch/x86/kernel/cpu/intel_rdt_rdtgroup.c| 14 +++- include/linux/resctrl.h | 17 +++- 6 files changed, 127 insertions(+), 88 deletions(-) diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c index 8cb2639b8a56..c4e6dcdd235b 100644 --- a/arch/x86/kernel/cpu/intel_rdt.c +++ b/arch/x86/kernel/cpu/intel_rdt.c @@ -377,21 +377,23 @@ static void mba_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r) { unsigned int i; + struct rdt_hw_domain *hw_dom = rc_dom_to_rdt(d); struct rdt_hw_resource *hw_res = resctrl_to_rdt(r); /* Write the delay values for mba. */ for (i = m->low; i < m->high; i++) - wrmsrl(hw_res->msr_base + i, delay_bw_map(d->ctrl_val[i], r)); + wrmsrl(hw_res->msr_base + i, delay_bw_map(hw_dom->ctrl_val[i], r)); } static void cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r) { unsigned int i; + struct rdt_hw_domain *hw_dom = rc_dom_to_rdt(d); struct rdt_hw_resource *hw_res = resctrl_to_rdt(r); for (i = m->low; i < m->high; i++) - wrmsrl(hw_res->msr_base + cbm_idx(r, i), d->ctrl_val[i]); + wrmsrl(hw_res->msr_base + cbm_idx(r, i), hw_dom->ctrl_val[i]); } struct rdt_domain *get_domain_from_cpu(int cpu, struct rdt_resource *r) @@ -476,21 +478,22 @@ void setup_default_ctrlval(struct rdt_resource *r, u32 *dc, u32 *dm) static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d) { struct rdt_hw_resource *hw_res = resctrl_to_rdt(r); + struct rdt_hw_domain *hw_dom = rc_dom_to_rdt(d); struct msr_param m; u32 *dc, *dm; - dc = kmalloc_array(r->num_closid, sizeof(*d->ctrl_val), GFP_KERNEL); + dc = kmalloc_array(r->num_closid, sizeof(*hw_dom->ctrl_val), GFP_KERNEL); if (!dc) return -ENOMEM; - dm = kmalloc_array(r->num_closid, sizeof(*d->mbps_val), GFP_KERNEL); + dm = kmalloc_array(r->num_closid, sizeof(*hw_dom->mbps_val), GFP_KERNEL); if (!dm) { kfree(dc); return -ENOMEM; } - d->ctrl_val = dc; - d->mbps_val = dm; + hw_dom->ctrl_val = dc; + hw_dom->mbps_val = dm; setup_default_ctrlval(r, dc, dm); m.low = 0; @@ -502,36 +505,37 @@ static int domain_setup_ctrlval(struct rdt_resource *r, struct rdt_domain *d) static int domain_setup_mon_state(struct rdt_resource *r, struct rdt_domain *d) { size_t tsize; + struct rdt_hw_domain *hw_dom = rc_dom_to_rdt(d); if (is_llc_occupancy_enabled()) { - d->rmid_busy_llc = kcalloc(BITS_TO_LONGS(r->num_rmid), + hw_dom->rmid_busy_llc = kcalloc(BITS_TO_LONGS(r->num_rmid), sizeof(unsigned long), GFP_KERNEL); - if (!d->rmid_busy_llc) + if (!hw_dom->rmid_busy_llc) return -ENOMEM; - INIT_DELAYED_WORK(&d->cqm_limbo, cqm_handle_limbo); + INIT_DELAYED_WORK(&hw_dom->cqm_limbo, cqm_handle_limbo); } if (is_mbm_total_enabled()) { - tsize = sizeof(*d->mbm_total); - d->mbm_total = kcalloc(r->num_rmid, tsize, GFP_KERNEL); - if (!d->mbm_total) { - kfree(d->rmid_busy_llc); + tsize = sizeof(*hw_dom->mbm_total); + hw_dom->mbm_total = kcalloc(r->num_rmid, tsize, GFP_KERNEL); + if (!hw_dom->mbm_total) { + kfree(hw_dom->rmid_busy_llc); return -ENOMEM; } } if (is_mbm_local_enabled()) { - tsize = sizeof(*d->mbm_local); - d->mbm_local = kcalloc(r->num_rmid, tsize, GFP_KERNEL); - if (!d->mbm_local) { - kfree(d->rmid_busy_llc); - kfree(d->mbm_total); + tsize = sizeof(*hw_dom->mbm_local);
[RFC PATCH 01/20] x86/intel_rdt: Split struct rdt_resource
resctrl is the defacto Linux ABI for SoC resource partitioning features. To support it on another architecture, we need to abstract it from Intel RDT, and move it to /fs/. Lets start by splitting struct rdt_resource, (the name is kept for now to keep the noise down), and add some type-trickery to keep the foreach helpers working. Move everything that that is particular to resctrl into a new header file, keeping the x86 msr specific stuff where it is. resctrl code paths touching a 'hw' struct indicates where an abstraction is needed. We split rdt_domain up in a similar way in the next patch. No change in behaviour, this patch just moves types around. Signed-off-by: James Morse --- arch/x86/kernel/cpu/intel_rdt.c | 193 +++- arch/x86/kernel/cpu/intel_rdt.h | 112 +++- arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c | 6 +- arch/x86/kernel/cpu/intel_rdt_monitor.c | 23 ++- arch/x86/kernel/cpu/intel_rdt_rdtgroup.c| 37 ++-- include/linux/resctrl.h | 103 +++ 6 files changed, 275 insertions(+), 199 deletions(-) create mode 100644 include/linux/resctrl.h diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c index ec4754f81cbd..8cb2639b8a56 100644 --- a/arch/x86/kernel/cpu/intel_rdt.c +++ b/arch/x86/kernel/cpu/intel_rdt.c @@ -64,122 +64,137 @@ mba_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r); static void cat_wrmsr(struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r); -#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].domains) +#define domain_init(id) LIST_HEAD_INIT(rdt_resources_all[id].resctrl.domains) -struct rdt_resource rdt_resources_all[] = { +struct rdt_hw_resource rdt_resources_all[] = { [RDT_RESOURCE_L3] = { .rid= RDT_RESOURCE_L3, - .name = "L3", - .domains= domain_init(RDT_RESOURCE_L3), + .resctrl = { + .name = "L3", + .cache_level= 3, + .cache = { + .min_cbm_bits = 1, + .cbm_idx_mult = 1, + .cbm_idx_offset = 0, + }, + .domains= domain_init(RDT_RESOURCE_L3), + .parse_ctrlval = parse_cbm, + .format_str = "%d=%0*x", + .fflags = RFTYPE_RES_CACHE, + }, .msr_base = IA32_L3_CBM_BASE, .msr_update = cat_wrmsr, - .cache_level= 3, - .cache = { - .min_cbm_bits = 1, - .cbm_idx_mult = 1, - .cbm_idx_offset = 0, - }, - .parse_ctrlval = parse_cbm, - .format_str = "%d=%0*x", - .fflags = RFTYPE_RES_CACHE, }, [RDT_RESOURCE_L3DATA] = { .rid= RDT_RESOURCE_L3DATA, - .name = "L3DATA", - .domains= domain_init(RDT_RESOURCE_L3DATA), + .resctrl = { + .name = "L3DATA", + .cache_level= 3, + .cache = { + .min_cbm_bits = 1, + .cbm_idx_mult = 2, + .cbm_idx_offset = 0, + }, + .domains= domain_init(RDT_RESOURCE_L3DATA), + .parse_ctrlval = parse_cbm, + .format_str = "%d=%0*x", + .fflags = RFTYPE_RES_CACHE, + }, .msr_base = IA32_L3_CBM_BASE, .msr_update = cat_wrmsr, - .cache_level= 3, - .cache = { - .min_cbm_bits = 1, - .cbm_idx_mult = 2, - .cbm_idx_offset = 0, - }, - .parse_ctrlval = parse_cbm, - .format_str = "%d=%0*x", - .fflags = RFTYPE_RES_CACHE, + }, [RDT_RESOURCE_L3CODE] = { .rid= RDT_RESOURCE_L3CODE, - .name = "L3CODE", - .domains= domain_init(RDT_RESOURCE_L3CODE), + .resctrl = { + .name
[RFC PATCH 06/20] x86/intel_rdt: Add a helper to read a closid's configuration for show_doms()
The configuration values used by the arch code may not be the same as the bitmaps generated by resctrl, resctrl shouldn't read or write them directly. update_domains() and the staged config are suitable for letting the arch code perform any conversion. Add a helper to read the current configuration. This will allow another architecture to scale the bitmaps if necessary, and possibly use controls that don't take a bitmap at all. Signed-off-by: James Morse --- arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c | 17 + arch/x86/kernel/cpu/intel_rdt_monitor.c | 2 +- include/linux/resctrl.h | 3 +++ 3 files changed, 17 insertions(+), 5 deletions(-) diff --git a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c index 01ffd455313a..ec3c15ee3473 100644 --- a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c +++ b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c @@ -322,21 +322,30 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of, return ret ?: nbytes; } +void resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, + u32 closid, u32 *value) +{ + struct rdt_hw_domain *hw_dom = rc_dom_to_rdt(d); + + if (!is_mba_sc(r)) + *value = hw_dom->ctrl_val[closid]; + else + *value = hw_dom->mbps_val[closid]; +} + static void show_doms(struct seq_file *s, struct rdt_resource *r, int closid) { - struct rdt_hw_domain *hw_dom; + struct rdt_domain *dom; bool sep = false; u32 ctrl_val; seq_printf(s, "%*s:", max_name_width, r->name); list_for_each_entry(dom, &r->domains, list) { - hw_dom = rc_dom_to_rdt(dom); if (sep) seq_puts(s, ";"); - ctrl_val = (!is_mba_sc(r) ? hw_dom->ctrl_val[closid] : - hw_dom->mbps_val[closid]); + resctrl_arch_get_config(r, dom, closid, &ctrl_val); seq_printf(s, r->format_str, dom->id, max_data_width, ctrl_val); sep = true; diff --git a/arch/x86/kernel/cpu/intel_rdt_monitor.c b/arch/x86/kernel/cpu/intel_rdt_monitor.c index c05f1cecf6cd..42ddcefc7065 100644 --- a/arch/x86/kernel/cpu/intel_rdt_monitor.c +++ b/arch/x86/kernel/cpu/intel_rdt_monitor.c @@ -390,7 +390,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm) hw_dom_mba = rc_dom_to_rdt(dom_mba); cur_bw = pmbm_data->prev_bw; - user_bw = hw_dom_mba->mbps_val[closid]; + resctrl_arch_get_config(r_mba, dom_mba, closid, &user_bw); delta_bw = pmbm_data->delta_bw; cur_msr_val = hw_dom_mba->ctrl_val[closid]; diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 370db085ee77..03d9fbc230af 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -125,4 +125,7 @@ struct rdt_resource { }; +void resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, +u32 closid, u32 *value); + #endif /* __LINUX_RESCTRL_H */ -- 2.18.0
[RFC PATCH 08/20] x86/intel_rdt: Make cdp enable/disable global
If the CPU supports Intel's Code and Data Prioritization (CDP), software can specify a separate bitmap for code and data. This feature needs enabling in a model-specific-register, and changes the properties of the cache-controls: it halves the effective number of closids. This changes how closids are allocated, and so applies to all alloc_enabled caches. If a system has multiple levels of RDT-like controls CDP should be enabled/disabled across them all. Make the CDP enable/disable calls global. Add CDP capable/enabled flags, and unify the enable/disable behind a single resctrl_arch_set_cdp_enabled(true/false) call. Architectures that have nothing to do here can just update the flags. This subtly changes resctrl's '-o cdp' (l3) and '-o cdpl2' parameters to mean enable globally if this level supports cdp. The difference can't be seen on a system which only has one of the two. Signed-off-by: James Morse --- arch/x86/kernel/cpu/intel_rdt.c | 1 + arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 72 include/linux/resctrl.h | 7 +++ 3 files changed, 57 insertions(+), 23 deletions(-) diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c index c4e6dcdd235b..0e651447956e 100644 --- a/arch/x86/kernel/cpu/intel_rdt.c +++ b/arch/x86/kernel/cpu/intel_rdt.c @@ -331,6 +331,7 @@ static void rdt_get_cdp_config(int level, int type) * By default, CDP is disabled. CDP can be enabled by mount parameter * "cdp" during resctrl file system mount time. */ + r_l->cdp_capable = true; r->alloc_enabled = false; } diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c index 3ed88d4fedd0..f4f76c193495 100644 --- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c +++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c @@ -1081,7 +1081,7 @@ static int cdp_enable(int level, int data_type, int code_type) int ret; if (!r_l->alloc_capable || !r_ldata->alloc_capable || - !r_lcode->alloc_capable) + !r_lcode->alloc_capable || !r_l->cdp_capable) return -EINVAL; ret = set_cache_qos_cfg(level, true); @@ -1089,51 +1089,77 @@ static int cdp_enable(int level, int data_type, int code_type) r_l->alloc_enabled = false; r_ldata->alloc_enabled = true; r_lcode->alloc_enabled = true; + + r_l->cdp_enabled = true; + r_ldata->cdp_enabled = true; + r_lcode->cdp_enabled = true; } return ret; } -static int cdpl3_enable(void) -{ - return cdp_enable(RDT_RESOURCE_L3, RDT_RESOURCE_L3DATA, - RDT_RESOURCE_L3CODE); -} - -static int cdpl2_enable(void) -{ - return cdp_enable(RDT_RESOURCE_L2, RDT_RESOURCE_L2DATA, - RDT_RESOURCE_L2CODE); -} - static void cdp_disable(int level, int data_type, int code_type) { struct rdt_resource *r = &rdt_resources_all[level].resctrl; + if (!r->cdp_enabled) + return; + r->alloc_enabled = r->alloc_capable; if (rdt_resources_all[data_type].resctrl.alloc_enabled) { rdt_resources_all[data_type].resctrl.alloc_enabled = false; rdt_resources_all[code_type].resctrl.alloc_enabled = false; set_cache_qos_cfg(level, false); + + r->cdp_enabled = false; + rdt_resources_all[data_type].resctrl.cdp_enabled = false; + rdt_resources_all[code_type].resctrl.cdp_enabled = false; } } -static void cdpl3_disable(void) +int resctrl_arch_set_cdp_enabled(bool enable) { - cdp_disable(RDT_RESOURCE_L3, RDT_RESOURCE_L3DATA, RDT_RESOURCE_L3CODE); + int ret = -EINVAL; + struct rdt_hw_resource *l3 = &rdt_resources_all[RDT_RESOURCE_L3]; + struct rdt_hw_resource *l2 = &rdt_resources_all[RDT_RESOURCE_L2]; + + if (l3 && l3->resctrl.cdp_capable) { + if (!enable) { + cdp_disable(RDT_RESOURCE_L3, RDT_RESOURCE_L3DATA, + RDT_RESOURCE_L3CODE); + ret = 0; + } else { + ret = cdp_enable(RDT_RESOURCE_L3, RDT_RESOURCE_L3DATA, + RDT_RESOURCE_L3CODE); + } + } + if (l2 && l2->resctrl.cdp_capable) { + if (!enable) { + cdp_disable(RDT_RESOURCE_L2, RDT_RESOURCE_L2DATA, + RDT_RESOURCE_L2CODE); + ret = 0; + } else { + ret = cdp_enable(RDT_RESOURCE_L2, RDT_RESOURCE_L2DATA, + RDT_RESOURCE_L2CODE); + } + } + + return ret; } -st
[RFC PATCH 00/20] x86/intel_rdt: Start abstraction for a second arch
Hi folks, ARM have some upcoming CPU features that are similar to Intel RDT. Resctrl is the defacto ABI for this sort of thing, but it lives under arch/x86. To get existing software working, we need to make resctrl work with arm64. This series is the first chunk of that. The aim is to move the filesystem/ABI parts into /fs/resctrl, and implement a second arch backend. What are the ARM features? Future ARM SoCs may have a feature called MPAM: Memory Partitioning and Monitoring. This is an umbrella term like RDT, and covers a range of controls (like CAT) and monitors (like MBM, CMT). This series is almost all about CDP. MPAM has equivalent functionality, but it doesn't need enabling, and doesn't affect the available closids. (I'll try and use Intel terms). MPAM expects the equivalent to IA32_PRQ_MSR to be configured with an Instruction closid and a Data closid. These are the same for no-CDP, and different otherwise. There is no need for them to be adjacent. To avoid emulating CDP in arm64's arch code, this series moves all the ABI parts of the CDP behaviour, (half the closid-space, each having two configurations) into the filesystem parts of resctrl. These will eventually be moved to /fs/. MPAMs control and monitor configuration is all memory mapped, the base addresses are discovered via firmware tables, so we won't have a table of possible resources that just need alloc_enabling. Is this it? No... there are another two series of a similar size that abstract the MBM/CMT overflow threads and avoid 'fs' code accessing things that have moved into the 'hw' arch specific struct. I'm after feedback on the general approach taken here, bugs, as there are certainly subtleties I've missed, and any strong-opinions on what should be arch-specific, and what shouldn't. This series is based on v4.18, and can be retrieved from: git://linux-arm.org/linux-jm.git -b mpam/resctrl_rework/rfc_1 Thanks, James Morse (20): x86/intel_rdt: Split struct rdt_resource x86/intel_rdt: Split struct rdt_domain x86/intel_rdt: Group staged configuration into a separate struct x86/intel_rdt: Add closid to the staged config x86/intel_rdt: make update_domains() learn the affected closids x86/intel_rdt: Add a helper to read a closid's configuration for show_doms() x86/intel_rdt: Expose update_domains() as an arch helper x86/intel_rdt: Make cdp enable/disable global x86/intel_rdt: Track the actual number of closids separately x86/intel_rdt: Let resctrl change the resources's num_closid x86/intel_rdt: Pass in the code/data/both configuration value when parsing x86/intel_rdt: Correct the closid when staging configuration changes x86/intel_rdt: Allow different CODE/DATA configurations to be staged x86/intel_rdt: Add a separate resource list for resctrl x86/intel_rdt: Walk the resctrl schema list instead of the arch's resource list x86/intel_rdt: Move the schemata names into struct resctrl_schema x86/intel_rdt: Stop using Lx CODE/DATA resources x86/intel_rdt: Remove the CODE/DATA illusionary caches x86/intel_rdt: Kill off alloc_enabled x86/intel_rdt: Merge cdp enable/disable calls arch/x86/kernel/cpu/intel_rdt.c | 298 +++- arch/x86/kernel/cpu/intel_rdt.h | 161 --- arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c | 142 +++--- arch/x86/kernel/cpu/intel_rdt_monitor.c | 78 ++--- arch/x86/kernel/cpu/intel_rdt_rdtgroup.c| 216 +- include/linux/resctrl.h | 166 +++ 6 files changed, 621 insertions(+), 440 deletions(-) create mode 100644 include/linux/resctrl.h -- 2.18.0
[RFC PATCH 05/20] x86/intel_rdt: make update_domains() learn the affected closids
Now that the closid is present in the staged configuration, update_domains() can learn which low/high values it should update. Remove the single passed in closid, and update msr_param as we apply each staged config. Once the L2/L2CODE/L2DATA resources are merged this will allow update_domains() to be called once for the single resource, even when CDP is in use. This results in both CODE and DATA configurations being applied and the two consecutive closids being updated with a single smp_call_function_many(). This will let us keep the CDP odd/even behaviour inside resctrl so that architectures that don't do this don't need to emulate it. Signed-off-by: James Morse --- arch/x86/kernel/cpu/intel_rdt.h | 4 ++-- arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c | 21 - 2 files changed, 18 insertions(+), 7 deletions(-) diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h index 5e271e0fe1f5..8df549ef016d 100644 --- a/arch/x86/kernel/cpu/intel_rdt.h +++ b/arch/x86/kernel/cpu/intel_rdt.h @@ -241,8 +241,8 @@ static inline struct rdt_hw_domain *rc_dom_to_rdt(struct rdt_domain *r) */ struct msr_param { struct rdt_resource *res; - int low; - int high; + u32 low; + u32 high; }; static inline bool is_llc_occupancy_enabled(void) diff --git a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c index 0c849653a99d..01ffd455313a 100644 --- a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c +++ b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c @@ -193,22 +193,21 @@ static void apply_config(struct rdt_hw_domain *hw_dom, } } -static int update_domains(struct rdt_resource *r, int closid) +static int update_domains(struct rdt_resource *r) { struct resctrl_staged_config *cfg; struct rdt_hw_domain *hw_dom; + bool msr_param_init = false; struct msr_param msr_param; cpumask_var_t cpu_mask; struct rdt_domain *d; bool mba_sc; + u32 closid; int i, cpu; if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL)) return -ENOMEM; - /* TODO: learn these two by looping the config */ - msr_param.low = closid; - msr_param.high = msr_param.low + 1; msr_param.res = r; mba_sc = is_mba_sc(r); @@ -220,9 +219,21 @@ static int update_domains(struct rdt_resource *r, int closid) continue; apply_config(hw_dom, cfg, cpu_mask, mba_sc); + + closid = cfg->closid; + if (!msr_param_init) { + msr_param.low = closid; + msr_param.high = closid; + msr_param_init = true; + } else { + msr_param.low = min(msr_param.low, closid); + msr_param.high = max(msr_param.high, closid); + } } } + msr_param.high += 1; + /* * Avoid writing the control msr with control values when * MBA software controller is enabled @@ -301,7 +312,7 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of, } for_each_alloc_enabled_rdt_resource(r) { - ret = update_domains(r, closid); + ret = update_domains(r); if (ret) goto out; } -- 2.18.0
[RFC PATCH 09/20] x86/intel_rdt: Track the actual number of closids separately
num_closid is different for the illusionary CODE/DATA caches, and these resource's ctrlval is sized on this parameter. When it comes to writing the configuration values into hardware, a correction is applied. The next step in moving this behaviour into the resctrl code is to make the arch code always work with the full range of closids, and size its ctrlval arrays based on this number. This means another architecture doesn't need to emulate CDP. Add a separate field to hold hw_num_closids and use this in the arch code. The CODE/DATA caches use the full range for their hardware struct, but the half sized version for the resctrl visible part. This means the ctrlval array is the full size, but only the first half is used. A later patch will correct the closid when the configuration is written, at which point we can merge the illusionary caches. A short lived quirk of this is when a resource is reset(), both the code and data illusionary caches reset the full closid range. This disappears in a later patch that merges the caches together. Signed-off-by: James Morse --- arch/x86/kernel/cpu/intel_rdt.c | 19 ++- arch/x86/kernel/cpu/intel_rdt.h | 2 ++ arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 3 ++- 3 files changed, 18 insertions(+), 6 deletions(-) diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c index 0e651447956e..c035280b4398 100644 --- a/arch/x86/kernel/cpu/intel_rdt.c +++ b/arch/x86/kernel/cpu/intel_rdt.c @@ -223,7 +223,8 @@ static unsigned int cbm_idx(struct rdt_resource *r, unsigned int closid) */ static inline void cache_alloc_hsw_probe(void) { - struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].resctrl; + struct rdt_hw_resource *hw_res = &rdt_resources_all[RDT_RESOURCE_L3]; + struct rdt_resource *r = &hw_res->resctrl; u32 l, h, max_cbm = BIT_MASK(20) - 1; if (wrmsr_safe(IA32_L3_CBM_BASE, max_cbm, 0)) @@ -235,6 +236,7 @@ static inline void cache_alloc_hsw_probe(void) return; r->num_closid = 4; + hw_res->hw_num_closid = 4; r->default_ctrl = max_cbm; r->cache.cbm_len = 20; r->cache.shareable_bits = 0xc; @@ -276,12 +278,14 @@ static inline bool rdt_get_mb_table(struct rdt_resource *r) static bool rdt_get_mem_config(struct rdt_resource *r) { + struct rdt_hw_resource *hw_res = resctrl_to_rdt(r); union cpuid_0x10_3_eax eax; union cpuid_0x10_x_edx edx; u32 ebx, ecx; cpuid_count(0x0010, 3, &eax.full, &ebx, &ecx, &edx.full); r->num_closid = edx.split.cos_max + 1; + hw_res->hw_num_closid = r->num_closid; r->membw.max_delay = eax.split.max_delay + 1; r->default_ctrl = MAX_MBA_BW; if (ecx & MBA_IS_LINEAR) { @@ -302,12 +306,14 @@ static bool rdt_get_mem_config(struct rdt_resource *r) static void rdt_get_cache_alloc_cfg(int idx, struct rdt_resource *r) { + struct rdt_hw_resource *hw_res = resctrl_to_rdt(r); union cpuid_0x10_1_eax eax; union cpuid_0x10_x_edx edx; u32 ebx, ecx; cpuid_count(0x0010, idx, &eax.full, &ebx, &ecx, &edx.full); r->num_closid = edx.split.cos_max + 1; + hw_res->hw_num_closid = r->num_closid; r->cache.cbm_len = eax.split.cbm_len + 1; r->default_ctrl = BIT_MASK(eax.split.cbm_len + 1) - 1; r->cache.shareable_bits = ebx & r->default_ctrl; @@ -319,9 +325,11 @@ static void rdt_get_cache_alloc_cfg(int idx, struct rdt_resource *r) static void rdt_get_cdp_config(int level, int type) { struct rdt_resource *r_l = &rdt_resources_all[level].resctrl; - struct rdt_resource *r = &rdt_resources_all[type].resctrl; + struct rdt_hw_resource *hw_res_t = &rdt_resources_all[type]; + struct rdt_resource *r = &hw_res_t->resctrl; r->num_closid = r_l->num_closid / 2; + hw_res_t->hw_num_closid = r_l->num_closid; r->cache.cbm_len = r_l->cache.cbm_len; r->default_ctrl = r_l->default_ctrl; r->cache.shareable_bits = r_l->cache.shareable_bits; @@ -463,6 +471,7 @@ struct rdt_domain *rdt_find_domain(struct rdt_resource *r, int id, void setup_default_ctrlval(struct rdt_resource *r, u32 *dc, u32 *dm) { int i; + struct rdt_hw_resource *hw_res = resctrl_to_rdt(r); /* * Initialize the Control MSRs to having no control. @@ -470,7 +479,7 @@ void setup_default_ctrlval(struct rdt_resource *r, u32 *dc, u32 *dm) * For Memory Allocation: Set b/w requested to 100% * and the bandwidth in MBps to U32_MAX */ - for (i = 0; i < r->num_closid; i++, dc++, dm++) { + for (i = 0; i < hw_res->hw_num_closid; i++, dc++, dm++) { *dc = r->default_ctrl;
[RFC PATCH 10/20] x86/intel_rdt: Let resctrl change the resources's num_closid
Today we switch between different alloc_enabled resources which have differing preset num_closid to account for CDP. We want to merge these illusionary caches together, at which point something needs to change the resctrl's view of num_closid. The arch code now has its own idea of how many closids there are, and as the two configurations for one rdtgroup is part of resctrl's ABI we should get resctrl to change it. We change the num_closid on the l2/l3 resources, which aren't yet in use when cdp is enabled, then change them back afterwards. Once we merge illusionary caches, resctrl will see the value it changed here. Signed-off-by: James Morse --- arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 22 +- 1 file changed, 21 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c index 58dceaad6863..e2a9202674f3 100644 --- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c +++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c @@ -1149,16 +1149,36 @@ int resctrl_arch_set_cdp_enabled(bool enable) static int try_to_enable_cdp(int level) { + int ret; struct rdt_resource *r = &rdt_resources_all[level].resctrl; + struct rdt_resource *l3 = &rdt_resources_all[RDT_RESOURCE_L3].resctrl; + struct rdt_resource *l2 = &rdt_resources_all[RDT_RESOURCE_L2].resctrl; if (!r->cdp_capable) return -EINVAL; + if (r->cdp_enabled) + return 0; - return resctrl_arch_set_cdp_enabled(true); + ret = resctrl_arch_set_cdp_enabled(true); + if (!ret) { + if (l2->cdp_enabled) + l2->num_closid /= 2; + if (l3->cdp_enabled) + l3->num_closid /= 2; + } + + return ret; } static void cdp_disable_all(void) { + struct rdt_resource *l2 = &rdt_resources_all[RDT_RESOURCE_L2].resctrl; + struct rdt_resource *l3 = &rdt_resources_all[RDT_RESOURCE_L3].resctrl; + + if (l2->cdp_enabled) + l2->num_closid *= 2; + if (l3->cdp_enabled) + l3->num_closid *= 2; resctrl_arch_set_cdp_enabled(false); } -- 2.18.0
[RFC PATCH 17/20] x86/intel_rdt: Stop using Lx CODE/DATA resources
Now that CDP enable/disable is global, and the closid offset correction is based on the configuration being applied, we can use the same Lx resource twice for CDP's CODE/DATA schema. This keeps the illusion of separate caches in the resctrl code. When CDP is enabled for a cache, create two schema generating the names and setting the configuration type. We can now remove the initialisation of of the illusionary hw_resources: 'cdp_capable' just requires setting a flag, resctrl knows what to do from there. Signed-off-by: James Morse --- arch/x86/kernel/cpu/intel_rdt.c | 49 ++-- arch/x86/kernel/cpu/intel_rdt.h | 1 - arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 98 +--- 3 files changed, 58 insertions(+), 90 deletions(-) diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c index 3a0d7de15afa..96b1aab36053 100644 --- a/arch/x86/kernel/cpu/intel_rdt.c +++ b/arch/x86/kernel/cpu/intel_rdt.c @@ -81,7 +81,6 @@ struct rdt_hw_resource rdt_resources_all[] = { .format_str = "%d=%0*x", .fflags = RFTYPE_RES_CACHE, }, - .cdp_type = CDP_BOTH, .msr_base = IA32_L3_CBM_BASE, .msr_update = cat_wrmsr, }, @@ -99,7 +98,6 @@ struct rdt_hw_resource rdt_resources_all[] = { .format_str = "%d=%0*x", .fflags = RFTYPE_RES_CACHE, }, - .cdp_type = CDP_DATA, .msr_base = IA32_L3_CBM_BASE, .msr_update = cat_wrmsr, @@ -118,7 +116,6 @@ struct rdt_hw_resource rdt_resources_all[] = { .format_str = "%d=%0*x", .fflags = RFTYPE_RES_CACHE, }, - .cdp_type = CDP_CODE, .msr_base = IA32_L3_CBM_BASE, .msr_update = cat_wrmsr, }, @@ -136,7 +133,6 @@ struct rdt_hw_resource rdt_resources_all[] = { .format_str = "%d=%0*x", .fflags = RFTYPE_RES_CACHE, }, - .cdp_type = CDP_BOTH, .msr_base = IA32_L2_CBM_BASE, .msr_update = cat_wrmsr, }, @@ -154,7 +150,6 @@ struct rdt_hw_resource rdt_resources_all[] = { .format_str = "%d=%0*x", .fflags = RFTYPE_RES_CACHE, }, - .cdp_type = CDP_DATA, .msr_base = IA32_L2_CBM_BASE, .msr_update = cat_wrmsr, }, @@ -172,7 +167,6 @@ struct rdt_hw_resource rdt_resources_all[] = { .format_str = "%d=%0*x", .fflags = RFTYPE_RES_CACHE, }, - .cdp_type = CDP_CODE, .msr_base = IA32_L2_CBM_BASE, .msr_update = cat_wrmsr, }, @@ -312,39 +306,6 @@ static void rdt_get_cache_alloc_cfg(int idx, struct rdt_resource *r) r->alloc_enabled = true; } -static void rdt_get_cdp_config(int level, int type) -{ - struct rdt_resource *r_l = &rdt_resources_all[level].resctrl; - struct rdt_hw_resource *hw_res_t = &rdt_resources_all[type]; - struct rdt_resource *r = &hw_res_t->resctrl; - - r->num_closid = r_l->num_closid / 2; - hw_res_t->hw_num_closid = r_l->num_closid; - r->cache.cbm_len = r_l->cache.cbm_len; - r->default_ctrl = r_l->default_ctrl; - r->cache.shareable_bits = r_l->cache.shareable_bits; - r->data_width = (r->cache.cbm_len + 3) / 4; - r->alloc_capable = true; - /* -* By default, CDP is disabled. CDP can be enabled by mount parameter -* "cdp" during resctrl file system mount time. -*/ - r_l->cdp_capable = true; - r->alloc_enabled = false; -} - -static void rdt_get_cdp_l3_config(void) -{ - rdt_get_cdp_config(RDT_RESOURCE_L3, RDT_RESOURCE_L3DATA); - rdt_get_cdp_config(RDT_RESOURCE_L3, RDT_RESOURCE_L3CODE); -} - -static void rdt_get_cdp_l2_config(void) -{ - rdt_get_cdp_config(RDT_RESOURCE_L2, RDT_RESOURCE_L2DATA); - rdt_get_cdp_config(RDT_RESOURCE_L2, RDT_RESOURCE_L2CODE); -} - static int get_cache_id(int cpu, int level) { struct cpu_cacheinfo *ci = get_cpu_cacheinfo(cpu); @@ -813,6 +774,8 @@ static bool __init rdt_cpu_has(int flag) static __init bool get_rdt_alloc_resources(void) {
[RFC PATCH 16/20] x86/intel_rdt: Move the schemata names into struct resctrl_schema
Move the names used for the schemata file out of the resource and into struct resctrl_schema. This lets us give one resource two different names, based on the other schema properties. For now we copy the name, once we merge the L2/L2CODE/L2DATA resources resctrl will generate it. Remove the arch code's max_name_width, this is now resctrl's problem. Signed-off-by: James Morse --- arch/x86/kernel/cpu/intel_rdt.c | 9 ++--- arch/x86/kernel/cpu/intel_rdt.h | 2 +- arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c | 4 ++-- arch/x86/kernel/cpu/intel_rdt_rdtgroup.c| 4 +++- include/linux/resctrl.h | 7 +++ 5 files changed, 15 insertions(+), 11 deletions(-) diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c index 6466c172c045..3a0d7de15afa 100644 --- a/arch/x86/kernel/cpu/intel_rdt.c +++ b/arch/x86/kernel/cpu/intel_rdt.c @@ -48,10 +48,10 @@ DEFINE_MUTEX(rdtgroup_mutex); DEFINE_PER_CPU(struct intel_pqr_state, pqr_state); /* - * Used to store the max resource name width and max resource data width + * Used to store the max resource data width * to display the schemata in a tabular format */ -int max_name_width, max_data_width; +int max_data_width; /* * Global boolean for rdt_alloc which is true if any @@ -722,13 +722,8 @@ static int intel_rdt_offline_cpu(unsigned int cpu) static __init void rdt_init_padding(void) { struct rdt_resource *r; - int cl; for_each_alloc_capable_rdt_resource(r) { - cl = strlen(r->name); - if (cl > max_name_width) - max_name_width = cl; - if (r->data_width > max_data_width) max_data_width = r->data_width; } diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h index cc8dea58b74f..b72448186532 100644 --- a/arch/x86/kernel/cpu/intel_rdt.h +++ b/arch/x86/kernel/cpu/intel_rdt.h @@ -146,7 +146,7 @@ struct rdtgroup { /* List of all resource groups */ extern struct list_head rdt_all_groups; -extern int max_name_width, max_data_width; +extern int max_data_width; int __init rdtgroup_init(void); diff --git a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c index 3038ecfdeec0..e8264637a4d3 100644 --- a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c +++ b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c @@ -275,7 +275,7 @@ static int rdtgroup_parse_resource(char *resname, char *tok, int closid) list_for_each_entry(s, &resctrl_all_schema, list) { r = s->res; - if (!strcmp(resname, r->name) && closid < r->num_closid) + if (!strcmp(resname, s->name) && closid < r->num_closid) return parse_line(tok, r, s->conf_type, closid); } rdt_last_cmd_printf("unknown/unsupported resource name '%s'\n", resname); @@ -358,7 +358,7 @@ static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int clo bool sep = false; u32 ctrl_val, hw_closid; - seq_printf(s, "%*s:", max_name_width, r->name); + seq_printf(s, "%*s:", sizeof(schema->name), schema->name); list_for_each_entry(dom, &r->domains, list) { if (sep) seq_puts(s, ";"); diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c index 0bd748defc73..b3d3acbb2ef7 100644 --- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c +++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c @@ -932,7 +932,7 @@ static int rdtgroup_create_info_dir(struct kernfs_node *parent_kn) list_for_each_entry(s, &resctrl_all_schema, list) { r = s->res; fflags = r->fflags | RF_CTRL_INFO; - ret = rdtgroup_mkdir_info_resdir(r, r->name, fflags); + ret = rdtgroup_mkdir_info_resdir(r, s->name, fflags); if (ret) goto out_destroy; } @@ -1306,6 +1306,8 @@ static int create_schemata_list(void) s->res = r; s->conf_type = resctrl_to_rdt(r)->cdp_type; + snprintf(s->name, sizeof(s->name), "%s", r->name); + INIT_LIST_HEAD(&s->list); list_add(&s->list, &resctrl_all_schema); } diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 9ed0beb241d8..8b06ed8e7407 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -7,6 +7,11 @@ #include #include +/* + * The longest name we expect in the schemata file: + */ +#define RESCTRL_NAME_LEN 7 + enum resctrl_conf_type { CDP_BOTH = 0, CDP_CODE, @@ -147,11 +152,13 @@ int resctrl_arch_set_cdp_enabled(bool enable); /** * @li
[RFC PATCH 15/20] x86/intel_rdt: Walk the resctrl schema list instead of the arch's resource list
Now that resctrl has a list of resources it is using, walk that list instead of the architectures list. This lets us keep schema properties with the resource that is using them. Most users of for_each_alloc_enabled_rdt_resource() are per-schema, switch these to walk the schema list. The remainder are working with a per-resource property. Previously we littered resctrl_to_rdt() wherever we needed to know the cdp_type of a cache. Now that this has a home, fix all those callers to read the value from the relevant schema entry. Signed-off-by: James Morse --- arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c | 24 + arch/x86/kernel/cpu/intel_rdt_rdtgroup.c| 4 +++- include/linux/resctrl.h | 2 +- 3 files changed, 19 insertions(+), 11 deletions(-) diff --git a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c index f80a838cc36d..3038ecfdeec0 100644 --- a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c +++ b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c @@ -271,10 +271,12 @@ int resctrl_arch_update_domains(struct rdt_resource *r) static int rdtgroup_parse_resource(char *resname, char *tok, int closid) { struct rdt_resource *r; + struct resctrl_schema *s; - for_each_alloc_enabled_rdt_resource(r) { + list_for_each_entry(s, &resctrl_all_schema, list) { + r = s->res; if (!strcmp(resname, r->name) && closid < r->num_closid) - return parse_line(tok, r, resctrl_to_rdt(r)->cdp_type, closid); + return parse_line(tok, r, s->conf_type, closid); } rdt_last_cmd_printf("unknown/unsupported resource name '%s'\n", resname); return -EINVAL; @@ -283,6 +285,7 @@ static int rdtgroup_parse_resource(char *resname, char *tok, int closid) ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { + struct resctrl_schema *s; struct rdtgroup *rdtgrp; struct rdt_domain *dom; struct rdt_resource *r; @@ -303,9 +306,10 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of, closid = rdtgrp->closid; - for_each_alloc_enabled_rdt_resource(r) { - list_for_each_entry(dom, &r->domains, list) + list_for_each_entry(s, &resctrl_all_schema, list) { + list_for_each_entry(dom, &s->res->domains, list) { memset(dom->staged_config, 0, sizeof(dom->staged_config)); + } } while ((tok = strsep(&buf, "\n")) != NULL) { @@ -347,9 +351,9 @@ void resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, *value = hw_dom->mbps_val[hw_closid]; } -static void show_doms(struct seq_file *s, struct rdt_resource *r, int closid) +static void show_doms(struct seq_file *s, struct resctrl_schema *schema, int closid) { - + struct rdt_resource *r = schema->res; struct rdt_domain *dom; bool sep = false; u32 ctrl_val, hw_closid; @@ -359,7 +363,7 @@ static void show_doms(struct seq_file *s, struct rdt_resource *r, int closid) if (sep) seq_puts(s, ";"); - hw_closid = resctrl_closid_cdp_map(closid, resctrl_to_rdt(r)->cdp_type); + hw_closid = resctrl_closid_cdp_map(closid, schema->conf_type); resctrl_arch_get_config(r, dom, hw_closid, &ctrl_val); seq_printf(s, r->format_str, dom->id, max_data_width, ctrl_val); @@ -371,6 +375,7 @@ static void show_doms(struct seq_file *s, struct rdt_resource *r, int closid) int rdtgroup_schemata_show(struct kernfs_open_file *of, struct seq_file *s, void *v) { + struct resctrl_schema *schema; struct rdtgroup *rdtgrp; struct rdt_resource *r; int ret = 0; @@ -379,9 +384,10 @@ int rdtgroup_schemata_show(struct kernfs_open_file *of, rdtgrp = rdtgroup_kn_lock_live(of->kn); if (rdtgrp) { closid = rdtgrp->closid; - for_each_alloc_enabled_rdt_resource(r) { + list_for_each_entry(schema, &resctrl_all_schema, list) { + r = schema->res; if (closid < r->num_closid) - show_doms(s, r, closid); + show_doms(s, schema, closid); } } else { ret = -ENOENT; diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c index 2015d99ca388..0bd748defc73 100644 --- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c +++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c @@ -913,6 +913,7 @@ static int rdtgroup_mkdir_info_resd
[RFC PATCH 14/20] x86/intel_rdt: Add a separate resource list for resctrl
We want to merge the L2/L2CODE/L2DATA resources together so that there is one resource per cache. The CDP properties are then part of the configuration. Currently the cdp type to use with the configuration is hidden in the resource. This needs to be part of the schema, but resctrl doesn't have a structure for this, (its all flattened out into extra resources). Create a list of schema that resctrl presents via the schemata file. We want to move the illusion of an "L2CODE" cache into resctrl so that this part of the ABI is dealt with by core code. This change will allow us to have the same resource represented twice as code/data, with the appropriate cdp_type for configuration. This will also let us generate the names in resctrl. Signed-off-by: James Morse --- arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 45 +++- include/linux/resctrl.h | 13 +++ 2 files changed, 57 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c index f3dfed9c609a..2015d99ca388 100644 --- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c +++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c @@ -43,6 +43,9 @@ static struct kernfs_root *rdt_root; struct rdtgroup rdtgroup_default; LIST_HEAD(rdt_all_groups); +/* list of entries for the schemata file */ +LIST_HEAD(resctrl_all_schema); + /* Kernel fs node for "info" directory under root */ static struct kernfs_node *kn_info; @@ -1287,6 +1290,37 @@ static int mkdir_mondata_all(struct kernfs_node *parent_kn, struct rdtgroup *prgrp, struct kernfs_node **mon_data_kn); + +static int create_schemata_list(void) +{ + struct rdt_resource *r; + struct resctrl_schema *s; + + for_each_alloc_enabled_rdt_resource(r) { + s = kzalloc(sizeof(*s), GFP_KERNEL); + if (!s) + return -ENOMEM; + + s->res = r; + s->conf_type = resctrl_to_rdt(r)->cdp_type; + + INIT_LIST_HEAD(&s->list); + list_add(&s->list, &resctrl_all_schema); + } + + return 0; +} + +static void destroy_schemata_list(void) +{ + struct resctrl_schema *s, *tmp; + + list_for_each_entry_safe(s, tmp, &resctrl_all_schema, list) { + list_del(&s->list); + kfree(s); + } +} + static struct dentry *rdt_mount(struct file_system_type *fs_type, int flags, const char *unused_dev_name, void *data) @@ -1312,12 +1346,18 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type, goto out_cdp; } + ret = create_schemata_list(); + if (ret) { + dentry = ERR_PTR(ret); + goto out_schemata_free; + } + closid_init(); ret = rdtgroup_create_info_dir(rdtgroup_default.kn); if (ret) { dentry = ERR_PTR(ret); - goto out_cdp; + goto out_schemata_free; } if (rdt_mon_capable) { @@ -1370,6 +1410,8 @@ static struct dentry *rdt_mount(struct file_system_type *fs_type, kernfs_remove(kn_mongrp); out_info: kernfs_remove(kn_info); +out_schemata_free: + destroy_schemata_list(); out_cdp: cdp_disable_all(); out: @@ -1538,6 +1580,7 @@ static void rdt_kill_sb(struct super_block *sb) reset_all_ctrls(r); cdp_disable_all(); rmdir_all_sub(); + destroy_schemata_list(); static_branch_disable_cpuslocked(&rdt_alloc_enable_key); static_branch_disable_cpuslocked(&rdt_mon_enable_key); static_branch_disable_cpuslocked(&rdt_enable_key); diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index ede5c40756b4..071b2cc9c402 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -145,4 +145,17 @@ void resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, /* Enable/Disable CDP on all applicable resources */ int resctrl_arch_set_cdp_enabled(bool enable); +/** + * @list: Member of resctrl's schema list + * @cdp_type: Whether this entry is for code/data/both + * @res: The rdt_resource for this entry + */ +struct resctrl_schema { + struct list_headlist; + enum resctrl_conf_type conf_type; + struct rdt_resource *res; +}; + +extern struct list_head resctrl_all_schema; + #endif /* __LINUX_RESCTRL_H */ -- 2.18.0
[RFC PATCH 04/20] x86/intel_rdt: Add closid to the staged config
Once we merge the L2/L2CODE/L2DATA resources, we still want to have two configurations staged for one resource when CDP is enabled. These two configurations would have different closid as far as the hardware is concerned. In preparation, add closid as a staged parameter, and pass it down when the schema is being parsed. In the future this will be the hardware closid, with the CDP correction already applied by resctrl. This allows another architecture to work with resctrl, without having to emulate CDP. Signed-off-by: James Morse --- arch/x86/kernel/cpu/intel_rdt.h | 4 ++-- arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c | 21 - include/linux/resctrl.h | 4 +++- 3 files changed, 17 insertions(+), 12 deletions(-) diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h index 7c17d74fd36c..5e271e0fe1f5 100644 --- a/arch/x86/kernel/cpu/intel_rdt.h +++ b/arch/x86/kernel/cpu/intel_rdt.h @@ -294,8 +294,8 @@ static inline struct rdt_hw_resource *resctrl_to_rdt(struct rdt_resource *r) return container_of(r, struct rdt_hw_resource, resctrl); } -int parse_cbm(char *buf, struct rdt_resource *r, struct rdt_domain *d); -int parse_bw(char *buf, struct rdt_resource *r, struct rdt_domain *d); +int parse_cbm(char *buf, struct rdt_resource *r, struct rdt_domain *d, u32 closid); +int parse_bw(char *buf, struct rdt_resource *r, struct rdt_domain *d, u32 closid); extern struct mutex rdtgroup_mutex; diff --git a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c index 1068a19e03c5..0c849653a99d 100644 --- a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c +++ b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c @@ -64,7 +64,7 @@ static bool bw_validate(char *buf, unsigned long *data, struct rdt_resource *r) return true; } -int parse_bw(char *buf, struct rdt_resource *r, struct rdt_domain *d) +int parse_bw(char *buf, struct rdt_resource *r, struct rdt_domain *d, u32 closid) { unsigned long data; struct resctrl_staged_config *cfg = &d->staged_config[0]; @@ -76,6 +76,7 @@ int parse_bw(char *buf, struct rdt_resource *r, struct rdt_domain *d) if (!bw_validate(buf, &data, r)) return -EINVAL; + cfg->closid = closid; cfg->new_ctrl = data; cfg->have_new_ctrl = true; @@ -127,7 +128,7 @@ static bool cbm_validate(char *buf, unsigned long *data, struct rdt_resource *r) * Read one cache bit mask (hex). Check that it is valid for the current * resource type. */ -int parse_cbm(char *buf, struct rdt_resource *r, struct rdt_domain *d) +int parse_cbm(char *buf, struct rdt_resource *r, struct rdt_domain *d, u32 closid) { unsigned long data; struct resctrl_staged_config *cfg = &d->staged_config[0]; @@ -139,6 +140,7 @@ int parse_cbm(char *buf, struct rdt_resource *r, struct rdt_domain *d) if(!cbm_validate(buf, &data, r)) return -EINVAL; + cfg->closid = closid; cfg->new_ctrl = data; cfg->have_new_ctrl = true; @@ -151,7 +153,7 @@ int parse_cbm(char *buf, struct rdt_resource *r, struct rdt_domain *d) * separated by ";". The "id" is in decimal, and must match one of * the "id"s for this resource. */ -static int parse_line(char *line, struct rdt_resource *r) +static int parse_line(char *line, struct rdt_resource *r, u32 closid) { char *dom = NULL, *id; struct rdt_domain *d; @@ -169,7 +171,7 @@ static int parse_line(char *line, struct rdt_resource *r) dom = strim(dom); list_for_each_entry(d, &r->domains, list) { if (d->id == dom_id) { - if (r->parse_ctrlval(dom, r, d)) + if (r->parse_ctrlval(dom, r, d, closid)) return -EINVAL; goto next; } @@ -178,15 +180,15 @@ static int parse_line(char *line, struct rdt_resource *r) } static void apply_config(struct rdt_hw_domain *hw_dom, -struct resctrl_staged_config *cfg, int closid, +struct resctrl_staged_config *cfg, cpumask_var_t cpu_mask, bool mba_sc) { u32 *dc = !mba_sc ? hw_dom->ctrl_val : hw_dom->mbps_val; - if (cfg->new_ctrl != dc[closid]) { + if (cfg->new_ctrl != dc[cfg->closid]) { cpumask_set_cpu(cpumask_any(&hw_dom->resctrl.cpu_mask), cpu_mask); - dc[closid] = cfg->new_ctrl; + dc[cfg->closid] = cfg->new_ctrl; cfg->have_new_ctrl = false; } } @@ -204,6 +206,7 @@ static int update_domains(struct rdt_resource *r, int closid) if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL)) return -ENOMEM; +
[RFC PATCH 11/20] x86/intel_rdt: Pass in the code/data/both configuration value when parsing
The illusion of three types of cache at each level is a neat trick to allow a static table of resources to be used. This is a problem if the cache topology and partitioning abilities have to be discovered at boot. We want to fold the three code/data/both caches into one, and move the CDP configuration details to be a property of the configuration and its closid, not the cache. The resctrl filesystem can then re-create the illusion of separate caches. Temporarily label the configuration property of the cache, and pass this value down to the configuration helpers. Eventually we will move this label up to become a property of the schema. A later patch will correct the closid for CDP when the configuration is staged, which will let us merge the three types of resource. Signed-off-by: James Morse --- arch/x86/kernel/cpu/intel_rdt.c | 6 ++ arch/x86/kernel/cpu/intel_rdt.h | 7 +-- arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c | 15 ++- include/linux/resctrl.h | 11 ++- 4 files changed, 31 insertions(+), 8 deletions(-) diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c index c035280b4398..8d3544b6c149 100644 --- a/arch/x86/kernel/cpu/intel_rdt.c +++ b/arch/x86/kernel/cpu/intel_rdt.c @@ -83,6 +83,7 @@ struct rdt_hw_resource rdt_resources_all[] = { .format_str = "%d=%0*x", .fflags = RFTYPE_RES_CACHE, }, + .cdp_type = CDP_BOTH, .msr_base = IA32_L3_CBM_BASE, .msr_update = cat_wrmsr, }, @@ -102,6 +103,7 @@ struct rdt_hw_resource rdt_resources_all[] = { .format_str = "%d=%0*x", .fflags = RFTYPE_RES_CACHE, }, + .cdp_type = CDP_DATA, .msr_base = IA32_L3_CBM_BASE, .msr_update = cat_wrmsr, @@ -122,6 +124,7 @@ struct rdt_hw_resource rdt_resources_all[] = { .format_str = "%d=%0*x", .fflags = RFTYPE_RES_CACHE, }, + .cdp_type = CDP_CODE, .msr_base = IA32_L3_CBM_BASE, .msr_update = cat_wrmsr, }, @@ -141,6 +144,7 @@ struct rdt_hw_resource rdt_resources_all[] = { .format_str = "%d=%0*x", .fflags = RFTYPE_RES_CACHE, }, + .cdp_type = CDP_BOTH, .msr_base = IA32_L2_CBM_BASE, .msr_update = cat_wrmsr, }, @@ -160,6 +164,7 @@ struct rdt_hw_resource rdt_resources_all[] = { .format_str = "%d=%0*x", .fflags = RFTYPE_RES_CACHE, }, + .cdp_type = CDP_DATA, .msr_base = IA32_L2_CBM_BASE, .msr_update = cat_wrmsr, }, @@ -179,6 +184,7 @@ struct rdt_hw_resource rdt_resources_all[] = { .format_str = "%d=%0*x", .fflags = RFTYPE_RES_CACHE, }, + .cdp_type = CDP_CODE, .msr_base = IA32_L2_CBM_BASE, .msr_update = cat_wrmsr, }, diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h index 92822ff99f1a..cc8dea58b74f 100644 --- a/arch/x86/kernel/cpu/intel_rdt.h +++ b/arch/x86/kernel/cpu/intel_rdt.h @@ -285,6 +285,7 @@ struct rdt_hw_resource { struct rdt_resource resctrl; int rid; u32 hw_num_closid; + enum resctrl_conf_type cdp_type; // temporary unsigned intmsr_base; void (*msr_update) (struct rdt_domain *d, struct msr_param *m, struct rdt_resource *r); @@ -296,8 +297,10 @@ static inline struct rdt_hw_resource *resctrl_to_rdt(struct rdt_resource *r) return container_of(r, struct rdt_hw_resource, resctrl); } -int parse_cbm(char *buf, struct rdt_resource *r, struct rdt_domain *d, u32 closid); -int parse_bw(char *buf, struct rdt_resource *r, struct rdt_domain *d, u32 closid); +int parse_cbm(char *buf, struct rdt_resource *r, struct rdt_domain *d, + enum resctrl_conf_type t, u32 closid); +int parse_bw(char *buf, struct rdt_resource *r, struct rdt_domain *d, +enum resctrl_conf_type t, u32 closid); extern struct mutex rdtgroup_mutex; diff --git a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c b/ar
[RFC PATCH 18/20] x86/intel_rdt: Remove the CODE/DATA illusionary caches
Now that nothing uses these caches, remove them. Signed-off-by: James Morse --- arch/x86/kernel/cpu/intel_rdt.c | 69 - arch/x86/kernel/cpu/intel_rdt.h | 4 -- 2 files changed, 73 deletions(-) diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c index 96b1aab36053..f6f1eceb366f 100644 --- a/arch/x86/kernel/cpu/intel_rdt.c +++ b/arch/x86/kernel/cpu/intel_rdt.c @@ -84,41 +84,6 @@ struct rdt_hw_resource rdt_resources_all[] = { .msr_base = IA32_L3_CBM_BASE, .msr_update = cat_wrmsr, }, - [RDT_RESOURCE_L3DATA] = - { - .rid= RDT_RESOURCE_L3DATA, - .resctrl = { - .name = "L3DATA", - .cache_level= 3, - .cache = { - .min_cbm_bits = 1, - }, - .domains= domain_init(RDT_RESOURCE_L3DATA), - .parse_ctrlval = parse_cbm, - .format_str = "%d=%0*x", - .fflags = RFTYPE_RES_CACHE, - }, - .msr_base = IA32_L3_CBM_BASE, - .msr_update = cat_wrmsr, - - }, - [RDT_RESOURCE_L3CODE] = - { - .rid= RDT_RESOURCE_L3CODE, - .resctrl = { - .name = "L3CODE", - .cache_level= 3, - .cache = { - .min_cbm_bits = 1, - }, - .domains= domain_init(RDT_RESOURCE_L3CODE), - .parse_ctrlval = parse_cbm, - .format_str = "%d=%0*x", - .fflags = RFTYPE_RES_CACHE, - }, - .msr_base = IA32_L3_CBM_BASE, - .msr_update = cat_wrmsr, - }, [RDT_RESOURCE_L2] = { .rid= RDT_RESOURCE_L2, @@ -136,40 +101,6 @@ struct rdt_hw_resource rdt_resources_all[] = { .msr_base = IA32_L2_CBM_BASE, .msr_update = cat_wrmsr, }, - [RDT_RESOURCE_L2DATA] = - { - .rid= RDT_RESOURCE_L2DATA, - .resctrl = { - .name = "L2DATA", - .cache_level= 2, - .cache = { - .min_cbm_bits = 1, - }, - .domains= domain_init(RDT_RESOURCE_L2DATA), - .parse_ctrlval = parse_cbm, - .format_str = "%d=%0*x", - .fflags = RFTYPE_RES_CACHE, - }, - .msr_base = IA32_L2_CBM_BASE, - .msr_update = cat_wrmsr, - }, - [RDT_RESOURCE_L2CODE] = - { - .rid= RDT_RESOURCE_L2CODE, - .resctrl = { - .name = "L2CODE", - .cache_level= 2, - .cache = { - .min_cbm_bits = 1, - }, - .domains= domain_init(RDT_RESOURCE_L2CODE), - .parse_ctrlval = parse_cbm, - .format_str = "%d=%0*x", - .fflags = RFTYPE_RES_CACHE, - }, - .msr_base = IA32_L2_CBM_BASE, - .msr_update = cat_wrmsr, - }, [RDT_RESOURCE_MBA] = { .rid= RDT_RESOURCE_MBA, diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h index fd5c0b3dc797..a4aba005cfea 100644 --- a/arch/x86/kernel/cpu/intel_rdt.h +++ b/arch/x86/kernel/cpu/intel_rdt.h @@ -311,11 +311,7 @@ int __init rdtgroup_init(void); enum { RDT_RESOURCE_L3, - RDT_RESOURCE_L3DATA, - RDT_RESOURCE_L3CODE, RDT_RESOURCE_L2, - RDT_RESOURCE_L2DATA, - RDT_RESOURCE_L2CODE, RDT_RESOURCE_MBA, /* Must be the last */ -- 2.18.0
[RFC PATCH 07/20] x86/intel_rdt: Expose update_domains() as an arch helper
update_domains() applies the staged configuration to the hw_dom's configuration array and updates the hardware. Make it part of the interface between resctrl and the arch code. Signed-off-by: James Morse --- arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c | 4 ++-- include/linux/resctrl.h | 1 + 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c index ec3c15ee3473..766c3e62ad91 100644 --- a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c +++ b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c @@ -193,7 +193,7 @@ static void apply_config(struct rdt_hw_domain *hw_dom, } } -static int update_domains(struct rdt_resource *r) +int resctrl_arch_update_domains(struct rdt_resource *r) { struct resctrl_staged_config *cfg; struct rdt_hw_domain *hw_dom; @@ -312,7 +312,7 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of, } for_each_alloc_enabled_rdt_resource(r) { - ret = update_domains(r); + ret = resctrl_arch_update_domains(r); if (ret) goto out; } diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 03d9fbc230af..9fe7d7de53d7 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -125,6 +125,7 @@ struct rdt_resource { }; +int resctrl_arch_update_domains(struct rdt_resource *r); void resctrl_arch_get_config(struct rdt_resource *r, struct rdt_domain *d, u32 closid, u32 *value); -- 2.18.0
[RFC PATCH 03/20] x86/intel_rdt: Group staged configuration into a separate struct
We want to merge the L2/L2CODE/L2DATA resources to be a single resource, but we still need to be able to apply separate CODE and DATA schema to the domain. Move the new_ctrl bitmap value and flag into a struct, and create an array of them. Today there is only one element in the array, but eventually resctrl will use the array slots for CODE/DATA/BOTH to detect a duplicate schema being written. Signed-off-by: James Morse --- arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c | 44 +++-- include/linux/resctrl.h | 16 ++-- 2 files changed, 43 insertions(+), 17 deletions(-) diff --git a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c index e3dcb5161122..1068a19e03c5 100644 --- a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c +++ b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c @@ -67,16 +67,17 @@ static bool bw_validate(char *buf, unsigned long *data, struct rdt_resource *r) int parse_bw(char *buf, struct rdt_resource *r, struct rdt_domain *d) { unsigned long data; + struct resctrl_staged_config *cfg = &d->staged_config[0]; - if (d->have_new_ctrl) { + if (cfg->have_new_ctrl) { rdt_last_cmd_printf("duplicate domain %d\n", d->id); return -EINVAL; } if (!bw_validate(buf, &data, r)) return -EINVAL; - d->new_ctrl = data; - d->have_new_ctrl = true; + cfg->new_ctrl = data; + cfg->have_new_ctrl = true; return 0; } @@ -129,16 +130,17 @@ static bool cbm_validate(char *buf, unsigned long *data, struct rdt_resource *r) int parse_cbm(char *buf, struct rdt_resource *r, struct rdt_domain *d) { unsigned long data; + struct resctrl_staged_config *cfg = &d->staged_config[0]; - if (d->have_new_ctrl) { + if (cfg->have_new_ctrl) { rdt_last_cmd_printf("duplicate domain %d\n", d->id); return -EINVAL; } if(!cbm_validate(buf, &data, r)) return -EINVAL; - d->new_ctrl = data; - d->have_new_ctrl = true; + cfg->new_ctrl = data; + cfg->have_new_ctrl = true; return 0; } @@ -175,15 +177,29 @@ static int parse_line(char *line, struct rdt_resource *r) return -EINVAL; } +static void apply_config(struct rdt_hw_domain *hw_dom, +struct resctrl_staged_config *cfg, int closid, +cpumask_var_t cpu_mask, bool mba_sc) +{ + u32 *dc = !mba_sc ? hw_dom->ctrl_val : hw_dom->mbps_val; + + if (cfg->new_ctrl != dc[closid]) { + cpumask_set_cpu(cpumask_any(&hw_dom->resctrl.cpu_mask), + cpu_mask); + dc[closid] = cfg->new_ctrl; + cfg->have_new_ctrl = false; + } +} + static int update_domains(struct rdt_resource *r, int closid) { + struct resctrl_staged_config *cfg; struct rdt_hw_domain *hw_dom; struct msr_param msr_param; cpumask_var_t cpu_mask; struct rdt_domain *d; bool mba_sc; - u32 *dc; - int cpu; + int i, cpu; if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL)) return -ENOMEM; @@ -195,10 +211,12 @@ static int update_domains(struct rdt_resource *r, int closid) mba_sc = is_mba_sc(r); list_for_each_entry(d, &r->domains, list) { hw_dom = rc_dom_to_rdt(d); - dc = !mba_sc ? hw_dom->ctrl_val : hw_dom->mbps_val; - if (d->have_new_ctrl && d->new_ctrl != dc[closid]) { - cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask); - dc[closid] = d->new_ctrl; + for (i = 0; i < ARRAY_SIZE(d->staged_config); i++) { + cfg = &hw_dom->resctrl.staged_config[i]; + if (!cfg->have_new_ctrl) + continue; + + apply_config(hw_dom, cfg, closid, cpu_mask, mba_sc); } } @@ -259,7 +277,7 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of, for_each_alloc_enabled_rdt_resource(r) { list_for_each_entry(dom, &r->domains, list) - dom->have_new_ctrl = false; + memset(dom->staged_config, 0, sizeof(dom->staged_config)); } while ((tok = strsep(&buf, "\n")) != NULL) { diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 5950c30fcc30..0b57a55f4b3d 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -7,21 +7,29 @@ #include #include +/** + * struct resctrl_staged_config - parsed configuration to be applied + * @new_ctrl: new ctrl value to be loaded + * @have_new_ctrl:
[RFC PATCH 20/20] x86/intel_rdt: Merge cdp enable/disable calls
Now that the cdp_enable() and cdp_disable() calls are basically the same, merge them into cdp_set_enabled(true/false). All these functions are behind resctrl_arch_set_cdp_enabled(), so the can take the rdt_hw_resource directly. Signed-off-by: James Morse --- arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 49 ++-- 1 file changed, 12 insertions(+), 37 deletions(-) diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c index 38cd463443e8..b9b7375ef8a9 100644 --- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c +++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c @@ -1017,10 +1017,9 @@ static inline bool is_mba_linear(void) return rdt_resources_all[RDT_RESOURCE_MBA].resctrl.membw.delay_linear; } -static int set_cache_qos_cfg(int level, bool enable) +static int set_cache_qos_cfg(struct rdt_hw_resource *hw_res, bool enable) { void (*update)(void *arg); - struct rdt_resource *r_l; cpumask_var_t cpu_mask; struct rdt_domain *d; int cpu; @@ -1028,15 +1027,14 @@ static int set_cache_qos_cfg(int level, bool enable) if (!zalloc_cpumask_var(&cpu_mask, GFP_KERNEL)) return -ENOMEM; - if (level == RDT_RESOURCE_L3) + if (hw_res == &rdt_resources_all[RDT_RESOURCE_L3]) update = l3_qos_cfg_update; - else if (level == RDT_RESOURCE_L2) + else if (hw_res == &rdt_resources_all[RDT_RESOURCE_L2]) update = l2_qos_cfg_update; else return -EINVAL; - r_l = &rdt_resources_all[level].resctrl; - list_for_each_entry(d, &r_l->domains, list) { + list_for_each_entry(d, &hw_res->resctrl.domains, list) { /* Pick one CPU from each domain instance to update MSR */ cpumask_set_cpu(cpumask_any(&d->cpu_mask), cpu_mask); } @@ -1078,53 +1076,30 @@ static int set_mba_sc(bool mba_sc) return 0; } -static int cdp_enable(int level) +static int cdp_set_enabled(struct rdt_hw_resource *hw_res, bool enable) { - struct rdt_resource *r_l = &rdt_resources_all[level].resctrl; int ret; - if (!r_l->alloc_capable || !r_l->cdp_capable) + if (!hw_res->resctrl.cdp_capable) return -EINVAL; - ret = set_cache_qos_cfg(level, true); + ret = set_cache_qos_cfg(hw_res, enable); if (!ret) - r_l->cdp_enabled = true; + hw_res->resctrl.cdp_enabled = enable; return ret; } -static void cdp_disable(int level) -{ - struct rdt_resource *r = &rdt_resources_all[level].resctrl; - - if (r->cdp_enabled) { - set_cache_qos_cfg(level, false); - r->cdp_enabled = false; - } -} - int resctrl_arch_set_cdp_enabled(bool enable) { int ret = -EINVAL; struct rdt_hw_resource *l3 = &rdt_resources_all[RDT_RESOURCE_L3]; struct rdt_hw_resource *l2 = &rdt_resources_all[RDT_RESOURCE_L2]; - if (l3 && l3->resctrl.cdp_capable) { - if (!enable) { - cdp_disable(RDT_RESOURCE_L3); - ret = 0; - } else { - ret = cdp_enable(RDT_RESOURCE_L3); - } - } - if (l2 && l2->resctrl.cdp_capable) { - if (!enable) { - cdp_disable(RDT_RESOURCE_L2); - ret = 0; - } else { - ret = cdp_enable(RDT_RESOURCE_L2); - } - } + if (l3 && l3->resctrl.cdp_capable) + ret = cdp_set_enabled(l3, enable); + if (l2 && l2->resctrl.cdp_capable) + ret = cdp_set_enabled(l2, enable); return ret; } -- 2.18.0
[RFC PATCH 19/20] x86/intel_rdt: Kill off alloc_enabled
Now that the L2/L2CODE/L2DATA resources are merged together, alloc_enabled doesn't mean anything, its the same as alloc_capable which indicates CAT is supported by this cache. Take the opportunity to kill of alloc_enabled and its helpers. Signed-off-by: James Morse --- arch/x86/kernel/cpu/intel_rdt.c | 3 --- arch/x86/kernel/cpu/intel_rdt.h | 5 - arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c | 2 +- arch/x86/kernel/cpu/intel_rdt_rdtgroup.c| 6 +++--- include/linux/resctrl.h | 2 -- 5 files changed, 4 insertions(+), 14 deletions(-) diff --git a/arch/x86/kernel/cpu/intel_rdt.c b/arch/x86/kernel/cpu/intel_rdt.c index f6f1eceb366f..982897839711 100644 --- a/arch/x86/kernel/cpu/intel_rdt.c +++ b/arch/x86/kernel/cpu/intel_rdt.c @@ -157,7 +157,6 @@ static inline void cache_alloc_hsw_probe(void) r->cache.shareable_bits = 0xc; r->cache.min_cbm_bits = 2; r->alloc_capable = true; - r->alloc_enabled = true; rdt_alloc_capable = true; } @@ -214,7 +213,6 @@ static bool rdt_get_mem_config(struct rdt_resource *r) r->data_width = 3; r->alloc_capable = true; - r->alloc_enabled = true; return true; } @@ -234,7 +232,6 @@ static void rdt_get_cache_alloc_cfg(int idx, struct rdt_resource *r) r->cache.shareable_bits = ebx & r->default_ctrl; r->data_width = (r->cache.cbm_len + 3) / 4; r->alloc_capable = true; - r->alloc_enabled = true; } static int get_cache_id(int cpu, int level) diff --git a/arch/x86/kernel/cpu/intel_rdt.h b/arch/x86/kernel/cpu/intel_rdt.h index a4aba005cfea..735d7bd4bcd9 100644 --- a/arch/x86/kernel/cpu/intel_rdt.h +++ b/arch/x86/kernel/cpu/intel_rdt.h @@ -342,11 +342,6 @@ static inline struct rdt_resource *resctrl_inc(struct rdt_resource *res) r = resctrl_inc(r)) \ if (r->mon_capable) -#define for_each_alloc_enabled_rdt_resource(r) \ - for (r = &rdt_resources_all[0].resctrl; r < &rdt_resources_all[RDT_NUM_RESOURCES].resctrl;\ -r = resctrl_inc(r)) \ - if (r->alloc_enabled) - #define for_each_mon_enabled_rdt_resource(r) \ for (r = &rdt_resources_all[0].resctrl; r < &rdt_resources_all[RDT_NUM_RESOURCES].resctrl;\ r = resctrl_inc(r)) \ diff --git a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c index e8264637a4d3..ab8bc97dee0b 100644 --- a/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c +++ b/arch/x86/kernel/cpu/intel_rdt_ctrlmondata.c @@ -329,7 +329,7 @@ ssize_t rdtgroup_schemata_write(struct kernfs_open_file *of, goto out; } - for_each_alloc_enabled_rdt_resource(r) { + for_each_alloc_capable_rdt_resource(r) { ret = resctrl_arch_update_domains(r); if (ret) goto out; diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c index 39038bdfa7d6..38cd463443e8 100644 --- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c +++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c @@ -103,7 +103,7 @@ static void closid_init(void) int rdt_min_closid = 32; /* Compute rdt_min_closid across all resources */ - for_each_alloc_enabled_rdt_resource(r) + for_each_alloc_capable_rdt_resource(r) rdt_min_closid = min(rdt_min_closid, r->num_closid); closid_free_map = BIT_MASK(rdt_min_closid) - 1; @@ -1307,7 +1307,7 @@ static int create_schemata_list(void) int ret = 0; struct rdt_resource *r; - for_each_alloc_enabled_rdt_resource(r) { + for_each_alloc_capable_rdt_resource(r) { if (r->cdp_enabled) { ret = add_schema(CDP_CODE, r); ret |= add_schema(CDP_DATA, r); @@ -1586,7 +1586,7 @@ static void rdt_kill_sb(struct super_block *sb) set_mba_sc(false); /*Put everything back to default values. */ - for_each_alloc_enabled_rdt_resource(r) + for_each_capable_rdt_resource(r) reset_all_ctrls(r); cdp_disable_all(); rmdir_all_sub(); diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h index 8b06ed8e7407..7a955ce93988 100644 --- a/include/linux/resctrl.h +++ b/include/linux/resctrl.h @@ -86,7 +86,6 @@ struct resctrl_membw { }; /** - * @alloc_enabled: Is allocation enabled on this machine * @mon_enabled: Is monitoring enabled for this feature * @cdp_enabledIs CDP enabled for this resource * @alloc_capable: Is allocation available on this machine @@ -113,7 +112,6 @@ struct resctr
Re: [RFC PATCH] EDAC, ghes: Enable per-layer error reporting for ARM
Hi guys, On 19/07/18 15:01, Borislav Petkov wrote: > On Mon, Jul 16, 2018 at 01:26:49PM -0400, Tyler Baicar wrote: >> Enable per-layer error reporting for ARM systems so that the error >> counters are incremented per-DIMM. >> >> On ARM systems that use firmware first error handling it is understood understood by whom? Is this written down somewhere, or is it the convention. (in which case, lets get it written down somewhere) >> that card=channel and module=DIMM on that channel. Populate that I'm guessing this is the mapping between CPER records and the DMItable data. >> information and enable per layer error reporting for ARM systems so that >> the EDAC error counters are incremented based on DIMM number as per the >> SMBIOS table rather than just incrementing the noinfo counters on the >> memory controller. Does this work on x86, and its just the dmi/cper fields have a subtle difference? > I guess. > > James? I don't know anything about this stuff. Looks like the SMBIOS specification is my new bedtime reading. Thanks, James
Re: [PATCH v3 5/5] arm64/mm: Move {idmap_pg_dir, swapper_pg_dir} to .rodata section
Hi Yun, On 02/07/18 12:16, Jun Yao wrote: > Move {idmap_pg_dir, swapper_pg_dir} to .rodata section and > populate swapper_pg_dir by fixmap. (any chance you could split the fixmap bits into a separate patch so that the rodata move comes last? This will make review and bisecting any problems easier.) > diff --git a/arch/arm64/include/asm/pgalloc.h > b/arch/arm64/include/asm/pgalloc.h > index 2e05bcd944c8..a0ce7d0f81c5 100644 > --- a/arch/arm64/include/asm/pgalloc.h > +++ b/arch/arm64/include/asm/pgalloc.h > @@ -29,6 +29,23 @@ > #define PGALLOC_GFP (GFP_KERNEL | __GFP_ZERO) > #define PGD_SIZE (PTRS_PER_PGD * sizeof(pgd_t)) > > +static inline int in_swapper_dir(void *addr) > +{ > + if ((unsigned long)addr >= (unsigned long)swapper_pg_dir && > + (unsigned long)addr < (unsigned long)swapper_pg_end) { > + return 1; > + } > + return 0; > +} Making this a bool and returning the condition feels more natural. We know swapper_pg_dir and swapper_pg_end are both page aligned, if we can tell the compiler, it can generate better code. Something like: | return ((long)addr & PAGE_MASK) == ((long)swapper_pg_dir & PAGE_MASK); (if you agree its the same) > +static inline void *swapper_mirror_addr(void *start, void *addr) > +{ > + unsigned long offset; > + > + offset = (unsigned long)addr - (unsigned long)swapper_pg_dir; > + return start + offset; > +} I think you've re-invented __set_fixmap_offset, ... > #if CONFIG_PGTABLE_LEVELS > 2 > > static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr) > @@ -49,6 +66,17 @@ static inline void __pud_populate(pud_t *pudp, phys_addr_t > pmdp, pudval_t prot) > > static inline void pud_populate(struct mm_struct *mm, pud_t *pudp, pmd_t > *pmdp) > { > +#ifdef __PAGETABLE_PUD_FOLDED > + if ((mm == &init_mm) && in_swapper_dir(pudp)) { > + pud_t *pud; > + pud = pud_set_fixmap(__pa_symbol(swapper_pg_dir)); > + pud = (pud_t *)swapper_mirror_addr(pud, pudp); This is calculating the corresponding pudp in the fixmap mapping of swapper. If you give pud_set_fixmap() the physical address of the original pudp it will calculate this for you. I think this could fold down to something like: | pud_t *fixmap_pudp = pud_set_fixmap(__kimg_to_phys((long)pudp)); (please check I've dug through all that properly!) On a wider issue: these p?d_populate() helpers only put table entries down, these are used by vmalloc(). Unfortunately ioremap() will try to add block mappings if it can, which don't use this helper. (bits and pieces for testing this in [0]). Systems with nvdimms probably do this all the time... We could add the same in_swapper_dir() tests and fixmap operations to p?d_set_huge(), but it may be worth pushing them down to set_p?d(), which is where all these paths join up. This would also cover pgd_clear(), which is called via p?d_none_or_clear_bad(). I think we should do something about concurrent calls. Today that ~should work, providing they aren't modifying the same locations, but with these changes both CPUs would try and clear_fixmap()/tlbi the range from underneath each other. The p?d_alloc() helpers take the mm lock around the p?d_populate() calls, but it doesn't look like there is any other lock for the ioremap work calling p?d_set_huge(), so we will need one of our own, especially for p?d_none_or_clear_bad(). > diff --git a/arch/arm64/include/asm/pgtable.h > b/arch/arm64/include/asm/pgtable.h > index 3b408f21fe2e..b479d1b997c2 100644 > --- a/arch/arm64/include/asm/pgtable.h > +++ b/arch/arm64/include/asm/pgtable.h > @@ -475,6 +475,9 @@ static inline phys_addr_t pmd_page_paddr(pmd_t pmd) > */ > #define mk_pte(page,prot)pfn_pte(page_to_pfn(page),prot) > > +#define pmd_set_fixmap(addr) ((pmd_t *)set_fixmap_offset(FIX_PMD, > addr)) > +#define pmd_clear_fixmap() clear_fixmap(FIX_PMD) > + Ooer, we shouldn't need to move these around. If they're wrong, it should be the subject of a separate patch, but its more likely we're using the wrong ones... > diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S > index d69e11ad92e3..beff018bf0f9 100644 > --- a/arch/arm64/kernel/vmlinux.lds.S > +++ b/arch/arm64/kernel/vmlinux.lds.S > @@ -223,21 +223,25 @@ SECTIONS > BSS_SECTION(0, 0, 0) > > . = ALIGN(PAGE_SIZE); > - idmap_pg_dir = .; > - . += IDMAP_DIR_SIZE; > + > + .rodata : { Now I'm confused. I thought this named section was output by RO_DATA() further up, just before __init_begin, which mark_rodata_ro() uses as its end marker. I thought named sections had to be unique. Are we expecting the linker to realise we want it to move this stuff up there, and not move the read-only contents down here, outside the markers. (maybe the linker always takes the first definition?) Can we move these to sit near INIT_DIR/INIT_PG_TABLES as a macro, which we then put after NOTES? (something
Re: [PATCH v9 5/7] arm64: kvm: Introduce KVM_ARM_SET_SERROR_ESR ioctl
Hi Dongjiu Geng, On 06/01/18 16:02, Dongjiu Geng wrote: > The ARM64 RAS SError Interrupt(SEI) syndrome value is specific to the > guest and user space needs a way to tell KVM this value. So we add a > new ioctl. Before user space specifies the Exception Syndrome Register > ESR(ESR), it firstly checks that whether KVM has the capability to > set the guest ESR, If has, will set it. Otherwise, nothing to do. > > For this ESR specifying, Only support for AArch64, not support AArch32. After this patch user-space can trigger an SError in the guest. If it wants to migrate the guest, how does the pending SError get migrated? I think we need to fix migration first. Andrew Jones suggested using KVM_GET/SET_VCPU_EVENTS: https://www.spinics.net/lists/arm-kernel/msg616846.html Given KVM uses kvm_inject_vabt() on v8.0 hardware too, we should cover systems without the v8.2 RAS Extensions with the same API. I think this means a bit to read/write whether SError is pending, and another to indicate the ESR should be set/read. CPUs without the v8.2 RAS Extensions can reject pending-SError that had an ESR. user-space can then use the 'for migration' calls to make a 'new' SError pending. Now that the cpufeature bits are queued, I think this can be split up into two separate series for v4.16-rc1, one to tackle NOTIFY_SEI and the associated plumbing. The second for the KVM 'make SError pending' API. > diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c > index 5c7f657..738ae90 100644 > --- a/arch/arm64/kvm/guest.c > +++ b/arch/arm64/kvm/guest.c > @@ -277,6 +277,11 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu, > return -EINVAL; > } > > +int kvm_arm_set_sei_esr(struct kvm_vcpu *vcpu, u32 *syndrome) > +{ > + return -EINVAL; > +} Does nothing in the patch that adds the support? This is a bit odd. (oh, its hiding in patch 6...) Thanks, James
Re: [PATCH v9 6/7] arm64: kvm: Set Virtual SError Exception Syndrome for guest
Hi Dongjiu Geng, On 06/01/18 16:02, Dongjiu Geng wrote: > RAS Extension add a VSESR_EL2 register which can provide > the syndrome value reported to software on taking a virtual > SError interrupt exception. This patch supports to specify > this Syndrome. > > In the RAS Extensions we can not set all-zero syndrome value > for SError, which means 'RAS error: Uncategorized' instead of > 'no valid ISS'. So set it to IMPLEMENTATION DEFINED syndrome > by default. > > We also need to support userspace to specify a valid syndrome > value, Because in some case, the recovery is driven by userspace. > This patch can support that userspace specify it. > > In the guest/host world switch, restore this value to VSESR_EL2 > only when HCR_EL2.VSE is set. This value no need to be saved > because it is stale vale when guest exit. A version of this patch has been queued by Catalin. Now that the cpufeature bits are queued, I think this can be split up into two separate series for v4.16-rc1, one to tackle NOTIFY_SEI and the associated plumbing. The second for the KVM 'make SError pending' API. > Signed-off-by: Dongjiu Geng > [Set an impdef ESR for Virtual-SError] > Signed-off-by: James Morse I didn't sign-off this patch. If you pick some bits from another version and want to credit someone else you can 'CC:' them or just mention it in the commit-message. > diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h > index 47b967d..3b035cc 100644 > --- a/arch/arm64/include/asm/sysreg.h > +++ b/arch/arm64/include/asm/sysreg.h > @@ -86,6 +86,9 @@ > #define REG_PSTATE_PAN_IMM sys_reg(0, 0, 4, 0, 4) > #define REG_PSTATE_UAO_IMM sys_reg(0, 0, 4, 0, 3) > > +/* virtual SError exception syndrome register */ > +#define REG_VSESR_EL2 sys_reg(3, 4, 5, 2, 3) Irrelevant-Nit: sys-regs usually have a 'SYS_' prefix, and are in instruction encoding order lower down the file. (These PSTATE PAN things are a bit odd as they were used to generate and instruction before the fancy {read,write}_sysreg() helpers were added). > #define SET_PSTATE_PAN(x) __emit_inst(0xd500 | REG_PSTATE_PAN_IMM | > \ > (!!x)<<8 | 0x1f) > #define SET_PSTATE_UAO(x) __emit_inst(0xd500 | REG_PSTATE_UAO_IMM | > \ > diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c > index 738ae90..ffad42b 100644 > --- a/arch/arm64/kvm/guest.c > +++ b/arch/arm64/kvm/guest.c > @@ -279,7 +279,16 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu, > > int kvm_arm_set_sei_esr(struct kvm_vcpu *vcpu, u32 *syndrome) Bits of this are spread between patches 5 and 6. If you put them in the other order this wouldn't happen. (but after a rebase most of this patch should disappear) > { > - return -EINVAL; > + u64 reg = *syndrome; > + > + /* inject virtual system Error or asynchronous abort */ > + kvm_inject_vabt(vcpu); So this writes an impdef ESR, because its the existing code-path in KVM. > + if (reg) > + /* set vsesr_el2[24:0] with value that user space specified */ > + kvm_vcpu_set_vsesr(vcpu, reg & ESR_ELx_ISS_MASK); And then you overwrite it. Which is a bit odd as there is a helper to do both in one go: > + > + return 0; > } > int __attribute_const__ kvm_target_cpu(void) > diff --git a/arch/arm64/kvm/inject_fault.c b/arch/arm64/kvm/inject_fault.c > index 3556715..fb94b5e 100644 > --- a/arch/arm64/kvm/inject_fault.c > +++ b/arch/arm64/kvm/inject_fault.c > @@ -246,14 +246,25 @@ void kvm_inject_undefined(struct kvm_vcpu *vcpu) > inject_undef64(vcpu); > } > > +static void pend_guest_serror(struct kvm_vcpu *vcpu, u64 esr) > +{ > + kvm_vcpu_set_vsesr(vcpu, esr); > + vcpu_set_hcr(vcpu, vcpu_get_hcr(vcpu) | HCR_VSE); > +} How come you don't use this in kvm_arm_set_sei_esr()? Thanks, James
Re: [PATCH V15 06/11] acpi: apei: handle SEA notification type for ARMv8
Hi Tyler, On 08/05/17 20:59, Baicar, Tyler wrote: > On 5/8/2017 11:28 AM, James Morse wrote: >> I was tidying up the masking/unmasking in entry.S, something I wasn't aware >> of >> that leads to a bug: >> entry.S will unmask interrupts for instruction/data aborts that came from a >> context with interrupts enabled. This makes sense for get_user() and >> friends... >> For do_sea() we pull nmi_enter() as this can interrupt interrupts-masked >> code, >> such as APEI, but if we end up in here with interrupts unmasked we can take >> an >> IRQ from this 'NMI' context, which will inherit the in_nmi() and could lead >> to >> the deadlock we were originally trying to avoid. >> >> Teaching entry.S to spot external aborts is messy. I think the two choices >> are >> to either mask interrupts when calling nmi_enter() (as these things should be >> mutually exclusive), or to conditionally call nmi_enter() based on >> interrupts_enabled(regs). I prefer the second one as it matches the >> notify_sea() >> while interruptible that happens when KVM takes one of these. Thinking about this some more: the KVM case is different as we know it was a guest that triggered the external abort. Nothing the host kernel does is likely to trigger either the same error or a related one. But I can't think of a way this would trip twice on the host... yes your suggestion looks fine. (When we add SError/SEI support too we will need to change it as SEA may interrupt SEI, and nmi_enter() has a BUG_ON(in_nmi()), so this nesting will need explicitly checking.) Thanks, James
Re: [PATCH v3 3/3] arm/arm64: signal SIBGUS and inject SEA Error
Hi gengdongjiu, On 05/05/17 13:31, gengdongjiu wrote: > when guest OS happen an SEA, My current solution is shown below: > > (1) host EL3 firmware firstly handle the SEA error and generate the CPER > record. > (2) EL3 firmware separately copy the esr_el3, elr_el3, SPSR_el3, > far_el3 to the esr_el2, elr_el2, SPSR_el2, far_el2. Copying {ELR,SPSR,FAR}_EL3 to the EL2 registers rings some alarm bells: I'm sure you exclude values from EL3 or the secure-world, we should never hand those to the normal world. > (3) then jump the EL2 hypervisor > so the EL2 hypervisor uses the ESR that come from esr_el3, here the > ESR(esr_el3) value may be different with the exist KVM API's ESR. The ESR may be different between EL3 and EL2. The ESR contains the severity of the event, the CPU will choose this when it takes the SError to EL3. If it had taken the SError to EL2, the CPU may have classified the error differently. Firmware may need to generate a more severe ESR if it receives an error that would be propagated by delivering SEI to a lower exception level, for example if an EL2 system register is 'infected'. This is the same for Qemu/kvmtool. A contained error at EL2 may be an uncontained error if we hand it to guest EL1. Linux's RAS code will decide this with its choice of signal to send, (and possibly which code to set). Qemu/kvmtool need to choose an appropriate APEI notification, which may involve generating a relevant ESR. Also relevant is the problem we discussed earlier with trying to deliver fake Physical-SError from software at EL3: If the SError is routed to EL2, and EL2 has PSTATE.A masked, EL3 has to wait and try again later. This is another case where firmware may have to upgrade the classification of an error to uncontainable. Thanks, James
Re: [PATCH v3 3/3] arm/arm64: signal SIBGUS and inject SEA Error
Hi gengdongjiu, On 10/05/17 09:44, gengdongjiu wrote: > On 2017/5/9 1:28, James Morse wrote: >>>> (hwpoison for KVM is a corner case as Qemu's memory effectively has two >>>> users, >>>> Qemu and KVM. This isn't the example of how user-space gets signalled.) >> >> KVM creates guests as if they were additional users of Qemu's memory. The >> code >> in mm/memory-failure.c may find that Qemu didn't have the affected page >> mapped >> to user-space - but it may have been in use by stage2. >> >> The KVM+SIGBUS patch hides this difference, meaning Qemu gets a signal when >> the >> guest touches the hwpoison page as if Qemu had touched the page itself. >> >> Signals from KVM is a corner case, for firmware-first decisions should >> happen in >> the APEI code based on CPER records. >>> If so, how the KVM handle the SEA type other than hwpoison? >> To deliver to a guest? It shouldn't have to know, user space should use a KVM >> API to drive this. >> >> When received from hardware? It shouldn't have to care, these things should >> be >> passed into the APEI code for handling. KVM just needs to put the host >> registers >> back. > Recently I confirmed with the hardware team. they said almost all the SEA > errors have the > Poison flag, so may be there is no need to consider other SEA errors other > than hwPoison. > only consider SEA hwpoison errors can be enough. We should be careful here, by hwpoison I meant the Linux feature. >From Documentation/vm/hwpoison.txt: > Upcoming Intel CPUs have support for recovering from some memory errors > (``MCA recovery''). This requires the OS to declare a page "poisoned", > kill the processes associated with it and avoid using it in the future. We were talking about KVM's reaction to 'the OS declaring a page poisoned'. Lets try to call this one memory-failure, as that is its Kconfig name. (now I understand why we've been confusing each other!) Your hwpoison looks like something the CPU reports in the ERRSTATUS registers (4.6.10 of DDI0587). This is something firmware should read, then describe to the OS via CPER records. Depending on these CPER records linux may invoke its memory-failure code. >>> injection a SEA is no more than setting some registers: elr_el1, PC, >>> PSTATE, SPSR_el1, far_el1, esr_el1 >>> I seen this KVM API do the same thing as Qemu. do you found call this >>> API will have issue and necessary to choose another ESR value? >> >> Should we let user-space pick the ESR to deliver to the guest? Yes, letting >> user-space specify the ESR gives the most flexibility to do something clever >> in >> the future. An obvious choice for SEA is between the external-abort and >> 'parity >> or ECC error' codes. If we tell user-space which of these happened (I don't >> think Linux does today) then Qemu can relay that information to the guest. > may be the ESR is delivered by the KVM. > (1) guest OS EL0 happen SEA due to hwpoison > (2) CPU traps to EL3 firmware, and update the ESR_EL3 > (3) the EL3 firmware copies the ESR_EL3 to ESR_EL2 > (4) then jump to EL2 hypervisor, hypervisor uses the ESR_EL2 to inject the > SEA. > > May be the esr_el2 can provide the accurate error information. > or do you think user-space specify the ESR instead of esr_el2 is better? I think the severity needs to be considered as the notification is handled by each exception level. There are cases where it will need to be upgraded from 'contained' to 'uncontained'. (more discussion on another part of the thread). Thanks, James
Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization
Hi gengdongjiu, On 15/12/17 03:30, gengdongjiu wrote: > On 2017/12/7 14:37, gengdongjiu wrote: >>> We need to tackle (1) and (3) separately. For (3) we need some API that lets >>> Qemu _trigger_ an SError in the guest, with a specified ESR. But, we don't >>> have >>> a way of migrating pending SError yet... which is where I got stuck last >>> time I >>> was looking at this. >> I understand you most idea. >> >> But In the Qemu one signal type can only correspond to one behavior, can not >> correspond to two behaviors, >> otherwise Qemu will do not know how to do. >> >> For the Qemu, if it receives the SIGBUS_MCEERR_AR signal, it will populate >> the CPER >> records and inject a SEA to guest through KVM IOCTL "KVM_SET_ONE_REG"; if >> receives the SIGBUS_MCEERR_AO >> signal, it will record the CPER and trigger a IRQ to notify guest, as shown >> below: >> >> SIGBUS_MCEERR_AR trigger Synchronous External Abort. >> SIGBUS_MCEERR_AO trigger GPIO IRQ. >> >> For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify >> trigger method, which all >> >> not involve _trigger_ an SError. >> >> so there is no chance for Qemu to trigger the SError when gets the >> SIGBUS_MCEERR_A{O,R}. > > As I explained above: > > If Qemu received SIGBUS_MCEERR_AR, it will record CPER and trigger > Synchronous External Abort; > If Qemu received SIGBUS_MCEERR_AO, it will record CPER and trigger GPIO IRQ; > So Qemu does not know when to _trigger_ an SError. There is no answer to this. How the CPU decides is specific to the CPU design. How Qemu decides is going to be specific to the machine it emulates. My understanding is there is some overlap for which RAS errors are reported as synchronous external abort, and which use SError. (Obviously the imprecise ones are all SError). Which one the CPU uses depends on how the CPU is designed. When you take an SIGBUS_MCEERR_AR from KVM, its because KVM can't complete a stage2 fault because the page is marked with PG_poisoned. These started out as a synchronous exception, but you could still report these with SError. We don't have a way to signal user-space about imprecise exceptions, this isn't a KVM specific problem. > so here I "return a error" to Qemu if ghes_notify_sei() return failure in > [1], if you opposed KVM "return error", > do you have a better idea about it? thanks If ghes_notify_sei() fails to claim the error, we should drop through to kernel-first-handling. We don't have that yet, just the stub that ignores errors where we can make progress. If neither firmware-first nor kernel-first claim a RAS error, we're in trouble. I'd like to panic() as we got a RAS notification but no description of the error. We can't do this until we have kernel-first support, hence that stub. > About the way of migrating pending SError, I think it is a separate case, > because Qemu still does not know > how and when to trigger the SError. I agree, but I think we should fix this first before we add another user of this unmigratable hypervisor state. (I recall someone saying migration is needed for any new KVM/cpu features, but I can't find the thread) > [1]: > static int kvm_handle_guest_sei(struct kvm_vcpu *vcpu, struct kvm_run *run) > { > ... > + case ESR_ELx_AET_UER: /* The error has not been propagated */ > + /* > +* Userspace only handle the guest SError Interrupt(SEI) if > the > +* error has not been propagated > +*/ > + run->exit_reason = KVM_EXIT_EXCEPTION; > + run->ex.exception = ESR_ELx_EC_SERROR; I'm against telling user space RAS errors ever happened, only the final user-visible error when the kernel can't fix it. This is inventing something new for RAS errors not claimed by firmware-first. If we have kernel-first too, this will never happen. (unless your system is losing the error description). Your system has firmware-first, why isn't it claiming the notification? If its not finding CPER records written by firmware, check firmware and the UEFI memory map agree on the attributes to be used when read/writing that area. > + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE; > + return 0; Thanks, James
Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization
Hi gengdongjiu, On 16/12/17 04:47, gengdongjiu wrote: > [...] >> >>> + case ESR_ELx_AET_UER: /* The error has not been propagated */ >>> + /* >>> + * Userspace only handle the guest SError Interrupt(SEI) if >>> the >>> + * error has not been propagated >>> + */ >>> + run->exit_reason = KVM_EXIT_EXCEPTION; >>> + run->ex.exception = ESR_ELx_EC_SERROR; >>> + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE; >>> + return 0; >> >> We should not pass RAS notifications to user space. The kernel either handles >> them, or it panics(). User space shouldn't even know if the kernel supports >> RAS > > For the ESR_ELx_AET_UER(Recoverable error), let us see its definition > below, which get from [0] [..] > so we can see the exception is precise and PE can recover execution > from the preferred return address of the exception, > so let guest handling it is > better, for example, if it is guest application RAS error, we can kill > the guest application instead of panic whole OS; if it is guest kernel > RAS error, guest will panic. If the kernel takes an unhandled RAS error it should panic - we don't know where the error is. I understand you want to kill-off guest tasks as a result of RAS errors, but this needs to go through the whole APEI->memory_failure()->sigbus machinery so that the kernel knows the kernel can keep running. This saves us signalling user-space when we don't need to. An example: code-corruption. Linux can happily re-read affected user-space executables from disk, there is absolutely nothing user-space can do about it. Handling errors first in the kernel allows us to do recovery for all the affected processes, not just the one that happens to be running right now. > Host does not know which application of guest has error, so host can > not handle it, It has to work this out, otherwise the errors we can handle never get a chance. This kernel is expected to look at the error description, (which for some reason we aren't talking about here), e.g. the CPER records, and determine what recovery action is necessary for this error. For memory errors this may be re-reading from disk, or at the worst case, unmapping from all user-space users (including KVM's stage2) and raining signals on all affected processes. For a memory error the important piece of information is the physical address. Only the kernel can do anything with this, it determines who owns the affected memory and what needs doing to recover from the error. If you pass the notification to user-space, all it can do is signal the guest to "stop doing whatever it is you're doing". The guest may have been able to re-read pages from disk, or otherwise handle the error. Has the error been handled? No: The error remains latent in the system. > panic OS is not a good choice for the Recoverable error. If we don't know where the error is, and we can't make progress, its the only sane choice. This code is never expected to run! (why are we arguing about it?) We should get RAS errors as GHES notifications from firmware via some mechanism. If those are NOTIFY_SEI then APEI should claim the notification and kick off the appropriate handling based on the CPER records. If/when we get kernel-first, that can claim the SError. What we're left with is RAS notifications that no-one claimed because there was no error-description found. James
Re: [PATCH v5 1/3] arm64/ras: support sea error recovery
Hi Xie XiuQi, On 26/01/18 12:31, Xie XiuQi wrote: > With ARM v8.2 RAS Extension, SEA are usually triggered when memory errors > are consumed. According to the existing process, errors occurred in the > kernel, leading to direct panic, if it occurred the user-space, we should > just kill process. > > But there is a class of error, in fact, is not necessary to kill > process, you can recover and continue to run the process. Such as > the instruction data corrupted, where the memory page might be > read-only, which is has not been modified, the disk might have the > correct data, so you can directly drop the page, ant reload it when > necessary. With firmware-first support, we do all this... > So this patchset is just try to solve such problem: if the error is > consumed in user-space and the error occurs on a clean page, you can > directly drop the memory page without killing process. > > If the corrupted page is clean, just dropped it and return to user-space > without side effects. And if corrupted page is dirty, memory_failure() > will send SIGBUS with code=BUS_MCEERR_AR. While without this patchset, > do_sea() will just send SIGBUS, so the process was killed in the same place. ... but this happens too. I agree its something we should fix, but I don't think this is the best way to do it. This series is pulling the memory-failure-queue details back into the arch-code to build a second list, that gets processed as extra work when we return to user-space. The root of the issue is ghes_notify_sea() claims the notification as something APEI has dealt with, ... but it hasn't done it yet. The signals will be generated by something currently stuck in a queue. (Evidently x86 doesn't handle synchronous errors like this using firmware-first). I think a smaller fix is to give the queues that may be holding the memory_failure() work a kick as part of the code that calls ghes_notify_sea(). This means that by the time we return to do_sea() ghes_notify_sea()'s claim that APEI has dealt with it is true as any generated signals are pending. We can then skip the existing SIGBUS generation code. > Because memory_failure() may sleep, we can not call it directly in SEA (this one is more serious, I've attempted to fix it by moving all NMI-like GHES-notifications to use the estatus queue). > exception context. So we saved faulting physical address associated with > a process in the ghes handler and set __TIF_SEA_NOTIFY. When we return > from SEA exception context and get into do_notify_resume() before the > process running, we could check it and call memory_failure() to do > recovery. > It's safe, because we are in process context. I think this is the trick. When we take a Synchronous-external-abort out of userspace, we're in process context too. We can add helpers to drain the memory_failure_queue which can be called when do_sea() when we know we're preemptible and interrupts-et-al are unmasked. Thanks, James [0] https://www.spinics.net/lists/linux-acpi/msg80149.html > --- > arch/arm64/Kconfig | 11 +++ > arch/arm64/include/asm/ras.h | 23 ++ > arch/arm64/include/asm/thread_info.h | 4 +- > arch/arm64/kernel/Makefile | 1 + > arch/arm64/kernel/ras.c | 142 > +++ > arch/arm64/kernel/signal.c | 7 ++ > arch/arm64/mm/fault.c| 27 +-- > drivers/acpi/apei/ghes.c | 8 +- > include/acpi/ghes.h | 3 + > 9 files changed, 216 insertions(+), 10 deletions(-) > create mode 100644 arch/arm64/include/asm/ras.h > create mode 100644 arch/arm64/kernel/ras.c
Re: [PATCH v9 5/7] arm64: kvm: Introduce KVM_ARM_SET_SERROR_ESR ioctl
Hi gengdongjiu, On 24/01/18 20:06, gengdongjiu wrote: >> On 06/01/18 16:02, Dongjiu Geng wrote: >>> The ARM64 RAS SError Interrupt(SEI) syndrome value is specific to the >>> guest and user space needs a way to tell KVM this value. So we add a >>> new ioctl. Before user space specifies the Exception Syndrome Register >>> ESR(ESR), it firstly checks that whether KVM has the capability to set >>> the guest ESR, If has, will set it. Otherwise, nothing to do. >>> >>> For this ESR specifying, Only support for AArch64, not support AArch32. >> >> After this patch user-space can trigger an SError in the guest. If it wants >> to migrate the guest, how does the pending SError get migrated? >> >> I think we need to fix migration first. Andrew Jones suggested using >> KVM_GET/SET_VCPU_EVENTS: >> https://www.spinics.net/lists/arm-kernel/msg616846.html >> >> Given KVM uses kvm_inject_vabt() on v8.0 hardware too, we should cover >> systems without the v8.2 RAS Extensions with the same API. I >> think this means a bit to read/write whether SError is pending, and another >> to indicate the ESR should be set/read. >> CPUs without the v8.2 RAS Extensions can reject pending-SError that had an >> ESR. > > For the CPUs without the v8.2 RAS Extensions, its ESR is always 0, > we only can inject a SError with ESR 0 to guest, cannot set its ESR. 0? It's always implementation-defined. On Juno it seems to be always-0, but other systems may behave differently. (Juno may generate another ESR value when I'm not watching it...) Just because we can't control the ESR doesn't mean injecting an SError isn't something user-space may want to do. If we tackle migration of pending-SError first, I think that will give us the API to create a new pending SError with/without an ESR as appropriate. > About how about to use the KVM_GET/SET_VCPU_EVENTS, I will check the code, and > consider your suggestion at the same time. (Not my suggestion, It was Andrew Jones idea.) > The IOCTL KVM_GET/SET_VCPU_EVENTS has been used by X86. We would be re-using the struct to have values with slightly different meanings. But for migration the upshot is the same, call KVM_GET_VCPU_EVENTS on one system, and pass the struct to KVM_SET_VCPU_EVENTS on the new system. If we're lucky Qemu may be able to do this in shared x86/arm64 code. >>> diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c index >>> 5c7f657..738ae90 100644 >>> --- a/arch/arm64/kvm/guest.c >>> +++ b/arch/arm64/kvm/guest.c >>> @@ -277,6 +277,11 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu >>> *vcpu, >>> return -EINVAL; >>> } >>> >>> +int kvm_arm_set_sei_esr(struct kvm_vcpu *vcpu, u32 *syndrome) { >>> + return -EINVAL; >>> +} >> >> Does nothing in the patch that adds the support? This is a bit odd. >> (oh, its hiding in patch 6...) > > To make this patch simple and small, I add it in patch 6. But that made the functionality of this patch: A new API to return -EINVAL from the kernel. Swapping the patches round would have avoided this. Regardless, I think this will fold out in a rebase. Thanks, James
Re: [PATCH v9 3/7] acpi: apei: Add SEI notification type support for ARMv8
Hi gengdongjiu, On 23/01/18 09:23, gengdongjiu wrote: > On 2018/1/23 3:39, James Morse wrote: >> gengdongjiu wrote: >>> This error source parsing and handling method >>> is similar with the SEA. >> >> There are problems with doing this: >> >> Oct. 18, 2017, 10:26 a.m. James Morse wrote: >> | How do SEA and SEI interact? >> | >> | As far as I can see they can both interrupt each other, which isn't >> something >> | the single in_nmi() path in APEI can handle. I thinks we should fix this >> | first. >> >> [..] >> >> | SEA gets away with a lot of things because its synchronous. SEI isn't. Xie >> | XiuQi pointed to the memory_failure_queue() code. We can use this directly >> | from SEA, but not SEI. (what happens if an SError arrives while we are >> | queueing memory_failure work from an IRQ). >> | >> | The one that scares me is the trace-point reporting stuff. What happens if >> an >> | SError arrives while we are enabling a trace point? (these are static-keys >> | right?) >> | >> | I don't think we can just plumb SEI in like this and be done with it. >> | (I'm looking at teasing out the estatus cache code from being x86:NMI >> only. >> | This way we solve the same 'cant do this from NMI context' with the same >> | code'.) >> >> >> I will post what I've got for this estatus-cache thing as an RFC, its not >> ready >> to be considered yet. > Yes, I know you are dong that. Your serial's patch will consider all above > things, right? Assuming I got it right, yes. It currently makes the race Xie XiuQi spotted worse, which I want to fix too. (details on the cover letter) > If your patch can be consider that, this patch can based on your patchset. > thanks. I'd like to pick these patches onto the end of that series, but first I want to know what NOTIFY_SEI means for any OS. The ACPI spec doesn't say, and because its asynchronous, route-able and mask-able, there are many more corners than NOTFIY_SEA. This thing is a notification using an emulated SError exception. (emulated because physical-SError must be routed to EL3 for firmware-first, and virtual-SError belongs to EL2). Does your firmware emulate SError exactly as the TakeException() pseudo code in the Arm-Arm? Is the emulated SError routed following the routing rules for HCR_EL2.{AMO, TGE}? What does your firmware do when it wants to emulate SError but its masked? (e.g.1: The physical-SError interrupted EL2 and the SPSR shows EL2 had PSTATE.A set. e.g.2: The physical-SError interrupted EL2 but HCR_EL2 indicates the emulated SError should go to EL1. This effectively masks SError.) Answers to these let us determine whether a bug is in the firmware or the kernel. If firmware is expecting the OS to do something special, I'd like to know about it from the beginning! >>> Expose API ghes_notify_sei() to external users. External >>> modules can call this exposed API to parse APEI table and >>> handle the SEI notification. >> >> external modules? You mean called by the arch code when it gets this >> NOTIFY_SEI? > yes, called by kernel ARCH code, such as below, I remember I have discussed > with you. Sure. The phrase 'external modules' usually means the '.ko' files that live in /lib/modules, nothing outside the kernel tree should be doing this stuff. Thanks, James
Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization
Hi gengdongjiu, On 21/01/18 02:45, gengdongjiu wrote: > For the ESR_ELx_AET_UER, this exception is precise, closing the VM may > be better[1]. > But if you think panic is better until we support kernel-first, it is > also OK to me. I'm not convinced SError while a guest was running means only guest memory could be affected. Mechanisms like KSM means the error could affect multiple guests. Both firmware-fist and kernel-first will give us the address, with which we can know which processes are affected, isolated the memory and signal affected processes. Until we have one of these panic() is the only way we have to contain an error, but its an interim fix. Not panic()ing the host for an error that should be contained to the guest is a fudge, we don't actually know its safe (KSM, page-table etc). I want to improve on this with {firmware, kernel}-first support (or both!), I don't want to expose that this is happening to user-space, as once we have one of {firmware, kernel}-first, it shouldn't happen. >> This is inventing something new for RAS errors not claimed by firmware-first. >> If we have kernel-first too, this will never happen. (unless your system is >> losing the error description). > In fact, if we have kernel-first, I think we still need to judge the > error type by ESR, right? The kernel-first mechanism should consider the ESR/FAR, yes, but once the error has been claimed and handled, KVM shouldn't care about any of these values. (maybe we'll sanity check for uncontained errors, just in case the error escaped to the RAS code...) My point here was exposing 'unhandled' (ignored) RAS errors to user-space creates an ABI: someone will complain once we start handling the error, and they no longer get a notification via this 'unhandled' interface. Code written to use this interface becomes useless/untested. > If the handle_guest_sei() , may be the system does not support firmware-first, > so we judge the ESR value, ...and panic()/ignore as appropriate. I agree not all systems will support firmware-first, (big-endian is the obvious example), but if we get kernel-first support this ESR guessing can disappear, I'm against exposing it to user-space in the meantime. Thanks, James
Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization
Hi gengdongjiu, On 16/12/17 03:44, gengdongjiu wrote: > On 2017/12/16 2:52, James Morse wrote: >>> signal, it will record the CPER and trigger a IRQ to notify guest, as shown >>> below: >>> >>> SIGBUS_MCEERR_AR trigger Synchronous External Abort. >>> SIGBUS_MCEERR_AO trigger GPIO IRQ. >>> >>> For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify >>> trigger method, which all >>> >>> not involve _trigger_ an SError. >> It's a policy choice. How does your virtual CPU notify RAS errors to its >> virtual >> software? You could use SError for SIGBUS_MCEERR_AR, it depends on what type >> of >> CPU you are trying to emulate. >> >> I'd suggest using NOTIFY_SEA for SIGBUS_MCEERR_AR as it avoids problems where >> the guest doesn't take the SError immediately, instead tries to re-execute >> the > I agree it is better to use NOTIFY_SEA for SIGBUS_MCEERR_AR in this case. >> code KVM has unmapped from stage2 because its corrupt. (You could detect this >> happening in Qemu and try something else) > For something else, using NOTIFY_SEI for SIGBUS_MCEERR_AR? Sorry that was unclear. If you use NOTIFY_SEI, the guest may have PSTATE.A set, in which case the the CPU will patiently wait for it to unmask, (or consume it with an ESB-instruction), before delivering the notification. The guest may not have unmasked SError because its hammering the same page taking the same fault again and again. Pending the asynchronous notification and re-running the vcpu doesn't guarantee progress will be made. In this case user-space can spot its pended an asynchronous notification (for the same address!) more than once in the last few seconds, and try something else, like firing a guest:reboot watchdog on another CPU. > At current implementation, > It seems only have this case that "KVM has unmapped from stage2", do you > thing we > still have something else? I'm wary that this only works for errors where we know the guest PC accessed the faulting location. The arch code will send this signal too if user-space touches the PG_poisoned page. (I recall you checked Qemu spots this case and acts differently). Migration is the obvious example for Qemu read/writing guest memory. On x86 the MachineCheck code sends these signals too, so our kernel-first implementation may do the same. As a response to a RAS error notified by synchronous-external-abort, this is fine. But we need to remember '_AR' implies the error is related to the code the signal interrupted, which wouldn't be true for an error notified by SError. >> Synchronous/asynchronous external abort matters to the CPU, but once the >> error >> has been notified to software the reasons for this distinction disappear. >> Once >> the error has been handled, all trace of this distinction is gone. >> >> CPER records only describe component failures. You are trying to re-create >> some >> state that disappeared with one of the firmware-first abstractions. Trying to >> re-create this information isn't worth the effort as the distinction doesn't >> matter to linux, only to the CPU. >> >> >>> so there is no chance for Qemu to trigger the SError when gets the >>> SIGBUS_MCEERR_A{O,R}. >> You mean there is no reason for Qemu to trigger an SError when it gets a >> signal >> from the kernel. >> >> The reasons the CPU might have to generate an SError don't apply to linux and >> KVM user space. User-space will never get a signal for an uncontained error, >> we >> will always panic(). We can't give user-space a signal for imprecise >> exceptions, >> as it can't return from the signal. The classes of error that are left are >> covered by polled/irq and NOTIFY_SEA. >> >> Qemu can decide to generate RAS SErrors for SIGBUS_MCEERR_AR if it really >> wants >> to, (but I don't think you should, the kernel may have unmapped the page at >> PC >> from stage2 due to corruption). > yes, you also said you do not want to generate RAS SErrors for > SIGBUS_MCEERR_AR, > so Qemu does not know in which condition to generate RAS SErrors. There are two things going on here, firstly the guest may have masked PSTATE.A, and be hammering an unmapped page. (this this 'sorry that was unclear' case above). This would happen if the exception-entry code or stack became corrupt when an exception was taken. The second is what does existing non-RAS-aware software do? For SError it panic()s, whereas for synchronous external abort there are some cases that can be handled. (e.g. on linux: synchronous external abort from user-space).
Re: [PATCH v9 3/7] acpi: apei: Add SEI notification type support for ARMv8
Hi Dongjiu Geng, (versions of patches 1,2 and 4 have been queued by Catalin) (Nit 'ACPI / APEI:' is the normal subject prefix for ghes.c, this helps the maintainers know which patches they need to pay attention to when you are touching multiple trees) On 06/01/18 16:02, Dongjiu Geng wrote: > ARMv8.2 requires implementation of the RAS extension. > In > this extension, it adds SEI(SError Interrupt) notification > type, this patch adds new GHES error source SEI handling > functions. This reads as if this patch is handling SError RAS notifications generated by a CPU with the RAS extensions. These are about CPU->Software notifications. APEI and GHES are a firmware first mechanism which is Software->Software. Reading the v8.2 documents won't help anyone with the APEI/GHES code. Please describe this from the ACPI view, "ACPI 6.x adds support for NOTIFY_SEI as a GHES notification mechanism... ", its up to the arch code to spot a v8.2 RAS Error based on the cpu caps. > This error source parsing and handling method > is similar with the SEA. There are problems with doing this: Oct. 18, 2017, 10:26 a.m. James Morse wrote: | How do SEA and SEI interact? | | As far as I can see they can both interrupt each other, which isn't something | the single in_nmi() path in APEI can handle. I thinks we should fix this | first. [..] | SEA gets away with a lot of things because its synchronous. SEI isn't. Xie | XiuQi pointed to the memory_failure_queue() code. We can use this directly | from SEA, but not SEI. (what happens if an SError arrives while we are | queueing memory_failure work from an IRQ). | | The one that scares me is the trace-point reporting stuff. What happens if an | SError arrives while we are enabling a trace point? (these are static-keys | right?) | | I don't think we can just plumb SEI in like this and be done with it. | (I'm looking at teasing out the estatus cache code from being x86:NMI only. | This way we solve the same 'cant do this from NMI context' with the same | code'.) I will post what I've got for this estatus-cache thing as an RFC, its not ready to be considered yet. > Expose API ghes_notify_sei() to external users. External > modules can call this exposed API to parse APEI table and > handle the SEI notification. external modules? You mean called by the arch code when it gets this NOTIFY_SEI? Thanks, James
Re: [PATCH v2 11/11] arm64: Implement branch predictor hardening for affected Cortex-A CPUs
Hi Marc, Will, (SOB-chain suggests a missing From: tag on this and patch 7) On 05/01/18 13:12, Will Deacon wrote: > Cortex-A57, A72, A73 and A75 are susceptible to branch predictor aliasing > and can theoretically be attacked by malicious code. > > This patch implements a PSCI-based mitigation for these CPUs when available. > The call into firmware will invalidate the branch predictor state, preventing > any malicious entries from affecting other victim contexts. > > Signed-off-by: Marc Zyngier > Signed-off-by: Will Deacon > diff --git a/arch/arm64/kernel/bpi.S b/arch/arm64/kernel/bpi.S > index 06a931eb2673..2e9146534174 100644 > --- a/arch/arm64/kernel/bpi.S > +++ b/arch/arm64/kernel/bpi.S > @@ -53,3 +53,27 @@ ENTRY(__bp_harden_hyp_vecs_start) > vectors __kvm_hyp_vector > .endr > ENTRY(__bp_harden_hyp_vecs_end) > +ENTRY(__psci_hyp_bp_inval_start) > + sub sp, sp, #(8 * 18) Where does 18 come from? Isn't this storing 9 sets of 16 bytes? > + stp x16, x17, [sp, #(16 * 0)] > + stp x14, x15, [sp, #(16 * 1)] > + stp x12, x13, [sp, #(16 * 2)] > + stp x10, x11, [sp, #(16 * 3)] > + stp x8, x9, [sp, #(16 * 4)] > + stp x6, x7, [sp, #(16 * 5)] > + stp x4, x5, [sp, #(16 * 6)] > + stp x2, x3, [sp, #(16 * 7)] > + stp x0, x1, [sp, #(18 * 8)] 16->18 typo? > + mov x0, #0x8400 > + smc #0 > + ldp x16, x17, [sp, #(16 * 0)] > + ldp x14, x15, [sp, #(16 * 1)] > + ldp x12, x13, [sp, #(16 * 2)] > + ldp x10, x11, [sp, #(16 * 3)] > + ldp x8, x9, [sp, #(16 * 4)] > + ldp x6, x7, [sp, #(16 * 5)] > + ldp x4, x5, [sp, #(16 * 6)] > + ldp x2, x3, [sp, #(16 * 7)] > + ldp x0, x1, [sp, #(18 * 8)] > + add sp, sp, #(8 * 18) (and here?) > +ENTRY(__psci_hyp_bp_inval_end) Thanks, James
Re: [PATCH v2 07/11] arm64: Add skeleton to harden the branch predictor against aliasing attacks
Hi Will, Marc, On 05/01/18 13:12, Will Deacon wrote: > Aliasing attacks against CPU branch predictors can allow an attacker to > redirect speculative control flow on some CPUs and potentially divulge > information from one context to another. > > This patch adds initial skeleton code behind a new Kconfig option to > enable implementation-specific mitigations against these attacks for > CPUs that are affected. [...] > diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h > index 6f7bdb89817f..6dd83d75b82a 100644 > --- a/arch/arm64/include/asm/mmu.h > +++ b/arch/arm64/include/asm/mmu.h > @@ -41,6 +41,43 @@ static inline bool arm64_kernel_unmapped_at_el0(void) > +static inline struct bp_hardening_data *arm64_get_bp_hardening_data(void) > +{ > + return this_cpu_ptr(&bp_hardening_data); > +} > + > +static inline void arm64_apply_bp_hardening(void) > +{ > + struct bp_hardening_data *d; > + > + if (!cpus_have_const_cap(ARM64_HARDEN_BRANCH_PREDICTOR)) > + return; > + > + d = arm64_get_bp_hardening_data(); > + if (d->fn) > + d->fn(); > +} > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c > index 22168cd0dde7..5203b6040cb6 100644 > --- a/arch/arm64/mm/fault.c > +++ b/arch/arm64/mm/fault.c > @@ -318,6 +318,7 @@ static void __do_user_fault(struct task_struct *tsk, > unsigned long addr, > lsb = PAGE_SHIFT; > si.si_addr_lsb = lsb; > > + arm64_apply_bp_hardening(); Due to the this_cpu_ptr() call: | BUG: using smp_processor_id() in preemptible [] code: print_my_pa/2093 | caller is debug_smp_processor_id+0x1c/0x24 | CPU: 0 PID: 2093 Comm: print_my_pa Tainted: GW 4.15.0-rc3-00044-g7f0aaec94f27-dirty #8950 | Call trace: | dump_backtrace+0x0/0x164 | show_stack+0x14/0x1c | dump_stack+0xa4/0xdc | check_preemption_disabled+0xfc/0x100 | debug_smp_processor_id+0x1c/0x24 | __do_user_fault+0xcc/0x180 | do_page_fault+0x14c/0x364 | do_translation_fault+0x40/0x48 | do_mem_abort+0x40/0xb8 | el0_da+0x20/0x24 Make it a TIF flag? (Seen with arm64's kpti-base tag and this series) > force_sig_info(sig, &si, tsk); > } Thanks, James
Re: [PATCH -next] firmware: arm_sdei: Fix return value check in sdei_present_dt()
Hi Wei, On 15/01/18 10:41, Wei Yongjun wrote: > In case of error, the function of_platform_device_create() returns > NULL pointer not ERR_PTR(). The IS_ERR() test in the return value > check should be replaced with NULL test. Bother, so it does! Thanks for catching this. Acked-by: James Morse Thanks, James
Re: [PATCH 07/11] signal/arm64: Document conflicts with SI_USER and SIGFPE, SIGTRAP, SIGBUS
Hi Dave, Thanks for going through all these, On 15/01/18 16:30, Dave Martin wrote: > On Thu, Jan 11, 2018 at 06:59:36PM -0600, Eric W. Biederman wrote: >> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c >> index 9b7f89df49db..abe200587334 100644 >> --- a/arch/arm64/mm/fault.c >> +++ b/arch/arm64/mm/fault.c >> @@ -596,7 +596,7 @@ static int do_sea(unsigned long addr, unsigned int esr, >> struct pt_regs *regs) >> +{ do_sea, SIGBUS, BUS_FIXME, "synchronous external >> abort"}, > > This si_code seems to be a fallback for if ACPI is absent or doesn't > know what to do with this error. > > -> SIGBUS/BUS_OBJERR? > > Can probably legitimately happen for userspace for suitable MMIO mappings. It can happen for normal memory too, there are specific ESR values for parity/checksum errors when read/writing memory. I think this first one is 'other/unknown', and its up to the CPU how to classify them. > Perhaps it's more serious though in the presence of ACPI. Do we expect > that ACPI can diagnose all localisable errors? Its not just ACPI, the CPU's v8.2 RAS Extensions use this synchronous-external-abort as notification of a RAS error, (the other details are written to to memory-mapped nodes). With the v8.2 RAS Extensions the ESR tells us if the error was contained. For ACPI we rely on firmware to set an appropriate severity in the CPER records generated by firmware. The APEI helpers will call panic() if they find a fatal error. For systems with neither {firmware,kernel}-first RAS, BUS_OBJERR looks like a good choice. >> +{ do_sea, SIGBUS, BUS_FIXME, "level 0 (translation >> table walk)" }, >> +{ do_sea, SIGBUS, BUS_FIXME, "level 1 (translation >> table walk)" }, >> +{ do_sea, SIGBUS, BUS_FIXME, "level 2 (translation >> table walk)" }, >> +{ do_sea, SIGBUS, BUS_FIXME, "level 3 (translation >> table walk)" }, > > Pagetable screwup or kernel/system/CPU bug -> SIGKILL, or panic(). (RAS mechanisms may claim this and send their own signals, if not:) SIGKILL is probably a better choice here, while we do have an address, there is nothing user-space can do about it. >> +{ do_sea, SIGBUS, BUS_FIXME, "synchronous parity or >> ECC error" },// Reserved when RAS is implemented > > Possibly SIGBUS/BUS_MCEERR_AR (though I don't know exactly what > userspace is supposed to do with this or whether this implies the > existence or certain kernel features for managing the error that > may not be present on arm64...) I'd like to keep the MCEERR signals to errors that we know are contained, the kernel has understood and handled. (These features do exist for arm64, enabling CONFIG_MEMORY_FAILURE and a few APEI options allows all this to work today with suitable firmware. My Seattle claims to support it). > Otherwise, SIGKILL. Sounds good, >> +{ do_sea, SIGBUS, BUS_FIXME, "level 0 synchronous >> parity error (translation table walk)" }, // Reserved when RAS is >> implemented >> +{ do_sea, SIGBUS, BUS_FIXME, "level 1 synchronous >> parity error (translation table walk)" }, // Reserved when RAS is >> implemented >> +{ do_sea, SIGBUS, BUS_FIXME, "level 2 synchronous >> parity error (translation table walk)" }, // Reserved when RAS is >> implemented >> +{ do_sea, SIGBUS, BUS_FIXME, "level 3 synchronous >> parity error (translation table walk)" }, // Reserved when RAS is >> implemented > > Process page tables corrupt: if the kernel couldn't fix this, the > process can't reasonably fix it -> SIGKILL > > Since this is a RAS-type error it could be triggered by a cosmic ray > rather than requiring a kernel or system bug or other major failure, so > we probably shouldn't panic the system if the error is localisable to a > particular process. Without the RAS-Extensions severity to tell us the error is contained I'm not sure what we can expect. But given the page-tables are per-process, and we never swap them to disk etc, its probably a safe bet that it doesn't matter either way for these. Thanks, James
Re: [PATCH 1/5] arm64: entry: isb in el1_irq
Hi Yury, On 05/04/18 18:17, Yury Norov wrote: > Kernel text patching framework relies on IPI to ensure that other > SMP cores observe the change. Target core calls isb() in IPI handler (Odd, if its just to synchronize the CPU, taking the IPI should be enough). > path, but not at the beginning of el1_irq entry. There's a chance > that modified instruction will appear prior isb(), and so will not be > observed. > > This patch inserts isb early at el1_irq entry to avoid that chance. > diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S > index ec2ee720e33e..9c06b4b80060 100644 > --- a/arch/arm64/kernel/entry.S > +++ b/arch/arm64/kernel/entry.S > @@ -593,6 +593,7 @@ ENDPROC(el1_sync) > > .align 6 > el1_irq: > + isb // pairs with > aarch64_insn_patch_text > kernel_entry 1 > enable_da_f > #ifdef CONFIG_TRACE_IRQFLAGS > An ISB at the beginning of the vectors? This is odd, taking an IRQ to get in here would be a context-synchronization-event too, so the ISB is superfluous. The ARM-ARM has a list of 'Context-Synchronization event's (Glossary on page 6480 of DDI0487B.b), paraphrasing: * ISB * Taking an exception * ERET * (...loads of debug stuff...) Thanks, James