[RESEND PATCH v9 2/2] arm64: support batched/deferred tlb shootdown during page reclamation/migration
From: Barry Song on x86, batched and deferred tlb shootdown has lead to 90% performance increase on tlb shootdown. on arm64, HW can do tlb shootdown without software IPI. But sync tlbi is still quite expensive. Even running a simplest program which requires swapout can prove this is true, #include #include #include #include int main() { #define SIZE (1 * 1024 * 1024) volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); memset(p, 0x88, SIZE); for (int k = 0; k < 1; k++) { /* swap in */ for (int i = 0; i < SIZE; i += 4096) { (void)p[i]; } /* swap out */ madvise(p, SIZE, MADV_PAGEOUT); } } Perf result on snapdragon 888 with 8 cores by using zRAM as the swap block device. ~ # perf record taskset -c 4 ./a.out [ perf record: Woken up 10 times to write data ] [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ] ~ # perf report # To display the perf.data header info, please use --header/--header-only options. # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 60K of event 'cycles' # Event count (approx.): 35706225414 # # Overhead Command Shared Object Symbol # ... . . # 21.07% a.out[kernel.kallsyms] [k] _raw_spin_unlock_irq 8.23% a.out[kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 6.67% a.out[kernel.kallsyms] [k] filemap_map_pages 6.16% a.out[kernel.kallsyms] [k] __zram_bvec_write 5.36% a.out[kernel.kallsyms] [k] ptep_clear_flush 3.71% a.out[kernel.kallsyms] [k] _raw_spin_lock 3.49% a.out[kernel.kallsyms] [k] memset64 1.63% a.out[kernel.kallsyms] [k] clear_page 1.42% a.out[kernel.kallsyms] [k] _raw_spin_unlock 1.26% a.out[kernel.kallsyms] [k] mod_zone_state.llvm.8525150236079521930 1.23% a.out[kernel.kallsyms] [k] xas_load 1.15% a.out[kernel.kallsyms] [k] zram_slot_lock ptep_clear_flush() takes 5.36% CPU in the micro-benchmark swapping in/out a page mapped by only one process. If the page is mapped by multiple processes, typically, like more than 100 on a phone, the overhead would be much higher as we have to run tlb flush 100 times for one single page. Plus, tlb flush overhead will increase with the number of CPU cores due to the bad scalability of tlb shootdown in HW, so those ARM64 servers should expect much higher overhead. Further perf annonate shows 95% cpu time of ptep_clear_flush is actually used by the final dsb() to wait for the completion of tlb flush. This provides us a very good chance to leverage the existing batched tlb in kernel. The minimum modification is that we only send async tlbi in the first stage and we send dsb while we have to sync in the second stage. With the above simplest micro benchmark, collapsed time to finish the program decreases around 5%. Typical collapsed time w/o patch: ~ # time taskset -c 4 ./a.out 0.21user 14.34system 0:14.69elapsed w/ patch: ~ # time taskset -c 4 ./a.out 0.22user 13.45system 0:13.80elapsed Also, Yicong Yang added the following observation. Tested with benchmark in the commit on Kunpeng920 arm64 server, observed an improvement around 12.5% with command `time ./swap_bench`. w/o w/ real0m13.460s 0m11.771s user0m0.248s0m0.279s sys 0m12.039s 0m11.458s Originally it's noticed a 16.99% overhead of ptep_clear_flush() which has been eliminated by this patch: [root@localhost yang]# perf record -- ./swap_bench && perf report [...] 16.99% swap_bench [kernel.kallsyms] [k] ptep_clear_flush It is tested on 4,8,128 CPU platforms and shows to be beneficial on large systems but may not have improvement on small systems like on a 4 CPU platform. So make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends on CONFIG_EXPERT for this stage and add a runtime tunable to allow to disable it according to the scenario. Also this patch improve the performance of page migration. Using pmbench and tries to migrate the pages of pmbench between node 0 and node 1 for 20 times, this patch decrease the time used more than 50% and saved the time used by ptep_clear_flush(). This patch extends arch_tlbbatch_add_mm() to take an address of the target page to support the feature on arm64. Also rename it to arch_tlbbatch_add_pending() to better match its function since we don't need to handle the mm on arm64 and add_mm is not proper. add_pending will make sense to both as on x86 we're pending the TLB flush operations while on arm64 w
[RESEND PATCH v9 0/2] arm64: support batched/deferred tlb shootdown during page reclamation/migration
From: Yicong Yang Though ARM64 has the hardware to do tlb shootdown, the hardware broadcasting is not free. A simplest micro benchmark shows even on snapdragon 888 with only 8 cores, the overhead for ptep_clear_flush is huge even for paging out one page mapped by only one process: 5.36% a.out[kernel.kallsyms] [k] ptep_clear_flush While pages are mapped by multiple processes or HW has more CPUs, the cost should become even higher due to the bad scalability of tlb shootdown. The same benchmark can result in 16.99% CPU consumption on ARM64 server with around 100 cores according to Yicong's test on patch 2/2. This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by 1. only send tlbi instructions in the first stage - arch_tlbbatch_add_mm() 2. wait for the completion of tlbi by dsb while doing tlbbatch sync in arch_tlbbatch_flush() Testing on snapdragon shows the overhead of ptep_clear_flush is removed by the patchset. The micro benchmark becomes 5% faster even for one page mapped by single process on snapdragon 888. This support also optimize the page migration more than 50% with support of batched TLB flushing [*]. [*] https://lore.kernel.org/linux-mm/20230213123444.155149-1-ying.hu...@intel.com/ -v9: 1. Using a runtime tunable to control batched TLB flush, per Catalin in v7. Sorry for missing this on v8. Link: https://lore.kernel.org/all/20230329035512.57392-1-yangyic...@huawei.com/ -v8: 1. Rebase on 6.3-rc4 2. Tested the optimization on page migration and mentioned it in the commit 3. Thanks the review from Anshuman. Link: https://lore.kernel.org/linux-mm/20221117082648.47526-1-yangyic...@huawei.com/ -v7: 1. rename arch_tlbbatch_add_mm() to arch_tlbbatch_add_pending() as suggested, since it takes an extra address for arm64, per Nadav and Anshuman. Also mentioned in the commit. 2. add tags from Xin Hao, thanks. Link: https://lore.kernel.org/lkml/20221115031425.44640-1-yangyic...@huawei.com/ -v6: 1. comment we don't defer TLB flush on platforms affected by ARM64_WORKAROUND_REPEAT_TLBI 2. use cpus_have_const_cap() instead of this_cpu_has_cap() 3. add tags from Punit, Thanks. 4. default enable the feature when cpus >= 8 rather than > 8, since the original improvement is observed on snapdragon 888 with 8 cores. Link: https://lore.kernel.org/lkml/20221028081255.19157-1-yangyic...@huawei.com/ -v5: 1. Make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends on EXPERT for this stage on arm64. 2. Make a threshold of CPU numbers for enabling batched TLP flush on arm64 Link: https://lore.kernel.org/linux-arm-kernel/20220921084302.43631-1-yangyic...@huawei.com/T/ -v4: 1. Add tags from Kefeng and Anshuman, Thanks. 2. Limit the TLB batch/defer on systems with >4 CPUs, per Anshuman 3. Merge previous Patch 1,2-3 into one, per Anshuman Link: https://lore.kernel.org/linux-mm/20220822082120.8347-1-yangyic...@huawei.com/ -v3: 1. Declare arch's tlbbatch defer support by arch_tlbbatch_should_defer() instead of ARCH_HAS_MM_CPUMASK, per Barry and Kefeng 2. Add Tested-by from Xin Hao Link: https://lore.kernel.org/linux-mm/20220711034615.482895-1-21cn...@gmail.com/ -v2: 1. Collected Yicong's test result on kunpeng920 ARM64 server; 2. Removed the redundant vma parameter in arch_tlbbatch_add_mm() according to the comments of Peter Zijlstra and Dave Hansen 3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask is empty according to the comments of Nadav Amit Thanks, Peter, Dave and Nadav for your testing or reviewing , and comments. -v1: https://lore.kernel.org/lkml/20220707125242.425242-1-21cn...@gmail.com/ Anshuman Khandual (1): mm/tlbbatch: Introduce arch_tlbbatch_should_defer() Barry Song (1): arm64: support batched/deferred tlb shootdown during page reclamation/migration .../features/vm/TLB/arch-support.txt | 2 +- arch/arm64/Kconfig| 1 + arch/arm64/include/asm/tlbbatch.h | 12 arch/arm64/include/asm/tlbflush.h | 33 - arch/arm64/mm/flush.c | 69 +++ arch/x86/include/asm/tlbflush.h | 17 - include/linux/mm_types_task.h | 4 +- mm/rmap.c | 21 +++--- 8 files changed, 139 insertions(+), 20 deletions(-) create mode 100644 arch/arm64/include/asm/tlbbatch.h -- 2.24.0
[RESEND PATCH v9 1/2] mm/tlbbatch: Introduce arch_tlbbatch_should_defer()
From: Anshuman Khandual The entire scheme of deferred TLB flush in reclaim path rests on the fact that the cost to refill TLB entries is less than flushing out individual entries by sending IPI to remote CPUs. But architecture can have different ways to evaluate that. Hence apart from checking TTU_BATCH_FLUSH in the TTU flags, rest of the decision should be architecture specific. Signed-off-by: Anshuman Khandual [https://lore.kernel.org/linuxppc-dev/20171101101735.2318-2-khand...@linux.vnet.ibm.com/] Signed-off-by: Yicong Yang [Rebase and fix incorrect return value type] Reviewed-by: Kefeng Wang Reviewed-by: Anshuman Khandual Reviewed-by: Barry Song Reviewed-by: Xin Hao Tested-by: Punit Agrawal --- arch/x86/include/asm/tlbflush.h | 12 mm/rmap.c | 9 + 2 files changed, 13 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 75bfaa421030..46bdff73217c 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -260,6 +260,18 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a) flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false); } +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm) +{ + bool should_defer = false; + + /* If remote CPUs need to be flushed then defer batch the flush */ + if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) + should_defer = true; + put_cpu(); + + return should_defer; +} + static inline u64 inc_mm_tlb_gen(struct mm_struct *mm) { /* diff --git a/mm/rmap.c b/mm/rmap.c index 19392e090bec..b45f95ab0c04 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -688,17 +688,10 @@ static void set_tlb_ubc_flush_pending(struct mm_struct *mm, pte_t pteval) */ static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags) { - bool should_defer = false; - if (!(flags & TTU_BATCH_FLUSH)) return false; - /* If remote CPUs need to be flushed then defer batch the flush */ - if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids) - should_defer = true; - put_cpu(); - - return should_defer; + return arch_tlbbatch_should_defer(mm); } /* -- 2.24.0
Re: [PATCH v2 00/25] iommu: Make default_domain's mandatory
On 16/05/2023 01:00, Jason Gunthorpe wrote: > This is on github: > https://github.com/jgunthorpe/linux/commits/iommu_all_defdom Tested-by: Steven Price Works fine on my Firefly-RK3288. Thanks, Steve
Re: [PATCH v2 00/34] Split ptdesc from struct page
On Mon, May 01, 2023 at 12:27:55PM -0700, Vishal Moola (Oracle) wrote: > The MM subsystem is trying to shrink struct page. This patchset > introduces a memory descriptor for page table tracking - struct ptdesc. > > This patchset introduces ptdesc, splits ptdesc from struct page, and > converts many callers of page table constructor/destructors to use ptdescs. Lightly related food for future thought - based on some discussions at LSF/MM it would be really nice if an end result of this was that a rcu_head was always available in the ptdesc so we don't need to allocate memory to free a page table. Jason
Re: [PATCH AUTOSEL 6.3 6/7] powerpc/fsl_uli1575: Allow to disable FSL_ULI1575 support
On Tue, May 09, 2023 at 09:18:35AM +0200, Pali Rohár wrote: On Tuesday 09 May 2023 17:14:48 Michael Ellerman wrote: Randy Dunlap writes: > Hi-- > > Just a heads up. This patch can cause build errors. > I sent a patch for these on 2023-APR-28: > https://lore.kernel.org/linuxppc-dev/20230429043519.19807-1-rdun...@infradead.org/ > > Michael, I think this is your area if I'm not mistaken. Yes. The fix is in my fixes branch as: 536d948a8dee ("powerpc/fsl_uli1575: fix kconfig warnings and build errors") But I don't think this commit (22fdf79171e8) really warrants going to stable, it's a nice-to-have but doesn't fix any pressing bugs. Exactly. And also this patch alone without 1/8 would not work as in 1/8 https://lore.kernel.org/all/20230409000812.18904-2-p...@kernel.org/ was added static inline variant of function which is used when ULI is disabled. I'll drop it, thanks! -- Thanks, Sasha
[PATCH v3 03/12] powerpc/dexcr: Add initial Dynamic Execution Control Register (DEXCR) support
ISA 3.1B introduces the Dynamic Execution Control Register (DEXCR). It is a per-cpu register that allows control over various CPU behaviours including branch hint usage, indirect branch speculation, and hashst/hashchk support. Add some definitions and basic support for the DEXCR in the kernel. Right now it just * Zero initialises the DEXCR and HASHKEYR when a CPU onlines. * Clears them in reset_sprs(). * Detects when the NPHIE aspect is supported (the others don't get looked at in this series, so there's no need to waste a CPU_FTR on them). We initialise the HASHKEYR to ensure that all cores have the same key, so an HV enforced NPHIE + swapping cores doesn't randomly crash a process using hash instructions. The stores to HASHKEYR are unconditional because the ISA makes no mention of the SPR being missing if support for doing the hashes isn't present. So all that would happen is the HASHKEYR value gets ignored. This helps slightly if NPHIE detection fails; e.g., we currently only detect it on pseries. Signed-off-by: Benjamin Gray Reviewed-by: Russell Currey --- v3: * Add ruscur reviewed-by v2: * More definitions for ptrace v1: * Only make a CPU feature for NPHIE. We only need to know if the hashst/hashchk functionality is supported for a static DEXCR. * Initialise the DEXCR to 0 when each CPU comes online. Remove the dexcr_init() and get_thread_dexcr() functions. * No longer track the DEXCR in a per-thread field. * Remove the made-up Opal features --- arch/powerpc/include/asm/book3s/64/kexec.h | 5 + arch/powerpc/include/asm/cputable.h| 4 +++- arch/powerpc/include/asm/reg.h | 10 ++ arch/powerpc/kernel/cpu_setup_power.c | 8 arch/powerpc/kernel/prom.c | 1 + 5 files changed, 27 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/book3s/64/kexec.h b/arch/powerpc/include/asm/book3s/64/kexec.h index d4b9d476ecba..df37a76c1e9f 100644 --- a/arch/powerpc/include/asm/book3s/64/kexec.h +++ b/arch/powerpc/include/asm/book3s/64/kexec.h @@ -21,6 +21,11 @@ static inline void reset_sprs(void) plpar_set_ciabr(0); } + if (cpu_has_feature(CPU_FTR_ARCH_31)) { + mtspr(SPRN_DEXCR, 0); + mtspr(SPRN_HASHKEYR, 0); + } + /* Do we need isync()? We are going via a kexec reset */ isync(); } diff --git a/arch/powerpc/include/asm/cputable.h b/arch/powerpc/include/asm/cputable.h index 757dbded11dc..443a9d482b15 100644 --- a/arch/powerpc/include/asm/cputable.h +++ b/arch/powerpc/include/asm/cputable.h @@ -192,6 +192,7 @@ static inline void cpu_feature_keys_init(void) { } #define CPU_FTR_P9_RADIX_PREFETCH_BUG LONG_ASM_CONST(0x0002) #define CPU_FTR_ARCH_31 LONG_ASM_CONST(0x0004) #define CPU_FTR_DAWR1 LONG_ASM_CONST(0x0008) +#define CPU_FTR_DEXCR_NPHIELONG_ASM_CONST(0x0010) #ifndef __ASSEMBLY__ @@ -451,7 +452,8 @@ static inline void cpu_feature_keys_init(void) { } CPU_FTR_CFAR | CPU_FTR_HVMODE | CPU_FTR_VMX_COPY | \ CPU_FTR_DBELL | CPU_FTR_HAS_PPR | CPU_FTR_ARCH_207S | \ CPU_FTR_ARCH_300 | CPU_FTR_ARCH_31 | \ - CPU_FTR_DAWR | CPU_FTR_DAWR1) + CPU_FTR_DAWR | CPU_FTR_DAWR1 | \ + CPU_FTR_DEXCR_NPHIE) #define CPU_FTRS_CELL (CPU_FTR_LWSYNC | \ CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \ CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \ diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h index 6372e5f55ef0..a30c5bdca568 100644 --- a/arch/powerpc/include/asm/reg.h +++ b/arch/powerpc/include/asm/reg.h @@ -382,7 +382,17 @@ #define SPRN_HIOR 0x137 /* 970 Hypervisor interrupt offset */ #define SPRN_RMOR 0x138 /* Real mode offset register */ #define SPRN_HRMOR 0x139 /* Real mode offset register */ +#define SPRN_HDEXCR_RO 0x1C7 /* Hypervisor DEXCR (non-privileged, readonly) */ +#define SPRN_HASHKEYR 0x1D4 /* Non-privileged hashst/hashchk key register */ +#define SPRN_HDEXCR0x1D7 /* Hypervisor dynamic execution control register */ +#define SPRN_DEXCR_RO 0x32C /* DEXCR (non-privileged, readonly) */ #define SPRN_ASDR 0x330 /* Access segment descriptor register */ +#define SPRN_DEXCR 0x33C /* Dynamic execution control register */ +#define DEXCR_PR_BIT(aspect) PPC_BIT(32 + (aspect)) +#define DEXCR_PR_SBHEDEXCR_PR_BIT(0) /* Speculative Branch Hint Enable */ +#define DEXCR_PR_IBRTPD DEXCR_PR_BIT(3) /* Indirect Branch Recurrent Target Prediction Disable */ +#define DEXCR_PR_SRAPD DEXCR_PR_BIT(4) /* Subroutine Return Address Prediction Disable */ +#define DEXCR_PR_NPHIE DEXCR_PR_BIT(5) /* Non-Privileged Hash Instruction Enable */ #define SPRN_IC0x350 /* Virtual Instruct
[PATCH v3 04/12] powerpc/dexcr: Handle hashchk exception
Recognise and pass the appropriate signal to the user program when a hashchk instruction triggers. This is independent of allowing configuration of DEXCR[NPHIE], as a hypervisor can enforce this aspect regardless of the kernel. The signal mirrors how ARM reports their similar check failure. For example, their FPAC handler in arch/arm64/kernel/traps.c do_el0_fpac() does this. When we fail to read the instruction that caused the fault we send a segfault, similar to how emulate_math() does it. Signed-off-by: Benjamin Gray --- v3: * Inline hashchk detection, remove dexcr.c and associated files v1: * Refactor the hashchk check to return 0 on success, an error code on failure. Determine what to do based on specific error code. * Motivate signal and code --- arch/powerpc/include/asm/ppc-opcode.h | 1 + arch/powerpc/kernel/traps.c | 16 2 files changed, 17 insertions(+) diff --git a/arch/powerpc/include/asm/ppc-opcode.h b/arch/powerpc/include/asm/ppc-opcode.h index ca5a0da7df4e..ef6972aa33b9 100644 --- a/arch/powerpc/include/asm/ppc-opcode.h +++ b/arch/powerpc/include/asm/ppc-opcode.h @@ -222,6 +222,7 @@ #define OP_31_XOP_STFSX663 #define OP_31_XOP_STFSUX695 #define OP_31_XOP_STFDX 727 +#define OP_31_XOP_HASHCHK 754 #define OP_31_XOP_STFDUX759 #define OP_31_XOP_LHBRX 790 #define OP_31_XOP_LFIWAX855 diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index 9bdd79aa51cf..e59ec6d32d37 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -1516,6 +1516,22 @@ static void do_program_check(struct pt_regs *regs) return; } } + + if (cpu_has_feature(CPU_FTR_DEXCR_NPHIE) && user_mode(regs)) { + ppc_inst_t insn; + + if (get_user_instr(insn, (void __user *)regs->nip)) { + _exception(SIGSEGV, regs, SEGV_MAPERR, regs->nip); + return; + } + + if (ppc_inst_primary_opcode(insn) == 31 && + get_xop(ppc_inst_val(insn)) == OP_31_XOP_HASHCHK) { + _exception(SIGILL, regs, ILL_ILLOPN, regs->nip); + return; + } + } + _exception(SIGTRAP, regs, TRAP_BRKPT, regs->nip); return; } -- 2.40.1
[PATCH v3 00/12] Add static DEXCR support
The DEXCR is a SPR that allows control over various execution 'aspects', such as indirect branch prediction and enabling the hashst/hashchk instructions. Further details are in ISA 3.1B Book 3 chapter 9. This series adds static (compile time) support for initialising the DEXCR, and basic HASHKEYR bookkeeping. v3: * Address Russell's comments v2: * Add ptrace/coredump support Previous versions: v2: https://lore.kernel.org/all/20230330055040.434133-1-bg...@linux.ibm.com/ v1: https://lore.kernel.org/all/20230322054612.1340573-1-bg...@linux.ibm.com/ RFC: https://lore.kernel.org/all/20221128024458.46121-1-bg...@linux.ibm.com/ Benjamin Gray (12): powerpc/book3s: Add missing include powerpc/ptrace: Add missing include powerpc/dexcr: Add initial Dynamic Execution Control Register (DEXCR) support powerpc/dexcr: Handle hashchk exception powerpc/dexcr: Support userspace ROP protection powerpc/dexcr: Support custom default DEXCR value powerpc/ptrace: Expose DEXCR and HDEXCR registers to ptrace powerpc/ptrace: Expose HASHKEYR register to ptrace Documentation: Document PowerPC kernel DEXCR interface selftests/powerpc: Add more utility macros selftests/powerpc/dexcr: Add hashst/hashchk test selftests/powerpc/dexcr: Add DEXCR status utility lsdexcr Documentation/powerpc/dexcr.rst | 58 + Documentation/powerpc/index.rst | 1 + arch/powerpc/Kconfig | 14 ++ arch/powerpc/include/asm/book3s/64/kexec.h| 5 + arch/powerpc/include/asm/book3s/64/kup.h | 1 + arch/powerpc/include/asm/cputable.h | 4 +- arch/powerpc/include/asm/ppc-opcode.h | 1 + arch/powerpc/include/asm/processor.h | 1 + arch/powerpc/include/asm/reg.h| 10 + arch/powerpc/include/uapi/asm/elf.h | 2 + arch/powerpc/kernel/cpu_setup_power.c | 9 + arch/powerpc/kernel/process.c | 17 ++ arch/powerpc/kernel/prom.c| 1 + arch/powerpc/kernel/ptrace/ptrace-decl.h | 6 + arch/powerpc/kernel/ptrace/ptrace-view.c | 67 +- arch/powerpc/kernel/traps.c | 16 ++ include/uapi/linux/elf.h | 2 + tools/testing/selftests/powerpc/Makefile | 1 + .../selftests/powerpc/dexcr/.gitignore| 2 + .../testing/selftests/powerpc/dexcr/Makefile | 9 + tools/testing/selftests/powerpc/dexcr/dexcr.c | 132 ++ tools/testing/selftests/powerpc/dexcr/dexcr.h | 49 .../selftests/powerpc/dexcr/hashchk_test.c| 227 ++ .../testing/selftests/powerpc/dexcr/lsdexcr.c | 141 +++ tools/testing/selftests/powerpc/include/reg.h | 4 + .../testing/selftests/powerpc/include/utils.h | 31 ++- .../powerpc/pmu/sampling_tests/misc.h | 2 - tools/testing/selftests/powerpc/utils.c | 24 ++ 28 files changed, 832 insertions(+), 5 deletions(-) create mode 100644 Documentation/powerpc/dexcr.rst create mode 100644 tools/testing/selftests/powerpc/dexcr/.gitignore create mode 100644 tools/testing/selftests/powerpc/dexcr/Makefile create mode 100644 tools/testing/selftests/powerpc/dexcr/dexcr.c create mode 100644 tools/testing/selftests/powerpc/dexcr/dexcr.h create mode 100644 tools/testing/selftests/powerpc/dexcr/hashchk_test.c create mode 100644 tools/testing/selftests/powerpc/dexcr/lsdexcr.c base-commit: 547124f858ea52b2f7e58e8c0d39170a9fa66b4b -- 2.40.1
[PATCH v3 09/12] Documentation: Document PowerPC kernel DEXCR interface
Describe the DEXCR and document how to configure it. Signed-off-by: Benjamin Gray Reviewed-by: Russell Currey --- v3: * Add ruscur reviewed-by v2: * Document coredump & ptrace support v1: * Remove the dynamic control docs, describe the static config option This documentation is a little bare for now, but will be expanded on when dynamic DEXCR control is added. --- Documentation/powerpc/dexcr.rst | 58 + Documentation/powerpc/index.rst | 1 + 2 files changed, 59 insertions(+) create mode 100644 Documentation/powerpc/dexcr.rst diff --git a/Documentation/powerpc/dexcr.rst b/Documentation/powerpc/dexcr.rst new file mode 100644 index ..75f1efaa8dc5 --- /dev/null +++ b/Documentation/powerpc/dexcr.rst @@ -0,0 +1,58 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +== +DEXCR (Dynamic Execution Control Register) +== + +Overview + + +The DEXCR is a privileged special purpose register (SPR) introduced in +PowerPC ISA 3.1B (Power10) that allows per-cpu control over several dynamic +execution behaviours. These behaviours include speculation (e.g., indirect +branch target prediction) and enabling return-oriented programming (ROP) +protection instructions. + +The execution control is exposed in hardware as up to 32 bits ('aspects') in +the DEXCR. Each aspect controls a certain behaviour, and can be set or cleared +to enable/disable the aspect. There are several variants of the DEXCR for +different purposes: + +DEXCR +A privileged SPR that can control aspects for userspace and kernel space +HDEXCR +A hypervisor-privileged SPR that can control aspects for the hypervisor and +enforce aspects for the kernel and userspace. +UDEXCR +An optional ultravisor-privileged SPR that can control aspects for the ultravisor. + +Userspace can examine the current DEXCR state using a dedicated SPR that +provides a non-privileged read-only view of the userspace DEXCR aspects. +There is also an SPR that provides a read-only view of the hypervisor enforced +aspects, which ORed with the userspace DEXCR view gives the effective DEXCR +state for a process. + + +Kernel Config += + +The kernel supports a static default DEXCR value determined at config time. +Set the ``PPC_DEXCR_DEFAULT`` config to the value you want all processes to +use. + + +coredump and ptrace +=== + +The userspace values of the DEXCR and HDEXCR (in this order) are exposed under +``NT_PPC_DEXCR``. These are each 32 bits and readonly, and are intended to +assist with core dumps. The DEXCR may be made writable in future. + +If the kernel config ``CONFIG_CHECKPOINT_RESTORE`` is enabled, then +``NT_PPC_HASHKEYR`` is available and exposes the HASHKEYR value of the process +for reading and writing. This is a tradeoff between increased security and +checkpoint/restore support: a process should normally have no need to know its +secret key, but restoring a process requires setting its original key. The key +therefore appears in core dumps, and an attacker may be able to retrieve it from +a coredump and effectively bypass ROP protection on any threads that share this +key (potentially all threads from the same parent that have not run ``exec()``). diff --git a/Documentation/powerpc/index.rst b/Documentation/powerpc/index.rst index 85e80e30160b..d33b554ca7ba 100644 --- a/Documentation/powerpc/index.rst +++ b/Documentation/powerpc/index.rst @@ -15,6 +15,7 @@ powerpc cxl cxlflash dawr-power9 +dexcr dscr eeh-pci-error-recovery elf_hwcaps -- 2.40.1
[PATCH v3 05/12] powerpc/dexcr: Support userspace ROP protection
The ISA 3.1B hashst and hashchk instructions use a per-cpu SPR HASHKEYR to hold a key used in the hash calculation. This key should be different for each process to make it harder for a malicious process to recreate valid hash values for a victim process. Add support for storing a per-thread hash key, and setting/clearing HASHKEYR appropriately. Signed-off-by: Benjamin Gray Reviewed-by: Russell Currey --- v3: * Add ruscur reviewed-by v1: * Guard HASHKEYR update behind change check * HASHKEYR reset moved earlier to patch 2 --- arch/powerpc/include/asm/processor.h | 1 + arch/powerpc/kernel/process.c| 17 + 2 files changed, 18 insertions(+) diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h index e96c9b8c2a60..8a6754ffdc7e 100644 --- a/arch/powerpc/include/asm/processor.h +++ b/arch/powerpc/include/asm/processor.h @@ -264,6 +264,7 @@ struct thread_struct { unsigned long mmcr3; unsigned long sier2; unsigned long sier3; + unsigned long hashkeyr; #endif }; diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c index 1fefafb2b29b..b68898ac07e1 100644 --- a/arch/powerpc/kernel/process.c +++ b/arch/powerpc/kernel/process.c @@ -1182,6 +1182,9 @@ static inline void save_sprs(struct thread_struct *t) */ t->tar = mfspr(SPRN_TAR); } + + if (cpu_has_feature(CPU_FTR_DEXCR_NPHIE)) + t->hashkeyr = mfspr(SPRN_HASHKEYR); #endif } @@ -1260,6 +1263,10 @@ static inline void restore_sprs(struct thread_struct *old_thread, if (cpu_has_feature(CPU_FTR_P9_TIDR) && old_thread->tidr != new_thread->tidr) mtspr(SPRN_TIDR, new_thread->tidr); + + if (cpu_has_feature(CPU_FTR_DEXCR_NPHIE) && + old_thread->hashkeyr != new_thread->hashkeyr) + mtspr(SPRN_HASHKEYR, new_thread->hashkeyr); #endif } @@ -1867,6 +1874,10 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args) } p->thread.tidr = 0; +#endif +#ifdef CONFIG_PPC_BOOK3S_64 + if (cpu_has_feature(CPU_FTR_DEXCR_NPHIE)) + p->thread.hashkeyr = current->thread.hashkeyr; #endif return 0; } @@ -1984,6 +1995,12 @@ void start_thread(struct pt_regs *regs, unsigned long start, unsigned long sp) current->thread.tm_tfiar = 0; current->thread.load_tm = 0; #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */ +#ifdef CONFIG_PPC_BOOK3S_64 + if (cpu_has_feature(CPU_FTR_DEXCR_NPHIE)) { + current->thread.hashkeyr = get_random_long(); + mtspr(SPRN_HASHKEYR, current->thread.hashkeyr); + } +#endif /* CONFIG_PPC_BOOK3S_64 */ } EXPORT_SYMBOL(start_thread); -- 2.40.1
[PATCH v3 12/12] selftests/powerpc/dexcr: Add DEXCR status utility lsdexcr
Add a utility 'lsdexcr' to print the current DEXCR status. Useful for quickly checking the status such as when debugging test failures or verifying the new default DEXCR does what you want (for userspace at least). Example output: # ./lsdexcr uDEXCR: 0400 (NPHIE) HDEXCR: Effective: 0400 (NPHIE) SBHE (0): clear (Speculative branch hint enable) IBRTPD (3): clear (Indirect branch recurrent target ...) SRAPD (4): clear (Subroutine return address ...) NPHIE * (5): set (Non-privileged hash instruction enable) PHIE (6): clear (Privileged hash instruction enable) DEXCR[NPHIE] enabled: hashst/hashchk working Signed-off-by: Benjamin Gray --- v1: * Report if hashst/hashchk actually does something --- .../selftests/powerpc/dexcr/.gitignore| 1 + .../testing/selftests/powerpc/dexcr/Makefile | 2 + .../testing/selftests/powerpc/dexcr/lsdexcr.c | 141 ++ 3 files changed, 144 insertions(+) create mode 100644 tools/testing/selftests/powerpc/dexcr/lsdexcr.c diff --git a/tools/testing/selftests/powerpc/dexcr/.gitignore b/tools/testing/selftests/powerpc/dexcr/.gitignore index d12e4560aca9..b82f45dd46b9 100644 --- a/tools/testing/selftests/powerpc/dexcr/.gitignore +++ b/tools/testing/selftests/powerpc/dexcr/.gitignore @@ -1 +1,2 @@ hashchk_test +lsdexcr diff --git a/tools/testing/selftests/powerpc/dexcr/Makefile b/tools/testing/selftests/powerpc/dexcr/Makefile index 16c8b489948a..76210f2bcec3 100644 --- a/tools/testing/selftests/powerpc/dexcr/Makefile +++ b/tools/testing/selftests/powerpc/dexcr/Makefile @@ -1,7 +1,9 @@ TEST_GEN_PROGS := hashchk_test +TEST_GEN_FILES := lsdexcr include ../../lib.mk $(OUTPUT)/hashchk_test: CFLAGS += -fno-pie $(call cc-option,-mno-rop-protect) $(TEST_GEN_PROGS): ../harness.c ../utils.c ./dexcr.c +$(TEST_GEN_FILES): ../utils.c ./dexcr.c diff --git a/tools/testing/selftests/powerpc/dexcr/lsdexcr.c b/tools/testing/selftests/powerpc/dexcr/lsdexcr.c new file mode 100644 index ..94abbfcc389e --- /dev/null +++ b/tools/testing/selftests/powerpc/dexcr/lsdexcr.c @@ -0,0 +1,141 @@ +// SPDX-License-Identifier: GPL-2.0+ + +#include +#include +#include +#include + +#include "dexcr.h" +#include "utils.h" + +static unsigned int dexcr; +static unsigned int hdexcr; +static unsigned int effective; + +struct dexcr_aspect { + const char *name; + const char *desc; + unsigned int index; +}; + +static const struct dexcr_aspect aspects[] = { + { + .name = "SBHE", + .desc = "Speculative branch hint enable", + .index = 0, + }, + { + .name = "IBRTPD", + .desc = "Indirect branch recurrent target prediction disable", + .index = 3, + }, + { + .name = "SRAPD", + .desc = "Subroutine return address prediction disable", + .index = 4, + }, + { + .name = "NPHIE", + .desc = "Non-privileged hash instruction enable", + .index = 5, + }, + { + .name = "PHIE", + .desc = "Privileged hash instruction enable", + .index = 6, + }, +}; + +static void print_list(const char *list[], size_t len) +{ + for (size_t i = 0; i < len; i++) { + printf("%s", list[i]); + if (i + 1 < len) + printf(", "); + } +} + +static void print_dexcr(char *name, unsigned int bits) +{ + const char *enabled_aspects[ARRAY_SIZE(aspects) + 1] = {NULL}; + size_t j = 0; + + printf("%s: %08x", name, bits); + + if (bits == 0) { + printf("\n"); + return; + } + + for (size_t i = 0; i < ARRAY_SIZE(aspects); i++) { + unsigned int mask = DEXCR_PR_BIT(aspects[i].index); + + if (bits & mask) { + enabled_aspects[j++] = aspects[i].name; + bits &= ~mask; + } + } + + if (bits) + enabled_aspects[j++] = "unknown"; + + printf(" ("); + print_list(enabled_aspects, j); + printf(")\n"); +} + +static void print_aspect(const struct dexcr_aspect *aspect) +{ + const char *attributes[8] = {NULL}; + size_t j = 0; + unsigned long mask; + + mask = DEXCR_PR_BIT(aspect->index); + if (dexcr & mask) + attributes[j++] = "set"; + if (hdexcr & mask) + attributes[j++] = "set (hypervisor)"; + if (!(effective & mask)) + attributes[j++] = "clear"; + + printf("%12s %c (%d): ", aspect->name, effective & mask ? '*' : ' ', aspect->index); + print_list(attributes, j); + printf(" \t(%s)\n", aspect->desc); +} + +int main(int argc, char *argv[]) +{ + if (!dexcr_exists()) { + printf
[PATCH v3 10/12] selftests/powerpc: Add more utility macros
Adds _MSG assertion variants to provide more context behind why a failure occurred. Also include unistd.h for _exit() and stdio.h for fprintf(), and move ARRAY_SIZE macro to utils.h. The _MSG variants and ARRAY_SIZE will be used by the following DEXCR selftests. Signed-off-by: Benjamin Gray Reviewed-by: Russell Currey --- v3: * Reword commit message * Add ruscur reviewed-by v1: * Remove the signal handler variants * Describe why headers are included --- .../testing/selftests/powerpc/include/utils.h | 27 ++- .../powerpc/pmu/sampling_tests/misc.h | 2 -- 2 files changed, 26 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/powerpc/include/utils.h b/tools/testing/selftests/powerpc/include/utils.h index 44bfd48b93d6..9dc53c4fbfe3 100644 --- a/tools/testing/selftests/powerpc/include/utils.h +++ b/tools/testing/selftests/powerpc/include/utils.h @@ -9,11 +9,17 @@ #define __cacheline_aligned __attribute__((aligned(128))) #include +#include #include #include #include #include #include "reg.h" +#include + +#ifndef ARRAY_SIZE +# define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0])) +#endif /* Avoid headaches with PRI?64 - just use %ll? always */ typedef unsigned long long u64; @@ -67,7 +73,6 @@ struct perf_event_read { }; #if !defined(__GLIBC_PREREQ) || !__GLIBC_PREREQ(2, 30) -#include #include static inline pid_t gettid(void) @@ -116,6 +121,16 @@ do { \ } \ } while (0) +#define FAIL_IF_MSG(x, msg)\ +do { \ + if ((x)) { \ + fprintf(stderr, \ + "[FAIL] Test FAILED on line %d: %s\n", \ + __LINE__, msg); \ + return 1; \ + } \ +} while (0) + #define FAIL_IF_EXIT(x)\ do { \ if ((x)) { \ @@ -125,6 +140,16 @@ do { \ } \ } while (0) +#define FAIL_IF_EXIT_MSG(x, msg) \ +do { \ + if ((x)) { \ + fprintf(stderr, \ + "[FAIL] Test FAILED on line %d: %s\n", \ + __LINE__, msg); \ + _exit(1); \ + } \ +} while (0) + /* The test harness uses this, yes it's gross */ #define MAGIC_SKIP_RETURN_VALUE99 diff --git a/tools/testing/selftests/powerpc/pmu/sampling_tests/misc.h b/tools/testing/selftests/powerpc/pmu/sampling_tests/misc.h index 4181755cf5a0..64e25cce1435 100644 --- a/tools/testing/selftests/powerpc/pmu/sampling_tests/misc.h +++ b/tools/testing/selftests/powerpc/pmu/sampling_tests/misc.h @@ -18,8 +18,6 @@ #define MMCR1_RSQ 0x2000ULL /* radix scope qual field */ #define BHRB_DISABLE0x20ULL /* MMCRA BHRB DISABLE bit */ -#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0])) - extern int ev_mask_pmcxsel, ev_shift_pmcxsel; extern int ev_mask_marked, ev_shift_marked; extern int ev_mask_comb, ev_shift_comb; -- 2.40.1
[PATCH v3 07/12] powerpc/ptrace: Expose DEXCR and HDEXCR registers to ptrace
The DEXCR register is of interest when ptracing processes. Currently it is static, but eventually will be dynamically controllable by a process. If a process can control its own, then it is useful for it to be ptrace-able to (e.g., for checkpoint-restore functionality). It is also relevant to core dumps (the NPHIE aspect in particular), which use the ptrace mechanism (or is it the other way around?) to decide what to dump. The HDEXCR is useful here too, as the NPHIE aspect may be set in the HDEXCR without being set in the DEXCR. Although the HDEXCR is per-cpu and we don't track it in the task struct (it's useless in normal operation), it would be difficult to imagine why a hypervisor would set it to different values within a guest. A hypervisor cannot safely set NPHIE differently at least, as that would break programs. Expose a read-only view of the userspace DEXCR and HDEXCR to ptrace. The HDEXCR is always readonly, and is useful for diagnosing the core dumps (as the HDEXCR may set NPHIE without the DEXCR setting it). Signed-off-by: Benjamin Gray Reviewed-by: Russell Currey --- v3: * Add ruscur reviewed-by v2: * New in v2 --- arch/powerpc/include/uapi/asm/elf.h | 1 + arch/powerpc/kernel/ptrace/ptrace-decl.h | 1 + arch/powerpc/kernel/ptrace/ptrace-view.c | 31 +++- include/uapi/linux/elf.h | 1 + 4 files changed, 33 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/uapi/asm/elf.h b/arch/powerpc/include/uapi/asm/elf.h index dbc4a5b8d02d..e0d323c808dd 100644 --- a/arch/powerpc/include/uapi/asm/elf.h +++ b/arch/powerpc/include/uapi/asm/elf.h @@ -98,6 +98,7 @@ #define ELF_NEBB 3 /* includes ebbrr, ebbhr, bescr */ #define ELF_NPMU 5 /* includes siar, sdar, sier, mmcr2, mmcr0 */ #define ELF_NPKEY 3 /* includes amr, iamr, uamor */ +#define ELF_NDEXCR 2 /* includes dexcr, hdexcr */ typedef unsigned long elf_greg_t64; typedef elf_greg_t64 elf_gregset_t64[ELF_NGREG]; diff --git a/arch/powerpc/kernel/ptrace/ptrace-decl.h b/arch/powerpc/kernel/ptrace/ptrace-decl.h index 463a63eb8cc7..998a84f64804 100644 --- a/arch/powerpc/kernel/ptrace/ptrace-decl.h +++ b/arch/powerpc/kernel/ptrace/ptrace-decl.h @@ -57,6 +57,7 @@ enum powerpc_regset { REGSET_TAR, /* TAR register */ REGSET_EBB, /* EBB registers */ REGSET_PMR, /* Performance Monitor Registers */ + REGSET_DEXCR, /* DEXCR registers */ #endif #ifdef CONFIG_PPC_MEM_KEYS REGSET_PKEY,/* AMR register */ diff --git a/arch/powerpc/kernel/ptrace/ptrace-view.c b/arch/powerpc/kernel/ptrace/ptrace-view.c index 5fff0d04b23f..d3304fb932fa 100644 --- a/arch/powerpc/kernel/ptrace/ptrace-view.c +++ b/arch/powerpc/kernel/ptrace/ptrace-view.c @@ -454,7 +454,31 @@ static int pmu_set(struct task_struct *target, const struct user_regset *regset, 5 * sizeof(unsigned long)); return ret; } -#endif + +static int dexcr_active(struct task_struct *target, const struct user_regset *regset) +{ + if (!cpu_has_feature(CPU_FTR_ARCH_31)) + return -ENODEV; + + return regset->n; +} + +static int dexcr_get(struct task_struct *target, const struct user_regset *regset, +struct membuf to) +{ + if (!cpu_has_feature(CPU_FTR_ARCH_31)) + return -ENODEV; + + membuf_store(&to, (unsigned int)CONFIG_PPC_DEXCR_DEFAULT); + + /* +* Technically the HDEXCR is per-cpu, but a hypervisor can't reasonably +* change it between CPUs of the same guest. +*/ + return membuf_store(&to, (unsigned int)mfspr(SPRN_HDEXCR_RO)); +} + +#endif /* CONFIG_PPC_BOOK3S_64 */ #ifdef CONFIG_PPC_MEM_KEYS static int pkey_active(struct task_struct *target, const struct user_regset *regset) @@ -615,6 +639,11 @@ static const struct user_regset native_regsets[] = { .size = sizeof(u64), .align = sizeof(u64), .active = pmu_active, .regset_get = pmu_get, .set = pmu_set }, + [REGSET_DEXCR] = { + .core_note_type = NT_PPC_DEXCR, .n = ELF_NDEXCR, + .size = sizeof(u32), .align = sizeof(u32), + .active = dexcr_active, .regset_get = dexcr_get + }, #endif #ifdef CONFIG_PPC_MEM_KEYS [REGSET_PKEY] = { diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h index ac3da855fb19..cfa31f1eb5d7 100644 --- a/include/uapi/linux/elf.h +++ b/include/uapi/linux/elf.h @@ -403,6 +403,7 @@ typedef struct elf64_shdr { #define NT_PPC_TM_CPPR 0x10e /* TM checkpointed Program Priority Register */ #define NT_PPC_TM_CDSCR0x10f /* TM checkpointed Data Stream Control Register */ #define NT_PPC_PKEY0x110 /* Memory Protection Keys registers */ +#define NT_PPC_DEXCR 0x111 /* PowerPC DEXCR registers */ #define NT_386_TLS 0x200
[PATCH v3 02/12] powerpc/ptrace: Add missing include
ptrace-decl.h uses user_regset_get2_fn (among other things) from regset.h. While all current users of ptrace-decl.h include regset.h before it anyway, it adds an implicit ordering dependency and breaks source tooling that tries to inspect ptrace-decl.h by itself. Signed-off-by: Benjamin Gray Reviewed-by: Russell Currey --- v3: * Add ruscur reviewed-by v2: * New in v2 --- arch/powerpc/kernel/ptrace/ptrace-decl.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/kernel/ptrace/ptrace-decl.h b/arch/powerpc/kernel/ptrace/ptrace-decl.h index eafe5f0f6289..463a63eb8cc7 100644 --- a/arch/powerpc/kernel/ptrace/ptrace-decl.h +++ b/arch/powerpc/kernel/ptrace/ptrace-decl.h @@ -1,5 +1,7 @@ /* SPDX-License-Identifier: GPL-2.0-or-later */ +#include + /* * Set of msr bits that gdb can change on behalf of a process. */ -- 2.40.1
[PATCH v3 08/12] powerpc/ptrace: Expose HASHKEYR register to ptrace
The HASHKEYR register contains a secret per-process key to enable unique hashes per process. In general it should not be exposed to userspace at all and a regular process has no need to know its key. However, checkpoint restore in userspace (CRIU) functionality requires that a process be able to set the HASHKEYR of another process, otherwise existing hashes on the stack would be invalidated by a new random key. Exposing HASHKEYR in this way also makes it appear in core dumps, which is a security concern. Multiple threads may share a key, for example just after a fork() call, where the kernel cannot know if the child is going to return back along the parent's stack. If such a thread is coerced into making a core dump, then the HASHKEYR value will be readable and able to be used against all other threads sharing that key, effectively undoing any protection offered by hashst/hashchk. Therefore we expose HASHKEYR to ptrace when CONFIG_CHECKPOINT_RESTORE is enabled, providing a choice of increased security or migratable ROP protected processes. This is similar to how ARM exposes its PAC keys. Signed-off-by: Benjamin Gray Reviewed-by: Russell Currey --- v3: * Add ruscur reviewed-by v2: * New in v2 --- arch/powerpc/include/uapi/asm/elf.h | 1 + arch/powerpc/kernel/ptrace/ptrace-decl.h | 3 ++ arch/powerpc/kernel/ptrace/ptrace-view.c | 36 include/uapi/linux/elf.h | 1 + 4 files changed, 41 insertions(+) diff --git a/arch/powerpc/include/uapi/asm/elf.h b/arch/powerpc/include/uapi/asm/elf.h index e0d323c808dd..a5377f494fa3 100644 --- a/arch/powerpc/include/uapi/asm/elf.h +++ b/arch/powerpc/include/uapi/asm/elf.h @@ -99,6 +99,7 @@ #define ELF_NPMU 5 /* includes siar, sdar, sier, mmcr2, mmcr0 */ #define ELF_NPKEY 3 /* includes amr, iamr, uamor */ #define ELF_NDEXCR 2 /* includes dexcr, hdexcr */ +#define ELF_NHASHKEYR 1 /* includes hashkeyr */ typedef unsigned long elf_greg_t64; typedef elf_greg_t64 elf_gregset_t64[ELF_NGREG]; diff --git a/arch/powerpc/kernel/ptrace/ptrace-decl.h b/arch/powerpc/kernel/ptrace/ptrace-decl.h index 998a84f64804..4171a5727197 100644 --- a/arch/powerpc/kernel/ptrace/ptrace-decl.h +++ b/arch/powerpc/kernel/ptrace/ptrace-decl.h @@ -58,6 +58,9 @@ enum powerpc_regset { REGSET_EBB, /* EBB registers */ REGSET_PMR, /* Performance Monitor Registers */ REGSET_DEXCR, /* DEXCR registers */ +#ifdef CONFIG_CHECKPOINT_RESTORE + REGSET_HASHKEYR,/* HASHKEYR register */ +#endif #endif #ifdef CONFIG_PPC_MEM_KEYS REGSET_PKEY,/* AMR register */ diff --git a/arch/powerpc/kernel/ptrace/ptrace-view.c b/arch/powerpc/kernel/ptrace/ptrace-view.c index d3304fb932fa..acbb8ec11b1e 100644 --- a/arch/powerpc/kernel/ptrace/ptrace-view.c +++ b/arch/powerpc/kernel/ptrace/ptrace-view.c @@ -478,6 +478,35 @@ static int dexcr_get(struct task_struct *target, const struct user_regset *regse return membuf_store(&to, (unsigned int)mfspr(SPRN_HDEXCR_RO)); } +#ifdef CONFIG_CHECKPOINT_RESTORE +static int hashkeyr_active(struct task_struct *target, const struct user_regset *regset) +{ + if (!cpu_has_feature(CPU_FTR_ARCH_31)) + return -ENODEV; + + return regset->n; +} + +static int hashkeyr_get(struct task_struct *target, const struct user_regset *regset, + struct membuf to) +{ + if (!cpu_has_feature(CPU_FTR_ARCH_31)) + return -ENODEV; + + return membuf_store(&to, target->thread.hashkeyr); +} + +static int hashkeyr_set(struct task_struct *target, const struct user_regset *regset, + unsigned int pos, unsigned int count, const void *kbuf, + const void __user *ubuf) +{ + if (!cpu_has_feature(CPU_FTR_ARCH_31)) + return -ENODEV; + + return user_regset_copyin(&pos, &count, &kbuf, &ubuf, &target->thread.hashkeyr, + 0, sizeof(unsigned long)); +} +#endif /* CONFIG_CHECKPOINT_RESTORE */ #endif /* CONFIG_PPC_BOOK3S_64 */ #ifdef CONFIG_PPC_MEM_KEYS @@ -644,6 +673,13 @@ static const struct user_regset native_regsets[] = { .size = sizeof(u32), .align = sizeof(u32), .active = dexcr_active, .regset_get = dexcr_get }, +#ifdef CONFIG_CHECKPOINT_RESTORE + [REGSET_HASHKEYR] = { + .core_note_type = NT_PPC_HASHKEYR, .n = ELF_NHASHKEYR, + .size = sizeof(u64), .align = sizeof(u64), + .active = hashkeyr_active, .regset_get = hashkeyr_get, .set = hashkeyr_set + }, +#endif #endif #ifdef CONFIG_PPC_MEM_KEYS [REGSET_PKEY] = { diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h index cfa31f1eb5d7..b705b301d88f 100644 --- a/include/uapi/linux/elf.h +++ b/include/uapi/linux/elf.h @@ -404,6 +404,7 @@ typedef struct elf64_shdr { #define NT_PPC_TM_CDSCR
[PATCH v3 06/12] powerpc/dexcr: Support custom default DEXCR value
Make the DEXCR value configurable at config time. Intentionally don't limit possible values to support future aspects without needing kernel updates. The default config value enables hashst/hashchk in problem state. This should be safe, as generally software needs to request these instructions be included in the first place. Signed-off-by: Benjamin Gray Reviewed-by: Russell Currey --- v3: * Fix hashchk typo, provide minimum ISA version * Add ruscur reviewed-by v1: * New in v1 Preface with: I'm note sure on the best place to put the config. I also don't think there's any need to zero out unknown/unsupported bits. Reserved implies they are ignored by the hardware (from my understanding of the ISA). Current P10s boot with all bits set; lsdexcr (later patch) reports uDEXCR: ff00 (SBHE, IBRTPD, SRAPD, NPHIE, PHIE, unknown) when you try to read it back. Leaving them be also makes it easier to support newer aspects without a kernel update. If arbitrary value support isn't important, it's probably a nicer interface to make each aspect an entry in a menu. Future work may include dynamic DEXCR controls via prctl() and sysfs. The dynamic controls would be able to override this default DEXCR on a per-process basis. A stronger "PPC_ENFORCE_USER_ROP_PROCTETION" config may be required at such a time to prevent dynamically disabling the hash checks. --- arch/powerpc/Kconfig | 14 ++ arch/powerpc/kernel/cpu_setup_power.c | 3 ++- 2 files changed, 16 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 539d1f03ff42..b96df37e4171 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -1039,6 +1039,20 @@ config PPC_MEM_KEYS If unsure, say y. +config PPC_DEXCR_DEFAULT + hex "Default DEXCR value" + default 0x0400 + depends on PPC_BOOK3S_64 + help + Power10 introduces the Dynamic Execution Control Register (DEXCR) + to provide fine grained control over various speculation and + security capabilities. This is used as the default DEXCR value. + + It is a 64 bit value that splits into 32 bits for supervisor mode + and 32 bits for problem state. The default config value enables + the hashst/hashchk instructions in userspace. See the ISA (3.1B or + later) for specifics of what each bit controls. + config PPC_SECURE_BOOT prompt "Enable secure boot support" bool diff --git a/arch/powerpc/kernel/cpu_setup_power.c b/arch/powerpc/kernel/cpu_setup_power.c index c00721801a1b..814c825a0661 100644 --- a/arch/powerpc/kernel/cpu_setup_power.c +++ b/arch/powerpc/kernel/cpu_setup_power.c @@ -10,6 +10,7 @@ #include #include #include +#include #include #include @@ -128,7 +129,7 @@ static void init_PMU_ISA31(void) static void init_DEXCR(void) { - mtspr(SPRN_DEXCR, 0); + mtspr(SPRN_DEXCR, CONFIG_PPC_DEXCR_DEFAULT); mtspr(SPRN_HASHKEYR, 0); } -- 2.40.1
[PATCH v3 11/12] selftests/powerpc/dexcr: Add hashst/hashchk test
Test the kernel DEXCR[NPHIE] interface and hashchk exception handling. Introduces with it a DEXCR utils library for common DEXCR operations. Volatile is used to prevent the compiler optimising away the signal tests. Signed-off-by: Benjamin Gray --- v1: * Clean up dexcr makefile * Include kernel headers in CFLAGS * Use numeric literals for hashst/hashchk to support older toolchains * A lot of other refactoring --- tools/testing/selftests/powerpc/Makefile | 1 + .../selftests/powerpc/dexcr/.gitignore| 1 + .../testing/selftests/powerpc/dexcr/Makefile | 7 + tools/testing/selftests/powerpc/dexcr/dexcr.c | 132 ++ tools/testing/selftests/powerpc/dexcr/dexcr.h | 49 .../selftests/powerpc/dexcr/hashchk_test.c| 227 ++ tools/testing/selftests/powerpc/include/reg.h | 4 + .../testing/selftests/powerpc/include/utils.h | 4 + tools/testing/selftests/powerpc/utils.c | 24 ++ 9 files changed, 449 insertions(+) create mode 100644 tools/testing/selftests/powerpc/dexcr/.gitignore create mode 100644 tools/testing/selftests/powerpc/dexcr/Makefile create mode 100644 tools/testing/selftests/powerpc/dexcr/dexcr.c create mode 100644 tools/testing/selftests/powerpc/dexcr/dexcr.h create mode 100644 tools/testing/selftests/powerpc/dexcr/hashchk_test.c diff --git a/tools/testing/selftests/powerpc/Makefile b/tools/testing/selftests/powerpc/Makefile index ae2bfc0d822f..49f2ad1793fd 100644 --- a/tools/testing/selftests/powerpc/Makefile +++ b/tools/testing/selftests/powerpc/Makefile @@ -17,6 +17,7 @@ SUB_DIRS = alignment \ benchmarks \ cache_shape \ copyloops\ + dexcr\ dscr \ mm \ nx-gzip \ diff --git a/tools/testing/selftests/powerpc/dexcr/.gitignore b/tools/testing/selftests/powerpc/dexcr/.gitignore new file mode 100644 index ..d12e4560aca9 --- /dev/null +++ b/tools/testing/selftests/powerpc/dexcr/.gitignore @@ -0,0 +1 @@ +hashchk_test diff --git a/tools/testing/selftests/powerpc/dexcr/Makefile b/tools/testing/selftests/powerpc/dexcr/Makefile new file mode 100644 index ..16c8b489948a --- /dev/null +++ b/tools/testing/selftests/powerpc/dexcr/Makefile @@ -0,0 +1,7 @@ +TEST_GEN_PROGS := hashchk_test + +include ../../lib.mk + +$(OUTPUT)/hashchk_test: CFLAGS += -fno-pie $(call cc-option,-mno-rop-protect) + +$(TEST_GEN_PROGS): ../harness.c ../utils.c ./dexcr.c diff --git a/tools/testing/selftests/powerpc/dexcr/dexcr.c b/tools/testing/selftests/powerpc/dexcr/dexcr.c new file mode 100644 index ..65ec5347de98 --- /dev/null +++ b/tools/testing/selftests/powerpc/dexcr/dexcr.c @@ -0,0 +1,132 @@ +// SPDX-License-Identifier: GPL-2.0+ + +#include +#include +#include +#include +#include + +#include "dexcr.h" +#include "reg.h" +#include "utils.h" + +static jmp_buf generic_signal_jump_buf; + +static void generic_signal_handler(int signum, siginfo_t *info, void *context) +{ + longjmp(generic_signal_jump_buf, 0); +} + +bool dexcr_exists(void) +{ + struct sigaction old; + volatile bool exists; + + old = push_signal_handler(SIGILL, generic_signal_handler); + if (setjmp(generic_signal_jump_buf)) + goto out; + + /* +* If the SPR is not recognised by the hardware it triggers +* a hypervisor emulation interrupt. If the kernel does not +* recognise/try to emulate it, we receive a SIGILL signal. +* +* If we do not receive a signal, assume we have the SPR or the +* kernel is trying to emulate it correctly. +*/ + exists = false; + mfspr(SPRN_DEXCR_RO); + exists = true; + +out: + pop_signal_handler(SIGILL, old); + return exists; +} + +/* + * Just test if a bad hashchk triggers a signal, without checking + * for support or if the NPHIE aspect is enabled. + */ +bool hashchk_triggers(void) +{ + struct sigaction old; + volatile bool triggers; + + old = push_signal_handler(SIGILL, generic_signal_handler); + if (setjmp(generic_signal_jump_buf)) + goto out; + + triggers = true; + do_bad_hashchk(); + triggers = false; + +out: + pop_signal_handler(SIGILL, old); + return triggers; +} + +unsigned int get_dexcr(enum dexcr_source source) +{ + switch (source) { + case DEXCR: + return mfspr(SPRN_DEXCR_RO); + case HDEXCR: + return mfspr(SPRN_HDEXCR_RO); + case EFFECTIVE: + return mfspr(SPRN_DEXCR_RO) | mfspr(SPRN_HDEXCR_RO); + default: + FAIL_IF_EXIT_MSG(true, "bad enum dexcr_source"); + } +} + +void await_child_success(pid_t pid) +{ + int wstatus; + + FAIL_IF_EXIT_MSG(pid == -1, "fork failed"); + FAIL_IF_EXIT_MSG(waitpid(pid, &wstatus, 0)
Re: [PATCH] mm: kfence: Fix false positives on big endian
Andrew Morton writes: > On Fri, 5 May 2023 16:02:17 + David Laight > wrote: > >> From: Michael Ellerman >> > Sent: 05 May 2023 04:51 >> > >> > Since commit 1ba3cbf3ec3b ("mm: kfence: improve the performance of >> > __kfence_alloc() and __kfence_free()"), kfence reports failures in >> > random places at boot on big endian machines. >> > >> > The problem is that the new KFENCE_CANARY_PATTERN_U64 encodes the >> > address of each byte in its value, so it needs to be byte swapped on big >> > endian machines. >> > >> > The compiler is smart enough to do the le64_to_cpu() at compile time, so >> > there is no runtime overhead. >> > >> > Fixes: 1ba3cbf3ec3b ("mm: kfence: improve the performance of >> > __kfence_alloc() and __kfence_free()") >> > Signed-off-by: Michael Ellerman >> > --- >> > mm/kfence/kfence.h | 2 +- >> > 1 file changed, 1 insertion(+), 1 deletion(-) >> > >> > diff --git a/mm/kfence/kfence.h b/mm/kfence/kfence.h >> > index 2aafc46a4aaf..392fb273e7bd 100644 >> > --- a/mm/kfence/kfence.h >> > +++ b/mm/kfence/kfence.h >> > @@ -29,7 +29,7 @@ >> > * canary of every 8 bytes is the same. 64-bit memory can be filled and >> > checked >> > * at a time instead of byte by byte to improve performance. >> > */ >> > -#define KFENCE_CANARY_PATTERN_U64 ((u64)0x ^ >> > (u64)(0x0706050403020100)) >> > +#define KFENCE_CANARY_PATTERN_U64 ((u64)0x ^ >> > (u64)(le64_to_cpu(0x0706050403020100))) >> >> What at the (u64) casts for? >> The constants should probably have a ul (or ull) suffix. >> > > I tried that, didn't fix the sparse warnings described at > https://lkml.kernel.org/r/202305132244.dwzbucud-...@intel.com. > > Michael, have you looked into this? I haven't sorry, been chasing other bugs. > I'll merge it upstream - I guess we can live with the warnings for a while. Thanks, yeah spurious WARNs are more of a pain than some sparse warnings. Maybe using le64_to_cpu() is too fancy, could just do it with an ifdef? eg. diff --git a/mm/kfence/kfence.h b/mm/kfence/kfence.h index 392fb273e7bd..510355a5382b 100644 --- a/mm/kfence/kfence.h +++ b/mm/kfence/kfence.h @@ -29,7 +29,11 @@ * canary of every 8 bytes is the same. 64-bit memory can be filled and checked * at a time instead of byte by byte to improve performance. */ -#define KFENCE_CANARY_PATTERN_U64 ((u64)0x ^ (u64)(le64_to_cpu(0x0706050403020100))) +#ifdef __LITTLE_ENDIAN__ +#define KFENCE_CANARY_PATTERN_U64 (0xULL ^ 0x0706050403020100ULL) +#else +#define KFENCE_CANARY_PATTERN_U64 (0xULL ^ 0x0001020304050607ULL) +#endif /* Maximum stack depth for reports. */ #define KFENCE_STACK_DEPTH 64 cheers
[PATCH v3 01/12] powerpc/book3s: Add missing include
The functions here use struct thread_struct fields, so need to import the full definition from . The header that defines current only forward declares struct thread_struct. Failing to include this header leads to a compilation error when a translation unit does not also include indirectly. Signed-off-by: Benjamin Gray Reviewed-by: Nicholas Piggin Reviewed-by: Russell Currey --- v3: * Add ruscur reviewed-by v1: * Add npiggin reviewed-by --- arch/powerpc/include/asm/book3s/64/kup.h | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/include/asm/book3s/64/kup.h b/arch/powerpc/include/asm/book3s/64/kup.h index 54cf46808157..84c09e546115 100644 --- a/arch/powerpc/include/asm/book3s/64/kup.h +++ b/arch/powerpc/include/asm/book3s/64/kup.h @@ -194,6 +194,7 @@ #else /* !__ASSEMBLY__ */ #include +#include DECLARE_STATIC_KEY_FALSE(uaccess_flush_key); -- 2.40.1
Re: [PATCH] mm: kfence: Fix false positives on big endian
Le 18/05/2023 à 00:20, Andrew Morton a écrit : > On Fri, 5 May 2023 16:02:17 + David Laight > wrote: > >> From: Michael Ellerman >>> Sent: 05 May 2023 04:51 >>> >>> Since commit 1ba3cbf3ec3b ("mm: kfence: improve the performance of >>> __kfence_alloc() and __kfence_free()"), kfence reports failures in >>> random places at boot on big endian machines. >>> >>> The problem is that the new KFENCE_CANARY_PATTERN_U64 encodes the >>> address of each byte in its value, so it needs to be byte swapped on big >>> endian machines. >>> >>> The compiler is smart enough to do the le64_to_cpu() at compile time, so >>> there is no runtime overhead. >>> >>> Fixes: 1ba3cbf3ec3b ("mm: kfence: improve the performance of >>> __kfence_alloc() and __kfence_free()") >>> Signed-off-by: Michael Ellerman >>> --- >>> mm/kfence/kfence.h | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/mm/kfence/kfence.h b/mm/kfence/kfence.h >>> index 2aafc46a4aaf..392fb273e7bd 100644 >>> --- a/mm/kfence/kfence.h >>> +++ b/mm/kfence/kfence.h >>> @@ -29,7 +29,7 @@ >>>* canary of every 8 bytes is the same. 64-bit memory can be filled and >>> checked >>>* at a time instead of byte by byte to improve performance. >>>*/ >>> -#define KFENCE_CANARY_PATTERN_U64 ((u64)0x ^ >>> (u64)(0x0706050403020100)) >>> +#define KFENCE_CANARY_PATTERN_U64 ((u64)0x ^ >>> (u64)(le64_to_cpu(0x0706050403020100))) >> >> What at the (u64) casts for? >> The constants should probably have a ul (or ull) suffix. >> > > I tried that, didn't fix the sparse warnings described at > https://lkml.kernel.org/r/202305132244.dwzbucud-...@intel.com. > > Michael, have you looked into this? > > I'll merge it upstream - I guess we can live with the warnings for a while. > sparse warning goes away with: #define KFENCE_CANARY_PATTERN_U64 (0xULL ^ le64_to_cpu((__force __le64)0x0706050403020100)) Christophe
Re: [PATCH] mm: kfence: Fix false positives on big endian
On Fri, 2023-05-19 at 15:14 +1000, Michael Ellerman wrote: > Andrew Morton writes: > > On Fri, 5 May 2023 16:02:17 + David Laight > > wrote: > > > > > From: Michael Ellerman > > > > Sent: 05 May 2023 04:51 > > > > > > > > Since commit 1ba3cbf3ec3b ("mm: kfence: improve the performance > > > > of > > > > __kfence_alloc() and __kfence_free()"), kfence reports failures > > > > in > > > > random places at boot on big endian machines. > > > > > > > > The problem is that the new KFENCE_CANARY_PATTERN_U64 encodes > > > > the > > > > address of each byte in its value, so it needs to be byte > > > > swapped on big > > > > endian machines. > > > > > > > > The compiler is smart enough to do the le64_to_cpu() at compile > > > > time, so > > > > there is no runtime overhead. > > > > > > > > Fixes: 1ba3cbf3ec3b ("mm: kfence: improve the performance of > > > > __kfence_alloc() and __kfence_free()") > > > > Signed-off-by: Michael Ellerman > > > > --- > > > > mm/kfence/kfence.h | 2 +- > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > diff --git a/mm/kfence/kfence.h b/mm/kfence/kfence.h > > > > index 2aafc46a4aaf..392fb273e7bd 100644 > > > > --- a/mm/kfence/kfence.h > > > > +++ b/mm/kfence/kfence.h > > > > @@ -29,7 +29,7 @@ > > > > * canary of every 8 bytes is the same. 64-bit memory can be > > > > filled and checked > > > > * at a time instead of byte by byte to improve performance. > > > > */ > > > > -#define KFENCE_CANARY_PATTERN_U64 ((u64)0x ^ > > > > (u64)(0x0706050403020100)) > > > > +#define KFENCE_CANARY_PATTERN_U64 ((u64)0x ^ > > > > (u64)(le64_to_cpu(0x0706050403020100))) > > > > > > What at the (u64) casts for? > > > The constants should probably have a ul (or ull) suffix. > > > > > > > I tried that, didn't fix the sparse warnings described at > > https://lkml.kernel.org/r/202305132244.dwzbucud-...@intel.com. > > > > Michael, have you looked into this? > > I haven't sorry, been chasing other bugs. > > > I'll merge it upstream - I guess we can live with the warnings for > > a while. > > Thanks, yeah spurious WARNs are more of a pain than some sparse > warnings. > > Maybe using le64_to_cpu() is too fancy, could just do it with an > ifdef? eg. > > diff --git a/mm/kfence/kfence.h b/mm/kfence/kfence.h > index 392fb273e7bd..510355a5382b 100644 > --- a/mm/kfence/kfence.h > +++ b/mm/kfence/kfence.h > @@ -29,7 +29,11 @@ > * canary of every 8 bytes is the same. 64-bit memory can be filled > and checked > * at a time instead of byte by byte to improve performance. > */ > -#define KFENCE_CANARY_PATTERN_U64 ((u64)0x ^ > (u64)(le64_to_cpu(0x0706050403020100))) > +#ifdef __LITTLE_ENDIAN__ > +#define KFENCE_CANARY_PATTERN_U64 (0xULL ^ > 0x0706050403020100ULL) > +#else > +#define KFENCE_CANARY_PATTERN_U64 (0xULL ^ > 0x0001020304050607ULL) > +#endif > > /* Maximum stack depth for reports. */ > #define KFENCE_STACK_DEPTH 64 > > > cheers (for the sparse errors) As I understand, we require memory to look like "00 01 02 03 04 05 06 07" such that iterating byte-by-byte gives 00, 01, etc. (with everything XORed with aaa...) I think it would be most semantically correct to use cpu_to_le64 on KFENCE_CANARY_PATTERN_U64 and annotate the values being compared against it as __le64. This is because we want the integer literal 0x0706050403020100 to be stored as "00 01 02 03 04 05 06 07", which is the definition of little endian. Masking this with an #ifdef leaves the type as cpu endian, which could result in future issues. (or I've just misunderstood and can disregard this)