[RESEND PATCH v9 2/2] arm64: support batched/deferred tlb shootdown during page reclamation/migration

2023-05-18 Thread Yicong Yang
From: Barry Song 

on x86, batched and deferred tlb shootdown has lead to 90%
performance increase on tlb shootdown. on arm64, HW can do
tlb shootdown without software IPI. But sync tlbi is still
quite expensive.

Even running a simplest program which requires swapout can
prove this is true,
 #include 
 #include 
 #include 
 #include 

 int main()
 {
 #define SIZE (1 * 1024 * 1024)
 volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
  MAP_SHARED | MAP_ANONYMOUS, -1, 0);

 memset(p, 0x88, SIZE);

 for (int k = 0; k < 1; k++) {
 /* swap in */
 for (int i = 0; i < SIZE; i += 4096) {
 (void)p[i];
 }

 /* swap out */
 madvise(p, SIZE, MADV_PAGEOUT);
 }
 }

Perf result on snapdragon 888 with 8 cores by using zRAM
as the swap block device.

 ~ # perf record taskset -c 4 ./a.out
 [ perf record: Woken up 10 times to write data ]
 [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
 ~ # perf report
 # To display the perf.data header info, please use --header/--header-only 
options.
 # To display the perf.data header info, please use --header/--header-only 
options.
 #
 #
 # Total Lost Samples: 0
 #
 # Samples: 60K of event 'cycles'
 # Event count (approx.): 35706225414
 #
 # Overhead  Command  Shared Object  Symbol
 #   ...  .  
.
 #
21.07%  a.out[kernel.kallsyms]  [k] _raw_spin_unlock_irq
 8.23%  a.out[kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
 6.67%  a.out[kernel.kallsyms]  [k] filemap_map_pages
 6.16%  a.out[kernel.kallsyms]  [k] __zram_bvec_write
 5.36%  a.out[kernel.kallsyms]  [k] ptep_clear_flush
 3.71%  a.out[kernel.kallsyms]  [k] _raw_spin_lock
 3.49%  a.out[kernel.kallsyms]  [k] memset64
 1.63%  a.out[kernel.kallsyms]  [k] clear_page
 1.42%  a.out[kernel.kallsyms]  [k] _raw_spin_unlock
 1.26%  a.out[kernel.kallsyms]  [k] 
mod_zone_state.llvm.8525150236079521930
 1.23%  a.out[kernel.kallsyms]  [k] xas_load
 1.15%  a.out[kernel.kallsyms]  [k] zram_slot_lock

ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
swapping in/out a page mapped by only one process. If the
page is mapped by multiple processes, typically, like more
than 100 on a phone, the overhead would be much higher as
we have to run tlb flush 100 times for one single page.
Plus, tlb flush overhead will increase with the number
of CPU cores due to the bad scalability of tlb shootdown
in HW, so those ARM64 servers should expect much higher
overhead.

Further perf annonate shows 95% cpu time of ptep_clear_flush
is actually used by the final dsb() to wait for the completion
of tlb flush. This provides us a very good chance to leverage
the existing batched tlb in kernel. The minimum modification
is that we only send async tlbi in the first stage and we send
dsb while we have to sync in the second stage.

With the above simplest micro benchmark, collapsed time to
finish the program decreases around 5%.

Typical collapsed time w/o patch:
 ~ # time taskset -c 4 ./a.out
 0.21user 14.34system 0:14.69elapsed
w/ patch:
 ~ # time taskset -c 4 ./a.out
 0.22user 13.45system 0:13.80elapsed

Also, Yicong Yang added the following observation.
Tested with benchmark in the commit on Kunpeng920 arm64 server,
observed an improvement around 12.5% with command
`time ./swap_bench`.
w/o w/
real0m13.460s   0m11.771s
user0m0.248s0m0.279s
sys 0m12.039s   0m11.458s

Originally it's noticed a 16.99% overhead of ptep_clear_flush()
which has been eliminated by this patch:

[root@localhost yang]# perf record -- ./swap_bench && perf report
[...]
16.99%  swap_bench  [kernel.kallsyms]  [k] ptep_clear_flush

It is tested on 4,8,128 CPU platforms and shows to be beneficial on
large systems but may not have improvement on small systems like on
a 4 CPU platform. So make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends
on CONFIG_EXPERT for this stage and add a runtime tunable to allow
to disable it according to the scenario.

Also this patch improve the performance of page migration. Using pmbench
and tries to migrate the pages of pmbench between node 0 and node 1 for
20 times, this patch decrease the time used more than 50% and saved the
time used by ptep_clear_flush().

This patch extends arch_tlbbatch_add_mm() to take an address of the
target page to support the feature on arm64. Also rename it to
arch_tlbbatch_add_pending() to better match its function since we
don't need to handle the mm on arm64 and add_mm is not proper.
add_pending will make sense to both as on x86 we're pending the
TLB flush operations while on arm64 w

[RESEND PATCH v9 0/2] arm64: support batched/deferred tlb shootdown during page reclamation/migration

2023-05-18 Thread Yicong Yang
From: Yicong Yang 

Though ARM64 has the hardware to do tlb shootdown, the hardware
broadcasting is not free.
A simplest micro benchmark shows even on snapdragon 888 with only
8 cores, the overhead for ptep_clear_flush is huge even for paging
out one page mapped by only one process:
5.36%  a.out[kernel.kallsyms]  [k] ptep_clear_flush

While pages are mapped by multiple processes or HW has more CPUs,
the cost should become even higher due to the bad scalability of
tlb shootdown.

The same benchmark can result in 16.99% CPU consumption on ARM64
server with around 100 cores according to Yicong's test on patch
2/2.

This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
1. only send tlbi instructions in the first stage -
arch_tlbbatch_add_mm()
2. wait for the completion of tlbi by dsb while doing tlbbatch
sync in arch_tlbbatch_flush()
Testing on snapdragon shows the overhead of ptep_clear_flush
is removed by the patchset. The micro benchmark becomes 5% faster
even for one page mapped by single process on snapdragon 888.

This support also optimize the page migration more than 50% with support
of batched TLB flushing [*].

[*] 
https://lore.kernel.org/linux-mm/20230213123444.155149-1-ying.hu...@intel.com/

-v9:
1. Using a runtime tunable to control batched TLB flush, per Catalin in v7.
   Sorry for missing this on v8.
Link: https://lore.kernel.org/all/20230329035512.57392-1-yangyic...@huawei.com/

-v8:
1. Rebase on 6.3-rc4
2. Tested the optimization on page migration and mentioned it in the commit
3. Thanks the review from Anshuman.
Link: 
https://lore.kernel.org/linux-mm/20221117082648.47526-1-yangyic...@huawei.com/

-v7:
1. rename arch_tlbbatch_add_mm() to arch_tlbbatch_add_pending() as suggested, 
since it
   takes an extra address for arm64, per Nadav and Anshuman. Also mentioned in 
the commit.
2. add tags from Xin Hao, thanks.
Link: https://lore.kernel.org/lkml/20221115031425.44640-1-yangyic...@huawei.com/

-v6:
1. comment we don't defer TLB flush on platforms affected by 
ARM64_WORKAROUND_REPEAT_TLBI
2. use cpus_have_const_cap() instead of this_cpu_has_cap()
3. add tags from Punit, Thanks.
4. default enable the feature when cpus >= 8 rather than > 8, since the original
   improvement is observed on snapdragon 888 with 8 cores.
Link: https://lore.kernel.org/lkml/20221028081255.19157-1-yangyic...@huawei.com/

-v5:
1. Make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends on EXPERT for this stage on 
arm64.
2. Make a threshold of CPU numbers for enabling batched TLP flush on arm64
Link: 
https://lore.kernel.org/linux-arm-kernel/20220921084302.43631-1-yangyic...@huawei.com/T/

-v4:
1. Add tags from Kefeng and Anshuman, Thanks.
2. Limit the TLB batch/defer on systems with >4 CPUs, per Anshuman
3. Merge previous Patch 1,2-3 into one, per Anshuman
Link: 
https://lore.kernel.org/linux-mm/20220822082120.8347-1-yangyic...@huawei.com/

-v3:
1. Declare arch's tlbbatch defer support by arch_tlbbatch_should_defer() instead
   of ARCH_HAS_MM_CPUMASK, per Barry and Kefeng
2. Add Tested-by from Xin Hao
Link: 
https://lore.kernel.org/linux-mm/20220711034615.482895-1-21cn...@gmail.com/

-v2:
1. Collected Yicong's test result on kunpeng920 ARM64 server;
2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
   according to the comments of Peter Zijlstra and Dave Hansen
3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
   is empty according to the comments of Nadav Amit

Thanks, Peter, Dave and Nadav for your testing or reviewing
, and comments.

-v1:
https://lore.kernel.org/lkml/20220707125242.425242-1-21cn...@gmail.com/

Anshuman Khandual (1):
  mm/tlbbatch: Introduce arch_tlbbatch_should_defer()

Barry Song (1):
  arm64: support batched/deferred tlb shootdown during page
reclamation/migration

 .../features/vm/TLB/arch-support.txt  |  2 +-
 arch/arm64/Kconfig|  1 +
 arch/arm64/include/asm/tlbbatch.h | 12 
 arch/arm64/include/asm/tlbflush.h | 33 -
 arch/arm64/mm/flush.c | 69 +++
 arch/x86/include/asm/tlbflush.h   | 17 -
 include/linux/mm_types_task.h |  4 +-
 mm/rmap.c | 21 +++---
 8 files changed, 139 insertions(+), 20 deletions(-)
 create mode 100644 arch/arm64/include/asm/tlbbatch.h

-- 
2.24.0



[RESEND PATCH v9 1/2] mm/tlbbatch: Introduce arch_tlbbatch_should_defer()

2023-05-18 Thread Yicong Yang
From: Anshuman Khandual 

The entire scheme of deferred TLB flush in reclaim path rests on the
fact that the cost to refill TLB entries is less than flushing out
individual entries by sending IPI to remote CPUs. But architecture
can have different ways to evaluate that. Hence apart from checking
TTU_BATCH_FLUSH in the TTU flags, rest of the decision should be
architecture specific.

Signed-off-by: Anshuman Khandual 
[https://lore.kernel.org/linuxppc-dev/20171101101735.2318-2-khand...@linux.vnet.ibm.com/]
Signed-off-by: Yicong Yang 
[Rebase and fix incorrect return value type]
Reviewed-by: Kefeng Wang 
Reviewed-by: Anshuman Khandual 
Reviewed-by: Barry Song 
Reviewed-by: Xin Hao 
Tested-by: Punit Agrawal 
---
 arch/x86/include/asm/tlbflush.h | 12 
 mm/rmap.c   |  9 +
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 75bfaa421030..46bdff73217c 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -260,6 +260,18 @@ static inline void flush_tlb_page(struct vm_area_struct 
*vma, unsigned long a)
flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
 }
 
+static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
+{
+   bool should_defer = false;
+
+   /* If remote CPUs need to be flushed then defer batch the flush */
+   if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
+   should_defer = true;
+   put_cpu();
+
+   return should_defer;
+}
+
 static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 {
/*
diff --git a/mm/rmap.c b/mm/rmap.c
index 19392e090bec..b45f95ab0c04 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -688,17 +688,10 @@ static void set_tlb_ubc_flush_pending(struct mm_struct 
*mm, pte_t pteval)
  */
 static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
 {
-   bool should_defer = false;
-
if (!(flags & TTU_BATCH_FLUSH))
return false;
 
-   /* If remote CPUs need to be flushed then defer batch the flush */
-   if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
-   should_defer = true;
-   put_cpu();
-
-   return should_defer;
+   return arch_tlbbatch_should_defer(mm);
 }
 
 /*
-- 
2.24.0



Re: [PATCH v2 00/25] iommu: Make default_domain's mandatory

2023-05-18 Thread Steven Price
On 16/05/2023 01:00, Jason Gunthorpe wrote:
> This is on github: 
> https://github.com/jgunthorpe/linux/commits/iommu_all_defdom

Tested-by: Steven Price 

Works fine on my Firefly-RK3288.

Thanks,

Steve



Re: [PATCH v2 00/34] Split ptdesc from struct page

2023-05-18 Thread Jason Gunthorpe
On Mon, May 01, 2023 at 12:27:55PM -0700, Vishal Moola (Oracle) wrote:
> The MM subsystem is trying to shrink struct page. This patchset
> introduces a memory descriptor for page table tracking - struct ptdesc.
> 
> This patchset introduces ptdesc, splits ptdesc from struct page, and
> converts many callers of page table constructor/destructors to use ptdescs.

Lightly related food for future thought - based on some discussions at
LSF/MM it would be really nice if an end result of this was that a
rcu_head was always available in the ptdesc so we don't need to
allocate memory to free a page table.

Jason


Re: [PATCH AUTOSEL 6.3 6/7] powerpc/fsl_uli1575: Allow to disable FSL_ULI1575 support

2023-05-18 Thread Sasha Levin

On Tue, May 09, 2023 at 09:18:35AM +0200, Pali Rohár wrote:

On Tuesday 09 May 2023 17:14:48 Michael Ellerman wrote:

Randy Dunlap  writes:
> Hi--
>
> Just a heads up. This patch can cause build errors.
> I sent a patch for these on 2023-APR-28:
>   
https://lore.kernel.org/linuxppc-dev/20230429043519.19807-1-rdun...@infradead.org/
>
> Michael, I think this is your area if I'm not mistaken.

Yes. The fix is in my fixes branch as:
  536d948a8dee ("powerpc/fsl_uli1575: fix kconfig warnings and build errors")

But I don't think this commit (22fdf79171e8) really warrants going to
stable, it's a nice-to-have but doesn't fix any pressing bugs.


Exactly. And also this patch alone without 1/8 would not work as in 1/8
https://lore.kernel.org/all/20230409000812.18904-2-p...@kernel.org/ was
added static inline variant of function which is used when ULI is
disabled.


I'll drop it, thanks!

--
Thanks,
Sasha


[PATCH v3 03/12] powerpc/dexcr: Add initial Dynamic Execution Control Register (DEXCR) support

2023-05-18 Thread Benjamin Gray
ISA 3.1B introduces the Dynamic Execution Control Register (DEXCR). It
is a per-cpu register that allows control over various CPU behaviours
including branch hint usage, indirect branch speculation, and
hashst/hashchk support.

Add some definitions and basic support for the DEXCR in the kernel.
Right now it just

  * Zero initialises the DEXCR and HASHKEYR when a CPU onlines.
  * Clears them in reset_sprs().
  * Detects when the NPHIE aspect is supported (the others don't get
looked at in this series, so there's no need to waste a CPU_FTR
on them).

We initialise the HASHKEYR to ensure that all cores have the same key,
so an HV enforced NPHIE + swapping cores doesn't randomly crash a
process using hash instructions. The stores to HASHKEYR are
unconditional because the ISA makes no mention of the SPR being missing
if support for doing the hashes isn't present. So all that would happen
is the HASHKEYR value gets ignored. This helps slightly if NPHIE
detection fails; e.g., we currently only detect it on pseries.

Signed-off-by: Benjamin Gray 
Reviewed-by: Russell Currey 

---
v3: * Add ruscur reviewed-by
v2: * More definitions for ptrace
v1: * Only make a CPU feature for NPHIE. We only need to know if the
  hashst/hashchk functionality is supported for a static DEXCR.
* Initialise the DEXCR to 0 when each CPU comes online. Remove
  the dexcr_init() and get_thread_dexcr() functions.
* No longer track the DEXCR in a per-thread field.
* Remove the made-up Opal features
---
 arch/powerpc/include/asm/book3s/64/kexec.h |  5 +
 arch/powerpc/include/asm/cputable.h|  4 +++-
 arch/powerpc/include/asm/reg.h | 10 ++
 arch/powerpc/kernel/cpu_setup_power.c  |  8 
 arch/powerpc/kernel/prom.c |  1 +
 5 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/book3s/64/kexec.h 
b/arch/powerpc/include/asm/book3s/64/kexec.h
index d4b9d476ecba..df37a76c1e9f 100644
--- a/arch/powerpc/include/asm/book3s/64/kexec.h
+++ b/arch/powerpc/include/asm/book3s/64/kexec.h
@@ -21,6 +21,11 @@ static inline void reset_sprs(void)
plpar_set_ciabr(0);
}
 
+   if (cpu_has_feature(CPU_FTR_ARCH_31)) {
+   mtspr(SPRN_DEXCR, 0);
+   mtspr(SPRN_HASHKEYR, 0);
+   }
+
/*  Do we need isync()? We are going via a kexec reset */
isync();
 }
diff --git a/arch/powerpc/include/asm/cputable.h 
b/arch/powerpc/include/asm/cputable.h
index 757dbded11dc..443a9d482b15 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -192,6 +192,7 @@ static inline void cpu_feature_keys_init(void) { }
 #define CPU_FTR_P9_RADIX_PREFETCH_BUG  LONG_ASM_CONST(0x0002)
 #define CPU_FTR_ARCH_31
LONG_ASM_CONST(0x0004)
 #define CPU_FTR_DAWR1  LONG_ASM_CONST(0x0008)
+#define CPU_FTR_DEXCR_NPHIELONG_ASM_CONST(0x0010)
 
 #ifndef __ASSEMBLY__
 
@@ -451,7 +452,8 @@ static inline void cpu_feature_keys_init(void) { }
CPU_FTR_CFAR | CPU_FTR_HVMODE | CPU_FTR_VMX_COPY | \
CPU_FTR_DBELL | CPU_FTR_HAS_PPR | CPU_FTR_ARCH_207S | \
CPU_FTR_ARCH_300 | CPU_FTR_ARCH_31 | \
-   CPU_FTR_DAWR | CPU_FTR_DAWR1)
+   CPU_FTR_DAWR | CPU_FTR_DAWR1 | \
+   CPU_FTR_DEXCR_NPHIE)
 #define CPU_FTRS_CELL  (CPU_FTR_LWSYNC | \
CPU_FTR_PPCAS_ARCH_V2 | CPU_FTR_CTRL | \
CPU_FTR_ALTIVEC_COMP | CPU_FTR_MMCRA | CPU_FTR_SMT | \
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 6372e5f55ef0..a30c5bdca568 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -382,7 +382,17 @@
 #define SPRN_HIOR  0x137   /* 970 Hypervisor interrupt offset */
 #define SPRN_RMOR  0x138   /* Real mode offset register */
 #define SPRN_HRMOR 0x139   /* Real mode offset register */
+#define SPRN_HDEXCR_RO 0x1C7   /* Hypervisor DEXCR (non-privileged, readonly) 
*/
+#define SPRN_HASHKEYR  0x1D4   /* Non-privileged hashst/hashchk key register */
+#define SPRN_HDEXCR0x1D7   /* Hypervisor dynamic execution control 
register */
+#define SPRN_DEXCR_RO  0x32C   /* DEXCR (non-privileged, readonly) */
 #define SPRN_ASDR  0x330   /* Access segment descriptor register */
+#define SPRN_DEXCR 0x33C   /* Dynamic execution control register */
+#define   DEXCR_PR_BIT(aspect) PPC_BIT(32 + (aspect))
+#define   DEXCR_PR_SBHEDEXCR_PR_BIT(0) /* Speculative Branch 
Hint Enable */
+#define   DEXCR_PR_IBRTPD  DEXCR_PR_BIT(3) /* Indirect Branch Recurrent 
Target Prediction Disable */
+#define   DEXCR_PR_SRAPD   DEXCR_PR_BIT(4) /* Subroutine Return Address 
Prediction Disable */
+#define   DEXCR_PR_NPHIE   DEXCR_PR_BIT(5) /* Non-Privileged Hash 
Instruction Enable */
 #define SPRN_IC0x350   /* Virtual Instruct

[PATCH v3 04/12] powerpc/dexcr: Handle hashchk exception

2023-05-18 Thread Benjamin Gray
Recognise and pass the appropriate signal to the user program when a
hashchk instruction triggers. This is independent of allowing
configuration of DEXCR[NPHIE], as a hypervisor can enforce this aspect
regardless of the kernel.

The signal mirrors how ARM reports their similar check failure. For
example, their FPAC handler in arch/arm64/kernel/traps.c do_el0_fpac()
does this. When we fail to read the instruction that caused the fault
we send a segfault, similar to how emulate_math() does it.

Signed-off-by: Benjamin Gray 

---

v3: * Inline hashchk detection, remove dexcr.c and associated files
v1: * Refactor the hashchk check to return 0 on success, an error
  code on failure. Determine what to do based on specific error
  code.
* Motivate signal and code
---
 arch/powerpc/include/asm/ppc-opcode.h |  1 +
 arch/powerpc/kernel/traps.c   | 16 
 2 files changed, 17 insertions(+)

diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
b/arch/powerpc/include/asm/ppc-opcode.h
index ca5a0da7df4e..ef6972aa33b9 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -222,6 +222,7 @@
 #define OP_31_XOP_STFSX663
 #define OP_31_XOP_STFSUX695
 #define OP_31_XOP_STFDX 727
+#define OP_31_XOP_HASHCHK   754
 #define OP_31_XOP_STFDUX759
 #define OP_31_XOP_LHBRX 790
 #define OP_31_XOP_LFIWAX855
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 9bdd79aa51cf..e59ec6d32d37 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -1516,6 +1516,22 @@ static void do_program_check(struct pt_regs *regs)
return;
}
}
+
+   if (cpu_has_feature(CPU_FTR_DEXCR_NPHIE) && user_mode(regs)) {
+   ppc_inst_t insn;
+
+   if (get_user_instr(insn, (void __user *)regs->nip)) {
+   _exception(SIGSEGV, regs, SEGV_MAPERR, 
regs->nip);
+   return;
+   }
+
+   if (ppc_inst_primary_opcode(insn) == 31 &&
+   get_xop(ppc_inst_val(insn)) == OP_31_XOP_HASHCHK) {
+   _exception(SIGILL, regs, ILL_ILLOPN, regs->nip);
+   return;
+   }
+   }
+
_exception(SIGTRAP, regs, TRAP_BRKPT, regs->nip);
return;
}
-- 
2.40.1



[PATCH v3 00/12] Add static DEXCR support

2023-05-18 Thread Benjamin Gray
The DEXCR is a SPR that allows control over various execution
'aspects', such as indirect branch prediction and enabling the
hashst/hashchk instructions. Further details are in ISA 3.1B
Book 3 chapter 9.

This series adds static (compile time) support for initialising
the DEXCR, and basic HASHKEYR bookkeeping.

v3: * Address Russell's comments
v2: * Add ptrace/coredump support

Previous versions:
v2: https://lore.kernel.org/all/20230330055040.434133-1-bg...@linux.ibm.com/
v1: https://lore.kernel.org/all/20230322054612.1340573-1-bg...@linux.ibm.com/
RFC: https://lore.kernel.org/all/20221128024458.46121-1-bg...@linux.ibm.com/


Benjamin Gray (12):
  powerpc/book3s: Add missing  include
  powerpc/ptrace: Add missing  include
  powerpc/dexcr: Add initial Dynamic Execution Control Register (DEXCR)
support
  powerpc/dexcr: Handle hashchk exception
  powerpc/dexcr: Support userspace ROP protection
  powerpc/dexcr: Support custom default DEXCR value
  powerpc/ptrace: Expose DEXCR and HDEXCR registers to ptrace
  powerpc/ptrace: Expose HASHKEYR register to ptrace
  Documentation: Document PowerPC kernel DEXCR interface
  selftests/powerpc: Add more utility macros
  selftests/powerpc/dexcr: Add hashst/hashchk test
  selftests/powerpc/dexcr: Add DEXCR status utility lsdexcr

 Documentation/powerpc/dexcr.rst   |  58 +
 Documentation/powerpc/index.rst   |   1 +
 arch/powerpc/Kconfig  |  14 ++
 arch/powerpc/include/asm/book3s/64/kexec.h|   5 +
 arch/powerpc/include/asm/book3s/64/kup.h  |   1 +
 arch/powerpc/include/asm/cputable.h   |   4 +-
 arch/powerpc/include/asm/ppc-opcode.h |   1 +
 arch/powerpc/include/asm/processor.h  |   1 +
 arch/powerpc/include/asm/reg.h|  10 +
 arch/powerpc/include/uapi/asm/elf.h   |   2 +
 arch/powerpc/kernel/cpu_setup_power.c |   9 +
 arch/powerpc/kernel/process.c |  17 ++
 arch/powerpc/kernel/prom.c|   1 +
 arch/powerpc/kernel/ptrace/ptrace-decl.h  |   6 +
 arch/powerpc/kernel/ptrace/ptrace-view.c  |  67 +-
 arch/powerpc/kernel/traps.c   |  16 ++
 include/uapi/linux/elf.h  |   2 +
 tools/testing/selftests/powerpc/Makefile  |   1 +
 .../selftests/powerpc/dexcr/.gitignore|   2 +
 .../testing/selftests/powerpc/dexcr/Makefile  |   9 +
 tools/testing/selftests/powerpc/dexcr/dexcr.c | 132 ++
 tools/testing/selftests/powerpc/dexcr/dexcr.h |  49 
 .../selftests/powerpc/dexcr/hashchk_test.c| 227 ++
 .../testing/selftests/powerpc/dexcr/lsdexcr.c | 141 +++
 tools/testing/selftests/powerpc/include/reg.h |   4 +
 .../testing/selftests/powerpc/include/utils.h |  31 ++-
 .../powerpc/pmu/sampling_tests/misc.h |   2 -
 tools/testing/selftests/powerpc/utils.c   |  24 ++
 28 files changed, 832 insertions(+), 5 deletions(-)
 create mode 100644 Documentation/powerpc/dexcr.rst
 create mode 100644 tools/testing/selftests/powerpc/dexcr/.gitignore
 create mode 100644 tools/testing/selftests/powerpc/dexcr/Makefile
 create mode 100644 tools/testing/selftests/powerpc/dexcr/dexcr.c
 create mode 100644 tools/testing/selftests/powerpc/dexcr/dexcr.h
 create mode 100644 tools/testing/selftests/powerpc/dexcr/hashchk_test.c
 create mode 100644 tools/testing/selftests/powerpc/dexcr/lsdexcr.c


base-commit: 547124f858ea52b2f7e58e8c0d39170a9fa66b4b
--
2.40.1


[PATCH v3 09/12] Documentation: Document PowerPC kernel DEXCR interface

2023-05-18 Thread Benjamin Gray
Describe the DEXCR and document how to configure it.

Signed-off-by: Benjamin Gray 
Reviewed-by: Russell Currey 

---

v3: * Add ruscur reviewed-by
v2: * Document coredump & ptrace support
v1: * Remove the dynamic control docs, describe the static config
  option

This documentation is a little bare for now, but will be expanded on
when dynamic DEXCR control is added.
---
 Documentation/powerpc/dexcr.rst | 58 +
 Documentation/powerpc/index.rst |  1 +
 2 files changed, 59 insertions(+)
 create mode 100644 Documentation/powerpc/dexcr.rst

diff --git a/Documentation/powerpc/dexcr.rst b/Documentation/powerpc/dexcr.rst
new file mode 100644
index ..75f1efaa8dc5
--- /dev/null
+++ b/Documentation/powerpc/dexcr.rst
@@ -0,0 +1,58 @@
+.. SPDX-License-Identifier: GPL-2.0-or-later
+
+==
+DEXCR (Dynamic Execution Control Register)
+==
+
+Overview
+
+
+The DEXCR is a privileged special purpose register (SPR) introduced in
+PowerPC ISA 3.1B (Power10) that allows per-cpu control over several dynamic
+execution behaviours. These behaviours include speculation (e.g., indirect
+branch target prediction) and enabling return-oriented programming (ROP)
+protection instructions.
+
+The execution control is exposed in hardware as up to 32 bits ('aspects') in
+the DEXCR. Each aspect controls a certain behaviour, and can be set or cleared
+to enable/disable the aspect. There are several variants of the DEXCR for
+different purposes:
+
+DEXCR
+A privileged SPR that can control aspects for userspace and kernel space
+HDEXCR
+A hypervisor-privileged SPR that can control aspects for the hypervisor and
+enforce aspects for the kernel and userspace.
+UDEXCR
+An optional ultravisor-privileged SPR that can control aspects for the 
ultravisor.
+
+Userspace can examine the current DEXCR state using a dedicated SPR that
+provides a non-privileged read-only view of the userspace DEXCR aspects.
+There is also an SPR that provides a read-only view of the hypervisor enforced
+aspects, which ORed with the userspace DEXCR view gives the effective DEXCR
+state for a process.
+
+
+Kernel Config
+=
+
+The kernel supports a static default DEXCR value determined at config time.
+Set the ``PPC_DEXCR_DEFAULT`` config to the value you want all processes to
+use.
+
+
+coredump and ptrace
+===
+
+The userspace values of the DEXCR and HDEXCR (in this order) are exposed under
+``NT_PPC_DEXCR``. These are each 32 bits and readonly, and are intended to
+assist with core dumps. The DEXCR may be made writable in future.
+
+If the kernel config ``CONFIG_CHECKPOINT_RESTORE`` is enabled, then
+``NT_PPC_HASHKEYR`` is available and exposes the HASHKEYR value of the process
+for reading and writing. This is a tradeoff between increased security and
+checkpoint/restore support: a process should normally have no need to know its
+secret key, but restoring a process requires setting its original key. The key
+therefore appears in core dumps, and an attacker may be able to retrieve it 
from
+a coredump and effectively bypass ROP protection on any threads that share this
+key (potentially all threads from the same parent that have not run 
``exec()``).
diff --git a/Documentation/powerpc/index.rst b/Documentation/powerpc/index.rst
index 85e80e30160b..d33b554ca7ba 100644
--- a/Documentation/powerpc/index.rst
+++ b/Documentation/powerpc/index.rst
@@ -15,6 +15,7 @@ powerpc
 cxl
 cxlflash
 dawr-power9
+dexcr
 dscr
 eeh-pci-error-recovery
 elf_hwcaps
-- 
2.40.1



[PATCH v3 05/12] powerpc/dexcr: Support userspace ROP protection

2023-05-18 Thread Benjamin Gray
The ISA 3.1B hashst and hashchk instructions use a per-cpu SPR HASHKEYR
to hold a key used in the hash calculation. This key should be different
for each process to make it harder for a malicious process to recreate
valid hash values for a victim process.

Add support for storing a per-thread hash key, and setting/clearing
HASHKEYR appropriately.

Signed-off-by: Benjamin Gray 
Reviewed-by: Russell Currey 

---

v3: * Add ruscur reviewed-by
v1: * Guard HASHKEYR update behind change check
* HASHKEYR reset moved earlier to patch 2
---
 arch/powerpc/include/asm/processor.h |  1 +
 arch/powerpc/kernel/process.c| 17 +
 2 files changed, 18 insertions(+)

diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index e96c9b8c2a60..8a6754ffdc7e 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -264,6 +264,7 @@ struct thread_struct {
unsigned long   mmcr3;
unsigned long   sier2;
unsigned long   sier3;
+   unsigned long   hashkeyr;
 
 #endif
 };
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 1fefafb2b29b..b68898ac07e1 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1182,6 +1182,9 @@ static inline void save_sprs(struct thread_struct *t)
 */
t->tar = mfspr(SPRN_TAR);
}
+
+   if (cpu_has_feature(CPU_FTR_DEXCR_NPHIE))
+   t->hashkeyr = mfspr(SPRN_HASHKEYR);
 #endif
 }
 
@@ -1260,6 +1263,10 @@ static inline void restore_sprs(struct thread_struct 
*old_thread,
if (cpu_has_feature(CPU_FTR_P9_TIDR) &&
old_thread->tidr != new_thread->tidr)
mtspr(SPRN_TIDR, new_thread->tidr);
+
+   if (cpu_has_feature(CPU_FTR_DEXCR_NPHIE) &&
+   old_thread->hashkeyr != new_thread->hashkeyr)
+   mtspr(SPRN_HASHKEYR, new_thread->hashkeyr);
 #endif
 
 }
@@ -1867,6 +1874,10 @@ int copy_thread(struct task_struct *p, const struct 
kernel_clone_args *args)
}
 
p->thread.tidr = 0;
+#endif
+#ifdef CONFIG_PPC_BOOK3S_64
+   if (cpu_has_feature(CPU_FTR_DEXCR_NPHIE))
+   p->thread.hashkeyr = current->thread.hashkeyr;
 #endif
return 0;
 }
@@ -1984,6 +1995,12 @@ void start_thread(struct pt_regs *regs, unsigned long 
start, unsigned long sp)
current->thread.tm_tfiar = 0;
current->thread.load_tm = 0;
 #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
+#ifdef CONFIG_PPC_BOOK3S_64
+   if (cpu_has_feature(CPU_FTR_DEXCR_NPHIE)) {
+   current->thread.hashkeyr = get_random_long();
+   mtspr(SPRN_HASHKEYR, current->thread.hashkeyr);
+   }
+#endif /* CONFIG_PPC_BOOK3S_64 */
 }
 EXPORT_SYMBOL(start_thread);
 
-- 
2.40.1



[PATCH v3 12/12] selftests/powerpc/dexcr: Add DEXCR status utility lsdexcr

2023-05-18 Thread Benjamin Gray
Add a utility 'lsdexcr' to print the current DEXCR status. Useful for
quickly checking the status such as when debugging test failures or
verifying the new default DEXCR does what you want (for userspace at
least). Example output:

# ./lsdexcr
   uDEXCR: 0400 (NPHIE)
   HDEXCR: 
Effective: 0400 (NPHIE)

SBHE   (0): clear   (Speculative branch hint enable)
  IBRTPD   (3): clear   (Indirect branch recurrent target ...)
   SRAPD   (4): clear   (Subroutine return address ...)
   NPHIE * (5): set (Non-privileged hash instruction enable)
PHIE   (6): clear   (Privileged hash instruction enable)

DEXCR[NPHIE] enabled: hashst/hashchk working

Signed-off-by: Benjamin Gray 

---

v1: * Report if hashst/hashchk actually does something
---
 .../selftests/powerpc/dexcr/.gitignore|   1 +
 .../testing/selftests/powerpc/dexcr/Makefile  |   2 +
 .../testing/selftests/powerpc/dexcr/lsdexcr.c | 141 ++
 3 files changed, 144 insertions(+)
 create mode 100644 tools/testing/selftests/powerpc/dexcr/lsdexcr.c

diff --git a/tools/testing/selftests/powerpc/dexcr/.gitignore 
b/tools/testing/selftests/powerpc/dexcr/.gitignore
index d12e4560aca9..b82f45dd46b9 100644
--- a/tools/testing/selftests/powerpc/dexcr/.gitignore
+++ b/tools/testing/selftests/powerpc/dexcr/.gitignore
@@ -1 +1,2 @@
 hashchk_test
+lsdexcr
diff --git a/tools/testing/selftests/powerpc/dexcr/Makefile 
b/tools/testing/selftests/powerpc/dexcr/Makefile
index 16c8b489948a..76210f2bcec3 100644
--- a/tools/testing/selftests/powerpc/dexcr/Makefile
+++ b/tools/testing/selftests/powerpc/dexcr/Makefile
@@ -1,7 +1,9 @@
 TEST_GEN_PROGS := hashchk_test
+TEST_GEN_FILES := lsdexcr
 
 include ../../lib.mk
 
 $(OUTPUT)/hashchk_test: CFLAGS += -fno-pie $(call cc-option,-mno-rop-protect)
 
 $(TEST_GEN_PROGS): ../harness.c ../utils.c ./dexcr.c
+$(TEST_GEN_FILES): ../utils.c ./dexcr.c
diff --git a/tools/testing/selftests/powerpc/dexcr/lsdexcr.c 
b/tools/testing/selftests/powerpc/dexcr/lsdexcr.c
new file mode 100644
index ..94abbfcc389e
--- /dev/null
+++ b/tools/testing/selftests/powerpc/dexcr/lsdexcr.c
@@ -0,0 +1,141 @@
+// SPDX-License-Identifier: GPL-2.0+
+
+#include 
+#include 
+#include 
+#include 
+
+#include "dexcr.h"
+#include "utils.h"
+
+static unsigned int dexcr;
+static unsigned int hdexcr;
+static unsigned int effective;
+
+struct dexcr_aspect {
+   const char *name;
+   const char *desc;
+   unsigned int index;
+};
+
+static const struct dexcr_aspect aspects[] = {
+   {
+   .name = "SBHE",
+   .desc = "Speculative branch hint enable",
+   .index = 0,
+   },
+   {
+   .name = "IBRTPD",
+   .desc = "Indirect branch recurrent target prediction disable",
+   .index = 3,
+   },
+   {
+   .name = "SRAPD",
+   .desc = "Subroutine return address prediction disable",
+   .index = 4,
+   },
+   {
+   .name = "NPHIE",
+   .desc = "Non-privileged hash instruction enable",
+   .index = 5,
+   },
+   {
+   .name = "PHIE",
+   .desc = "Privileged hash instruction enable",
+   .index = 6,
+   },
+};
+
+static void print_list(const char *list[], size_t len)
+{
+   for (size_t i = 0; i < len; i++) {
+   printf("%s", list[i]);
+   if (i + 1 < len)
+   printf(", ");
+   }
+}
+
+static void print_dexcr(char *name, unsigned int bits)
+{
+   const char *enabled_aspects[ARRAY_SIZE(aspects) + 1] = {NULL};
+   size_t j = 0;
+
+   printf("%s: %08x", name, bits);
+
+   if (bits == 0) {
+   printf("\n");
+   return;
+   }
+
+   for (size_t i = 0; i < ARRAY_SIZE(aspects); i++) {
+   unsigned int mask = DEXCR_PR_BIT(aspects[i].index);
+
+   if (bits & mask) {
+   enabled_aspects[j++] = aspects[i].name;
+   bits &= ~mask;
+   }
+   }
+
+   if (bits)
+   enabled_aspects[j++] = "unknown";
+
+   printf(" (");
+   print_list(enabled_aspects, j);
+   printf(")\n");
+}
+
+static void print_aspect(const struct dexcr_aspect *aspect)
+{
+   const char *attributes[8] = {NULL};
+   size_t j = 0;
+   unsigned long mask;
+
+   mask = DEXCR_PR_BIT(aspect->index);
+   if (dexcr & mask)
+   attributes[j++] = "set";
+   if (hdexcr & mask)
+   attributes[j++] = "set (hypervisor)";
+   if (!(effective & mask))
+   attributes[j++] = "clear";
+
+   printf("%12s %c (%d): ", aspect->name, effective & mask ? '*' : ' ', 
aspect->index);
+   print_list(attributes, j);
+   printf("  \t(%s)\n", aspect->desc);
+}
+
+int main(int argc, char *argv[])
+{
+   if (!dexcr_exists()) {
+   printf

[PATCH v3 10/12] selftests/powerpc: Add more utility macros

2023-05-18 Thread Benjamin Gray
Adds _MSG assertion variants to provide more context behind why a
failure occurred. Also include unistd.h for _exit() and stdio.h for
fprintf(), and move ARRAY_SIZE macro to utils.h.

The _MSG variants and ARRAY_SIZE will be used by the following
DEXCR selftests.

Signed-off-by: Benjamin Gray 
Reviewed-by: Russell Currey 

---

v3: * Reword commit message
* Add ruscur reviewed-by
v1: * Remove the signal handler variants
* Describe why headers are included
---
 .../testing/selftests/powerpc/include/utils.h | 27 ++-
 .../powerpc/pmu/sampling_tests/misc.h |  2 --
 2 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/powerpc/include/utils.h 
b/tools/testing/selftests/powerpc/include/utils.h
index 44bfd48b93d6..9dc53c4fbfe3 100644
--- a/tools/testing/selftests/powerpc/include/utils.h
+++ b/tools/testing/selftests/powerpc/include/utils.h
@@ -9,11 +9,17 @@
 #define __cacheline_aligned __attribute__((aligned(128)))
 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include "reg.h"
+#include 
+
+#ifndef ARRAY_SIZE
+# define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
+#endif
 
 /* Avoid headaches with PRI?64 - just use %ll? always */
 typedef unsigned long long u64;
@@ -67,7 +73,6 @@ struct perf_event_read {
 };
 
 #if !defined(__GLIBC_PREREQ) || !__GLIBC_PREREQ(2, 30)
-#include 
 #include 
 
 static inline pid_t gettid(void)
@@ -116,6 +121,16 @@ do {   
\
}   \
 } while (0)
 
+#define FAIL_IF_MSG(x, msg)\
+do {   \
+   if ((x)) {  \
+   fprintf(stderr, \
+   "[FAIL] Test FAILED on line %d: %s\n",  \
+   __LINE__, msg); \
+   return 1;   \
+   }   \
+} while (0)
+
 #define FAIL_IF_EXIT(x)\
 do {   \
if ((x)) {  \
@@ -125,6 +140,16 @@ do {   
\
}   \
 } while (0)
 
+#define FAIL_IF_EXIT_MSG(x, msg)   \
+do {   \
+   if ((x)) {  \
+   fprintf(stderr, \
+   "[FAIL] Test FAILED on line %d: %s\n",  \
+   __LINE__, msg); \
+   _exit(1);   \
+   }   \
+} while (0)
+
 /* The test harness uses this, yes it's gross */
 #define MAGIC_SKIP_RETURN_VALUE99
 
diff --git a/tools/testing/selftests/powerpc/pmu/sampling_tests/misc.h 
b/tools/testing/selftests/powerpc/pmu/sampling_tests/misc.h
index 4181755cf5a0..64e25cce1435 100644
--- a/tools/testing/selftests/powerpc/pmu/sampling_tests/misc.h
+++ b/tools/testing/selftests/powerpc/pmu/sampling_tests/misc.h
@@ -18,8 +18,6 @@
 #define MMCR1_RSQ   0x2000ULL /* radix scope qual field */
 #define BHRB_DISABLE0x20ULL /* MMCRA BHRB DISABLE bit */
 
-#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
-
 extern int ev_mask_pmcxsel, ev_shift_pmcxsel;
 extern int ev_mask_marked, ev_shift_marked;
 extern int ev_mask_comb, ev_shift_comb;
-- 
2.40.1



[PATCH v3 07/12] powerpc/ptrace: Expose DEXCR and HDEXCR registers to ptrace

2023-05-18 Thread Benjamin Gray
The DEXCR register is of interest when ptracing processes. Currently it
is static, but eventually will be dynamically controllable by a process.
If a process can control its own, then it is useful for it to be
ptrace-able to (e.g., for checkpoint-restore functionality).

It is also relevant to core dumps (the NPHIE aspect in particular),
which use the ptrace mechanism (or is it the other way around?) to
decide what to dump. The HDEXCR is useful here too, as the NPHIE aspect
may be set in the HDEXCR without being set in the DEXCR. Although the
HDEXCR is per-cpu and we don't track it in the task struct (it's useless
in normal operation), it would be difficult to imagine why a hypervisor
would set it to different values within a guest. A hypervisor cannot
safely set NPHIE differently at least, as that would break programs.

Expose a read-only view of the userspace DEXCR and HDEXCR to ptrace.
The HDEXCR is always readonly, and is useful for diagnosing the core
dumps (as the HDEXCR may set NPHIE without the DEXCR setting it).

Signed-off-by: Benjamin Gray 
Reviewed-by: Russell Currey 

---

v3: * Add ruscur reviewed-by
v2: * New in v2
---
 arch/powerpc/include/uapi/asm/elf.h  |  1 +
 arch/powerpc/kernel/ptrace/ptrace-decl.h |  1 +
 arch/powerpc/kernel/ptrace/ptrace-view.c | 31 +++-
 include/uapi/linux/elf.h |  1 +
 4 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/uapi/asm/elf.h 
b/arch/powerpc/include/uapi/asm/elf.h
index dbc4a5b8d02d..e0d323c808dd 100644
--- a/arch/powerpc/include/uapi/asm/elf.h
+++ b/arch/powerpc/include/uapi/asm/elf.h
@@ -98,6 +98,7 @@
 #define ELF_NEBB   3   /* includes ebbrr, ebbhr, bescr */
 #define ELF_NPMU   5   /* includes siar, sdar, sier, mmcr2, mmcr0 */
 #define ELF_NPKEY  3   /* includes amr, iamr, uamor */
+#define ELF_NDEXCR 2   /* includes dexcr, hdexcr */
 
 typedef unsigned long elf_greg_t64;
 typedef elf_greg_t64 elf_gregset_t64[ELF_NGREG];
diff --git a/arch/powerpc/kernel/ptrace/ptrace-decl.h 
b/arch/powerpc/kernel/ptrace/ptrace-decl.h
index 463a63eb8cc7..998a84f64804 100644
--- a/arch/powerpc/kernel/ptrace/ptrace-decl.h
+++ b/arch/powerpc/kernel/ptrace/ptrace-decl.h
@@ -57,6 +57,7 @@ enum powerpc_regset {
REGSET_TAR, /* TAR register */
REGSET_EBB, /* EBB registers */
REGSET_PMR, /* Performance Monitor Registers */
+   REGSET_DEXCR,   /* DEXCR registers */
 #endif
 #ifdef CONFIG_PPC_MEM_KEYS
REGSET_PKEY,/* AMR register */
diff --git a/arch/powerpc/kernel/ptrace/ptrace-view.c 
b/arch/powerpc/kernel/ptrace/ptrace-view.c
index 5fff0d04b23f..d3304fb932fa 100644
--- a/arch/powerpc/kernel/ptrace/ptrace-view.c
+++ b/arch/powerpc/kernel/ptrace/ptrace-view.c
@@ -454,7 +454,31 @@ static int pmu_set(struct task_struct *target, const 
struct user_regset *regset,
 5 * sizeof(unsigned long));
return ret;
 }
-#endif
+
+static int dexcr_active(struct task_struct *target, const struct user_regset 
*regset)
+{
+   if (!cpu_has_feature(CPU_FTR_ARCH_31))
+   return -ENODEV;
+
+   return regset->n;
+}
+
+static int dexcr_get(struct task_struct *target, const struct user_regset 
*regset,
+struct membuf to)
+{
+   if (!cpu_has_feature(CPU_FTR_ARCH_31))
+   return -ENODEV;
+
+   membuf_store(&to, (unsigned int)CONFIG_PPC_DEXCR_DEFAULT);
+
+   /*
+* Technically the HDEXCR is per-cpu, but a hypervisor can't reasonably
+* change it between CPUs of the same guest.
+*/
+   return membuf_store(&to, (unsigned int)mfspr(SPRN_HDEXCR_RO));
+}
+
+#endif /* CONFIG_PPC_BOOK3S_64 */
 
 #ifdef CONFIG_PPC_MEM_KEYS
 static int pkey_active(struct task_struct *target, const struct user_regset 
*regset)
@@ -615,6 +639,11 @@ static const struct user_regset native_regsets[] = {
.size = sizeof(u64), .align = sizeof(u64),
.active = pmu_active, .regset_get = pmu_get, .set = pmu_set
},
+   [REGSET_DEXCR] = {
+   .core_note_type = NT_PPC_DEXCR, .n = ELF_NDEXCR,
+   .size = sizeof(u32), .align = sizeof(u32),
+   .active = dexcr_active, .regset_get = dexcr_get
+   },
 #endif
 #ifdef CONFIG_PPC_MEM_KEYS
[REGSET_PKEY] = {
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index ac3da855fb19..cfa31f1eb5d7 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -403,6 +403,7 @@ typedef struct elf64_shdr {
 #define NT_PPC_TM_CPPR 0x10e   /* TM checkpointed Program Priority 
Register */
 #define NT_PPC_TM_CDSCR0x10f   /* TM checkpointed Data Stream 
Control Register */
 #define NT_PPC_PKEY0x110   /* Memory Protection Keys registers */
+#define NT_PPC_DEXCR   0x111   /* PowerPC DEXCR registers */
 #define NT_386_TLS 0x200 

[PATCH v3 02/12] powerpc/ptrace: Add missing include

2023-05-18 Thread Benjamin Gray
ptrace-decl.h uses user_regset_get2_fn (among other things) from
regset.h. While all current users of ptrace-decl.h include regset.h
before it anyway, it adds an implicit ordering dependency and breaks
source tooling that tries to inspect ptrace-decl.h by itself.

Signed-off-by: Benjamin Gray 
Reviewed-by: Russell Currey 

---

v3: * Add ruscur reviewed-by
v2: * New in v2
---
 arch/powerpc/kernel/ptrace/ptrace-decl.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/kernel/ptrace/ptrace-decl.h 
b/arch/powerpc/kernel/ptrace/ptrace-decl.h
index eafe5f0f6289..463a63eb8cc7 100644
--- a/arch/powerpc/kernel/ptrace/ptrace-decl.h
+++ b/arch/powerpc/kernel/ptrace/ptrace-decl.h
@@ -1,5 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0-or-later */
 
+#include 
+
 /*
  * Set of msr bits that gdb can change on behalf of a process.
  */
-- 
2.40.1



[PATCH v3 08/12] powerpc/ptrace: Expose HASHKEYR register to ptrace

2023-05-18 Thread Benjamin Gray
The HASHKEYR register contains a secret per-process key to enable unique
hashes per process. In general it should not be exposed to userspace
at all and a regular process has no need to know its key.

However, checkpoint restore in userspace (CRIU) functionality requires
that a process be able to set the HASHKEYR of another process, otherwise
existing hashes on the stack would be invalidated by a new random key.

Exposing HASHKEYR in this way also makes it appear in core dumps, which
is a security concern. Multiple threads may share a key, for example
just after a fork() call, where the kernel cannot know if the child is
going to return back along the parent's stack. If such a thread is
coerced into making a core dump, then the HASHKEYR value will be
readable and able to be used against all other threads sharing that key,
effectively undoing any protection offered by hashst/hashchk.

Therefore we expose HASHKEYR to ptrace when CONFIG_CHECKPOINT_RESTORE is
enabled, providing a choice of increased security or migratable ROP
protected processes. This is similar to how ARM exposes its PAC keys.

Signed-off-by: Benjamin Gray 
Reviewed-by: Russell Currey 

---

v3: * Add ruscur reviewed-by
v2: * New in v2
---
 arch/powerpc/include/uapi/asm/elf.h  |  1 +
 arch/powerpc/kernel/ptrace/ptrace-decl.h |  3 ++
 arch/powerpc/kernel/ptrace/ptrace-view.c | 36 
 include/uapi/linux/elf.h |  1 +
 4 files changed, 41 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/elf.h 
b/arch/powerpc/include/uapi/asm/elf.h
index e0d323c808dd..a5377f494fa3 100644
--- a/arch/powerpc/include/uapi/asm/elf.h
+++ b/arch/powerpc/include/uapi/asm/elf.h
@@ -99,6 +99,7 @@
 #define ELF_NPMU   5   /* includes siar, sdar, sier, mmcr2, mmcr0 */
 #define ELF_NPKEY  3   /* includes amr, iamr, uamor */
 #define ELF_NDEXCR 2   /* includes dexcr, hdexcr */
+#define ELF_NHASHKEYR  1   /* includes hashkeyr */
 
 typedef unsigned long elf_greg_t64;
 typedef elf_greg_t64 elf_gregset_t64[ELF_NGREG];
diff --git a/arch/powerpc/kernel/ptrace/ptrace-decl.h 
b/arch/powerpc/kernel/ptrace/ptrace-decl.h
index 998a84f64804..4171a5727197 100644
--- a/arch/powerpc/kernel/ptrace/ptrace-decl.h
+++ b/arch/powerpc/kernel/ptrace/ptrace-decl.h
@@ -58,6 +58,9 @@ enum powerpc_regset {
REGSET_EBB, /* EBB registers */
REGSET_PMR, /* Performance Monitor Registers */
REGSET_DEXCR,   /* DEXCR registers */
+#ifdef CONFIG_CHECKPOINT_RESTORE
+   REGSET_HASHKEYR,/* HASHKEYR register */
+#endif
 #endif
 #ifdef CONFIG_PPC_MEM_KEYS
REGSET_PKEY,/* AMR register */
diff --git a/arch/powerpc/kernel/ptrace/ptrace-view.c 
b/arch/powerpc/kernel/ptrace/ptrace-view.c
index d3304fb932fa..acbb8ec11b1e 100644
--- a/arch/powerpc/kernel/ptrace/ptrace-view.c
+++ b/arch/powerpc/kernel/ptrace/ptrace-view.c
@@ -478,6 +478,35 @@ static int dexcr_get(struct task_struct *target, const 
struct user_regset *regse
return membuf_store(&to, (unsigned int)mfspr(SPRN_HDEXCR_RO));
 }
 
+#ifdef CONFIG_CHECKPOINT_RESTORE
+static int hashkeyr_active(struct task_struct *target, const struct 
user_regset *regset)
+{
+   if (!cpu_has_feature(CPU_FTR_ARCH_31))
+   return -ENODEV;
+
+   return regset->n;
+}
+
+static int hashkeyr_get(struct task_struct *target, const struct user_regset 
*regset,
+   struct membuf to)
+{
+   if (!cpu_has_feature(CPU_FTR_ARCH_31))
+   return -ENODEV;
+
+   return membuf_store(&to, target->thread.hashkeyr);
+}
+
+static int hashkeyr_set(struct task_struct *target, const struct user_regset 
*regset,
+   unsigned int pos, unsigned int count, const void *kbuf,
+   const void __user *ubuf)
+{
+   if (!cpu_has_feature(CPU_FTR_ARCH_31))
+   return -ENODEV;
+
+   return user_regset_copyin(&pos, &count, &kbuf, &ubuf, 
&target->thread.hashkeyr,
+ 0, sizeof(unsigned long));
+}
+#endif /* CONFIG_CHECKPOINT_RESTORE */
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
 #ifdef CONFIG_PPC_MEM_KEYS
@@ -644,6 +673,13 @@ static const struct user_regset native_regsets[] = {
.size = sizeof(u32), .align = sizeof(u32),
.active = dexcr_active, .regset_get = dexcr_get
},
+#ifdef CONFIG_CHECKPOINT_RESTORE
+   [REGSET_HASHKEYR] = {
+   .core_note_type = NT_PPC_HASHKEYR, .n = ELF_NHASHKEYR,
+   .size = sizeof(u64), .align = sizeof(u64),
+   .active = hashkeyr_active, .regset_get = hashkeyr_get, .set = 
hashkeyr_set
+   },
+#endif
 #endif
 #ifdef CONFIG_PPC_MEM_KEYS
[REGSET_PKEY] = {
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index cfa31f1eb5d7..b705b301d88f 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -404,6 +404,7 @@ typedef struct elf64_shdr {
 #define NT_PPC_TM_CDSCR 

[PATCH v3 06/12] powerpc/dexcr: Support custom default DEXCR value

2023-05-18 Thread Benjamin Gray
Make the DEXCR value configurable at config time. Intentionally don't
limit possible values to support future aspects without needing kernel
updates.

The default config value enables hashst/hashchk in problem state.
This should be safe, as generally software needs to request these
instructions be included in the first place.

Signed-off-by: Benjamin Gray 
Reviewed-by: Russell Currey 

---

v3: * Fix hashchk typo, provide minimum ISA version
* Add ruscur reviewed-by
v1: * New in v1

Preface with: I'm note sure on the best place to put the config.

I also don't think there's any need to zero out unknown/unsupported
bits. Reserved implies they are ignored by the hardware (from my
understanding of the ISA). Current P10s boot with all bits set; lsdexcr
(later patch) reports

   uDEXCR: ff00 (SBHE, IBRTPD, SRAPD, NPHIE, PHIE, unknown)

when you try to read it back. Leaving them be also makes it easier to
support newer aspects without a kernel update.

If arbitrary value support isn't important, it's probably a nicer
interface to make each aspect an entry in a menu.

Future work may include dynamic DEXCR controls via prctl() and sysfs.
The dynamic controls would be able to override this default DEXCR on a
per-process basis. A stronger "PPC_ENFORCE_USER_ROP_PROCTETION" config
may be required at such a time to prevent dynamically disabling the
hash checks.
---
 arch/powerpc/Kconfig  | 14 ++
 arch/powerpc/kernel/cpu_setup_power.c |  3 ++-
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 539d1f03ff42..b96df37e4171 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -1039,6 +1039,20 @@ config PPC_MEM_KEYS
 
  If unsure, say y.
 
+config PPC_DEXCR_DEFAULT
+   hex "Default DEXCR value"
+   default 0x0400
+   depends on PPC_BOOK3S_64
+   help
+ Power10 introduces the Dynamic Execution Control Register (DEXCR)
+ to provide fine grained control over various speculation and
+ security capabilities. This is used as the default DEXCR value.
+
+ It is a 64 bit value that splits into 32 bits for supervisor mode
+ and 32 bits for problem state. The default config value enables
+ the hashst/hashchk instructions in userspace. See the ISA (3.1B or
+ later) for specifics of what each bit controls.
+
 config PPC_SECURE_BOOT
prompt "Enable secure boot support"
bool
diff --git a/arch/powerpc/kernel/cpu_setup_power.c 
b/arch/powerpc/kernel/cpu_setup_power.c
index c00721801a1b..814c825a0661 100644
--- a/arch/powerpc/kernel/cpu_setup_power.c
+++ b/arch/powerpc/kernel/cpu_setup_power.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -128,7 +129,7 @@ static void init_PMU_ISA31(void)
 
 static void init_DEXCR(void)
 {
-   mtspr(SPRN_DEXCR, 0);
+   mtspr(SPRN_DEXCR, CONFIG_PPC_DEXCR_DEFAULT);
mtspr(SPRN_HASHKEYR, 0);
 }
 
-- 
2.40.1



[PATCH v3 11/12] selftests/powerpc/dexcr: Add hashst/hashchk test

2023-05-18 Thread Benjamin Gray
Test the kernel DEXCR[NPHIE] interface and hashchk exception handling.

Introduces with it a DEXCR utils library for common DEXCR operations.

Volatile is used to prevent the compiler optimising away the signal
tests.

Signed-off-by: Benjamin Gray 

---
v1: * Clean up dexcr makefile
* Include kernel headers in CFLAGS
* Use numeric literals for hashst/hashchk to support older
  toolchains
* A lot of other refactoring
---
 tools/testing/selftests/powerpc/Makefile  |   1 +
 .../selftests/powerpc/dexcr/.gitignore|   1 +
 .../testing/selftests/powerpc/dexcr/Makefile  |   7 +
 tools/testing/selftests/powerpc/dexcr/dexcr.c | 132 ++
 tools/testing/selftests/powerpc/dexcr/dexcr.h |  49 
 .../selftests/powerpc/dexcr/hashchk_test.c| 227 ++
 tools/testing/selftests/powerpc/include/reg.h |   4 +
 .../testing/selftests/powerpc/include/utils.h |   4 +
 tools/testing/selftests/powerpc/utils.c   |  24 ++
 9 files changed, 449 insertions(+)
 create mode 100644 tools/testing/selftests/powerpc/dexcr/.gitignore
 create mode 100644 tools/testing/selftests/powerpc/dexcr/Makefile
 create mode 100644 tools/testing/selftests/powerpc/dexcr/dexcr.c
 create mode 100644 tools/testing/selftests/powerpc/dexcr/dexcr.h
 create mode 100644 tools/testing/selftests/powerpc/dexcr/hashchk_test.c

diff --git a/tools/testing/selftests/powerpc/Makefile 
b/tools/testing/selftests/powerpc/Makefile
index ae2bfc0d822f..49f2ad1793fd 100644
--- a/tools/testing/selftests/powerpc/Makefile
+++ b/tools/testing/selftests/powerpc/Makefile
@@ -17,6 +17,7 @@ SUB_DIRS = alignment  \
   benchmarks   \
   cache_shape  \
   copyloops\
+  dexcr\
   dscr \
   mm   \
   nx-gzip  \
diff --git a/tools/testing/selftests/powerpc/dexcr/.gitignore 
b/tools/testing/selftests/powerpc/dexcr/.gitignore
new file mode 100644
index ..d12e4560aca9
--- /dev/null
+++ b/tools/testing/selftests/powerpc/dexcr/.gitignore
@@ -0,0 +1 @@
+hashchk_test
diff --git a/tools/testing/selftests/powerpc/dexcr/Makefile 
b/tools/testing/selftests/powerpc/dexcr/Makefile
new file mode 100644
index ..16c8b489948a
--- /dev/null
+++ b/tools/testing/selftests/powerpc/dexcr/Makefile
@@ -0,0 +1,7 @@
+TEST_GEN_PROGS := hashchk_test
+
+include ../../lib.mk
+
+$(OUTPUT)/hashchk_test: CFLAGS += -fno-pie $(call cc-option,-mno-rop-protect)
+
+$(TEST_GEN_PROGS): ../harness.c ../utils.c ./dexcr.c
diff --git a/tools/testing/selftests/powerpc/dexcr/dexcr.c 
b/tools/testing/selftests/powerpc/dexcr/dexcr.c
new file mode 100644
index ..65ec5347de98
--- /dev/null
+++ b/tools/testing/selftests/powerpc/dexcr/dexcr.c
@@ -0,0 +1,132 @@
+// SPDX-License-Identifier: GPL-2.0+
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "dexcr.h"
+#include "reg.h"
+#include "utils.h"
+
+static jmp_buf generic_signal_jump_buf;
+
+static void generic_signal_handler(int signum, siginfo_t *info, void *context)
+{
+   longjmp(generic_signal_jump_buf, 0);
+}
+
+bool dexcr_exists(void)
+{
+   struct sigaction old;
+   volatile bool exists;
+
+   old = push_signal_handler(SIGILL, generic_signal_handler);
+   if (setjmp(generic_signal_jump_buf))
+   goto out;
+
+   /*
+* If the SPR is not recognised by the hardware it triggers
+* a hypervisor emulation interrupt. If the kernel does not
+* recognise/try to emulate it, we receive a SIGILL signal.
+*
+* If we do not receive a signal, assume we have the SPR or the
+* kernel is trying to emulate it correctly.
+*/
+   exists = false;
+   mfspr(SPRN_DEXCR_RO);
+   exists = true;
+
+out:
+   pop_signal_handler(SIGILL, old);
+   return exists;
+}
+
+/*
+ * Just test if a bad hashchk triggers a signal, without checking
+ * for support or if the NPHIE aspect is enabled.
+ */
+bool hashchk_triggers(void)
+{
+   struct sigaction old;
+   volatile bool triggers;
+
+   old = push_signal_handler(SIGILL, generic_signal_handler);
+   if (setjmp(generic_signal_jump_buf))
+   goto out;
+
+   triggers = true;
+   do_bad_hashchk();
+   triggers = false;
+
+out:
+   pop_signal_handler(SIGILL, old);
+   return triggers;
+}
+
+unsigned int get_dexcr(enum dexcr_source source)
+{
+   switch (source) {
+   case DEXCR:
+   return mfspr(SPRN_DEXCR_RO);
+   case HDEXCR:
+   return mfspr(SPRN_HDEXCR_RO);
+   case EFFECTIVE:
+   return mfspr(SPRN_DEXCR_RO) | mfspr(SPRN_HDEXCR_RO);
+   default:
+   FAIL_IF_EXIT_MSG(true, "bad enum dexcr_source");
+   }
+}
+
+void await_child_success(pid_t pid)
+{
+   int wstatus;
+
+   FAIL_IF_EXIT_MSG(pid == -1, "fork failed");
+   FAIL_IF_EXIT_MSG(waitpid(pid, &wstatus, 0)

Re: [PATCH] mm: kfence: Fix false positives on big endian

2023-05-18 Thread Michael Ellerman
Andrew Morton  writes:
> On Fri, 5 May 2023 16:02:17 + David Laight  
> wrote:
>
>> From: Michael Ellerman
>> > Sent: 05 May 2023 04:51
>> > 
>> > Since commit 1ba3cbf3ec3b ("mm: kfence: improve the performance of
>> > __kfence_alloc() and __kfence_free()"), kfence reports failures in
>> > random places at boot on big endian machines.
>> > 
>> > The problem is that the new KFENCE_CANARY_PATTERN_U64 encodes the
>> > address of each byte in its value, so it needs to be byte swapped on big
>> > endian machines.
>> > 
>> > The compiler is smart enough to do the le64_to_cpu() at compile time, so
>> > there is no runtime overhead.
>> > 
>> > Fixes: 1ba3cbf3ec3b ("mm: kfence: improve the performance of 
>> > __kfence_alloc() and __kfence_free()")
>> > Signed-off-by: Michael Ellerman 
>> > ---
>> >  mm/kfence/kfence.h | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> > 
>> > diff --git a/mm/kfence/kfence.h b/mm/kfence/kfence.h
>> > index 2aafc46a4aaf..392fb273e7bd 100644
>> > --- a/mm/kfence/kfence.h
>> > +++ b/mm/kfence/kfence.h
>> > @@ -29,7 +29,7 @@
>> >   * canary of every 8 bytes is the same. 64-bit memory can be filled and 
>> > checked
>> >   * at a time instead of byte by byte to improve performance.
>> >   */
>> > -#define KFENCE_CANARY_PATTERN_U64 ((u64)0x ^ 
>> > (u64)(0x0706050403020100))
>> > +#define KFENCE_CANARY_PATTERN_U64 ((u64)0x ^ 
>> > (u64)(le64_to_cpu(0x0706050403020100)))
>> 
>> What at the (u64) casts for?
>> The constants should probably have a ul (or ull) suffix.
>> 
>
> I tried that, didn't fix the sparse warnings described at
> https://lkml.kernel.org/r/202305132244.dwzbucud-...@intel.com.
>
> Michael, have you looked into this?

I haven't sorry, been chasing other bugs.

> I'll merge it upstream - I guess we can live with the warnings for a while.

Thanks, yeah spurious WARNs are more of a pain than some sparse warnings.

Maybe using le64_to_cpu() is too fancy, could just do it with an ifdef? eg.

diff --git a/mm/kfence/kfence.h b/mm/kfence/kfence.h
index 392fb273e7bd..510355a5382b 100644
--- a/mm/kfence/kfence.h
+++ b/mm/kfence/kfence.h
@@ -29,7 +29,11 @@
  * canary of every 8 bytes is the same. 64-bit memory can be filled and checked
  * at a time instead of byte by byte to improve performance.
  */
-#define KFENCE_CANARY_PATTERN_U64 ((u64)0x ^ 
(u64)(le64_to_cpu(0x0706050403020100)))
+#ifdef __LITTLE_ENDIAN__
+#define KFENCE_CANARY_PATTERN_U64 (0xULL ^ 
0x0706050403020100ULL)
+#else
+#define KFENCE_CANARY_PATTERN_U64 (0xULL ^ 
0x0001020304050607ULL)
+#endif
 
 /* Maximum stack depth for reports. */
 #define KFENCE_STACK_DEPTH 64


cheers


[PATCH v3 01/12] powerpc/book3s: Add missing include

2023-05-18 Thread Benjamin Gray
The functions here use struct thread_struct fields, so need to import
the full definition from . The  header
that defines current only forward declares struct thread_struct.

Failing to include this  header leads to a compilation
error when a translation unit does not also include 
indirectly.

Signed-off-by: Benjamin Gray 
Reviewed-by: Nicholas Piggin 
Reviewed-by: Russell Currey 

---

v3: * Add ruscur reviewed-by
v1: * Add npiggin reviewed-by
---
 arch/powerpc/include/asm/book3s/64/kup.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/include/asm/book3s/64/kup.h 
b/arch/powerpc/include/asm/book3s/64/kup.h
index 54cf46808157..84c09e546115 100644
--- a/arch/powerpc/include/asm/book3s/64/kup.h
+++ b/arch/powerpc/include/asm/book3s/64/kup.h
@@ -194,6 +194,7 @@
 #else /* !__ASSEMBLY__ */
 
 #include 
+#include 
 
 DECLARE_STATIC_KEY_FALSE(uaccess_flush_key);
 
-- 
2.40.1



Re: [PATCH] mm: kfence: Fix false positives on big endian

2023-05-18 Thread Christophe Leroy


Le 18/05/2023 à 00:20, Andrew Morton a écrit :
> On Fri, 5 May 2023 16:02:17 + David Laight  
> wrote:
> 
>> From: Michael Ellerman
>>> Sent: 05 May 2023 04:51
>>>
>>> Since commit 1ba3cbf3ec3b ("mm: kfence: improve the performance of
>>> __kfence_alloc() and __kfence_free()"), kfence reports failures in
>>> random places at boot on big endian machines.
>>>
>>> The problem is that the new KFENCE_CANARY_PATTERN_U64 encodes the
>>> address of each byte in its value, so it needs to be byte swapped on big
>>> endian machines.
>>>
>>> The compiler is smart enough to do the le64_to_cpu() at compile time, so
>>> there is no runtime overhead.
>>>
>>> Fixes: 1ba3cbf3ec3b ("mm: kfence: improve the performance of 
>>> __kfence_alloc() and __kfence_free()")
>>> Signed-off-by: Michael Ellerman 
>>> ---
>>>   mm/kfence/kfence.h | 2 +-
>>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/mm/kfence/kfence.h b/mm/kfence/kfence.h
>>> index 2aafc46a4aaf..392fb273e7bd 100644
>>> --- a/mm/kfence/kfence.h
>>> +++ b/mm/kfence/kfence.h
>>> @@ -29,7 +29,7 @@
>>>* canary of every 8 bytes is the same. 64-bit memory can be filled and 
>>> checked
>>>* at a time instead of byte by byte to improve performance.
>>>*/
>>> -#define KFENCE_CANARY_PATTERN_U64 ((u64)0x ^ 
>>> (u64)(0x0706050403020100))
>>> +#define KFENCE_CANARY_PATTERN_U64 ((u64)0x ^ 
>>> (u64)(le64_to_cpu(0x0706050403020100)))
>>
>> What at the (u64) casts for?
>> The constants should probably have a ul (or ull) suffix.
>>
> 
> I tried that, didn't fix the sparse warnings described at
> https://lkml.kernel.org/r/202305132244.dwzbucud-...@intel.com.
> 
> Michael, have you looked into this?
> 
> I'll merge it upstream - I guess we can live with the warnings for a while.
> 

sparse warning goes away with:

#define KFENCE_CANARY_PATTERN_U64 (0xULL ^ 
le64_to_cpu((__force __le64)0x0706050403020100))

Christophe


Re: [PATCH] mm: kfence: Fix false positives on big endian

2023-05-18 Thread Benjamin Gray
On Fri, 2023-05-19 at 15:14 +1000, Michael Ellerman wrote:
> Andrew Morton  writes:
> > On Fri, 5 May 2023 16:02:17 + David Laight
> >  wrote:
> > 
> > > From: Michael Ellerman
> > > > Sent: 05 May 2023 04:51
> > > > 
> > > > Since commit 1ba3cbf3ec3b ("mm: kfence: improve the performance
> > > > of
> > > > __kfence_alloc() and __kfence_free()"), kfence reports failures
> > > > in
> > > > random places at boot on big endian machines.
> > > > 
> > > > The problem is that the new KFENCE_CANARY_PATTERN_U64 encodes
> > > > the
> > > > address of each byte in its value, so it needs to be byte
> > > > swapped on big
> > > > endian machines.
> > > > 
> > > > The compiler is smart enough to do the le64_to_cpu() at compile
> > > > time, so
> > > > there is no runtime overhead.
> > > > 
> > > > Fixes: 1ba3cbf3ec3b ("mm: kfence: improve the performance of
> > > > __kfence_alloc() and __kfence_free()")
> > > > Signed-off-by: Michael Ellerman 
> > > > ---
> > > >  mm/kfence/kfence.h | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/mm/kfence/kfence.h b/mm/kfence/kfence.h
> > > > index 2aafc46a4aaf..392fb273e7bd 100644
> > > > --- a/mm/kfence/kfence.h
> > > > +++ b/mm/kfence/kfence.h
> > > > @@ -29,7 +29,7 @@
> > > >   * canary of every 8 bytes is the same. 64-bit memory can be
> > > > filled and checked
> > > >   * at a time instead of byte by byte to improve performance.
> > > >   */
> > > > -#define KFENCE_CANARY_PATTERN_U64 ((u64)0x ^
> > > > (u64)(0x0706050403020100))
> > > > +#define KFENCE_CANARY_PATTERN_U64 ((u64)0x ^
> > > > (u64)(le64_to_cpu(0x0706050403020100)))
> > > 
> > > What at the (u64) casts for?
> > > The constants should probably have a ul (or ull) suffix.
> > > 
> > 
> > I tried that, didn't fix the sparse warnings described at
> > https://lkml.kernel.org/r/202305132244.dwzbucud-...@intel.com.
> > 
> > Michael, have you looked into this?
> 
> I haven't sorry, been chasing other bugs.
> 
> > I'll merge it upstream - I guess we can live with the warnings for
> > a while.
> 
> Thanks, yeah spurious WARNs are more of a pain than some sparse
> warnings.
> 
> Maybe using le64_to_cpu() is too fancy, could just do it with an
> ifdef? eg.
> 
> diff --git a/mm/kfence/kfence.h b/mm/kfence/kfence.h
> index 392fb273e7bd..510355a5382b 100644
> --- a/mm/kfence/kfence.h
> +++ b/mm/kfence/kfence.h
> @@ -29,7 +29,11 @@
>   * canary of every 8 bytes is the same. 64-bit memory can be filled
> and checked
>   * at a time instead of byte by byte to improve performance.
>   */
> -#define KFENCE_CANARY_PATTERN_U64 ((u64)0x ^
> (u64)(le64_to_cpu(0x0706050403020100)))
> +#ifdef __LITTLE_ENDIAN__
> +#define KFENCE_CANARY_PATTERN_U64 (0xULL ^
> 0x0706050403020100ULL)
> +#else
> +#define KFENCE_CANARY_PATTERN_U64 (0xULL ^
> 0x0001020304050607ULL)
> +#endif
>  
>  /* Maximum stack depth for reports. */
>  #define KFENCE_STACK_DEPTH 64
> 
> 
> cheers

(for the sparse errors)

As I understand, we require memory to look like "00 01 02 03 04 05 06
07" such that iterating byte-by-byte gives 00, 01, etc. (with
everything XORed with aaa...)

I think it would be most semantically correct to use cpu_to_le64 on
KFENCE_CANARY_PATTERN_U64 and annotate the values being compared
against it as __le64. This is because we want the integer literal
0x0706050403020100 to be stored as "00 01 02 03 04 05 06 07", which is
the definition of little endian.

Masking this with an #ifdef leaves the type as cpu endian, which could
result in future issues.

(or I've just misunderstood and can disregard this)