Re: [patch 34/39] PCI/MSI: Reject multi-MSI early

2022-11-17 Thread Thomas Gleixner
On Wed, Nov 16 2022 at 10:31, Bjorn Helgaas wrote:
> On Fri, Nov 11, 2022 at 02:55:09PM +0100, Thomas Gleixner wrote:
>>  
>> +/**
>> + * pci_msi_domain_supports - Check for support of a particular feature flag
>> + * @pdev:   The PCI device to operate on
>> + * @feature_mask:   The feature mask to check for (full match)
>> + * @mode:   If ALLOW_LEGACY this grants the feature when there is 
>> no irq domain
>> + *  associated to the device. If DENY_LEGACY the lack of an 
>> irq domain
>> + *  makes the feature unsupported
>
> Looks like some of these might be wider than 80 columns, which I think
> was the typical width of this file.

I accommodated to the general sentiment that 80 columns is yesterday. My
new cutoff is 100.

Thanks,

tglx


[PATCH v7 0/2] arm64: support batched/deferred tlb shootdown during page reclamation

2022-11-17 Thread Yicong Yang
From: Yicong Yang 

Though ARM64 has the hardware to do tlb shootdown, the hardware
broadcasting is not free.
A simplest micro benchmark shows even on snapdragon 888 with only
8 cores, the overhead for ptep_clear_flush is huge even for paging
out one page mapped by only one process:
5.36%  a.out[kernel.kallsyms]  [k] ptep_clear_flush

While pages are mapped by multiple processes or HW has more CPUs,
the cost should become even higher due to the bad scalability of
tlb shootdown.

The same benchmark can result in 16.99% CPU consumption on ARM64
server with around 100 cores according to Yicong's test on patch
4/4.

This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
1. only send tlbi instructions in the first stage -
arch_tlbbatch_add_mm()
2. wait for the completion of tlbi by dsb while doing tlbbatch
sync in arch_tlbbatch_flush()
Testing on snapdragon shows the overhead of ptep_clear_flush
is removed by the patchset. The micro benchmark becomes 5% faster
even for one page mapped by single process on snapdragon 888.

With this support we're possible to do more optimization for memory
reclamation and migration[*].

[*] 
https://lore.kernel.org/lkml/393d6318-aa38-01ed-6ad8-f9eac89bf...@linux.alibaba.com/

-v7:
1. rename arch_tlbbatch_add_mm() to arch_tlbbatch_add_pending() as suggested, 
since it
   takes an extra address for arm64, per Nadav and Anshuman. Also mentioned in 
the commit.
2. add tags from Xin Hao, thanks.
Link: https://lore.kernel.org/lkml/20221115031425.44640-1-yangyic...@huawei.com/

-v6:
1. comment we don't defer TLB flush on platforms affected by 
ARM64_WORKAROUND_REPEAT_TLBI
2. use cpus_have_const_cap() instead of this_cpu_has_cap()
3. add tags from Punit, Thanks.
4. default enable the feature when cpus >= 8 rather than > 8, since the original
   improvement is observed on snapdragon 888 with 8 cores.
Link: https://lore.kernel.org/lkml/20221028081255.19157-1-yangyic...@huawei.com/

-v5:
1. Make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends on EXPERT for this stage on 
arm64.
2. Make a threshold of CPU numbers for enabling batched TLP flush on arm64
Link: 
https://lore.kernel.org/linux-arm-kernel/20220921084302.43631-1-yangyic...@huawei.com/T/

-v4:
1. Add tags from Kefeng and Anshuman, Thanks.
2. Limit the TLB batch/defer on systems with >4 CPUs, per Anshuman
3. Merge previous Patch 1,2-3 into one, per Anshuman
Link: 
https://lore.kernel.org/linux-mm/20220822082120.8347-1-yangyic...@huawei.com/

-v3:
1. Declare arch's tlbbatch defer support by arch_tlbbatch_should_defer() instead
   of ARCH_HAS_MM_CPUMASK, per Barry and Kefeng
2. Add Tested-by from Xin Hao
Link: 
https://lore.kernel.org/linux-mm/20220711034615.482895-1-21cn...@gmail.com/

-v2:
1. Collected Yicong's test result on kunpeng920 ARM64 server;
2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
   according to the comments of Peter Zijlstra and Dave Hansen
3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
   is empty according to the comments of Nadav Amit

Thanks, Peter, Dave and Nadav for your testing or reviewing
, and comments.

-v1:
https://lore.kernel.org/lkml/20220707125242.425242-1-21cn...@gmail.com/

Anshuman Khandual (1):
  mm/tlbbatch: Introduce arch_tlbbatch_should_defer()

Barry Song (1):
  arm64: support batched/deferred tlb shootdown during page reclamation

 .../features/vm/TLB/arch-support.txt  |  2 +-
 arch/arm64/Kconfig|  6 +++
 arch/arm64/include/asm/tlbbatch.h | 12 +
 arch/arm64/include/asm/tlbflush.h | 52 ++-
 arch/x86/include/asm/tlbflush.h   | 17 +-
 include/linux/mm_types_task.h |  4 +-
 mm/rmap.c | 19 +++
 7 files changed, 93 insertions(+), 19 deletions(-)
 create mode 100644 arch/arm64/include/asm/tlbbatch.h

-- 
2.24.0



[PATCH v7 2/2] arm64: support batched/deferred tlb shootdown during page reclamation

2022-11-17 Thread Yicong Yang
From: Barry Song 

on x86, batched and deferred tlb shootdown has lead to 90%
performance increase on tlb shootdown. on arm64, HW can do
tlb shootdown without software IPI. But sync tlbi is still
quite expensive.

Even running a simplest program which requires swapout can
prove this is true,
 #include 
 #include 
 #include 
 #include 

 int main()
 {
 #define SIZE (1 * 1024 * 1024)
 volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
  MAP_SHARED | MAP_ANONYMOUS, -1, 0);

 memset(p, 0x88, SIZE);

 for (int k = 0; k < 1; k++) {
 /* swap in */
 for (int i = 0; i < SIZE; i += 4096) {
 (void)p[i];
 }

 /* swap out */
 madvise(p, SIZE, MADV_PAGEOUT);
 }
 }

Perf result on snapdragon 888 with 8 cores by using zRAM
as the swap block device.

 ~ # perf record taskset -c 4 ./a.out
 [ perf record: Woken up 10 times to write data ]
 [ perf record: Captured and wrote 2.297 MB perf.data (60084 samples) ]
 ~ # perf report
 # To display the perf.data header info, please use --header/--header-only 
options.
 # To display the perf.data header info, please use --header/--header-only 
options.
 #
 #
 # Total Lost Samples: 0
 #
 # Samples: 60K of event 'cycles'
 # Event count (approx.): 35706225414
 #
 # Overhead  Command  Shared Object  Symbol
 #   ...  .  
.
 #
21.07%  a.out[kernel.kallsyms]  [k] _raw_spin_unlock_irq
 8.23%  a.out[kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
 6.67%  a.out[kernel.kallsyms]  [k] filemap_map_pages
 6.16%  a.out[kernel.kallsyms]  [k] __zram_bvec_write
 5.36%  a.out[kernel.kallsyms]  [k] ptep_clear_flush
 3.71%  a.out[kernel.kallsyms]  [k] _raw_spin_lock
 3.49%  a.out[kernel.kallsyms]  [k] memset64
 1.63%  a.out[kernel.kallsyms]  [k] clear_page
 1.42%  a.out[kernel.kallsyms]  [k] _raw_spin_unlock
 1.26%  a.out[kernel.kallsyms]  [k] 
mod_zone_state.llvm.8525150236079521930
 1.23%  a.out[kernel.kallsyms]  [k] xas_load
 1.15%  a.out[kernel.kallsyms]  [k] zram_slot_lock

ptep_clear_flush() takes 5.36% CPU in the micro-benchmark
swapping in/out a page mapped by only one process. If the
page is mapped by multiple processes, typically, like more
than 100 on a phone, the overhead would be much higher as
we have to run tlb flush 100 times for one single page.
Plus, tlb flush overhead will increase with the number
of CPU cores due to the bad scalability of tlb shootdown
in HW, so those ARM64 servers should expect much higher
overhead.

Further perf annonate shows 95% cpu time of ptep_clear_flush
is actually used by the final dsb() to wait for the completion
of tlb flush. This provides us a very good chance to leverage
the existing batched tlb in kernel. The minimum modification
is that we only send async tlbi in the first stage and we send
dsb while we have to sync in the second stage.

With the above simplest micro benchmark, collapsed time to
finish the program decreases around 5%.

Typical collapsed time w/o patch:
 ~ # time taskset -c 4 ./a.out
 0.21user 14.34system 0:14.69elapsed
w/ patch:
 ~ # time taskset -c 4 ./a.out
 0.22user 13.45system 0:13.80elapsed

Also, Yicong Yang added the following observation.
Tested with benchmark in the commit on Kunpeng920 arm64 server,
observed an improvement around 12.5% with command
`time ./swap_bench`.
w/o w/
real0m13.460s   0m11.771s
user0m0.248s0m0.279s
sys 0m12.039s   0m11.458s

Originally it's noticed a 16.99% overhead of ptep_clear_flush()
which has been eliminated by this patch:

[root@localhost yang]# perf record -- ./swap_bench && perf report
[...]
16.99%  swap_bench  [kernel.kallsyms]  [k] ptep_clear_flush

It is tested on 4,8,128 CPU platforms and shows to be beneficial on
large systems but may not have improvement on small systems like on
a 4 CPU platform. So make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends
on CONFIG_EXPERT for this stage and make this disabled on systems
with less than 8 CPUs. User can modify this threshold according to
their own platforms by CONFIG_NR_CPUS_FOR_BATCHED_TLB.

This patch extends arch_tlbbatch_add_mm() to take an address of the
target page to support the feature on arm64. Also rename it to
arch_tlbbatch_add_pending() to better match its function since we
don't need to handle the mm on arm64 and add_mm is not proper.
add_pending will make sense to both as on x86 we're pending the
TLB flush operations while on arm64 we're pending the synchronize
operations.

Cc: Anshuman Khandual 
Cc: Jonathan Corbet 
Cc: Nadav Amit 
Cc: Mel Gorman 
Tested-by: Yicong Yang 
Tested-by: Xin Hao 
Tested-by

[PATCH v7 1/2] mm/tlbbatch: Introduce arch_tlbbatch_should_defer()

2022-11-17 Thread Yicong Yang
From: Anshuman Khandual 

The entire scheme of deferred TLB flush in reclaim path rests on the
fact that the cost to refill TLB entries is less than flushing out
individual entries by sending IPI to remote CPUs. But architecture
can have different ways to evaluate that. Hence apart from checking
TTU_BATCH_FLUSH in the TTU flags, rest of the decision should be
architecture specific.

Signed-off-by: Anshuman Khandual 
[https://lore.kernel.org/linuxppc-dev/20171101101735.2318-2-khand...@linux.vnet.ibm.com/]
Signed-off-by: Yicong Yang 
[Rebase and fix incorrect return value type]
Reviewed-by: Kefeng Wang 
Reviewed-by: Anshuman Khandual 
Reviewed-by: Barry Song 
Reviewed-by: Xin Hao 
Tested-by: Punit Agrawal 
---
 arch/x86/include/asm/tlbflush.h | 12 
 mm/rmap.c   |  9 +
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index cda3118f3b27..8a497d902c16 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -240,6 +240,18 @@ static inline void flush_tlb_page(struct vm_area_struct 
*vma, unsigned long a)
flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
 }
 
+static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
+{
+   bool should_defer = false;
+
+   /* If remote CPUs need to be flushed then defer batch the flush */
+   if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
+   should_defer = true;
+   put_cpu();
+
+   return should_defer;
+}
+
 static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 {
/*
diff --git a/mm/rmap.c b/mm/rmap.c
index 2ec925e5fa6a..a9ab10bc0144 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -685,17 +685,10 @@ static void set_tlb_ubc_flush_pending(struct mm_struct 
*mm, bool writable)
  */
 static bool should_defer_flush(struct mm_struct *mm, enum ttu_flags flags)
 {
-   bool should_defer = false;
-
if (!(flags & TTU_BATCH_FLUSH))
return false;
 
-   /* If remote CPUs need to be flushed then defer batch the flush */
-   if (cpumask_any_but(mm_cpumask(mm), get_cpu()) < nr_cpu_ids)
-   should_defer = true;
-   put_cpu();
-
-   return should_defer;
+   return arch_tlbbatch_should_defer(mm);
 }
 
 /*
-- 
2.24.0



Re: [RFC PATCH v2 4/8] smp: Trace IPIs sent via arch_send_call_function_ipi_mask()

2022-11-17 Thread Peter Zijlstra
On Wed, Nov 02, 2022 at 06:33:32PM +, Valentin Schneider wrote:
> This simply wraps around the arch function and prepends it with a
> tracepoint, similar to send_call_function_single_ipi().
> 
> Signed-off-by: Valentin Schneider 
> ---
>  kernel/smp.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/smp.c b/kernel/smp.c
> index e2ca1e2f31274..c4d561cf50d45 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -160,6 +160,13 @@ void __init call_function_init(void)
>   smpcfd_prepare_cpu(smp_processor_id());
>  }
>  
> +static inline void

Given the use of _RET_IP_, I would strongly recommend you use
__always_inline.

> +send_call_function_ipi_mask(const struct cpumask *mask)
> +{
> + trace_ipi_send_cpumask(mask, _RET_IP_, func);

What's func?

> + arch_send_call_function_ipi_mask(mask);
> +}


Re: [RFC PATCH v2 5/8] irq_work: Trace self-IPIs sent via arch_irq_work_raise()

2022-11-17 Thread Peter Zijlstra
On Wed, Nov 02, 2022 at 06:33:33PM +, Valentin Schneider wrote:
> IPIs sent to remove CPUs via irq_work_queue_on() are now covered by
> trace_ipi_send_cpumask(), add another instance of the tracepoint to cover
> self-IPIs.
> 
> Signed-off-by: Valentin Schneider 
> ---
>  kernel/irq_work.c | 14 +-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/irq_work.c b/kernel/irq_work.c
> index 7afa40fe5cc43..aec38c294ce68 100644
> --- a/kernel/irq_work.c
> +++ b/kernel/irq_work.c
> @@ -22,6 +22,8 @@
>  #include 
>  #include 
>  
> +#include 
> +
>  static DEFINE_PER_CPU(struct llist_head, raised_list);
>  static DEFINE_PER_CPU(struct llist_head, lazy_list);
>  static DEFINE_PER_CPU(struct task_struct *, irq_workd);
> @@ -74,6 +76,16 @@ void __weak arch_irq_work_raise(void)
>*/
>  }
>  
> +static inline void irq_work_raise(struct irq_work *work)

__always_inline, unless you want to occasionally only see it point to
__irq_work_queue_local().

> +{
> + if (trace_ipi_send_cpumask_enabled() && arch_irq_work_has_interrupt()) {
> + trace_ipi_send_cpumask(cpumask_of(smp_processor_id()),
> +_RET_IP_,
> +work->func);
}
> +
> + arch_irq_work_raise();
> +}
> +
>  /* Enqueue on current CPU, work must already be claimed and preempt disabled 
> */
>  static void __irq_work_queue_local(struct irq_work *work)
>  {
> @@ -99,7 +111,7 @@ static void __irq_work_queue_local(struct irq_work *work)
>  
>   /* If the work is "lazy", handle it from next tick if any */
>   if (!lazy_work || tick_nohz_tick_stopped())
> - arch_irq_work_raise();
> + irq_work_raise(work);
>  }
>  
>  /* Enqueue the irq work @work on the current CPU */
> -- 
> 2.31.1
> 


Re: [RFC PATCH v2 6/8] treewide: Trace IPIs sent via smp_send_reschedule()

2022-11-17 Thread Peter Zijlstra
On Wed, Nov 02, 2022 at 06:33:34PM +, Valentin Schneider wrote:

> diff --git a/kernel/smp.c b/kernel/smp.c
> index c4d561cf50d45..44fa4b9b1f46b 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -167,6 +167,14 @@ send_call_function_ipi_mask(const struct cpumask *mask)
>   arch_send_call_function_ipi_mask(mask);
>  }
>  
> +void smp_send_reschedule(int cpu)
> +{
> + /* XXX scheduler_ipi is inline :/ */
> + trace_ipi_send_cpumask(cpumask_of(cpu), _RET_IP_, NULL);
> + arch_smp_send_reschedule(cpu);
> +}
> +EXPORT_SYMBOL_GPL(smp_send_reschedule);

Yeah, no.. I see some crazy archs do this, but no we're not exporting
this in generic.


[PATCH mm-unstable v1 00/20] mm/gup: remove FOLL_FORCE usage from drivers (reliable R/O long-term pinning)

2022-11-17 Thread David Hildenbrand
For now, we did not support reliable R/O long-term pinning in COW mappings.
That means, if we would trigger R/O long-term pinning in MAP_PRIVATE
mapping, we could end up pinning the (R/O-mapped) shared zeropage or a
pagecache page.

The next write access would trigger a write fault and replace the pinned
page by an exclusive anonymous page in the process page table; whatever the
process would write to that private page copy would not be visible by the
owner of the previous page pin: for example, RDMA could read stale data.
The end result is essentially an unexpected and hard-to-debug memory
corruption.

Some drivers tried working around that limitation by using
"FOLL_FORCE|FOLL_WRITE|FOLL_LONGTERM" for R/O long-term pinning for now.
FOLL_WRITE would trigger a write fault, if required, and break COW before
pinning the page. FOLL_FORCE is required because the VMA might lack write
permissions, and drivers wanted to make that working as well, just like
one would expect (no write access, but still triggering a write access to
break COW).

However, that is not a practical solution, because
(1) Drivers that don't stick to that undocumented and debatable pattern
would still run into that issue. For example, VFIO only uses
FOLL_LONGTERM for R/O long-term pinning.
(2) Using FOLL_WRITE just to work around a COW mapping + page pinning
limitation is unintuitive. FOLL_WRITE would, for example, mark the
page softdirty or trigger uffd-wp, even though, there actually isn't
going to be any write access.
(3) The purpose of FOLL_FORCE is debug access, not access without lack of
VMA permissions by arbitrarty drivers.

So instead, make R/O long-term pinning work as expected, by breaking COW
in a COW mapping early, such that we can remove any FOLL_FORCE usage from
drivers and make FOLL_FORCE ptrace-specific (renaming it to FOLL_PTRACE).
More details in patch #8.

Patches #1--#3 add COW tests for non-anonymous pages.
Patches #4--#7 prepare core MM for extended FAULT_FLAG_UNSHARE support in
COW mappings.
Patch #8 implements reliable R/O long-term pinning in COW mappings
Patches #9--#19 remove any FOLL_FORCE usage from drivers.
Patch #20 renames FOLL_FORCE to FOLL_PTRACE.

I'm refraining from CCing all driver/arch maintainers on the whole patch
set, but only CC them on the cover letter and the applicable patch
(I know, I know, someone is always unhappy ... sorry).

RFC -> v1:
* Use term "ptrace" instead of "debuggers" in patch descriptions
* Added ACK/Tested-by
* "mm/frame-vector: remove FOLL_FORCE usage"
 -> Adjust description
* "mm: rename FOLL_FORCE to FOLL_PTRACE"
 -> Added

David Hildenbrand (20):
  selftests/vm: anon_cow: prepare for non-anonymous COW tests
  selftests/vm: cow: basic COW tests for non-anonymous pages
  selftests/vm: cow: R/O long-term pinning reliability tests for
non-anon pages
  mm: add early FAULT_FLAG_UNSHARE consistency checks
  mm: add early FAULT_FLAG_WRITE consistency checks
  mm: rework handling in do_wp_page() based on private vs. shared
mappings
  mm: don't call vm_ops->huge_fault() in wp_huge_pmd()/wp_huge_pud() for
private mappings
  mm: extend FAULT_FLAG_UNSHARE support to anything in a COW mapping
  mm/gup: reliable R/O long-term pinning in COW mappings
  RDMA/umem: remove FOLL_FORCE usage
  RDMA/usnic: remove FOLL_FORCE usage
  RDMA/siw: remove FOLL_FORCE usage
  media: videobuf-dma-sg: remove FOLL_FORCE usage
  drm/etnaviv: remove FOLL_FORCE usage
  media: pci/ivtv: remove FOLL_FORCE usage
  mm/frame-vector: remove FOLL_FORCE usage
  drm/exynos: remove FOLL_FORCE usage
  RDMA/hw/qib/qib_user_pages: remove FOLL_FORCE usage
  habanalabs: remove FOLL_FORCE usage
  mm: rename FOLL_FORCE to FOLL_PTRACE

 arch/alpha/kernel/ptrace.c|   6 +-
 arch/arm64/kernel/mte.c   |   2 +-
 arch/ia64/kernel/ptrace.c |  10 +-
 arch/mips/kernel/ptrace32.c   |   4 +-
 arch/mips/math-emu/dsemul.c   |   2 +-
 arch/powerpc/kernel/ptrace/ptrace32.c |   4 +-
 arch/sparc/kernel/ptrace_32.c |   4 +-
 arch/sparc/kernel/ptrace_64.c |   8 +-
 arch/x86/kernel/step.c|   2 +-
 arch/x86/um/ptrace_32.c   |   2 +-
 arch/x86/um/ptrace_64.c   |   2 +-
 drivers/gpu/drm/etnaviv/etnaviv_gem.c |   8 +-
 drivers/gpu/drm/exynos/exynos_drm_g2d.c   |   2 +-
 drivers/infiniband/core/umem.c|   8 +-
 drivers/infiniband/hw/qib/qib_user_pages.c|   2 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c  |   9 +-
 drivers/infiniband/sw/siw/siw_mem.c   |   9 +-
 drivers/media/common/videobuf2/frame_vector.c |   2 +-
 drivers/media/pci/ivtv/ivtv-udma.c|   2 +-
 drivers/media/pci/ivtv/ivtv-yuv.c |   5 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c |  14 +-
 drivers/misc/habanalabs/common/memory.c   |   3 +-
 fs/exec.c |   

[PATCH mm-unstable v1 20/20] mm: rename FOLL_FORCE to FOLL_PTRACE

2022-11-17 Thread David Hildenbrand
Let's make it clearer that functionality provided by FOLL_FORCE is
really only for ptrace access. Prevent accidental re-use in drivers
by renaming FOLL_FORCE to FOLL_PTRACE:

  git grep -l 'FOLL_FORCE' | xargs sed -i 's/FOLL_FORCE/FOLL_PTRACE/g'

In the future, we might want to use a separate set of flags for the
access_vm interface: most FOLL_* flags don't apply and we mostly only
want to pass FOLL_PTRACE and FOLL_WRITE.

Suggested-by: Christoph Hellwig 
Cc: Oleg Nesterov 
Cc: Richard Henderson 
Cc: Ivan Kokshaysky 
Cc: Matt Turner 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Thomas Bogendoerfer 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: "David S. Miller" 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: "H. Peter Anvin" 
Cc: Richard Weinberger 
Cc: Anton Ivanov 
Cc: Johannes Berg 
Cc: Eric Biederman 
Cc: Kees Cook 
Cc: Alexander Viro 
Cc: Peter Zijlstra 
Cc: Arnaldo Carvalho de Melo 
Cc: Mark Rutland 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Mike Kravetz 
Cc: Muchun Song 
Cc: Kentaro Takeda 
Cc: Tetsuo Handa 
Cc: Paul Moore 
Cc: James Morris 
Cc: "Serge E. Hallyn" 
Signed-off-by: David Hildenbrand 
---
 arch/alpha/kernel/ptrace.c|  6 +++---
 arch/arm64/kernel/mte.c   |  2 +-
 arch/ia64/kernel/ptrace.c | 10 +-
 arch/mips/kernel/ptrace32.c   |  4 ++--
 arch/mips/math-emu/dsemul.c   |  2 +-
 arch/powerpc/kernel/ptrace/ptrace32.c |  4 ++--
 arch/sparc/kernel/ptrace_32.c |  4 ++--
 arch/sparc/kernel/ptrace_64.c |  8 
 arch/x86/kernel/step.c|  2 +-
 arch/x86/um/ptrace_32.c   |  2 +-
 arch/x86/um/ptrace_64.c   |  2 +-
 fs/exec.c |  2 +-
 fs/proc/base.c|  2 +-
 include/linux/mm.h|  8 
 kernel/events/uprobes.c   |  4 ++--
 kernel/ptrace.c   | 12 ++--
 mm/gup.c  | 28 +--
 mm/huge_memory.c  |  8 
 mm/hugetlb.c  |  2 +-
 mm/memory.c   |  4 ++--
 mm/util.c |  4 ++--
 security/tomoyo/domain.c  |  2 +-
 22 files changed, 61 insertions(+), 61 deletions(-)

diff --git a/arch/alpha/kernel/ptrace.c b/arch/alpha/kernel/ptrace.c
index a1a239ea002d..55def6479ff2 100644
--- a/arch/alpha/kernel/ptrace.c
+++ b/arch/alpha/kernel/ptrace.c
@@ -158,7 +158,7 @@ static inline int
 read_int(struct task_struct *task, unsigned long addr, int * data)
 {
int copied = access_process_vm(task, addr, data, sizeof(int),
-   FOLL_FORCE);
+   FOLL_PTRACE);
return (copied == sizeof(int)) ? 0 : -EIO;
 }
 
@@ -166,7 +166,7 @@ static inline int
 write_int(struct task_struct *task, unsigned long addr, int data)
 {
int copied = access_process_vm(task, addr, &data, sizeof(int),
-   FOLL_FORCE | FOLL_WRITE);
+   FOLL_PTRACE | FOLL_WRITE);
return (copied == sizeof(int)) ? 0 : -EIO;
 }
 
@@ -284,7 +284,7 @@ long arch_ptrace(struct task_struct *child, long request,
case PTRACE_PEEKTEXT: /* read word at location addr. */
case PTRACE_PEEKDATA:
copied = ptrace_access_vm(child, addr, &tmp, sizeof(tmp),
-   FOLL_FORCE);
+   FOLL_PTRACE);
ret = -EIO;
if (copied != sizeof(tmp))
break;
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index 7467217c1eaf..fa29fecaedbc 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -525,7 +525,7 @@ int mte_ptrace_copy_tags(struct task_struct *child, long 
request,
int ret;
struct iovec kiov;
struct iovec __user *uiov = (void __user *)data;
-   unsigned int gup_flags = FOLL_FORCE;
+   unsigned int gup_flags = FOLL_PTRACE;
 
if (!system_supports_mte())
return -EIO;
diff --git a/arch/ia64/kernel/ptrace.c b/arch/ia64/kernel/ptrace.c
index ab8aeb34d1d9..3781db1f506c 100644
--- a/arch/ia64/kernel/ptrace.c
+++ b/arch/ia64/kernel/ptrace.c
@@ -452,7 +452,7 @@ ia64_peek (struct task_struct *child, struct switch_stack 
*child_stack,
return 0;
}
}
-   copied = access_process_vm(child, addr, &ret, sizeof(ret), FOLL_FORCE);
+   copied = access_process_vm(child, addr, &ret, sizeof(ret), FOLL_PTRACE);
if (copied != sizeof(ret))
return -EIO;
*val = ret;
@@ -489,7 +489,7 @@ ia64_poke (struct task_struct *child, struct switch_stack 
*child_stack,
}
}
} else if (access_process_vm(child, addr, &val, sizeof(val),
-   FOLL_FORCE | FOLL_WRITE)
+   F

Re: [PATCH v2 12/44] cpuidle,dt: Push RCU-idle into driver

2022-11-17 Thread Peter Zijlstra


Sorry; things keep getting in the way of finishing this :/

As such, I need a bit of time to get on-track again..

On Tue, Oct 04, 2022 at 01:03:57PM +0200, Ulf Hansson wrote:

> > --- a/drivers/acpi/processor_idle.c
> > +++ b/drivers/acpi/processor_idle.c
> > @@ -1200,6 +1200,8 @@ static int acpi_processor_setup_lpi_stat
> > state->target_residency = lpi->min_residency;
> > if (lpi->arch_flags)
> > state->flags |= CPUIDLE_FLAG_TIMER_STOP;
> > +   if (lpi->entry_method == ACPI_CSTATE_FFH)
> > +   state->flags |= CPUIDLE_FLAG_RCU_IDLE;
> 
> I assume the state index here will never be 0?
> 
> If not, it may lead to that acpi_processor_ffh_lpi_enter() may trigger
> CPU_PM_CPU_IDLE_ENTER_PARAM() to call ct_cpuidle_enter|exit() for an
> idle-state that doesn't have the CPUIDLE_FLAG_RCU_IDLE bit set.

I'm not quite sure I see how. AFAICT this condition above implies
acpi_processor_ffh_lpi_enter() gets called, no?

Which in turn is an unconditional __CPU_PM_CPU_IDLE_ENTER() user, so
even if idx==0, it ends up in ct_idle_{enter,exit}().

> 
> > state->enter = acpi_idle_lpi_enter;
> > drv->safe_state_index = i;
> > }
> > --- a/drivers/cpuidle/cpuidle-arm.c
> > +++ b/drivers/cpuidle/cpuidle-arm.c
> > @@ -53,6 +53,7 @@ static struct cpuidle_driver arm_idle_dr
> >  * handler for idle state index 0.
> >  */
> > .states[0] = {
> > +   .flags  = CPUIDLE_FLAG_RCU_IDLE,
> 
> Comparing arm64 and arm32 idle-states/idle-drivers, the $subject
> series ends up setting the CPUIDLE_FLAG_RCU_IDLE for the ARM WFI idle
> state (state zero), but only for the arm64 and psci cases (mostly
> arm64). For arm32 we would need to update the ARM_CPUIDLE_WFI_STATE
> too, as that is what most arm32 idle-drivers are using. My point is,
> the code becomes a bit inconsistent.

True.

> Perhaps it's easier to avoid setting the CPUIDLE_FLAG_RCU_IDLE bit for
> all of the ARM WFI idle states, for both arm64 and arm32?

As per the below?

> 
> > .enter  = arm_enter_idle_state,
> > .exit_latency   = 1,
> > .target_residency   = 1,

> > --- a/include/linux/cpuidle.h
> > +++ b/include/linux/cpuidle.h
> > @@ -282,14 +282,18 @@ extern s64 cpuidle_governor_latency_req(
> > int __ret = 0;  \
> > \
> > if (!idx) { \
> > +   ct_idle_enter();\
> 
> According to my comment above, we should then drop these calls to
> ct_idle_enter and ct_idle_exit() here. Right?

Yes, if we ensure idx==0 never has RCU_IDLE set then these must be
removed.

> > cpu_do_idle();  \
> > +   ct_idle_exit(); \
> > return idx; \
> > }   \
> > \
> > if (!is_retention)  \
> > __ret =  cpu_pm_enter();\
> > if (!__ret) {   \
> > +   ct_idle_enter();\
> > __ret = low_level_idle_enter(state);\
> > +   ct_idle_exit(); \
> > if (!is_retention)  \
> > cpu_pm_exit();  \
> > }   \
> >

So the basic premise is that everything that needs RCU inside the idle
callback must set CPUIDLE_FLAG_RCU_IDLE and by doing that promise to
call ct_idle_{enter,exit}() themselves.

Setting RCU_IDLE is required when there is RCU usage, however even if
there is no RCU usage, setting RCU_IDLE is fine, as long as
ct_idle_{enter,exit}() then get called.


So does the below (delta) look better to you?

--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -1218,7 +1218,7 @@ static int acpi_processor_setup_lpi_stat
state->target_residency = lpi->min_residency;
if (lpi->arch_flags)
state->flags |= CPUIDLE_FLAG_TIMER_STOP;
-   if (lpi->entry_method == ACPI_CSTATE_FFH)
+   if (i != 0 && lpi->entry_method == ACPI_CSTATE_FFH)
state->flags |= CPUIDLE_FLAG_RCU_IDLE;
state->enter = acpi_idle_lpi_enter;
drv->safe_state_index =

Re: [PATCH mm-unstable v1 20/20] mm: rename FOLL_FORCE to FOLL_PTRACE

2022-11-17 Thread Linus Torvalds
On Wed, Nov 16, 2022 at 2:30 AM David Hildenbrand  wrote:
>
> Let's make it clearer that functionality provided by FOLL_FORCE is
> really only for ptrace access.

I'm not super-happy about this one.

I do understand the "let's rename the bit so that no new user shows up".

And it's true that the main traditional use is ptrace.

But from the patch itself it becomes obvious that no, it's not *just*
ptrace. At least not yet.

It's used for get_arg_page(), which uses it to basically look up (and
install) pages in the newly created VM.

Now, I'm not entirely sure why it even uses FOLL_FORCE, - I think it
might be historical, because the target should always be the new stack
vma.

Following the history of it is a big of a mess, because there's a
number of renamings and re-organizations, but it seems to go back to
2007 and commit b6a2fea39318 ("mm: variable length argument support").

Before that commit, we kept our own array of "this is the set of pages
that I will install in the new VM". That commit basically just inserts
the pages directly into the VM instead, getting rid of the array size
limitation.

So at a minimum, I think that FOLL_FORCE would need to be removed
before any renaming to FOLL_PTRACE, because that's not some kind of
small random case.

It *might* be as simple as just removing it, but maybe there's some
reason for having it that I don't immediately see.

There _are_ also small random cases too, like get_cmdline(). Maybe
that counts as ptrace, but the execve() case most definitely does not.

Linus


Re: [PATCH mm-unstable v1 20/20] mm: rename FOLL_FORCE to FOLL_PTRACE

2022-11-17 Thread David Hildenbrand

On 16.11.22 19:16, Linus Torvalds wrote:

On Wed, Nov 16, 2022 at 2:30 AM David Hildenbrand  wrote:


Let's make it clearer that functionality provided by FOLL_FORCE is
really only for ptrace access.


I'm not super-happy about this one.

I do understand the "let's rename the bit so that no new user shows up".

And it's true that the main traditional use is ptrace.

But from the patch itself it becomes obvious that no, it's not *just*
ptrace. At least not yet.

It's used for get_arg_page(), which uses it to basically look up (and
install) pages in the newly created VM.

Now, I'm not entirely sure why it even uses FOLL_FORCE, - I think it
might be historical, because the target should always be the new stack
vma.

Following the history of it is a big of a mess, because there's a
number of renamings and re-organizations, but it seems to go back to
2007 and commit b6a2fea39318 ("mm: variable length argument support").



Right.


Before that commit, we kept our own array of "this is the set of pages
that I will install in the new VM". That commit basically just inserts
the pages directly into the VM instead, getting rid of the array size
limitation.

So at a minimum, I think that FOLL_FORCE would need to be removed
before any renaming to FOLL_PTRACE, because that's not some kind of
small random case.

It *might* be as simple as just removing it, but maybe there's some
reason for having it that I don't immediately see.


Right, I have the same feeling. It might just be a copy-and-paste legacy 
leftover.




There _are_ also small random cases too, like get_cmdline(). Maybe
that counts as ptrace, but the execve() case most definitely does not.


I agree. I'd suggest moving forward without this (last) patch for now 
and figuring out how to further cleanup FOLL_FORCE usage on top.


@Andrew, if you intend to put this into mm-unstable, please drop the 
last patch for now.


--
Thanks,

David / dhildenb



Re: [PATCH v2 3/3] treewide: use get_random_u32_between() when possible

2022-11-17 Thread Kees Cook
On November 16, 2022 4:43:54 PM PST, "Jason A. Donenfeld"  
wrote:
>On Wed, Nov 16, 2022 at 04:31:18PM -0800, Kees Cook wrote:
>> On Thu, Nov 17, 2022 at 01:03:14AM +0100, Jason A. Donenfeld wrote:
>> > On Thu, Nov 17, 2022 at 12:55:47AM +0100, Jason A. Donenfeld wrote:
>> > > 2) What to call it:
>> > >- between I still like, because it mirrors "I'm thinking of a number
>> > >  between 1 and 10 and..." that everybody knows,
>> > >- inclusive I guess works, but it's not a preposition,
>> > >- bikeshed color #3?
>> > 
>> > - between
>> > - ranged
>> > - spanning
>> > 
>> > https://www.thefreedictionary.com/List-of-prepositions.htm
>> > - amid
>> > 
>> > Sigh, names.
>> 
>> I think "inclusive" is best.
>
>I find it not very descriptive of what the function does. Is there one
>you like second best? Or are you convinced they're all much much much
>worse than "inclusive" that they shouldn't be considered?

Right, I don't think any are sufficiently descriptive. "Incluisve" with two 
arguments seems unambiguous and complete to me. :)


-- 
Kees Cook


Re: [PATCH v2 3/3] treewide: use get_random_u32_between() when possible

2022-11-17 Thread Russell King (Oracle)
On Thu, Nov 17, 2022 at 03:05:14AM +0100, Jason A. Donenfeld wrote:
> On Thu, Nov 17, 2022 at 12:55:47AM +0100, Jason A. Donenfeld wrote:
> > 1) How/whether to make f(0, UR2_MAX) safe,
> >- without additional 64-bit arithmetic,
> >- minimizing the number of branches.
> >I have a few ideas I'll code golf for a bit.
> > I think I can make progress with (1) alone by fiddling around with
> > godbolt enough, like usual.
> 
> The code gen is definitely worse.
> 
> Original half-open interval:
> 
> return floor + get_random_u32_below(ceil - floor);
> 
> Suggested fully closed interval:
>   
> ceil = ceil - floor + 1;
> return likely(ceil) ? floor + get_random_u32_below(ceil) : 
> get_random_u32();

How many of these uses are going to have ceil and floor as a variable?
If they're constants (e.g. due to being in an inline function with
constant arguments) then the compiler will optimise all of the above
and the assembly code will just be either:

1. a call to get_random_u32()
2. a call to get_random_u32_below() and an addition.

If one passes ceil or floor as a variable, then yes, the code gen is
going to be as complicated as you suggest above.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!


Re: [patch 01/39] PCI/MSI: Check for MSI enabled in __pci_msix_enable()

2022-11-17 Thread Thomas Gleixner
On Wed, Nov 16 2022 at 07:39, Ashok Raj wrote:
> On Fri, Nov 11, 2022 at 02:54:15PM +0100, Thomas Gleixner wrote:
>
> Can the pre-enabled checks for msi and msix be moved up before any vector
> range check?
>
> not that it matters for how it fails, does EBUSY sound better?

Does any caller care about the error code or about the ordering in which
the caller stupity is detected?


Re: [PATCH net-next v2 00/11] net: pcs: Add support for devices probed in the "usual" manner

2022-11-17 Thread Rob Herring
On Mon, Nov 14, 2022 at 1:53 PM Vladimir Oltean  wrote:
>
> On Mon, Nov 14, 2022 at 01:08:03PM -0500, Sean Anderson wrote:
> > On 11/14/22 12:23, Vladimir Oltean wrote:
> > > On Thu, Nov 10, 2022 at 11:56:15AM -0500, Sean Anderson wrote:
> > >> these will probably be in device trees for a year before the kernel
> > >> starts using them. But once that is done, we are free to require them.
> > >
> > > Sorry, you need to propose something that is not "we can break 
> > > compatibility
> > > with today's device trees one year from now".
> >
> > But only if the kernel gets updated and not the device tree. When can
> > such a situation occur? Are we stuck with this for the next 10 years all
> > because someone may have a device tree which they compiled in 2017, and
> > *insist* on using the latest kernel with? Is this how you run your
> > systems?
>
> I'm a developer (and I work on other platforms than the ones you're
> planning to break), so the answer to this question doesn't mean a thing.
>
> > We don't get the device tree from firmware on this platform; usually it
> > is bundled with the kernel in a FIT or loaded from the same disk
> > partition as the kernel. I can imagine that they might not always be
> > updated at exactly the same time, but this is nuts.
>
> What does "this" platform mean exactly? There are many platforms to
> which you've added compatible strings to keep things working assuming a
> dtb update, many of them very old. And those to which you did are not by
> far all that exist. There is no requirement that all platform device
> trees are upstreamed to the Linux kernel.
>
> > The original device tree is broken because it doesn't include compatible
> > strings for devices on a generic bus. There's no way to fix that other
> > than hard-coding the driver. This can be done for some buses, but this
> > is an MDIO bus and we already assume devices without compatibles are
> > PHYs.
>
> Let's be clear about this. It's "broken" in the sense that you don't like
> the way in which it works, not in the sense that it results in a system
> that doesn't work. And not having a compatible string is just as broken
> as it is for other devices with detectable device IDs, like Ethernet
> PHYs in general, PCI devices, etc.
>
> The way in which that works here, specifically, is that a generic PHY driver
> is bound to the Lynx PCS devices, driver which does nothing since nobody
> calls phy_attach_direct() to it. Then, using fwnode_mdio_find_device(),
> you follow the pcsphy-handle and you get a reference to the mdio_device
> (parent class of phy_device) object that resulted from the generic PHY
> driver probing on the PCS, and you program the PCS to do what you want.
>
> The PHY core does assume that mdio_devices without compatible strings
> are phy_devices, but also makes exceptions (and warns about it) - see
> commit ae461131960b ("of: of_mdio: Add a whitelist of PHY compatibilities.").
> Maybe the reverse exception could also be made, and a warning for that
> be added as well.
>
> > In the next version of this series, I will include a compatibility
> > function which can bind a driver automatically if one is missing when
> > looking up a phy. But I would really like to have an exit strategy.
>
> You'll have to get agreement from higher level maintainers than me that
> the strategy "wait one year, break old device trees" is okay. Generally
> we wouldn't have answers to this kind of questions that depend on whom
> you ask. Otherwise.. we would all know whom to ask and whom not to ;)

A window of time can work, but only when there's other reasons
everyone must update the firmware/DT.

> Sadly I haven't found anything "official" in either 
> Documentation/devicetree/usage-model.rst
> or Documentation/process/submitting-patches.rst. Maybe I missed it?

Documentation/devicetree/bindings/ABI.rst

The exact policy depends on the platform (or family of platforms). In
short, if *anyone* cares, then compatibility should not be broken.
Vladimir uses platforms in question and cares, so don't break the
platforms.

Rob


Re: [patch 01/39] PCI/MSI: Check for MSI enabled in __pci_msix_enable()

2022-11-17 Thread Ashok Raj
On Thu, Nov 17, 2022 at 02:07:33PM +0100, Thomas Gleixner wrote:
> On Wed, Nov 16 2022 at 07:39, Ashok Raj wrote:
> > On Fri, Nov 11, 2022 at 02:54:15PM +0100, Thomas Gleixner wrote:
> >
> > Can the pre-enabled checks for msi and msix be moved up before any vector
> > range check?
> >
> > not that it matters for how it fails, does EBUSY sound better?
> 
> Does any caller care about the error code or about the ordering in which
> the caller stupity is detected?

No, I don't think so. That's why I prefixed it with "not that it matters" :-)

Just thought it would be good hygiene, but doesn't change anything functionally.


Re: [RFC PATCH v2 8/8] sched, smp: Trace smp callback causing an IPI

2022-11-17 Thread Peter Zijlstra
On Wed, Nov 02, 2022 at 06:33:36PM +, Valentin Schneider wrote:
> The newly-introduced ipi_send_cpumask tracepoint has a "callback" parameter
> which so far has only been fed with NULL.
> 
> While CSD_TYPE_SYNC/ASYNC and CSD_TYPE_IRQ_WORK share a similar backing
> struct layout (meaning their callback func can be accessed without caring
> about the actual CSD type), CSD_TYPE_TTWU doesn't even have a function
> attached to its struct. This means we need to check the type of a CSD
> before eventually dereferencing its associated callback.
> 
> This isn't as trivial as it sounds: the CSD type is stored in
> __call_single_node.u_flags, which get cleared right before the callback is
> executed via csd_unlock(). This implies checking the CSD type before it is
> enqueued on the call_single_queue, as the target CPU's queue can be flushed
> before we get to sending an IPI.
> 
> Furthermore, send_call_function_single_ipi() only has a CPU parameter, and
> would need to have an additional argument to trickle down the invoked
> function. This is somewhat silly, as the extra argument will always be
> pushed down to the function even when nothing is being traced, which is
> unnecessary overhead.
> 
> Two options present themselves:
> a) Create copies of send_call_function_{single_ipi, ipi_mask}() that take
>an extra argument used for tracing, so that codepaths remain unchanged
>when tracing isn't in effect (a sort of manual -fipa-sra).
> 
> b) Stash the CSD func in somewhere as a side effect that
>the portion of send_call_function_{single_ipi, ipi_mask}() under the
>tracepoint's static key can fetch.
> 
> a) creates redundant code, and b) is quite fragile due to requiring extra
> care for "reentrant" functions (async SMP calls).
> 
> This implements a).
> 
> Signed-off-by: Valentin Schneider 
> ---
>  kernel/irq_work.c   |  2 ++
>  kernel/sched/core.c | 35 ---
>  kernel/sched/smp.h  |  1 +
>  kernel/smp.c| 42 ++
>  4 files changed, 65 insertions(+), 15 deletions(-)
> 
> diff --git a/kernel/irq_work.c b/kernel/irq_work.c
> index aec38c294ce68..fcfa75c4a5daf 100644
> --- a/kernel/irq_work.c
> +++ b/kernel/irq_work.c
> @@ -24,6 +24,8 @@
>  
>  #include 
>  
> +#include "sched/smp.h"
> +
>  static DEFINE_PER_CPU(struct llist_head, raised_list);
>  static DEFINE_PER_CPU(struct llist_head, lazy_list);
>  static DEFINE_PER_CPU(struct task_struct *, irq_workd);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 02181f8072b5f..41196ca67e913 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3743,17 +3743,30 @@ void sched_ttwu_pending(void *arg)
>   rq_unlock_irqrestore(rq, &rf);
>  }
>  
> -void send_call_function_single_ipi(int cpu)
> -{
> - struct rq *rq = cpu_rq(cpu);
> -
> - if (!set_nr_if_polling(rq->idle)) {
> - trace_ipi_send_cpumask(cpumask_of(cpu), _RET_IP_, NULL);
> - arch_send_call_function_single_ipi(cpu);
> - } else {
> - trace_sched_wake_idle_without_ipi(cpu);
> - }
> -}
> +/*
> + * We want a variant that traces the function causing the IPI to be sent, but
> + * we don't want the extra argument to cause unnecessary overhead when 
> tracing
> + * isn't happening.
> + */
> +#define GEN_CFSI(suffix, IPI_EXP, ...)   
> \
> +void send_call_function_single_ipi##suffix(__VA_ARGS__)  
> \
> +{
> \
> + struct rq *rq = cpu_rq(cpu);
> \
> + 
> \
> + if (!set_nr_if_polling(rq->idle)) { 
> \
> + IPI_EXP;
> \
> + arch_send_call_function_single_ipi(cpu);
> \
> + } else {
> \
> + trace_sched_wake_idle_without_ipi(cpu); 
> \
> + }   
> \
> +}
> +
> +GEN_CFSI(/* nop */,
> +  /* nop */,
> +  int cpu)
> +GEN_CFSI(_trace,
> +  trace_ipi_send_cpumask(cpumask_of(cpu), _RET_IP_, func),
> +  int cpu, smp_call_func_t func)
>  

*yuck*

How about something like so?

---
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -24,6 +24,8 @@
 
 #include 
 
+#include "sched/smp.h"
+
 static DEFINE_PER_CPU(struct llist_head, raised_list);
 static DEFINE_PER_CPU(struct llist_head, lazy_list);
 static DEFINE_PER_CPU(struct task_struct *, irq_workd);
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3763,16 +3763,17 @@ void sched_ttwu_pending(void *arg)
rq_unlock_irqrestore(rq, &rf);
 }
 
-void send_call_function_single_ipi(int cpu)
+bool send_call_function_si

Re: [RFC PATCH v2 4/8] smp: Trace IPIs sent via arch_send_call_function_ipi_mask()

2022-11-17 Thread Valentin Schneider
On 17/11/22 10:08, Peter Zijlstra wrote:
> On Wed, Nov 02, 2022 at 06:33:32PM +, Valentin Schneider wrote:
>> This simply wraps around the arch function and prepends it with a
>> tracepoint, similar to send_call_function_single_ipi().
>>
>> Signed-off-by: Valentin Schneider 
>> ---
>>  kernel/smp.c | 9 -
>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/smp.c b/kernel/smp.c
>> index e2ca1e2f31274..c4d561cf50d45 100644
>> --- a/kernel/smp.c
>> +++ b/kernel/smp.c
>> @@ -160,6 +160,13 @@ void __init call_function_init(void)
>>  smpcfd_prepare_cpu(smp_processor_id());
>>  }
>>
>> +static inline void
>
> Given the use of _RET_IP_, I would strongly recommend you use
> __always_inline.
>

Noted, thanks

>> +send_call_function_ipi_mask(const struct cpumask *mask)
>> +{
>> +trace_ipi_send_cpumask(mask, _RET_IP_, func);
>
> What's func?
>

A rebase fail... That's only plugged in later.

>> +arch_send_call_function_ipi_mask(mask);
>> +}



Re: [RFC PATCH v2 6/8] treewide: Trace IPIs sent via smp_send_reschedule()

2022-11-17 Thread Valentin Schneider
On 17/11/22 10:12, Peter Zijlstra wrote:
> On Wed, Nov 02, 2022 at 06:33:34PM +, Valentin Schneider wrote:
>
>> diff --git a/kernel/smp.c b/kernel/smp.c
>> index c4d561cf50d45..44fa4b9b1f46b 100644
>> --- a/kernel/smp.c
>> +++ b/kernel/smp.c
>> @@ -167,6 +167,14 @@ send_call_function_ipi_mask(const struct cpumask *mask)
>>  arch_send_call_function_ipi_mask(mask);
>>  }
>>
>> +void smp_send_reschedule(int cpu)
>> +{
>> +/* XXX scheduler_ipi is inline :/ */
>> +trace_ipi_send_cpumask(cpumask_of(cpu), _RET_IP_, NULL);
>> +arch_smp_send_reschedule(cpu);
>> +}
>> +EXPORT_SYMBOL_GPL(smp_send_reschedule);
>
> Yeah, no.. I see some crazy archs do this, but no we're not exporting
> this in generic.

So the list is: ia64, powerpc, riscv
and they all seem to do it because of KVM:
  c4cb768f0277 ("[IA64] export smp_send_reschedule")
  de56a948b918 ("KVM: PPC: Add support for Book3S processors in hypervisor 
mode")
  d3d7a0ce020e ("RISC-V: Export kernel symbols for kvm")

Other archs get out of it either because their smp_send_reschedule() is
inline (e.g. x86), or because they don't allow building KVM as a module
(e.g. arm64).

If I can cobble the tracepoint+reschedule in an inline helper, then that
wouldn't require any new exports - I'm fighting a bit with the header maze
ATM, but hopefully I can get somewhere with this.



Re: [RFC PATCH v2 8/8] sched, smp: Trace smp callback causing an IPI

2022-11-17 Thread Valentin Schneider
On 17/11/22 15:12, Peter Zijlstra wrote:
> On Wed, Nov 02, 2022 at 06:33:36PM +, Valentin Schneider wrote:
> *yuck*

:-)

>
> How about something like so?
>
> ---
> --- a/kernel/irq_work.c
> +++ b/kernel/irq_work.c
> @@ -24,6 +24,8 @@
>
>  #include 
>
> +#include "sched/smp.h"
> +
>  static DEFINE_PER_CPU(struct llist_head, raised_list);
>  static DEFINE_PER_CPU(struct llist_head, lazy_list);
>  static DEFINE_PER_CPU(struct task_struct *, irq_workd);
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3763,16 +3763,17 @@ void sched_ttwu_pending(void *arg)
>   rq_unlock_irqrestore(rq, &rf);
>  }
>
> -void send_call_function_single_ipi(int cpu)
> +bool send_call_function_single_ipi(int cpu)
>  {
>   struct rq *rq = cpu_rq(cpu);
>
>   if (!set_nr_if_polling(rq->idle)) {
> - trace_ipi_send_cpumask(cpumask_of(cpu), _RET_IP_, NULL);
>   arch_send_call_function_single_ipi(cpu);
> - } else {
> - trace_sched_wake_idle_without_ipi(cpu);
> + return true;
>   }
> +
> + trace_sched_wake_idle_without_ipi(cpu);
> + return false;
>  }
>
>  /*
> --- a/kernel/sched/smp.h
> +++ b/kernel/sched/smp.h
> @@ -6,7 +6,7 @@
>
>  extern void sched_ttwu_pending(void *arg);
>
> -extern void send_call_function_single_ipi(int cpu);
> +extern bool send_call_function_single_ipi(int cpu);
>
>  #ifdef CONFIG_SMP
>  extern void flush_smp_call_function_queue(void);
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -163,7 +163,6 @@ void __init call_function_init(void)
>  static inline void
>  send_call_function_ipi_mask(const struct cpumask *mask)
>  {
> - trace_ipi_send_cpumask(mask, _RET_IP_, func);
>   arch_send_call_function_ipi_mask(mask);
>  }
>
> @@ -438,11 +437,16 @@ static void __smp_call_single_queue_debu
>   struct cfd_seq_local *seq = this_cpu_ptr(&cfd_seq_local);
>   struct call_function_data *cfd = this_cpu_ptr(&cfd_data);
>   struct cfd_percpu *pcpu = per_cpu_ptr(cfd->pcpu, cpu);
> + struct __call_single_data *csd;
> +
> + csd = container_of(node, call_single_data_t, node.llist);
> + WARN_ON_ONCE(!(CSD_TYPE(csd) & (CSD_TYPE_SYNC | CSD_TYPE_ASYNC)));
>
>   cfd_seq_store(pcpu->seq_queue, this_cpu, cpu, CFD_SEQ_QUEUE);
>   if (llist_add(node, &per_cpu(call_single_queue, cpu))) {
>   cfd_seq_store(pcpu->seq_ipi, this_cpu, cpu, CFD_SEQ_IPI);
>   cfd_seq_store(seq->ping, this_cpu, cpu, CFD_SEQ_PING);
> + trace_ipi_send_cpumask(cpumask_of(cpu), _RET_IP_, csd->func);
>   send_call_function_single_ipi(cpu);
>   cfd_seq_store(seq->pinged, this_cpu, cpu, CFD_SEQ_PINGED);
>   } else {
> @@ -487,6 +491,27 @@ static __always_inline void csd_unlock(s
>
>  static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data);
>
> +static __always_inline
> +bool raw_smp_call_single_queue(int cpu, struct llist_node *node)
> +{
> + /*
> +  * The list addition should be visible to the target CPU when it pops
> +  * the head of the list to pull the entry off it in the IPI handler
> +  * because of normal cache coherency rules implied by the underlying
> +  * llist ops.
> +  *
> +  * If IPIs can go out of order to the cache coherency protocol
> +  * in an architecture, sufficient synchronisation should be added
> +  * to arch code to make it appear to obey cache coherency WRT
> +  * locking and barrier primitives. Generic code isn't really
> +  * equipped to do the right thing...
> +  */
> + if (llist_add(node, &per_cpu(call_single_queue, cpu)))
> + return send_call_function_single_ipi(cpu);
> +
> + return false;
> +}
> +
>  void __smp_call_single_queue(int cpu, struct llist_node *node)
>  {
>  #ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
> @@ -503,19 +528,28 @@ void __smp_call_single_queue(int cpu, st
>  #endif
>
>   /*
> -  * The list addition should be visible to the target CPU when it pops
> -  * the head of the list to pull the entry off it in the IPI handler
> -  * because of normal cache coherency rules implied by the underlying
> -  * llist ops.
> -  *
> -  * If IPIs can go out of order to the cache coherency protocol
> -  * in an architecture, sufficient synchronisation should be added
> -  * to arch code to make it appear to obey cache coherency WRT
> -  * locking and barrier primitives. Generic code isn't really
> -  * equipped to do the right thing...
> -  */
> - if (llist_add(node, &per_cpu(call_single_queue, cpu)))
> - send_call_function_single_ipi(cpu);
> +  * We have to check the type of the CSD before queueing it, because
> +  * once queued it can have its flags cleared by
> +  *   flush_smp_call_function_queue()
> +  * even if we haven't sent the smp_call IPI yet (e.g. the stopper
> +  * executes migration_cpu_stop() on the remote CPU).
> +  */
> + if (trace_ipi_send_cpumask_enabled()) {
> +

Re: [patch 39/39] x86/apic: Remove X86_IRQ_ALLOC_CONTIGUOUS_VECTORS

2022-11-17 Thread Thomas Gleixner
Jason, Bjorn, Ashok!

On Wed, Nov 16 2022 at 14:05, Jason Gunthorpe wrote:
> On Fri, Nov 11, 2022 at 02:55:17PM +0100, Thomas Gleixner wrote:
>> Now that the PCI/MSI core code does early checking for multi-MSI support
>> X86_IRQ_ALLOC_CONTIGUOUS_VECTORS is not required anymore.
>
>> Remove the flag and rely on MSI_FLAG_MULTI_PCI_MSI.
>
> Reviewed-by: Jason Gunthorpe 

Thanks for taking the time to go through this pile!

   tglx


Re: [PATCH 38/44] KVM: Disable CPU hotplug during hardware enabling

2022-11-17 Thread Sean Christopherson
On Thu, Nov 17, 2022, Huang, Kai wrote:
> On Wed, 2022-11-16 at 17:11 +, Sean Christopherson wrote:
> > static int kvm_x86_check_processor_compatibility(void)
> > {
> > int cpu = smp_processor_id();
> > struct cpuinfo_x86 *c = &cpu_data(cpu);
> > 
> > /*
> >  * Compatibility checks are done when loading KVM and when enabling
> >  * hardware, e.g. during CPU hotplug, to ensure all online CPUs are
> >  * compatible, i.e. KVM should never perform a compatibility check 
> > on
> >  * an offline CPU.
> >  */
> > WARN_ON(!cpu_online(cpu));
> 
> Looks good to me.  Perhaps this also can be removed, though.

Hmm, it's a bit superfluous, but I think it could fire if KVM messed up CPU
hotplug again, e.g. if the for_each_online_cpu() => IPI raced with CPU unplug.

> And IMHO the removing of WARN_ON(!irq_disabled()) should be folded to the 
> patch
> "[PATCH 37/44] KVM: Rename and move CPUHP_AP_KVM_STARTING to ONLINE section". 
> Because moving from STARTING section to ONLINE section changes the IRQ status
> when the compatibility check is called.

Yep, that's what I have coded up, just smushed it all together here.


Re: [PATCH v2 3/3] treewide: use get_random_u32_between() when possible

2022-11-17 Thread Theodore Ts'o
On Wed, Nov 16, 2022 at 04:47:27PM -0800, Kees Cook wrote:
> >> > - between
> >> > - ranged
> >> > - spanning
> >> > 
> >> > https://www.thefreedictionary.com/List-of-prepositions.htm
> >> > - amid
> >> > 
> >> > Sigh, names.
> >> 
> >> I think "inclusive" is best.
> >
> >I find it not very descriptive of what the function does. Is there one
> >you like second best? Or are you convinced they're all much much much
> >worse than "inclusive" that they shouldn't be considered?
> 
> Right, I don't think any are sufficiently descriptive. "Incluisve"
> with two arguments seems unambiguous and complete to me. :)

The problem with "between", "ranged", "spanning" is that they don't
tell the reader whether we're dealing with an "open interval" or a
"closed interval".  They are just different ways of saying that it's a
range between, say, 0 and 20.  But it doesn't tell you whether it
includes 0 or 20 or not.

The only way I can see for making it ambiguous is either to use the
terminology "closed interval" or "inclusive".  And "open" and "closed"
can have other meanings, so get_random_u32_inclusive() is going to be
less confusing than get_random_u32_closed().

Cheers,

- Ted


Re: [PATCH v2 3/3] treewide: use get_random_u32_between() when possible

2022-11-17 Thread Jason A. Donenfeld
On Thu, Nov 17, 2022 at 10:42:37AM -0500, Theodore Ts'o wrote:
> On Wed, Nov 16, 2022 at 04:47:27PM -0800, Kees Cook wrote:
> > >> > - between
> > >> > - ranged
> > >> > - spanning
> > >> > 
> > >> > https://www.thefreedictionary.com/List-of-prepositions.htm
> > >> > - amid
> > >> > 
> > >> > Sigh, names.
> > >> 
> > >> I think "inclusive" is best.
> > >
> > >I find it not very descriptive of what the function does. Is there one
> > >you like second best? Or are you convinced they're all much much much
> > >worse than "inclusive" that they shouldn't be considered?
> > 
> > Right, I don't think any are sufficiently descriptive. "Incluisve"
> > with two arguments seems unambiguous and complete to me. :)
> 
> The problem with "between", "ranged", "spanning" is that they don't
> tell the reader whether we're dealing with an "open interval" or a
> "closed interval".  They are just different ways of saying that it's a
> range between, say, 0 and 20.  But it doesn't tell you whether it
> includes 0 or 20 or not.
> 
> The only way I can see for making it ambiguous is either to use the
> terminology "closed interval" or "inclusive".  And "open" and "closed"
> can have other meanings, so get_random_u32_inclusive() is going to be
> less confusing than get_random_u32_closed().

Alright, that about settles it then. I'll re-roll.

Jason


[PATCH v3 0/3] convert tree to get_random_u32_{below,above,inclusive}()

2022-11-17 Thread Jason A. Donenfeld
Hey everyone,

[Changes v2->v3: rename get_random_u32_between() to
 get_random_u32_inclusive(), and implement with closed interval.]

This series is the second tranche of tree-wide conversions to get random
integer handling a bit tamer. It's another Coccinelle-based patchset.

First we s/prandom_u32_max/get_random_u32_below/, since the former is
just a deprecated alias for the latter. Then later, we can remove
prandom_u32_max all together. I'm quite happy about finally being able
to do that. It means that prandom.h is now only for deterministic and
repeatable randomness, not non-deterministic/cryptographic randomness.
That line is no longer blurred.

In order to clean up a bunch of inefficient patterns, we use two simple
helper functions built on top of get_random_u32_below:
get_random_u32_above and get_random_u32_inclusive. The next two patches
convert some gnarly open-coded number juggling to use these helpers.

I've used Coccinelle for these three treewide patches, so hopefully
review is rather uneventful. I didn't accept all of the changes that
Coccinelle proposed, though, as these tend to be somewhat
context-specific. I erred on the side of just going with the most
obvious cases, at least this time through. And then we can address more
complicated cases through actual maintainer trees.

Since get_random_u32_below() and others sits in my random.git tree,
these patches too will flow through that same tree.

Regards,
Jason

Cc: Kees Cook 
Cc: Greg Kroah-Hartman 
Cc: Jakub Kicinski 
Cc: Russell King 
Cc: Catalin Marinas 
Cc: Thomas Bogendoerfer 
Cc: Heiko Carstens 
Cc: Herbert Xu 
Cc: Christoph Böhmwalder 
Cc: Jani Nikula 
Cc: Jason Gunthorpe 
Cc: Sakari Ailus 
Cc: Martin K. Petersen 
Cc: Theodore Ts'o 
Cc: Andreas Dilger 
Cc: Jaegeuk Kim 
Cc: Richard Weinberger 
Cc: Darrick J. Wong 
Cc: SeongJae Park 
Cc: Thomas Gleixner 
Cc: Andrew Morton 
Cc: Michael Ellerman 
Cc: Helge Deller 
Cc: net...@vger.kernel.org
Cc: linux-cry...@vger.kernel.org
Cc: linux-bl...@vger.kernel.org
Cc: linux-fsde...@vger.kernel.org
Cc: linux-me...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: loonga...@lists.linux.dev
Cc: linux-m...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-...@vger.kernel.org
Cc: linux-par...@vger.kernel.org

Jason A. Donenfeld (3):
  treewide: use get_random_u32_below() instead of deprecated function
  treewide: use get_random_u32_{above,below}() instead of manual loop
  treewide: use get_random_u32_inclusive() when possible

 arch/arm/kernel/process.c |  2 +-
 arch/arm64/kernel/process.c   |  2 +-
 arch/loongarch/kernel/process.c   |  2 +-
 arch/loongarch/kernel/vdso.c  |  2 +-
 arch/mips/kernel/process.c|  2 +-
 arch/mips/kernel/vdso.c   |  2 +-
 arch/parisc/kernel/vdso.c |  2 +-
 arch/powerpc/crypto/crc-vpmsum_test.c |  4 +-
 arch/powerpc/kernel/process.c |  2 +-
 arch/s390/kernel/process.c|  2 +-
 arch/s390/kernel/vdso.c   |  2 +-
 arch/sparc/vdso/vma.c |  2 +-
 arch/um/kernel/process.c  |  2 +-
 arch/x86/entry/vdso/vma.c |  2 +-
 arch/x86/kernel/module.c  |  2 +-
 arch/x86/kernel/process.c |  2 +-
 arch/x86/mm/pat/cpa-test.c|  4 +-
 crypto/rsa-pkcs1pad.c |  2 +-
 crypto/testmgr.c  | 86 +--
 drivers/block/drbd/drbd_receiver.c|  4 +-
 drivers/bus/mhi/host/internal.h   |  2 +-
 drivers/dma-buf/st-dma-fence-chain.c  |  6 +-
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c|  2 +-
 .../drm/i915/gt/intel_execlists_submission.c  |  2 +-
 drivers/gpu/drm/i915/intel_memory_region.c|  4 +-
 drivers/infiniband/core/cma.c |  2 +-
 drivers/infiniband/hw/cxgb4/id_table.c|  4 +-
 drivers/infiniband/hw/hns/hns_roce_ah.c   |  5 +-
 drivers/infiniband/ulp/rtrs/rtrs-clt.c|  2 +-
 drivers/md/bcache/request.c   |  2 +-
 drivers/media/common/v4l2-tpg/v4l2-tpg-core.c |  8 +-
 .../media/test-drivers/vidtv/vidtv_demod.c|  8 +-
 .../test-drivers/vivid/vivid-kthread-cap.c|  2 +-
 .../test-drivers/vivid/vivid-kthread-out.c|  2 +-
 .../media/test-drivers/vivid/vivid-radio-rx.c |  4 +-
 .../media/test-drivers/vivid/vivid-sdr-cap.c  |  2 +-
 .../test-drivers/vivid/vivid-touch-cap.c  |  2 +-
 drivers/mmc/core/core.c   |  4 +-
 drivers/mmc/host/dw_mmc.c |  2 +-
 drivers/mtd/nand/raw/nandsim.c|  4 +-
 drivers/mtd/tests/mtd_nandecctest.c   | 10 +--
 drivers/mtd/tests/stresstest.c|  8 +-
 drivers/mtd/ubi/debug.c   |  2 +-
 drivers/mtd/ubi/debug.h   |  6 +-
 drivers/net/ethernet/broadcom/cnic.c  |  2 +-
 .../chelsi

[PATCH v3 1/3] treewide: use get_random_u32_below() instead of deprecated function

2022-11-17 Thread Jason A. Donenfeld
This is a simple mechanical transformation done by:

@@
expression E;
@@
- prandom_u32_max
+ get_random_u32_below
  (E)

Reviewed-by: Greg Kroah-Hartman 
Acked-by: Darrick J. Wong  # for xfs
Reviewed-by: SeongJae Park  # for damon
Reviewed-by: Jason Gunthorpe  # for infiniband
Reviewed-by: Russell King (Oracle)  # for arm
Acked-by: Ulf Hansson  # for mmc
Signed-off-by: Jason A. Donenfeld 
---
 arch/arm/kernel/process.c |  2 +-
 arch/arm64/kernel/process.c   |  2 +-
 arch/loongarch/kernel/process.c   |  2 +-
 arch/loongarch/kernel/vdso.c  |  2 +-
 arch/mips/kernel/process.c|  2 +-
 arch/mips/kernel/vdso.c   |  2 +-
 arch/parisc/kernel/vdso.c |  2 +-
 arch/powerpc/crypto/crc-vpmsum_test.c |  4 +-
 arch/powerpc/kernel/process.c |  2 +-
 arch/s390/kernel/process.c|  2 +-
 arch/s390/kernel/vdso.c   |  2 +-
 arch/sparc/vdso/vma.c |  2 +-
 arch/um/kernel/process.c  |  2 +-
 arch/x86/entry/vdso/vma.c |  2 +-
 arch/x86/kernel/module.c  |  2 +-
 arch/x86/kernel/process.c |  2 +-
 arch/x86/mm/pat/cpa-test.c|  4 +-
 crypto/rsa-pkcs1pad.c |  2 +-
 crypto/testmgr.c  | 86 +--
 drivers/block/drbd/drbd_receiver.c|  4 +-
 drivers/bus/mhi/host/internal.h   |  2 +-
 drivers/dma-buf/st-dma-fence-chain.c  |  6 +-
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c|  2 +-
 .../drm/i915/gt/intel_execlists_submission.c  |  2 +-
 drivers/gpu/drm/i915/intel_memory_region.c|  4 +-
 drivers/infiniband/core/cma.c |  2 +-
 drivers/infiniband/hw/cxgb4/id_table.c|  4 +-
 drivers/infiniband/hw/hns/hns_roce_ah.c   |  4 +-
 drivers/infiniband/ulp/rtrs/rtrs-clt.c|  2 +-
 drivers/md/bcache/request.c   |  2 +-
 drivers/media/common/v4l2-tpg/v4l2-tpg-core.c |  8 +-
 .../media/test-drivers/vidtv/vidtv_demod.c|  8 +-
 .../test-drivers/vivid/vivid-kthread-cap.c|  2 +-
 .../test-drivers/vivid/vivid-kthread-out.c|  2 +-
 .../media/test-drivers/vivid/vivid-radio-rx.c |  4 +-
 .../media/test-drivers/vivid/vivid-sdr-cap.c  |  2 +-
 .../test-drivers/vivid/vivid-touch-cap.c  |  2 +-
 drivers/mmc/core/core.c   |  4 +-
 drivers/mmc/host/dw_mmc.c |  2 +-
 drivers/mtd/nand/raw/nandsim.c|  4 +-
 drivers/mtd/tests/mtd_nandecctest.c   | 10 +--
 drivers/mtd/tests/stresstest.c|  8 +-
 drivers/mtd/ubi/debug.c   |  2 +-
 drivers/mtd/ubi/debug.h   |  6 +-
 drivers/net/ethernet/broadcom/cnic.c  |  2 +-
 .../chelsio/inline_crypto/chtls/chtls_io.c|  4 +-
 drivers/net/phy/at803x.c  |  2 +-
 drivers/net/team/team_mode_random.c   |  2 +-
 drivers/net/wireguard/selftest/allowedips.c   | 20 ++---
 drivers/net/wireguard/timers.c|  4 +-
 .../broadcom/brcm80211/brcmfmac/p2p.c |  2 +-
 .../net/wireless/intel/iwlwifi/mvm/mac-ctxt.c |  2 +-
 drivers/pci/p2pdma.c  |  2 +-
 drivers/s390/scsi/zfcp_fc.c   |  2 +-
 drivers/scsi/fcoe/fcoe_ctlr.c |  4 +-
 drivers/scsi/qedi/qedi_main.c |  2 +-
 drivers/scsi/scsi_debug.c |  6 +-
 fs/ceph/inode.c   |  2 +-
 fs/ceph/mdsmap.c  |  2 +-
 fs/ext2/ialloc.c  |  2 +-
 fs/ext4/ialloc.c  |  2 +-
 fs/ext4/super.c   |  5 +-
 fs/f2fs/gc.c  |  2 +-
 fs/f2fs/segment.c |  8 +-
 fs/ubifs/debug.c  |  8 +-
 fs/ubifs/lpt_commit.c | 14 +--
 fs/ubifs/tnc_commit.c |  2 +-
 fs/xfs/libxfs/xfs_alloc.c |  2 +-
 fs/xfs/libxfs/xfs_ialloc.c|  2 +-
 fs/xfs/xfs_error.c|  2 +-
 include/linux/damon.h |  2 +-
 include/linux/nodemask.h  |  2 +-
 kernel/bpf/core.c |  4 +-
 kernel/kcsan/selftest.c   |  4 +-
 kernel/locking/test-ww_mutex.c|  4 +-
 kernel/time/clocksource.c |  2 +-
 lib/fault-inject.c|  2 +-
 lib/find_bit_benchmark.c  |  4 +-
 lib/kobject.c |  2 +-
 lib/reed_solomon/test_rslib.c |  6 +-
 lib/sbitmap.c |  4 +-
 lib/test-string_helpers.c |  2 +-
 lib/test_hexdump.c| 10 +--
 lib/test_list_

[PATCH v3 2/3] treewide: use get_random_u32_{above,below}() instead of manual loop

2022-11-17 Thread Jason A. Donenfeld
These cases were done with this Coccinelle:

@@
expression E;
identifier I;
@@
-   do {
  ... when != I
- I = get_random_u32();
  ... when != I
-   } while (I > E);
+   I = get_random_u32_below(E + 1);

@@
expression E;
identifier I;
@@
-   do {
  ... when != I
- I = get_random_u32();
  ... when != I
-   } while (I >= E);
+   I = get_random_u32_below(E);

@@
expression E;
identifier I;
@@
-   do {
  ... when != I
- I = get_random_u32();
  ... when != I
-   } while (I < E);
+   I = get_random_u32_above(E - 1);

@@
expression E;
identifier I;
@@
-   do {
  ... when != I
- I = get_random_u32();
  ... when != I
-   } while (I <= E);
+   I = get_random_u32_above(E);

@@
identifier I;
@@
-   do {
  ... when != I
- I = get_random_u32();
  ... when != I
-   } while (!I);
+   I = get_random_u32_above(0);

@@
identifier I;
@@
-   do {
  ... when != I
- I = get_random_u32();
  ... when != I
-   } while (I == 0);
+   I = get_random_u32_above(0);

@@
expression E;
@@
- E + 1 + get_random_u32_below(U32_MAX - E)
+ get_random_u32_above(E)

Reviewed-by: Greg Kroah-Hartman 
Signed-off-by: Jason A. Donenfeld 
---
 fs/ext4/mmp.c| 8 +---
 lib/test_fprobe.c| 5 +
 lib/test_kprobes.c   | 5 +
 net/ipv6/output_core.c   | 8 +---
 net/vmw_vsock/af_vsock.c | 3 +--
 5 files changed, 5 insertions(+), 24 deletions(-)

diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c
index 588cb09c5291..4681fff6665f 100644
--- a/fs/ext4/mmp.c
+++ b/fs/ext4/mmp.c
@@ -262,13 +262,7 @@ void ext4_stop_mmpd(struct ext4_sb_info *sbi)
  */
 static unsigned int mmp_new_seq(void)
 {
-   u32 new_seq;
-
-   do {
-   new_seq = get_random_u32();
-   } while (new_seq > EXT4_MMP_SEQ_MAX);
-
-   return new_seq;
+   return get_random_u32_below(EXT4_MMP_SEQ_MAX + 1);
 }
 
 /*
diff --git a/lib/test_fprobe.c b/lib/test_fprobe.c
index e0381b3ec410..1fb56cf5e5ce 100644
--- a/lib/test_fprobe.c
+++ b/lib/test_fprobe.c
@@ -144,10 +144,7 @@ static unsigned long get_ftrace_location(void *func)
 
 static int fprobe_test_init(struct kunit *test)
 {
-   do {
-   rand1 = get_random_u32();
-   } while (rand1 <= div_factor);
-
+   rand1 = get_random_u32_above(div_factor);
target = fprobe_selftest_target;
target2 = fprobe_selftest_target2;
target_ip = get_ftrace_location(target);
diff --git a/lib/test_kprobes.c b/lib/test_kprobes.c
index eeb1d728d974..1c95e5719802 100644
--- a/lib/test_kprobes.c
+++ b/lib/test_kprobes.c
@@ -339,10 +339,7 @@ static int kprobes_test_init(struct kunit *test)
stacktrace_target = kprobe_stacktrace_target;
internal_target = kprobe_stacktrace_internal_target;
stacktrace_driver = kprobe_stacktrace_driver;
-
-   do {
-   rand1 = get_random_u32();
-   } while (rand1 <= div_factor);
+   rand1 = get_random_u32_above(div_factor);
return 0;
 }
 
diff --git a/net/ipv6/output_core.c b/net/ipv6/output_core.c
index 2685c3f15e9d..b5205311f372 100644
--- a/net/ipv6/output_core.c
+++ b/net/ipv6/output_core.c
@@ -15,13 +15,7 @@ static u32 __ipv6_select_ident(struct net *net,
   const struct in6_addr *dst,
   const struct in6_addr *src)
 {
-   u32 id;
-
-   do {
-   id = get_random_u32();
-   } while (!id);
-
-   return id;
+   return get_random_u32_above(0);
 }
 
 /* This function exists only for tap drivers that must support broken
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index ff38c5a4d174..d593d5b6d4b1 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -626,8 +626,7 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
struct sockaddr_vm new_addr;
 
if (!port)
-   port = LAST_RESERVED_PORT + 1 +
-   get_random_u32_below(U32_MAX - LAST_RESERVED_PORT);
+   port = get_random_u32_above(LAST_RESERVED_PORT);
 
vsock_addr_init(&new_addr, addr->svm_cid, addr->svm_port);
 
-- 
2.38.1



[PATCH v3 3/3] treewide: use get_random_u32_inclusive() when possible

2022-11-17 Thread Jason A. Donenfeld
These cases were done with this Coccinelle:

@@
expression H;
expression L;
@@
- (get_random_u32_below(H) + L)
+ get_random_u32_inclusive(L, H + L - 1)

@@
expression H;
expression L;
expression E;
@@
  get_random_u32_inclusive(L,
  H
- + E
- - E
  )

@@
expression H;
expression L;
expression E;
@@
  get_random_u32_inclusive(L,
  H
- - E
- + E
  )

@@
expression H;
expression L;
expression E;
expression F;
@@
  get_random_u32_inclusive(L,
  H
- - E
  + F
- + E
  )

@@
expression H;
expression L;
expression E;
expression F;
@@
  get_random_u32_inclusive(L,
  H
- + E
  + F
- - E
  )

And then subsequently cleaned up by hand, with several automatic cases
rejected if it didn't make sense contextually.

Reviewed-by: Greg Kroah-Hartman 
Reviewed-by: Jason Gunthorpe  # for infiniband
Signed-off-by: Jason A. Donenfeld 
---
 arch/x86/kernel/module.c  |  2 +-
 crypto/rsa-pkcs1pad.c |  2 +-
 crypto/testmgr.c  | 10 
 drivers/bus/mhi/host/internal.h   |  2 +-
 drivers/dma-buf/st-dma-fence-chain.c  |  2 +-
 drivers/infiniband/core/cma.c |  2 +-
 drivers/infiniband/hw/hns/hns_roce_ah.c   |  5 ++--
 drivers/mtd/nand/raw/nandsim.c|  2 +-
 drivers/net/wireguard/selftest/allowedips.c   |  8 +++---
 .../broadcom/brcm80211/brcmfmac/p2p.c |  2 +-
 .../net/wireless/intel/iwlwifi/mvm/mac-ctxt.c |  2 +-
 fs/f2fs/segment.c |  6 ++---
 kernel/kcsan/selftest.c   |  2 +-
 lib/test_hexdump.c| 10 
 lib/test_printf.c |  2 +-
 lib/test_vmalloc.c|  6 ++---
 mm/kasan/kasan_test.c |  6 ++---
 mm/kfence/kfence_test.c   |  2 +-
 mm/swapfile.c |  5 ++--
 net/bluetooth/mgmt.c  |  5 ++--
 net/core/pktgen.c | 25 ---
 net/ipv4/tcp_input.c  |  2 +-
 net/ipv6/addrconf.c   |  6 ++---
 net/netfilter/nf_nat_helper.c |  2 +-
 net/xfrm/xfrm_state.c |  2 +-
 25 files changed, 56 insertions(+), 64 deletions(-)

diff --git a/arch/x86/kernel/module.c b/arch/x86/kernel/module.c
index c09ae279ef32..a98687642dd0 100644
--- a/arch/x86/kernel/module.c
+++ b/arch/x86/kernel/module.c
@@ -53,7 +53,7 @@ static unsigned long int get_module_load_offset(void)
 */
if (module_load_offset == 0)
module_load_offset =
-   (get_random_u32_below(1024) + 1) * PAGE_SIZE;
+   get_random_u32_inclusive(1, 1024) * PAGE_SIZE;
mutex_unlock(&module_kaslr_mutex);
}
return module_load_offset;
diff --git a/crypto/rsa-pkcs1pad.c b/crypto/rsa-pkcs1pad.c
index 0f722f8f779b..e75728f87ce5 100644
--- a/crypto/rsa-pkcs1pad.c
+++ b/crypto/rsa-pkcs1pad.c
@@ -253,7 +253,7 @@ static int pkcs1pad_encrypt(struct akcipher_request *req)
ps_end = ctx->key_size - req->src_len - 2;
req_ctx->in_buf[0] = 0x02;
for (i = 1; i < ps_end; i++)
-   req_ctx->in_buf[i] = 1 + get_random_u32_below(255);
+   req_ctx->in_buf[i] = get_random_u32_inclusive(1, 255);
req_ctx->in_buf[ps_end] = 0x00;
 
pkcs1pad_sg_set_buf(req_ctx->in_sg, req_ctx->in_buf,
diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index 079923d43ce2..e669acd2ebdd 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -962,11 +962,11 @@ static char *generate_random_sgl_divisions(struct 
test_sg_division *divs,
if (div == &divs[max_divs - 1] || get_random_u32_below(2) == 0)
this_len = remaining;
else
-   this_len = 1 + get_random_u32_below(remaining);
+   this_len = get_random_u32_inclusive(1, remaining);
div->proportion_of_total = this_len;
 
if (get_random_u32_below(4) == 0)
-   div->offset = (PAGE_SIZE - 128) + 
get_random_u32_below(128);
+   div->offset = get_random_u32_inclusive(PAGE_SIZE - 128, 
PAGE_SIZE - 1);
else if (get_random_u32_below(2) == 0)
div->offset = get_random_u32_below(32);
else
@@ -1094,12 +1094,12 @@ static void generate_random_testvec_config(struct 
testvec_config *cfg,
}
 
if (get_random_u32_below(2) == 0) {
-   cfg->iv_offset = 1 + get_random_u32_below(MAX_ALGAPI_ALIGNMASK);
+   cfg->iv_offset = get_random_u32_inclusive(1, 
MAX_ALGAPI_ALIGNMASK);
p += scnprintf(p, end - p, " iv_offset=%u", cfg->iv_offset);
}
 
if (get_random_u32_below(2) == 0) {
-   cfg->key_offset = 1 + 
get_random_u32_below(MAX_ALGAPI_ALIGNMASK);
+   

Re: [PATCH 2/4] fs: define a firmware security filesystem named fwsecurityfs

2022-11-17 Thread Greg Kroah-Hartman
On Mon, Nov 14, 2022 at 06:03:43PM -0500, Nayna wrote:
> 
> On 11/10/22 04:58, Greg Kroah-Hartman wrote:
> > On Wed, Nov 09, 2022 at 03:10:37PM -0500, Nayna wrote:
> > > On 11/9/22 08:46, Greg Kroah-Hartman wrote:
> > > > On Sun, Nov 06, 2022 at 04:07:42PM -0500, Nayna Jain wrote:
> > > > > securityfs is meant for Linux security subsystems to expose 
> > > > > policies/logs
> > > > > or any other information. However, there are various firmware security
> > > > > features which expose their variables for user management via the 
> > > > > kernel.
> > > > > There is currently no single place to expose these variables. 
> > > > > Different
> > > > > platforms use sysfs/platform specific filesystem(efivarfs)/securityfs
> > > > > interface as they find it appropriate. Thus, there is a gap in kernel
> > > > > interfaces to expose variables for security features.
> > > > > 
> > > > > Define a firmware security filesystem (fwsecurityfs) to be used by
> > > > > security features enabled by the firmware. These variables are 
> > > > > platform
> > > > > specific. This filesystem provides platforms a way to implement their
> > > > >own underlying semantics by defining own inode and file operations.
> > > > > 
> > > > > Similar to securityfs, the firmware security filesystem is recommended
> > > > > to be exposed on a well known mount point /sys/firmware/security.
> > > > > Platforms can define their own directory or file structure under this 
> > > > > path.
> > > > > 
> > > > > Example:
> > > > > 
> > > > > # mount -t fwsecurityfs fwsecurityfs /sys/firmware/security
> > > > Why not juset use securityfs in /sys/security/firmware/ instead?  Then
> > > > you don't have to create a new filesystem and convince userspace to
> > > > mount it in a specific location?
> > >  From man 5 sysfs page:
> > > 
> > > /sys/firmware: This subdirectory contains interfaces for viewing and
> > > manipulating firmware-specific objects and attributes.
> > > 
> > > /sys/kernel: This subdirectory contains various files and subdirectories
> > > that provide information about the running kernel.
> > > 
> > > The security variables which are being exposed via fwsecurityfs are 
> > > managed
> > > by firmware, stored in firmware managed space and also often consumed by
> > > firmware for enabling various security features.
> > Ok, then just use the normal sysfs interface for /sys/firmware, why do
> > you need a whole new filesystem type?
> > 
> > >  From git commit b67dbf9d4c1987c370fd18fdc4cf9d8aaea604c2, the purpose of
> > > securityfs(/sys/kernel/security) is to provide a common place for all 
> > > kernel
> > > LSMs. The idea of
> > > fwsecurityfs(/sys/firmware/security) is to similarly provide a common 
> > > place
> > > for all firmware security objects.
> > > 
> > > /sys/firmware already exists. The patch now defines a new /security
> > > directory in it for firmware security features. Using /sys/kernel/security
> > > would mean scattering firmware objects in multiple places and confusing 
> > > the
> > > purpose of /sys/kernel and /sys/firmware.
> > sysfs is confusing already, no problem with making it more confusing :)
> > 
> > Just document where you add things and all should be fine.
> > 
> > > Even though fwsecurityfs code is based on securityfs, since the two
> > > filesystems expose different types of objects and have different
> > > requirements, there are distinctions:
> > > 
> > > 1. fwsecurityfs lets users create files in userspace, securityfs only 
> > > allows
> > > kernel subsystems to create files.
> > Wait, why would a user ever create a file in this filesystem?  If you
> > need that, why not use configfs?  That's what that is for, right?
> 
> The purpose of fwsecurityfs is not to expose configuration items but rather
> security objects used for firmware security features. I think these are more
> comparable to EFI variables, which are exposed via an EFI-specific
> filesystem, efivarfs, rather than configfs.
> 
> > 
> > > 2. firmware and kernel objects may have different requirements. For 
> > > example,
> > > consideration of namespacing. As per my understanding, namespacing is
> > > applied to kernel resources and not firmware resources. That's why it 
> > > makes
> > > sense to add support for namespacing in securityfs, but we concluded that
> > > fwsecurityfs currently doesn't need it. Another but similar example of it
> > > is: TPM space, which is exposed from hardware. For containers, the TPM 
> > > would
> > > be made as virtual/software TPM. Similarly for firmware space for
> > > containers, it would have to be something virtualized/software version of
> > > it.
> > I do not understand, sorry.  What does namespaces have to do with this?
> > sysfs can already handle namespaces just fine, why not use that?
> 
> Firmware objects are not namespaced. I mentioned it here as an example of
> the difference between firmware and kernel objects. It is also in response
> to the feedback from James Bottomley in RFC v2 
> [

Re: [PATCH v3 3/3] treewide: use get_random_u32_inclusive() when possible

2022-11-17 Thread Kees Cook
On Thu, Nov 17, 2022 at 09:29:06PM +0100, Jason A. Donenfeld wrote:
> These cases were done with this Coccinelle:
> 
> @@
> expression H;
> expression L;
> @@
> - (get_random_u32_below(H) + L)
> + get_random_u32_inclusive(L, H + L - 1)
> 
> @@
> expression H;
> expression L;
> expression E;
> @@
>   get_random_u32_inclusive(L,
>   H
> - + E
> - - E
>   )
> 
> @@
> expression H;
> expression L;
> expression E;
> @@
>   get_random_u32_inclusive(L,
>   H
> - - E
> - + E
>   )
> 
> @@
> expression H;
> expression L;
> expression E;
> expression F;
> @@
>   get_random_u32_inclusive(L,
>   H
> - - E
>   + F
> - + E
>   )
> 
> @@
> expression H;
> expression L;
> expression E;
> expression F;
> @@
>   get_random_u32_inclusive(L,
>   H
> - + E
>   + F
> - - E
>   )
> 
> And then subsequently cleaned up by hand, with several automatic cases
> rejected if it didn't make sense contextually.
> 
> Reviewed-by: Greg Kroah-Hartman 
> Reviewed-by: Jason Gunthorpe  # for infiniband
> Signed-off-by: Jason A. Donenfeld 
> ---
>  arch/x86/kernel/module.c  |  2 +-
>  crypto/rsa-pkcs1pad.c |  2 +-
>  crypto/testmgr.c  | 10 
>  drivers/bus/mhi/host/internal.h   |  2 +-
>  drivers/dma-buf/st-dma-fence-chain.c  |  2 +-
>  drivers/infiniband/core/cma.c |  2 +-
>  drivers/infiniband/hw/hns/hns_roce_ah.c   |  5 ++--
>  drivers/mtd/nand/raw/nandsim.c|  2 +-
>  drivers/net/wireguard/selftest/allowedips.c   |  8 +++---
>  .../broadcom/brcm80211/brcmfmac/p2p.c |  2 +-
>  .../net/wireless/intel/iwlwifi/mvm/mac-ctxt.c |  2 +-
>  fs/f2fs/segment.c |  6 ++---
>  kernel/kcsan/selftest.c   |  2 +-
>  lib/test_hexdump.c| 10 
>  lib/test_printf.c |  2 +-
>  lib/test_vmalloc.c|  6 ++---
>  mm/kasan/kasan_test.c |  6 ++---
>  mm/kfence/kfence_test.c   |  2 +-
>  mm/swapfile.c |  5 ++--
>  net/bluetooth/mgmt.c  |  5 ++--
>  net/core/pktgen.c | 25 ---
>  net/ipv4/tcp_input.c  |  2 +-
>  net/ipv6/addrconf.c   |  6 ++---
>  net/netfilter/nf_nat_helper.c |  2 +-
>  net/xfrm/xfrm_state.c |  2 +-
>  25 files changed, 56 insertions(+), 64 deletions(-)

Even the diffstat agrees this is a nice clean-up. :)

Reviewed-by: Kees Cook 

The only comment I have is that maybe these cases can just be left as-is
with _below()?

> - size_t len = get_random_u32_below(rs) + gs;
> + size_t len = get_random_u32_inclusive(gs, rs + gs - 1);

It seems like writing it in the form of base plus [0, limit) is clearer?

size_t len = gs + get_random_u32_below(rs);

But there is only a handful, so *shrug*

All the others are much cleaner rewritten as _inclusive().

-- 
Kees Cook


Re: [PATCH v3 2/3] treewide: use get_random_u32_{above,below}() instead of manual loop

2022-11-17 Thread Kees Cook
On Thu, Nov 17, 2022 at 09:29:05PM +0100, Jason A. Donenfeld wrote:
> These cases were done with this Coccinelle:
> 
> @@
> expression E;
> identifier I;
> @@
> -   do {
>   ... when != I
> - I = get_random_u32();
>   ... when != I
> -   } while (I > E);
> +   I = get_random_u32_below(E + 1);
> 
> @@
> expression E;
> identifier I;
> @@
> -   do {
>   ... when != I
> - I = get_random_u32();
>   ... when != I
> -   } while (I >= E);
> +   I = get_random_u32_below(E);
> 
> @@
> expression E;
> identifier I;
> @@
> -   do {
>   ... when != I
> - I = get_random_u32();
>   ... when != I
> -   } while (I < E);
> +   I = get_random_u32_above(E - 1);
> 
> @@
> expression E;
> identifier I;
> @@
> -   do {
>   ... when != I
> - I = get_random_u32();
>   ... when != I
> -   } while (I <= E);
> +   I = get_random_u32_above(E);
> 
> @@
> identifier I;
> @@
> -   do {
>   ... when != I
> - I = get_random_u32();
>   ... when != I
> -   } while (!I);
> +   I = get_random_u32_above(0);
> 
> @@
> identifier I;
> @@
> -   do {
>   ... when != I
> - I = get_random_u32();
>   ... when != I
> -   } while (I == 0);
> +   I = get_random_u32_above(0);
> 
> @@
> expression E;
> @@
> - E + 1 + get_random_u32_below(U32_MAX - E)
> + get_random_u32_above(E)
> 
> Reviewed-by: Greg Kroah-Hartman 
> Signed-off-by: Jason A. Donenfeld 

Reviewed-by: Kees Cook 

-- 
Kees Cook


Re: [PATCH v3 1/3] treewide: use get_random_u32_below() instead of deprecated function

2022-11-17 Thread Kees Cook
On Thu, Nov 17, 2022 at 09:29:04PM +0100, Jason A. Donenfeld wrote:
> This is a simple mechanical transformation done by:
> 
> @@
> expression E;
> @@
> - prandom_u32_max
> + get_random_u32_below
>   (E)
> 
> Reviewed-by: Greg Kroah-Hartman 
> Acked-by: Darrick J. Wong  # for xfs
> Reviewed-by: SeongJae Park  # for damon
> Reviewed-by: Jason Gunthorpe  # for infiniband
> Reviewed-by: Russell King (Oracle)  # for arm
> Acked-by: Ulf Hansson  # for mmc
> Signed-off-by: Jason A. Donenfeld 

Reviewed-by: Kees Cook 

-- 
Kees Cook


Re: [PATCH v6] livepatch: Clear relocation targets on a module removal

2022-11-17 Thread Song Liu
Hi folks,

It seems we forgot about this work... What shall we do to move forward?

Thanks,
Song

On Thu, Sep 1, 2022 at 10:16 AM Song Liu  wrote:
>
> From: Miroslav Benes 
>
> Josh reported a bug:
>
>   When the object to be patched is a module, and that module is
>   rmmod'ed and reloaded, it fails to load with:
>
>   module: x86/modules: Skipping invalid relocation target, existing value is 
> nonzero for type 2, loc ba0302e9, val a03e293c
>   livepatch: failed to initialize patch 'livepatch_nfsd' for module 'nfsd' 
> (-8)
>   livepatch: patch 'livepatch_nfsd' failed for module 'nfsd', refusing to 
> load module 'nfsd'
>
>   The livepatch module has a relocation which references a symbol
>   in the _previous_ loading of nfsd. When apply_relocate_add()
>   tries to replace the old relocation with a new one, it sees that
>   the previous one is nonzero and it errors out.
>
>   On ppc64le, we have a similar issue:
>
>   module_64: livepatch_nfsd: Expected nop after call, got e8410018 at 
> e_show+0x60/0x548 [livepatch_nfsd]
>   livepatch: failed to initialize patch 'livepatch_nfsd' for module 'nfsd' 
> (-8)
>   livepatch: patch 'livepatch_nfsd' failed for module 'nfsd', refusing to 
> load module 'nfsd'
>
> He also proposed three different solutions. We could remove the error
> check in apply_relocate_add() introduced by commit eda9cec4c9a1
> ("x86/module: Detect and skip invalid relocations"). However the check
> is useful for detecting corrupted modules.
>
> We could also deny the patched modules to be removed. If it proved to be
> a major drawback for users, we could still implement a different
> approach. The solution would also complicate the existing code a lot.
>
> We thus decided to reverse the relocation patching (clear all relocation
> targets on x86_64). The solution is not
> universal and is too much arch-specific, but it may prove to be simpler
> in the end.
>
> Reported-by: Josh Poimboeuf 
> Signed-off-by: Miroslav Benes 
> Signed-off-by: Song Liu 
>
> ---
>
> NOTE: powerpc32 code is only compile tested.
>
> Changes v5 = v6:
> 1. Fix powerpc64.
> 2. Fix compile for powerpc32.
>
> Changes v4 = v5:
> 1. Fix compile with powerpc.
>
> Changes v3 = v4:
> 1. Reuse __apply_relocate_add to make it more reliable in long term.
>(Josh Poimboeuf)
> 2. Add back ppc64 logic from v2, with changes to match current code.
>(Josh Poimboeuf)
>
> Changes v2 => v3:
> 1. Rewrite x86 changes to match current code style.
> 2. Remove powerpc changes as there is no test coverage in v3.
> 3. Only keep 1/3 of v2.
>
> v2: https://lore.kernel.org/all/20190905124514.8944-1-mbe...@suse.cz/T/#u
> ---
>  arch/powerpc/kernel/module_32.c |  10 
>  arch/powerpc/kernel/module_64.c |  49 +++
>  arch/s390/kernel/module.c   |   8 +++
>  arch/x86/kernel/module.c| 102 +++-
>  include/linux/moduleloader.h|   7 +++
>  kernel/livepatch/core.c |  41 -
>  6 files changed, 189 insertions(+), 28 deletions(-)
>
> diff --git a/arch/powerpc/kernel/module_32.c b/arch/powerpc/kernel/module_32.c
> index ea6536171778..e3c312770453 100644
> --- a/arch/powerpc/kernel/module_32.c
> +++ b/arch/powerpc/kernel/module_32.c
> @@ -285,6 +285,16 @@ int apply_relocate_add(Elf32_Shdr *sechdrs,
> return 0;
>  }
>
> +#ifdef CONFIG_LIVEPATCH
> +void clear_relocate_add(Elf32_Shdr *sechdrs,
> +  const char *strtab,
> +  unsigned int symindex,
> +  unsigned int relsec,
> +  struct module *me)
> +{
> +}
> +#endif
> +
>  #ifdef CONFIG_DYNAMIC_FTRACE
>  notrace int module_trampoline_target(struct module *mod, unsigned long addr,
>  unsigned long *target)
> diff --git a/arch/powerpc/kernel/module_64.c b/arch/powerpc/kernel/module_64.c
> index 7e45dc98df8a..514951f97391 100644
> --- a/arch/powerpc/kernel/module_64.c
> +++ b/arch/powerpc/kernel/module_64.c
> @@ -739,6 +739,55 @@ int apply_relocate_add(Elf64_Shdr *sechdrs,
> return 0;
>  }
>
> +#ifdef CONFIG_LIVEPATCH
> +void clear_relocate_add(Elf64_Shdr *sechdrs,
> +  const char *strtab,
> +  unsigned int symindex,
> +  unsigned int relsec,
> +  struct module *me)
> +{
> +   unsigned int i;
> +   Elf64_Rela *rela = (void *)sechdrs[relsec].sh_addr;
> +   Elf64_Sym *sym;
> +   unsigned long *location;
> +   const char *symname;
> +   u32 *instruction;
> +
> +   pr_debug("Clearing ADD relocate section %u to %u\n", relsec,
> +sechdrs[relsec].sh_info);
> +
> +   for (i = 0; i < sechdrs[relsec].sh_size / sizeof(*rela); i++) {
> +   location = (void *)sechdrs[sechdrs[relsec].sh_info].sh_addr
> +   + rela[i].r_offset;
> +   sym = (Elf64_Sym *)sechdrs[symindex].sh_addr
> +   + ELF64_R_SYM(rela[i].r_info);
> +   sym

Re: [PATCH v3 3/3] treewide: use get_random_u32_inclusive() when possible

2022-11-17 Thread Jason A. Donenfeld
On Thu, Nov 17, 2022 at 01:57:13PM -0800, Kees Cook wrote:
> The only comment I have is that maybe these cases can just be left as-is
> with _below()?
> 
> > - size_t len = get_random_u32_below(rs) + gs;
> > + size_t len = get_random_u32_inclusive(gs, rs + gs - 1);
> 
> It seems like writing it in the form of base plus [0, limit) is clearer?
> 
>   size_t len = gs + get_random_u32_below(rs);
> 
> But there is only a handful, so *shrug*

Okay, I'll drop that one.

Jason


RE: [PATCH v2 3/3] treewide: use get_random_u32_between() when possible

2022-11-17 Thread David Laight
From: Theodore Ts'o
> Sent: 17 November 2022 15:43
...
> The problem with "between", "ranged", "spanning" is that they don't
> tell the reader whether we're dealing with an "open interval" or a
> "closed interval".  They are just different ways of saying that it's a
> range between, say, 0 and 20.  But it doesn't tell you whether it
> includes 0 or 20 or not.
> 
> The only way I can see for making it ambiguous is either to use the
> terminology "closed interval" or "inclusive".  And "open" and "closed"
> can have other meanings, so get_random_u32_inclusive() is going to be
> less confusing than get_random_u32_closed().

It has to be said that removing the extra function and requiring
the callers use 'base + get_random_below(high [+1] - base)' is
likely to be the only way to succinctly make the code readable
and understandable.

Otherwise readers either have to look up another function to see
what it does or waste variable brain cells on more trivia.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



[PATCH v5 1/5] powerpc: mm: Replace p{u,m,4}d_is_leaf with p{u,m,4}_leaf

2022-11-17 Thread Rohan McLure
Replace occurrences of p{u,m,4}d_is_leaf with p{u,m,4}_leaf, as the
latter is the name given to checking that a higher-level entry in
multi-level paging contains a page translation entry (pte) throughout
all other archs.

A future patch will implement p{u,m,4}_leaf stubs on all platforms so
that they may be referenced in generic code.

Signed-off-by: Rohan McLure 
---
V4: New patch
V5: Previously replaced stub definition for *_is_leaf with *_leaf. Do
that in a later patch
---
 arch/powerpc/kvm/book3s_64_mmu_radix.c   | 12 ++--
 arch/powerpc/mm/book3s64/radix_pgtable.c | 14 +++---
 arch/powerpc/mm/pgtable.c|  6 +++---
 arch/powerpc/mm/pgtable_64.c |  6 +++---
 arch/powerpc/xmon/xmon.c |  6 +++---
 5 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 5d5e12f3bf86..d29f8d1d97a6 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -497,7 +497,7 @@ static void kvmppc_unmap_free_pmd(struct kvm *kvm, pmd_t 
*pmd, bool full,
for (im = 0; im < PTRS_PER_PMD; ++im, ++p) {
if (!pmd_present(*p))
continue;
-   if (pmd_is_leaf(*p)) {
+   if (pmd_leaf(*p)) {
if (full) {
pmd_clear(p);
} else {
@@ -526,7 +526,7 @@ static void kvmppc_unmap_free_pud(struct kvm *kvm, pud_t 
*pud,
for (iu = 0; iu < PTRS_PER_PUD; ++iu, ++p) {
if (!pud_present(*p))
continue;
-   if (pud_is_leaf(*p)) {
+   if (pud_leaf(*p)) {
pud_clear(p);
} else {
pmd_t *pmd;
@@ -629,12 +629,12 @@ int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, 
pte_t pte,
new_pud = pud_alloc_one(kvm->mm, gpa);
 
pmd = NULL;
-   if (pud && pud_present(*pud) && !pud_is_leaf(*pud))
+   if (pud && pud_present(*pud) && !pud_leaf(*pud))
pmd = pmd_offset(pud, gpa);
else if (level <= 1)
new_pmd = kvmppc_pmd_alloc();
 
-   if (level == 0 && !(pmd && pmd_present(*pmd) && !pmd_is_leaf(*pmd)))
+   if (level == 0 && !(pmd && pmd_present(*pmd) && !pmd_leaf(*pmd)))
new_ptep = kvmppc_pte_alloc();
 
/* Check if we might have been invalidated; let the guest retry if so */
@@ -652,7 +652,7 @@ int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, 
pte_t pte,
new_pud = NULL;
}
pud = pud_offset(p4d, gpa);
-   if (pud_is_leaf(*pud)) {
+   if (pud_leaf(*pud)) {
unsigned long hgpa = gpa & PUD_MASK;
 
/* Check if we raced and someone else has set the same thing */
@@ -703,7 +703,7 @@ int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, 
pte_t pte,
new_pmd = NULL;
}
pmd = pmd_offset(pud, gpa);
-   if (pmd_is_leaf(*pmd)) {
+   if (pmd_leaf(*pmd)) {
unsigned long lgpa = gpa & PMD_MASK;
 
/* Check if we raced and someone else has set the same thing */
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index cac727b01799..8ac27e031ff4 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -205,14 +205,14 @@ static void radix__change_memory_range(unsigned long 
start, unsigned long end,
pudp = pud_alloc(&init_mm, p4dp, idx);
if (!pudp)
continue;
-   if (pud_is_leaf(*pudp)) {
+   if (pud_leaf(*pudp)) {
ptep = (pte_t *)pudp;
goto update_the_pte;
}
pmdp = pmd_alloc(&init_mm, pudp, idx);
if (!pmdp)
continue;
-   if (pmd_is_leaf(*pmdp)) {
+   if (pmd_leaf(*pmdp)) {
ptep = pmdp_ptep(pmdp);
goto update_the_pte;
}
@@ -762,7 +762,7 @@ static void __meminit remove_pmd_table(pmd_t *pmd_start, 
unsigned long addr,
if (!pmd_present(*pmd))
continue;
 
-   if (pmd_is_leaf(*pmd)) {
+   if (pmd_leaf(*pmd)) {
if (!IS_ALIGNED(addr, PMD_SIZE) ||
!IS_ALIGNED(next, PMD_SIZE)) {
WARN_ONCE(1, "%s: unaligned range\n", __func__);
@@ -792,7 +792,7 @@ static void __meminit remove_pud_table(pud_t *pud_start, 
unsigned long addr,
if (!pud_present(*pud))
continue;
 
-   if (pud_is_leaf(*pud)) {
+   if (pud_leaf(*pud)) {
if (!IS_ALIGNED(addr, PUD_SIZE) ||
!IS_ALIGNED(n

[PATCH v5 2/5] powerpc: mm: Implement p{m,u,4}d_leaf on all platforms

2022-11-17 Thread Rohan McLure
The check that a higher-level entry in multi-level pages contains a page
translation entry (pte) is performed by p{m,u,4}d_leaf stubs, which may
be specialised for each choice of mmu. In a prior commit, we replace
uses to the catch-all stubs, p{m,u,4}d_is_leaf with p{m,u,4}d_leaf.

Replace the catch-all stub definitions for p{m,u,4}d_is_leaf with
definitions for p{m,u,4}d_leaf. A future patch will assume that
p{m,u,4}d_leaf is defined on all platforms.

In particular, implement pud_leaf for Book3E-64, pmd_leaf for all Book3E
and Book3S-64 platforms, with a catch-all definition for p4d_leaf.

Signed-off-by: Rohan McLure 
---
v5: Split patch that replaces p{m,u,4}d_is_leaf into two patches, first
replacing callsites and afterward providing generic definition.
Remove ifndef-defines implementing p{m,u}d_leaf in favour of
implementing stubs in headers belonging to the particular platforms
needing them.
---
 arch/powerpc/include/asm/book3s/32/pgtable.h |  4 
 arch/powerpc/include/asm/book3s/64/pgtable.h |  8 ++-
 arch/powerpc/include/asm/nohash/64/pgtable.h |  5 +
 arch/powerpc/include/asm/nohash/pgtable.h|  5 +
 arch/powerpc/include/asm/pgtable.h   | 22 ++--
 5 files changed, 18 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h 
b/arch/powerpc/include/asm/book3s/32/pgtable.h
index 75823f39e042..921ae95cd939 100644
--- a/arch/powerpc/include/asm/book3s/32/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/32/pgtable.h
@@ -242,6 +242,10 @@ static inline void pmd_clear(pmd_t *pmdp)
*pmdp = __pmd(0);
 }
 
+static inline bool pmd_leaf(pmd_t pmd)
+{
+   return false;
+}
 
 /*
  * When flushing the tlb entry for a page, we also need to flush the hash
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index c436d8422654..9a6db4ad3228 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1426,16 +1426,12 @@ static inline bool is_pte_rw_upgrade(unsigned long 
old_val, unsigned long new_va
 /*
  * Like pmd_huge() and pmd_large(), but works regardless of config options
  */
-#define pmd_is_leaf pmd_is_leaf
-#define pmd_leaf pmd_is_leaf
-static inline bool pmd_is_leaf(pmd_t pmd)
+static inline bool pmd_leaf(pmd_t pmd)
 {
return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
 }
 
-#define pud_is_leaf pud_is_leaf
-#define pud_leaf pud_is_leaf
-static inline bool pud_is_leaf(pud_t pud)
+static inline bool pud_leaf(pud_t pud)
 {
return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
 }
diff --git a/arch/powerpc/include/asm/nohash/64/pgtable.h 
b/arch/powerpc/include/asm/nohash/64/pgtable.h
index 879e9a6e5a87..34e618bb9140 100644
--- a/arch/powerpc/include/asm/nohash/64/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/64/pgtable.h
@@ -141,6 +141,11 @@ static inline void pud_clear(pud_t *pudp)
*pudp = __pud(0);
 }
 
+static inline bool pud_leaf(pud_t pud)
+{
+   return false;
+}
+
 #define pud_none(pud)  (!pud_val(pud))
 #definepud_bad(pud)(!is_kernel_addr(pud_val(pud)) \
 || (pud_val(pud) & PUD_BAD_BITS))
diff --git a/arch/powerpc/include/asm/nohash/pgtable.h 
b/arch/powerpc/include/asm/nohash/pgtable.h
index d9067dfc531c..455ae13822ee 100644
--- a/arch/powerpc/include/asm/nohash/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/pgtable.h
@@ -60,6 +60,11 @@ static inline bool pte_hw_valid(pte_t pte)
return pte_val(pte) & _PAGE_PRESENT;
 }
 
+static inline bool pmd_leaf(pmd_t pmd)
+{
+   return false;
+}
+
 /*
  * Don't just check for any non zero bits in __PAGE_USER, since for book3e
  * and PTE_64BIT, PAGE_KERNEL_X contains _PAGE_BAP_SR which is also in
diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index 283f40d05a4d..68a3f271aebe 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -134,29 +134,11 @@ static inline void pte_frag_set(mm_context_t *ctx, void 
*p)
 }
 #endif
 
-#ifndef pmd_is_leaf
-#define pmd_is_leaf pmd_is_leaf
-static inline bool pmd_is_leaf(pmd_t pmd)
+#define p4d_leaf p4d_leaf
+static inline bool p4d_leaf(p4d_t p4d)
 {
return false;
 }
-#endif
-
-#ifndef pud_is_leaf
-#define pud_is_leaf pud_is_leaf
-static inline bool pud_is_leaf(pud_t pud)
-{
-   return false;
-}
-#endif
-
-#ifndef p4d_is_leaf
-#define p4d_is_leaf p4d_is_leaf
-static inline bool p4d_is_leaf(p4d_t p4d)
-{
-   return false;
-}
-#endif
 
 #define pmd_pgtable pmd_pgtable
 static inline pgtable_t pmd_pgtable(pmd_t pmd)
-- 
2.37.2



[PATCH v5 4/5] powerpc: mm: add p{te,md,ud}_user_accessible_page helpers

2022-11-17 Thread Rohan McLure
Add the following helpers for detecting whether a page table entry
is a leaf and is accessible to user space.

 * pte_user_accessible_page
 * pmd_user_accessible_page
 * pud_user_accessible_page

Also implement missing pud_user definitions for both Book3S/nohash 64-bit
systems, and pmd_user for Book3S/nohash 32-bit systems.

Signed-off-by: Rohan McLure 
---
V2: Provide missing pud_user implementations, use p{u,m}d_is_leaf.
V3: Provide missing pmd_user implementations as stubs in 32-bit.
V4: Use pmd_leaf, pud_leaf, and define pmd_user for 32 Book3E with
static inline method rather than macro.
---
 arch/powerpc/include/asm/book3s/32/pgtable.h |  4 
 arch/powerpc/include/asm/book3s/64/pgtable.h | 10 ++
 arch/powerpc/include/asm/nohash/32/pgtable.h |  5 +
 arch/powerpc/include/asm/nohash/64/pgtable.h | 10 ++
 arch/powerpc/include/asm/pgtable.h   | 15 +++
 5 files changed, 44 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h 
b/arch/powerpc/include/asm/book3s/32/pgtable.h
index 921ae95cd939..b2fdd3cc81de 100644
--- a/arch/powerpc/include/asm/book3s/32/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/32/pgtable.h
@@ -515,6 +515,10 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
return __pte((pte_val(pte) & _PAGE_CHG_MASK) | pgprot_val(newprot));
 }
 
+static inline bool pmd_user(pmd_t pmd)
+{
+   return 0;
+}
 
 
 /* This low level function performs the actual PTE insertion
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 5fb910ab250d..e04e3cd378a6 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -602,6 +602,16 @@ static inline bool pte_user(pte_t pte)
return !(pte_raw(pte) & cpu_to_be64(_PAGE_PRIVILEGED));
 }
 
+static inline bool pmd_user(pmd_t pmd)
+{
+   return !(pmd_raw(pmd) & cpu_to_be64(_PAGE_PRIVILEGED));
+}
+
+static inline bool pud_user(pud_t pud)
+{
+   return !(pud_raw(pud) & cpu_to_be64(_PAGE_PRIVILEGED));
+}
+
 #define pte_access_permitted pte_access_permitted
 static inline bool pte_access_permitted(pte_t pte, bool write)
 {
diff --git a/arch/powerpc/include/asm/nohash/32/pgtable.h 
b/arch/powerpc/include/asm/nohash/32/pgtable.h
index 0d40b33184eb..94b2a53f73d5 100644
--- a/arch/powerpc/include/asm/nohash/32/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/32/pgtable.h
@@ -209,6 +209,11 @@ static inline void pmd_clear(pmd_t *pmdp)
*pmdp = __pmd(0);
 }
 
+static inline bool pmd_user(pmd_t pmd)
+{
+   return false;
+}
+
 /*
  * PTE updates. This function is called whenever an existing
  * valid PTE is updated. This does -not- include set_pte_at()
diff --git a/arch/powerpc/include/asm/nohash/64/pgtable.h 
b/arch/powerpc/include/asm/nohash/64/pgtable.h
index 34e618bb9140..69304159aafc 100644
--- a/arch/powerpc/include/asm/nohash/64/pgtable.h
+++ b/arch/powerpc/include/asm/nohash/64/pgtable.h
@@ -123,6 +123,11 @@ static inline pte_t pmd_pte(pmd_t pmd)
return __pte(pmd_val(pmd));
 }
 
+static inline bool pmd_user(pmd_t pmd)
+{
+   return (pmd_val(pmd) & _PAGE_USER) == _PAGE_USER;
+}
+
 #define pmd_none(pmd)  (!pmd_val(pmd))
 #definepmd_bad(pmd)(!is_kernel_addr(pmd_val(pmd)) \
 || (pmd_val(pmd) & PMD_BAD_BITS))
@@ -163,6 +168,11 @@ static inline pte_t pud_pte(pud_t pud)
return __pte(pud_val(pud));
 }
 
+static inline bool pud_user(pud_t pud)
+{
+   return (pud_val(pud) & _PAGE_USER) == _PAGE_USER;
+}
+
 static inline pud_t pte_pud(pte_t pte)
 {
return __pud(pte_val(pte));
diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index 5db9fa157685..5227bbaa7acf 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -173,6 +173,21 @@ static inline int pud_pfn(pud_t pud)
 }
 #endif
 
+static inline bool pte_user_accessible_page(pte_t pte)
+{
+   return pte_present(pte) && pte_user(pte);
+}
+
+static inline bool pmd_user_accessible_page(pmd_t pmd)
+{
+   return pmd_leaf(pmd) && pmd_present(pmd) && pmd_user(pmd);
+}
+
+static inline bool pud_user_accessible_page(pud_t pud)
+{
+   return pud_leaf(pud) && pud_present(pud) && pud_user(pud);
+}
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_PGTABLE_H */
-- 
2.37.2



[PATCH v5 5/5] powerpc: mm: support page table check

2022-11-17 Thread Rohan McLure
On creation and clearing of a page table mapping, instrument such calls
by invoking page_table_check_pte_set and page_table_check_pte_clear
respectively. These calls serve as a sanity check against illegal
mappings.

Enable ARCH_SUPPORTS_PAGE_TABLE_CHECK for all ppc64, and 32-bit
platforms implementing Book3S.

Change pud_pfn to be a runtime bug rather than a build bug as it is
consumed by page_table_check_pud_{clear,set} which are not called.

See also:

riscv support in commit 3fee229a8eb9 ("riscv/mm: enable
ARCH_SUPPORTS_PAGE_TABLE_CHECK")
arm64 in commit 42b2547137f5 ("arm64/mm: enable
ARCH_SUPPORTS_PAGE_TABLE_CHECK")
x86_64 in commit d283d422c6c4 ("x86: mm: add x86_64 support for page table
check")

Reviewed-by: Russell Currey 
Reviewed-by: Christophe Leroy 
Signed-off-by: Rohan McLure 
---
V2: Update spacing and types assigned to pte_update calls.
V3: Update one last pte_update call to remove __pte invocation.
---
 arch/powerpc/Kconfig |  1 +
 arch/powerpc/include/asm/book3s/32/pgtable.h |  9 -
 arch/powerpc/include/asm/book3s/64/pgtable.h | 18 +++---
 arch/powerpc/include/asm/nohash/32/pgtable.h |  7 ++-
 arch/powerpc/include/asm/nohash/64/pgtable.h |  8 ++--
 arch/powerpc/include/asm/nohash/pgtable.h|  1 +
 6 files changed, 37 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 699df27b0e2f..1d943a13a204 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -150,6 +150,7 @@ config PPC
select ARCH_STACKWALK
select ARCH_SUPPORTS_ATOMIC_RMW
select ARCH_SUPPORTS_DEBUG_PAGEALLOCif PPC_BOOK3S || PPC_8xx || 40x
+   select ARCH_SUPPORTS_PAGE_TABLE_CHECK
select ARCH_USE_BUILTIN_BSWAP
select ARCH_USE_CMPXCHG_LOCKREF if PPC64
select ARCH_USE_MEMTEST
diff --git a/arch/powerpc/include/asm/book3s/32/pgtable.h 
b/arch/powerpc/include/asm/book3s/32/pgtable.h
index b2fdd3cc81de..44703c8c590c 100644
--- a/arch/powerpc/include/asm/book3s/32/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/32/pgtable.h
@@ -53,6 +53,8 @@
 
 #ifndef __ASSEMBLY__
 
+#include 
+
 static inline bool pte_user(pte_t pte)
 {
return pte_val(pte) & _PAGE_USER;
@@ -337,7 +339,11 @@ static inline int __ptep_test_and_clear_young(struct 
mm_struct *mm,
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm, unsigned long 
addr,
   pte_t *ptep)
 {
-   return __pte(pte_update(mm, addr, ptep, ~_PAGE_HASHPTE, 0, 0));
+   pte_t old_pte = __pte(pte_update(mm, addr, ptep, ~_PAGE_HASHPTE, 0, 0));
+
+   page_table_check_pte_clear(mm, addr, old_pte);
+
+   return old_pte;
 }
 
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
@@ -529,6 +535,7 @@ static inline bool pmd_user(pmd_t pmd)
 static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pte, int percpu)
 {
+   page_table_check_pte_set(mm, addr, ptep, pte);
 #if defined(CONFIG_SMP) && !defined(CONFIG_PTE_64BIT)
/* First case is 32-bit Hash MMU in SMP mode with 32-bit PTEs. We use 
the
 * helper pte_update() which does an atomic update. We need to do that
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index e04e3cd378a6..436632d04304 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -162,6 +162,8 @@
 #define PAGE_KERNEL_ROX__pgprot(_PAGE_BASE | _PAGE_KERNEL_ROX)
 
 #ifndef __ASSEMBLY__
+#include 
+
 /*
  * page table defines
  */
@@ -465,8 +467,11 @@ static inline void huge_ptep_set_wrprotect(struct 
mm_struct *mm,
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
   unsigned long addr, pte_t *ptep)
 {
-   unsigned long old = pte_update(mm, addr, ptep, ~0UL, 0, 0);
-   return __pte(old);
+   pte_t old_pte = __pte(pte_update(mm, addr, ptep, ~0UL, 0, 0));
+
+   page_table_check_pte_clear(mm, addr, old_pte);
+
+   return old_pte;
 }
 
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
@@ -475,11 +480,16 @@ static inline pte_t ptep_get_and_clear_full(struct 
mm_struct *mm,
pte_t *ptep, int full)
 {
if (full && radix_enabled()) {
+   pte_t old_pte;
+
/*
 * We know that this is a full mm pte clear and
 * hence can be sure there is no parallel set_pte.
 */
-   return radix__ptep_get_and_clear_full(mm, addr, ptep, full);
+   old_pte = radix__ptep_get_and_clear_full(mm, addr, ptep, full);
+   page_table_check_pte_clear(mm, addr, old_pte);
+
+   return old_pte;
}
return ptep_get_and_clear(mm, addr, ptep);
 }
@@ -865,6 +875,8 @@ static inline void __set_pte_at(struct mm_struct *mm, 
unsigned long addr,
 */
pte = __pt

[PATCH v5 3/5] powerpc: mm: Add common pud_pfn stub for all platforms

2022-11-17 Thread Rohan McLure
Prior to this commit, pud_pfn was implemented with BUILD_BUG as the inline
function for 64-bit Book3S systems but is never included, as its
invocations in generic code are guarded by calls to pud_devmap which return
zero on such systems. A future patch will provide support for page table
checks, the generic code for which depends on a pud_pfn stub being
implemented, even while the patch will not interact with puds directly.

Remove the 64-bit Book3S stub and define pud_pfn to warn on all
platforms. pud_pfn may be defined properly on a per-platform basis
should it grow real usages in future.

Signed-off-by: Rohan McLure 
---
V2: Remove conditional BUILD_BUG and BUG. Instead warn on usage.
V3: Replace WARN with WARN_ONCE, which should suffice to demonstrate
misuse of puds.
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 10 --
 arch/powerpc/include/asm/pgtable.h   | 14 ++
 2 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 9a6db4ad3228..5fb910ab250d 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1394,16 +1394,6 @@ static inline int pgd_devmap(pgd_t pgd)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-static inline int pud_pfn(pud_t pud)
-{
-   /*
-* Currently all calls to pud_pfn() are gated around a pud_devmap()
-* check so this should never be used. If it grows another user we
-* want to know about it.
-*/
-   BUILD_BUG();
-   return 0;
-}
 #define __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION
 pte_t ptep_modify_prot_start(struct vm_area_struct *, unsigned long, pte_t *);
 void ptep_modify_prot_commit(struct vm_area_struct *, unsigned long,
diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index 68a3f271aebe..5db9fa157685 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -159,6 +159,20 @@ struct seq_file;
 void arch_report_meminfo(struct seq_file *m);
 #endif /* CONFIG_PPC64 */
 
+/*
+ * Currently only consumed by page_table_check_pud_{set,clear}. Since clears
+ * and sets to page table entries at any level are done through
+ * page_table_check_pte_{set,clear}, provide stub implementation.
+ */
+#ifndef pud_pfn
+#define pud_pfn pud_pfn
+static inline int pud_pfn(pud_t pud)
+{
+   WARN_ONCE(1, "pud: platform does not use pud entries directly");
+   return 0;
+}
+#endif
+
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_POWERPC_PGTABLE_H */
-- 
2.37.2



Re: [PATCH mm-unstable v1 20/20] mm: rename FOLL_FORCE to FOLL_PTRACE

2022-11-17 Thread Kees Cook
On Wed, Nov 16, 2022 at 10:16:34AM -0800, Linus Torvalds wrote:
> There _are_ also small random cases too, like get_cmdline(). Maybe
> that counts as ptrace, but the execve() case most definitely does not.

Oh, er, why does get_arg_page() even need FOLL_FORCE? This is writing the
new stack contents to the nascent brpm->vma, which was newly allocated
with VM_STACK_FLAGS, which an arch can override, but they all appear to include
VM_WRITE | VM_MAYWRITE.

-- 
Kees Cook


Re: [PATCH mm-unstable v1 20/20] mm: rename FOLL_FORCE to FOLL_PTRACE

2022-11-17 Thread Linus Torvalds
On Thu, Nov 17, 2022 at 2:58 PM Kees Cook  wrote:
>
> Oh, er, why does get_arg_page() even need FOLL_FORCE? This is writing the
> new stack contents to the nascent brpm->vma, which was newly allocated
> with VM_STACK_FLAGS, which an arch can override, but they all appear to 
> include
> VM_WRITE | VM_MAYWRITE.

Yeah, it does seem entirely superfluous.

It's been there since the very beginning (although in that original
commit b6a2fea39318 it was there as a '1' to the 'force' argument to
get_user_pages()).

I *think* it can be just removed. But as long as it exists, it should
most definitely not be renamed to FOLL_PTRACE.

There's a slight worry that it currently hides some other setup issue
that makes it matter, since it's been that way so long, but I can't
see what it is.

 Linus


Re: [PATCH mm-unstable v1 20/20] mm: rename FOLL_FORCE to FOLL_PTRACE

2022-11-17 Thread Kees Cook
On Thu, Nov 17, 2022 at 03:20:01PM -0800, Linus Torvalds wrote:
> On Thu, Nov 17, 2022 at 2:58 PM Kees Cook  wrote:
> >
> > Oh, er, why does get_arg_page() even need FOLL_FORCE? This is writing the
> > new stack contents to the nascent brpm->vma, which was newly allocated
> > with VM_STACK_FLAGS, which an arch can override, but they all appear to 
> > include
> > VM_WRITE | VM_MAYWRITE.
> 
> Yeah, it does seem entirely superfluous.
> 
> It's been there since the very beginning (although in that original
> commit b6a2fea39318 it was there as a '1' to the 'force' argument to
> get_user_pages()).
> 
> I *think* it can be just removed. But as long as it exists, it should
> most definitely not be renamed to FOLL_PTRACE.
> 
> There's a slight worry that it currently hides some other setup issue
> that makes it matter, since it's been that way so long, but I can't
> see what it is.

My test system boots happily with it removed. I'll throw it into -next
and see if anything melts...

-- 
Kees Cook


RE: [patch 01/39] PCI/MSI: Check for MSI enabled in __pci_msix_enable()

2022-11-17 Thread Tian, Kevin
> From: Thomas Gleixner 
> Sent: Friday, November 11, 2022 9:54 PM
> 
> PCI/MSI and PCI/MSI-X are mutually exclusive, but the MSI-X enable code
> lacks a check for already enabled MSI.
> 
> Signed-off-by: Thomas Gleixner 
> ---
>  drivers/pci/msi/msi.c |5 +
>  1 file changed, 5 insertions(+)
> 
> --- a/drivers/pci/msi/msi.c
> +++ b/drivers/pci/msi/msi.c
> @@ -935,6 +935,11 @@ static int __pci_enable_msix_range(struc
>   if (maxvec < minvec)
>   return -ERANGE;
> 
> + if (dev->msi_enabled) {
> + pci_info(dev, "can't enable MSI-X (MSI already enabled)\n");
> + return -EINVAL;
> + }
> +
>   if (WARN_ON_ONCE(dev->msix_enabled))
>   return -EINVAL;
> 

a same check remains in __pci_enable_msix():

/* Check whether driver already requested for MSI IRQ */
if (dev->msi_enabled) {
pci_info(dev, "can't enable MSI-X (MSI IRQ already 
assigned)\n");
return -EINVAL;
}
return msix_capability_init(dev, entries, nvec, affd);

It's removed later in patch33 when sanitizing MSI-X checks. But logically
the removal can come with this patch.


RE: [patch 33/39] PCI/MSI: Sanitize MSI-X checks

2022-11-17 Thread Tian, Kevin
> From: Thomas Gleixner 
> Sent: Friday, November 11, 2022 9:55 PM
> 
> @@ -785,7 +786,7 @@ int __pci_enable_msix_range(struct pci_d
>   return -ENOSPC;
>   }
> 
> - rc = __pci_enable_msix(dev, entries, nvec, affd, flags);
> + rc = msix_capability_init(dev, entries, nvec, affd);
>   if (rc == 0)
>   return nvec;
> 

The check in following lines doesn't make sense now:

if (rc < minvec)
return -ENOSPC;


RE: [patch 00/39] genirq, PCI/MSI: Support for per device MSI and PCI/IMS - Part 1 cleanups

2022-11-17 Thread Tian, Kevin
> From: Thomas Gleixner 
> Sent: Friday, November 11, 2022 9:54 PM
> 
> Enough of history and theory. Here comes part 1:
> 
> This is just a cleanup and a reorganisation of the PCI/MSI code which
> became quite an unreadable mess over time. There is no intentional
> functional change in this series.
> 
> It's just a separate step to make the subsequent changes in the
> infrastructure easier both to implement and to review.
> 
> Thanks,
> 
>   tglx

The entire series looks good to me except a couple nits replied in
individual patches:

Reviewed-by: Kevin Tian