date:20250108

Re: [PATCH v5 11/17] x86: pgtable: convert __tlb_remove_table() to use struct ptdesc

2025-01-08 Thread Kevin Brodsky

On 08/01/2025 07:57, Qi Zheng wrote:
> Convert __tlb_remove_table() to use struct ptdesc, which will help to move
> pagetable_dtor() to __tlb_remove_table().
>
> And page tables shouldn't have swap cache, so use pagetable_free() instead
> of free_page_and_swap_cache() to free page table pages.
>
> Signed-off-by: Qi Zheng 

Definitely a good idea to have split patch 11 from v4.

Reviewed-by: Kevin Brodsky 

- Kevin

[PATCH v3 20/28] powerpc/ftrace: Use RCU in all users of __module_text_address().

2025-01-08 Thread Sebastian Andrzej Siewior

__module_text_address() can be invoked within a RCU section, there is no
requirement to have preemption disabled.

Replace the preempt_disable() section around __module_text_address()
with RCU.

Cc: Christophe Leroy 
Cc: Madhavan Srinivasan 
Cc: Mark Rutland 
Cc: Masami Hiramatsu 
Cc: Michael Ellerman 
Cc: Naveen N Rao 
Cc: Nicholas Piggin 
Cc: Steven Rostedt 
Cc: linux-trace-ker...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Acked-by: Peter Zijlstra (Intel) 
Signed-off-by: Sebastian Andrzej Siewior 
---
 arch/powerpc/kernel/trace/ftrace.c   | 6 ++
 arch/powerpc/kernel/trace/ftrace_64_pg.c | 6 ++
 2 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kernel/trace/ftrace.c 
b/arch/powerpc/kernel/trace/ftrace.c
index 5ccd791761e8f..558d7f4e4bea6 100644
--- a/arch/powerpc/kernel/trace/ftrace.c
+++ b/arch/powerpc/kernel/trace/ftrace.c
@@ -115,10 +115,8 @@ static unsigned long ftrace_lookup_module_stub(unsigned 
long ip, unsigned long a
 {
struct module *mod = NULL;
 
-   preempt_disable();
-   mod = __module_text_address(ip);
-   preempt_enable();
-
+   scoped_guard(rcu)
+   mod = __module_text_address(ip);
if (!mod)
pr_err("No module loaded at addr=%lx\n", ip);
 
diff --git a/arch/powerpc/kernel/trace/ftrace_64_pg.c 
b/arch/powerpc/kernel/trace/ftrace_64_pg.c
index 98787376eb87c..531d40f10c8a1 100644
--- a/arch/powerpc/kernel/trace/ftrace_64_pg.c
+++ b/arch/powerpc/kernel/trace/ftrace_64_pg.c
@@ -120,10 +120,8 @@ static struct module *ftrace_lookup_module(struct 
dyn_ftrace *rec)
 {
struct module *mod;
 
-   preempt_disable();
-   mod = __module_text_address(rec->ip);
-   preempt_enable();
-
+   scoped_guard(rcu)
+   mod = __module_text_address(rec->ip);
if (!mod)
pr_err("No module loaded at addr=%lx\n", rec->ip);
 
-- 
2.47.1

[PATCH v3 01/28] module: Extend the preempt disabled section in dereference_symbol_descriptor().

2025-01-08 Thread Sebastian Andrzej Siewior

dereference_symbol_descriptor() needs to obtain the module pointer
belonging to pointer in order to resolve that pointer.
The returned mod pointer is obtained under RCU-sched/ preempt_disable()
guarantees and needs to be used within this section to ensure that the
module is not removed in the meantime.

Extend the preempt_disable() section to also cover
dereference_module_function_descriptor().

Fixes: 04b8eb7a4ccd9 ("symbol lookup: introduce 
dereference_symbol_descriptor()")
Cc: James E.J. Bottomley 
Cc: Christophe Leroy 
Cc: Helge Deller 
Cc: Madhavan Srinivasan 
Cc: Michael Ellerman 
Cc: Naveen N Rao 
Cc: Nicholas Piggin 
Cc: Sergey Senozhatsky 
Cc: linux-par...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Sergey Senozhatsky 
Acked-by: Peter Zijlstra (Intel) 
Signed-off-by: Sebastian Andrzej Siewior 
---
 include/linux/kallsyms.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/kallsyms.h b/include/linux/kallsyms.h
index c3f075e8f60cb..1c6a6c1704d8d 100644
--- a/include/linux/kallsyms.h
+++ b/include/linux/kallsyms.h
@@ -57,10 +57,10 @@ static inline void *dereference_symbol_descriptor(void *ptr)
 
preempt_disable();
mod = __module_address((unsigned long)ptr);
-   preempt_enable();
 
if (mod)
ptr = dereference_module_function_descriptor(mod, ptr);
+   preempt_enable();
 #endif
return ptr;
 }
-- 
2.47.1

[PATCH] PCI/AER: Add kernel.aer_print_skip_mask to control aer log

2025-01-08 Thread Bijie Xu

Sometimes certain PCIE devices installed on some servers occasionally
produce large number of AER correctable error logs, which is quite
annoying. Add this sysctl parameter kernel.aer_print_skip_mask to
skip printing AER errors of certain severity.

The AER severity can be 0(NONFATAL), 1(FATAL), 2(CORRECTABLE). The 3
low bits of the mask are used to skip these 3 severities. Set bit 0
can skip printing NONFATAL AER errors, and set bit 1 can skip printing
FATAL AER errors, set bit 2 can skip printing CORRECTABLE AER errors.
And multiple bits can be set to skip multiple severities.

Signed-off-by: Bijie Xu 
---
 drivers/pci/pcie/aer.c | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 80c5ba8d8296..b46973526bcf 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -698,6 +698,7 @@ static void __aer_print_error(struct pci_dev *dev,
pci_dev_aer_stats_incr(dev, info);
 }
 
+unsigned int aer_print_skip_mask __read_mostly;
 void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 {
int layer, agent;
@@ -710,6 +711,9 @@ void aer_print_error(struct pci_dev *dev, struct 
aer_err_info *info)
goto out;
}
 
+   if ((1 << info->severity) & aer_print_skip_mask)
+   goto out;
+
layer = AER_GET_LAYER_ERROR(info->severity, info->status);
agent = AER_GET_AGENT(info->severity, info->status);
 
@@ -1596,3 +1600,22 @@ int __init pcie_aer_init(void)
return -ENXIO;
return pcie_port_service_register(&aerdriver);
 }
+
+static const struct ctl_table aer_print_skip_mask_sysctls[] = {
+   {
+   .procname   = "aer_print_skip_mask",
+   .data   = &aer_print_skip_mask,
+   .maxlen = sizeof(unsigned int),
+   .mode   = 0644,
+   .proc_handler   = &proc_douintvec,
+   },
+   {}
+};
+
+static int __init aer_print_skip_mask_sysctl_init(void)
+{
+   register_sysctl_init("kernel", aer_print_skip_mask_sysctls);
+   return 0;
+}
+
+late_initcall(aer_print_skip_mask_sysctl_init);
-- 
2.25.1

[PATCH] perf machine: Don't ignore _etext when not a text symbol

2025-01-08 Thread Christophe Leroy

Depending on how vmlinux.lds is written, _etext might be the very
first data symbol instead of the very last text symbol.

Don't require it to be a text symbol, accept any symbol type.

Fixes: ed9adb2035b5 ("perf machine: Read also the end of the kernel")
Signed-off-by: Christophe Leroy 
---
 tools/perf/util/machine.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 27d5345d2b30..9be2f4479f52 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -1003,7 +1003,7 @@ static int machine__get_running_kernel_start(struct 
machine *machine,
 
err = kallsyms__get_symbol_start(filename, "_edata", &addr);
if (err)
-   err = kallsyms__get_function_start(filename, "_etext", &addr);
+   err = kallsyms__get_symbol_start(filename, "_etext", &addr);
if (!err)
*end = addr;
 
-- 
2.47.0

Re: [PATCH RESEND v1 1/5] crash: remove an unused argument from reserve_crashkernel_generic()

2025-01-08 Thread Baoquan he

On 01/08/25 at 03:44pm, Sourabh Jain wrote:
> cmdline argument is not used in reserve_crashkernel_generic() so remove
> it. Correspondingly, all the callers have been updated as well.
> 
> No functional change intended.
> 
> Cc: Andrew Morton 
> Cc: Baoquan he 
> Cc: Hari Bathini 
> CC: Madhavan Srinivasan 
> Cc: Michael Ellerman 
> Cc: ke...@lists.infradead.org
> Cc: linux-ker...@vger.kernel.org
> Signed-off-by: Sourabh Jain 
> ---
>  arch/arm64/mm/init.c  |  6 ++
>  arch/loongarch/kernel/setup.c |  5 ++---
>  arch/riscv/mm/init.c  |  6 ++
>  arch/x86/kernel/setup.c   |  6 ++
>  include/linux/crash_reserve.h | 11 +--
>  kernel/crash_reserve.c|  9 -
>  6 files changed, 17 insertions(+), 26 deletions(-)

Acked-by: Baoquan He

Re: [PATCH RESEND v1 5/5] crash: option to let arch decide mem range is usable

2025-01-08 Thread Sourabh Jain






On 08/01/25 17:08, Baoquan he wrote:

On 01/08/25 at 03:44pm, Sourabh Jain wrote:
...snip...

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index f0e9f8eda7a3..407f8b0346aa 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -205,6 +205,15 @@ static inline int 
arch_kimage_file_post_load_cleanup(struct kimage *image)
  }
  #endif
  
+#ifndef arch_check_excluded_range

+static inline int arch_check_excluded_range(struct kimage *image,
+   unsigned long start,
+   unsigned long end)
+{
+   return 0;
+}
+#endif
+
  #ifdef CONFIG_KEXEC_SIG
  #ifdef CONFIG_SIGNED_PE_FILE_VERIFICATION
  int kexec_kernel_verify_pe_sig(const char *kernel, unsigned long kernel_len);
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 3eedb8c226ad..52e1480dbfa1 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -464,6 +464,12 @@ static int locate_mem_hole_top_down(unsigned long start, 
unsigned long end,
continue;
}
  
+		/* Make sure this does not conflict exclude range */

^
 Make sure this doesn't conflict with excluded range?


+   if (arch_check_excluded_range(image, temp_start, temp_end)) {
+   temp_start = temp_start - PAGE_SIZE;
+   continue;
+   }
+
/* We found a suitable memory range */
break;
} while (1);
@@ -498,6 +504,12 @@ static int locate_mem_hole_bottom_up(unsigned long start, 
unsigned long end,
continue;
}
  
+		/* Make sure this does not conflict exclude range */

^
  Ditto.


I will update both comments.

Thanks,
Sourabh Jain

Re: [PATCH RESEND v1 2/5] crash: let arch decide crash memory export to iomem_resource

2025-01-08 Thread Sourabh Jain


Hello Baoquan

On 08/01/25 16:55, Baoquan he wrote:

Hi,

On 01/08/25 at 03:44pm, Sourabh Jain wrote:

insert_crashkernel_resources() adds crash memory to iomem_resource if
generic crashkernel reservation is enabled on an architecture.

On PowerPC, system RAM is added to iomem_resource. See commit
c40dd2f766440 ("powerpc: Add System RAM to /proc/iomem").

Enabling generic crashkernel reservation on PowerPC leads to a conflict
when system RAM is added to iomem_resource because a part of the system
RAM, the crashkernel memory, has already been added to iomem_resource.

The next commit in the series "powerpc/crash: use generic crashkernel
reservation" enables generic crashkernel reservation on PowerPC. If the
crashkernel is added to iomem_resource, the kernel fails to add
system RAM to /proc/iomem and prints the following traces:

CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.13.0-rc2+
snip...
NIP [c2016b3c] add_system_ram_resources+0xf0/0x15c
LR [c2016b34] add_system_ram_resources+0xe8/0x15c
Call Trace:
[c484bbc0] [c2016b34] add_system_ram_resources+0xe8/0x15c
[c484bc20] [c0010a4c] do_one_initcall+0x7c/0x39c
[c484bd00] [c2005418] do_initcalls+0x144/0x18c
[c484bd90] [c2005714] kernel_init_freeable+0x21c/0x290
[c484bdf0] [c00110f4] kernel_init+0x2c/0x1b8
[c484be50] [c000dd3c] ret_from_kernel_user_thread+0x14/0x1c

To avoid this, an architecture hook is added in
insert_crashkernel_resources(), allowing the architecture to decide
whether crashkernel memory should be added to iomem_resource.

Have you tried defining HAVE_ARCH_ADD_CRASH_RES_TO_IOMEM_EARLY in ppc to
add crashkernel region to iomem early?



I didn’t try, but I think even if I do, the kernel will run into the 
same issue

because as soon as crashkernel is added to iomem, there will be a resource
conflict when the PowerPC code tries to add system RAM to iomem. Which
happens during subsys_initcall.

Regardless, I will give it a try.


Now there are two branches in the
existing code, adding a hook will make three ways.


I agree. I can try updating powerpc code to not consider crashkernel 
memory as

iomem conflict.

Thanks for the review.

- Sourabh Jain

Re: [PATCH v3 5/6] s390/crash: Use note name macros

2025-01-08 Thread Dave Martin

On Wed, Jan 08, 2025 at 01:53:51PM +0900, Akihiko Odaki wrote:
> On 2025/01/08 1:17, Dave Martin wrote:
> > Hi,
> > 
> > On Tue, Jan 07, 2025 at 09:45:56PM +0900, Akihiko Odaki wrote:
> > > Use note name macros to match with the userspace's expectation.
> > > 
> > > Signed-off-by: Akihiko Odaki 
> > > ---
> > >   arch/s390/kernel/crash_dump.c | 62 
> > > ---
> > >   1 file changed, 23 insertions(+), 39 deletions(-)
> > > 
> > > diff --git a/arch/s390/kernel/crash_dump.c b/arch/s390/kernel/crash_dump.c
> > 
> > [...]

> > > +#define NT_INIT(buf, type, desc) \
> > > + (nt_init_name((buf), NT_ ## type, &(desc), sizeof(desc), NN_ ## type))

[...]

> > (Note also, the outer parentheses and the parentheses around (buf)
> > appear redundant -- although harmless?)
> 
> They only make a difference in trivial corner cases and may look needlessly
> verbose.

(In case there was a misunderstanding here, I meant that some
parentheses can be removed without affecting correctness:

#define NT_INIT(buf, type, desc) \
nt_init_name(buf, NT_ ## type, &(desc), sizeof(desc), NN_ ## type))

It still doesn't matter though -- and some people do prefer to be
defensive anyway and err on the side of having too many parentheses
rather than too few.)

[...]

Cheers
---Dave

Re: [PATCH v8 0/7] PCI: Consolidate TLP Log reading and printing

2025-01-08 Thread Yazen Ghannam

On Wed, Dec 18, 2024 at 04:37:40PM +0200, Ilpo Järvinen wrote:
> This series has the remaining patches of the AER & DPC TLP Log handling
> consolidation and now includes a few minor improvements to the earlier
> accepted TLP Logging code.
> 
> v8:
> - Added missing parameter to kerneldoc.
> - Dropped last patch due to conflict with the pci_printk() cleanup
>   series (will move the patch into that series).
> 
> v7:
> - Explain in commit message reasoning why eetlp_prefix_max stores Max
>   End-End TLP Prefixes value instead of limiting it by the bridge/RP
>   imposed limits
> - Take account TLP Prefix Log Present flag.
> - Align PCI_ERR_CAP_* flags in pci_regs.h
> - Add EE_PREFIX_STR define to be able to take its sizeof() for output
>   char[] sizing.
> 
> v6:
> - Preserve "AER:"/"DPC:" prefix on the printed TLP line
> - New patch to add "AER:" also  on other lines of the AER error dump
> 
> v5:
> - Fix build with AER=y and DPC=n
> - Match kerneldoc and function parameter name
> 
> v4:
> - Added patches:
>   - Remove EXPORT of pcie_read_tlp_log()
>   - Moved code to pcie/tlp.c and build only with AER enabled
>   - Match variables in prototype and function
>   - int -> unsigned int conversion
>   - eetlp_prefix_max into own patch
> - struct pcie_tlp_log param consistently called "log" within tlp.c
> - Moved function prototypes into drivers/pci/pci.h
> - Describe AER/DPC differences more clearly in one commit message
> 
> v3:
> - Small rewording in a commit message
> 
> v2:
> - Don't add EXPORT()s
> - Don't include igxbe changes
> - Don't use pr_cont() as it's incompatible with pci_err() and according
>   to Andy Shevchenko should not be used in the first place
> 
> 
> Ilpo Järvinen (7):
>   PCI: Don't expose pcie_read_tlp_log() outside of PCI subsystem
>   PCI: Move TLP Log handling to own file
>   PCI: Make pcie_read_tlp_log() signature same
>   PCI: Use unsigned int i in pcie_read_tlp_log()
>   PCI: Store # of supported End-End TLP Prefixes
>   PCI: Add TLP Prefix reading into pcie_read_tlp_log()
>   PCI: Create helper to print TLP Header and Prefix Log
> 
>  drivers/pci/ats.c |   2 +-
>  drivers/pci/pci.c |  28 -
>  drivers/pci/pci.h |   9 +++
>  drivers/pci/pcie/Makefile |   2 +-
>  drivers/pci/pcie/aer.c|  15 ++---
>  drivers/pci/pcie/dpc.c|  14 ++---
>  drivers/pci/pcie/tlp.c| 113 ++
>  drivers/pci/probe.c   |  14 +++--
>  include/linux/aer.h   |   3 +-
>  include/linux/pci.h   |   2 +-
>  include/uapi/linux/pci_regs.h |  11 ++--
>  11 files changed, 153 insertions(+), 60 deletions(-)
>  create mode 100644 drivers/pci/pcie/tlp.c
> 
> -- 

Can you please include a base commit in future revisions?

I was able to apply the set to v6.13-rc6, but I was trying a couple of
the PCI repo branches before which didn't apply.

Thanks,
Yazen

Re: [PATCH RESEND v1 4/5] powerpc/crash: use generic crashkernel reservation

2025-01-08 Thread Mahesh J Salgaonkar

On 2025-01-08 15:44:57 Wed, Sourabh Jain wrote:
> Commit 0ab97169aa05 ("crash_core: add generic function to do
> reservation") added a generic function to reserve crashkernel memory.
> So let's use the same function on powerpc and remove the
> architecture-specific code that essentially does the same thing.
> 
> The generic crashkernel reservation also provides a way to split the
> crashkernel reservation into high and low memory reservations, which can
> be enabled for powerpc in the future.
> 
> Along with moving to the generic crashkernel reservation, the code
> related to finding the base address for the crashkernel has been
> separated into its own function name get_crash_base() for better
> readability and maintainability.
> 
> To prevent crashkernel memory from being added to iomem_resource, the
> function arch_add_crash_res_to_iomem() has been introduced. For further
> details on why this should not be done for the PowerPC architecture,
> please refer to the previous commit titled "crash: let arch decide crash
> memory export to iomem_resource.
> 
> Cc: Andrew Morton 
> Cc: Baoquan he 
> Cc: Hari Bathini 
> CC: Madhavan Srinivasan 
> Cc: Michael Ellerman 
> Cc: ke...@lists.infradead.org
> Cc: linux-ker...@vger.kernel.org
> Signed-off-by: Sourabh Jain 
> ---
>  arch/powerpc/Kconfig |  3 +
>  arch/powerpc/include/asm/crash_reserve.h | 18 +
>  arch/powerpc/include/asm/kexec.h |  4 +-
>  arch/powerpc/kernel/prom.c   |  2 +-
>  arch/powerpc/kexec/core.c| 90 ++--
>  5 files changed, 63 insertions(+), 54 deletions(-)
>  create mode 100644 arch/powerpc/include/asm/crash_reserve.h
[...]
> @@ -113,9 +113,9 @@ int setup_new_fdt_ppc64(const struct kimage *image, void 
> *fdt, struct crash_mem
>  
>  #ifdef CONFIG_CRASH_RESERVE
>  int __init overlaps_crashkernel(unsigned long start, unsigned long size);
> -extern void reserve_crashkernel(void);
> +extern void arch_reserve_crashkernel(void);

Do we really need to rename this ? it is still called from powepc arch
and not from the common code.

>  #else
> -static inline void reserve_crashkernel(void) {}
> +static inline void arch_reserve_crashkernel(void) {}
>  static inline int overlaps_crashkernel(unsigned long start, unsigned long 
> size) { return 0; }
>  #endif
>  
> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
> index e0059842a1c6..9ed9dde7d231 100644
> --- a/arch/powerpc/kernel/prom.c
> +++ b/arch/powerpc/kernel/prom.c
> @@ -860,7 +860,7 @@ void __init early_init_devtree(void *params)
>*/
>   if (fadump_reserve_mem() == 0)
>  #endif
> - reserve_crashkernel();
> + arch_reserve_crashkernel();
>   early_reserve_mem();
>  
>   if (memory_limit > memblock_phys_mem_size())

Rest looks good to me.

Reviewed-by: Mahesh Salgaonkar 

Thanks,
-Mahesh.

Re: [PATCH RESEND v1 3/5] powerpc/kdump: preserve user-specified memory limit

2025-01-08 Thread Mahesh J Salgaonkar

On 2025-01-08 15:44:56 Wed, Sourabh Jain wrote:
> Commit 59d58189f3d9 ("crash: fix crash memory reserve exceed system
> memory bug") fails crashkernel parsing if the crash size is found to be
> higher than system RAM, which makes the memory_limit adjustment code
> ineffective due to an early exit from reserve_crashkernel().
> 
> Regardless lets not violated the user-specified memory limit by
> adjusting it. Remove this adjustment to ensure all reservations stay
> within the limit. Commit f94f5ac07983 ("powerpc/fadump: Don't update
> the user-specified memory limit") did the same for fadump.

Agreed.

Reviewed-by: Mahesh Salgaonkar 

Thanks,
-Mahesh.

Re: [PATCH v2] perf: Fix display of kernel symbols

2025-01-08 Thread Arnaldo Carvalho de Melo

On Wed, Jan 08, 2025 at 10:54:20AM +0100, Christophe Leroy wrote:
> Since commit 659ad3492b91 ("perf maps: Switch from rbtree to lazily
> sorted array for addresses"), perf doesn't display anymore kernel
> symbols on powerpc, allthough it still detects them as kernel addresses.
> 
>   # Overhead  Command Shared Object  Symbol
>   #   ..  . 
> ..
>   #
>   80.49%  Coeur main  [unknown]  [k] 0xc005f0f8
>3.91%  Coeur main  gau[.] 
> engine_loop.constprop.0.isra.0
>1.72%  Coeur main  [unknown]  [k] 0xc005f11c
>1.09%  Coeur main  [unknown]  [k] 0xc01f82c8
>0.44%  Coeur main  libc.so.6  [.] epoll_wait
>0.38%  Coeur main  [unknown]  [k] 0xc0011718
>0.36%  Coeur main  [unknown]  [k] 0xc01f45c0
> 
> This is because function maps__find_next_entry() now returns current
> entry instead of next entry, leading to kernel map end address
> getting mis-configured with its own start address instead of the
> start address of the following map.
> 
> Fix it by really taking the next entry, also make sure that entry
> follows current one by making sure entries are sorted.
> 
> Fixes: 659ad3492b91 ("perf maps: Switch from rbtree to lazily sorted array 
> for addresses")
> Signed-off-by: Christophe Leroy 
> Reviewed-by: Arnaldo Carvalho de Melo 
> ---
> v2: Make sure the entries are sorted, if not sort them.

Since you have changed what I reviewed I'll have to re-review :-) Will
try to do it after some calls.

- Arnaldo

> ---
>  tools/perf/util/maps.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/perf/util/maps.c b/tools/perf/util/maps.c
> index 432399cbe5dd..09c9cc326c08 100644
> --- a/tools/perf/util/maps.c
> +++ b/tools/perf/util/maps.c
> @@ -1136,8 +1136,13 @@ struct map *maps__find_next_entry(struct maps *maps, 
> struct map *map)
>   struct map *result = NULL;
>  
>   down_read(maps__lock(maps));
> + while (!maps__maps_by_address_sorted(maps)) {
> + up_read(maps__lock(maps));
> + maps__sort_by_address(maps);
> + down_read(maps__lock(maps));
> + }
>   i = maps__by_address_index(maps, map);
> - if (i < maps__nr_maps(maps))
> + if (++i < maps__nr_maps(maps))
>   result = map__get(maps__maps_by_address(maps)[i]);
>  
>   up_read(maps__lock(maps));
> -- 
> 2.47.0

[PATCH] powerpc/32: Stop printing Kernel virtual memory layout

2025-01-08 Thread Christophe Leroy

Printing of Kernel virtual memory layout was added for debug purpose
by commit f637a49e507c ("powerpc: Minor cleanups of kernel virt
address space definitions")

For security reasons, don't display the kernel's virtual memory layout.

Other architectures have removed it through following commits.

071929dbdd86 ("arm64: Stop printing the virtual memory layout")
1c31d4e96b8c ("ARM: 8820/1: mm: Stop printing the virtual memory layout")
3182f798 ("m68k/mm: Stop printing the virtual memory layout")
fd8d0ca25631 ("parisc: Hide virtual kernel memory layout")
681ff0181bbf ("x86/mm/init/32: Stop printing the virtual memory layout")

Commit 681ff0181bbf ("x86/mm/init/32: Stop printing the virtual memory
layout") thought x86 was the last one, but in reality powerpc/32 still
had it.

So remove it now on powerpc/32 as well.

Cc: Arvind Sankar 
Cc: Kees Cook 
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/mm/mem.c | 22 --
 1 file changed, 22 deletions(-)

diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index c7708c8fad29..34806c858e54 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -319,28 +319,6 @@ void __init mem_init(void)
per_cpu(next_tlbcam_idx, smp_processor_id()) =
(mfspr(SPRN_TLB1CFG) & TLBnCFG_N_ENTRY) - 1;
 #endif
-
-#ifdef CONFIG_PPC32
-   pr_info("Kernel virtual memory layout:\n");
-#ifdef CONFIG_KASAN
-   pr_info("  * 0x%08lx..0x%08lx  : kasan shadow mem\n",
-   KASAN_SHADOW_START, KASAN_SHADOW_END);
-#endif
-   pr_info("  * 0x%08lx..0x%08lx  : fixmap\n", FIXADDR_START, FIXADDR_TOP);
-#ifdef CONFIG_HIGHMEM
-   pr_info("  * 0x%08lx..0x%08lx  : highmem PTEs\n",
-   PKMAP_BASE, PKMAP_ADDR(LAST_PKMAP));
-#endif /* CONFIG_HIGHMEM */
-   if (ioremap_bot != IOREMAP_TOP)
-   pr_info("  * 0x%08lx..0x%08lx  : early ioremap\n",
-   ioremap_bot, IOREMAP_TOP);
-   pr_info("  * 0x%08lx..0x%08lx  : vmalloc & ioremap\n",
-   VMALLOC_START, VMALLOC_END);
-#ifdef MODULES_VADDR
-   pr_info("  * 0x%08lx..0x%08lx  : modules\n",
-   MODULES_VADDR, MODULES_END);
-#endif
-#endif /* CONFIG_PPC32 */
 }
 
 void free_initmem(void)
-- 
2.47.0

Re: [PATCH v2] perf: Fix display of kernel symbols

2025-01-08 Thread Arnaldo Carvalho de Melo

On Wed, Jan 08, 2025 at 06:06:03PM +0100, Christophe Leroy wrote:
> 
> 
> Le 08/01/2025 à 15:53, Arnaldo Carvalho de Melo a écrit :
> > On Wed, Jan 08, 2025 at 10:54:20AM +0100, Christophe Leroy wrote:
> > > Since commit 659ad3492b91 ("perf maps: Switch from rbtree to lazily
> > > sorted array for addresses"), perf doesn't display anymore kernel
> > > symbols on powerpc, allthough it still detects them as kernel addresses.
> > > 
> > >   # Overhead  Command Shared Object  Symbol
> > >   #   ..  . 
> > > ..
> > >   #
> > >   80.49%  Coeur main  [unknown]  [k] 0xc005f0f8
> > >3.91%  Coeur main  gau[.] 
> > > engine_loop.constprop.0.isra.0
> > >1.72%  Coeur main  [unknown]  [k] 0xc005f11c
> > >1.09%  Coeur main  [unknown]  [k] 0xc01f82c8
> > >0.44%  Coeur main  libc.so.6  [.] epoll_wait
> > >0.38%  Coeur main  [unknown]  [k] 0xc0011718
> > >0.36%  Coeur main  [unknown]  [k] 0xc01f45c0
> > > 
> > > This is because function maps__find_next_entry() now returns current
> > > entry instead of next entry, leading to kernel map end address
> > > getting mis-configured with its own start address instead of the
> > > start address of the following map.
> > > 
> > > Fix it by really taking the next entry, also make sure that entry
> > > follows current one by making sure entries are sorted.
> > > 
> > > Fixes: 659ad3492b91 ("perf maps: Switch from rbtree to lazily sorted 
> > > array for addresses")
> > > Signed-off-by: Christophe Leroy 
> > > Reviewed-by: Arnaldo Carvalho de Melo 
> > > ---
> > > v2: Make sure the entries are sorted, if not sort them.
> > 
> > Since you have changed what I reviewed I'll have to re-review :-) Will
> > try to do it after some calls.
> 
> Ah yes sorry, should have removed your Reviewed-by.
> 
> Based on Ian's feedback "Using the next entry in this way won't work if the
> entries aren't sorted", I added the following block in front of the initial
> change:
> 
> + while (!maps__maps_by_address_sorted(maps)) {
> + up_read(maps__lock(maps));
> + maps__sort_by_address(maps);
> + down_read(maps__lock(maps));
> + }

Its ok, I'll keep it now that I looked at it.

Thanks!

- Arnaldo

Re: [PATCH v5 00/25] fs/dax: Fix ZONE_DEVICE page reference counts

2025-01-08 Thread Alison Schofield

On Tue, Jan 07, 2025 at 02:42:16PM +1100, Alistair Popple wrote:
> Main updates since v4:
> 
>  - Removed most of the devdax/fsdax checks in fs/proc/task_mmu.c. This
>means smaps/pagemap may contain DAX pages.
> 
>  - Fixed rmap accounting of PUD mapped pages.
> 
>  - Minor code clean-ups.
> 
> Main updates since v3:
> 
>  - Rebased onto next-20241216.

Hi Alistair-

This set passes the ndctl/dax unit tests when applied to next-20241216

Tested-by: Alison Schofield 

-- snip

Re: [PATCH v5 01/25] fuse: Fix dax truncate/punch_hole fault path

2025-01-08 Thread Alistair Popple

On Wed, Jan 08, 2025 at 02:30:24PM -0800, Dan Williams wrote:
> Alistair Popple wrote:
> > FS DAX requires file systems to call into the DAX layout prior to
> > unlinking inodes to ensure there is no ongoing DMA or other remote
> > access to the direct mapped page. The fuse file system implements
> > fuse_dax_break_layouts() to do this which includes a comment
> > indicating that passing dmap_end == 0 leads to unmapping of the whole
> > file.
> > 
> > However this is not true - passing dmap_end == 0 will not unmap
> > anything before dmap_start, and further more
> > dax_layout_busy_page_range() will not scan any of the range to see if
> > there maybe ongoing DMA access to the range.
> 
> It would be useful to clarify that this is bug was found by inspection
> and that there are no known end user reports of trouble but that the
> failure more would look like random fs corruption. The window is hard to
> hit because a block needs to be truncated, reallocated to
> a file, and written to before stale DMA could corrupt it. So that may
> contribute to the fact that fuse-dax users have not reported an issue
> since v5.10.
> 
> > Fix this by checking for dmap_end == 0 in fuse_dax_break_layouts() and
> > pass the entire file range to dax_layout_busy_page_range().
> 
> That's not what this patch does, maybe a rebase error that pushed the
> @dmap_end fixup after the call to dax_layout_busy_page_range?

Ha. Yep. I spotted this when doing the conversion to
dax_layout_busy_page_range() and then had to rebase it into a stand alone patch
for easy review. Obviously I put the check in the wrong spot, although it ends
up in the right spot at the end of the series.

> However, I don't think this is quite the right fix, more below...

Yeah, I like your version better so will respin with that.

> > Signed-off-by: Alistair Popple 
> > Fixes: 6ae330cad6ef ("virtiofs: serialize truncate/punch_hole and dax fault 
> > path")
> > Cc: Vivek Goyal 
> > 
> > ---
> > 
> > I am not at all familiar with the fuse file system driver so I have no
> > idea if the comment is relevant or not and whether the documented
> > behaviour for dmap_end == 0 is ever relied upon. However this seemed
> > like the safest fix unless someone more familiar with fuse can confirm
> > that dmap_end == 0 is never used.
> 
> It is used in several places and has been wrong since day one. I believe
> the original commit simply misunderstood that
> dax_layout_busy_page_range() semantics are analogous to
> invalidate_inode_pages2_range() semantics in terms of what @start and
> @end mean.
> 
> You can add:
> 
> Co-developed-by: Dan Williams 
> Signed-off-by: Dan Williams 
> 
> ...if you end up doing a resend, or I will add it on applying to
> nvdimm.git if the rebase does not end up being too prickly.

Looks like I accidentally dropped a PPC fix so will do a respin for that. And
the kernel build bot was complaining about incorrect documentation so will fix
that while I'm at it.

> -- 8< --
> diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
> index c5d1feaa239c..455c4a16080b 100644
> --- a/fs/fuse/dax.c
> +++ b/fs/fuse/dax.c
> @@ -681,7 +681,6 @@ static int __fuse_dax_break_layouts(struct inode *inode, 
> bool *retry,
>   0, 0, fuse_wait_dax_page(inode));
>  }
>  
> -/* dmap_end == 0 leads to unmapping of whole file */
>  int fuse_dax_break_layouts(struct inode *inode, u64 dmap_start,
> u64 dmap_end)
>  {
> @@ -693,10 +692,6 @@ int fuse_dax_break_layouts(struct inode *inode, u64 
> dmap_start,
>   ret = __fuse_dax_break_layouts(inode, &retry, dmap_start,
>  dmap_end);
>   } while (ret == 0 && retry);
> - if (!dmap_end) {
> - dmap_start = 0;
> - dmap_end = LLONG_MAX;
> - }
>  
>   return ret;
>  }
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 0b2f8567ca30..bc6c8936c529 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -1936,7 +1936,7 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct 
> dentry *dentry,
>   if (FUSE_IS_DAX(inode) && is_truncate) {
>   filemap_invalidate_lock(mapping);
>   fault_blocked = true;
> - err = fuse_dax_break_layouts(inode, 0, 0);
> + err = fuse_dax_break_layouts(inode, 0, -1);
>   if (err) {
>   filemap_invalidate_unlock(mapping);
>   return err;
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 082ee374f694..cef7a8f75821 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -253,7 +253,7 @@ static int fuse_open(struct inode *inode, struct file 
> *file)
>  
>   if (dax_truncate) {
>   filemap_invalidate_lock(inode->i_mapping);
> - err = fuse_dax_break_layouts(inode, 0, 0);
> + err = fuse_dax_break_layouts(inode, 0, -1);
>   if (err)
>   goto out_inode_unlock;
>   }
> @@ -2890,7 +2890,7 @@ static long fus

Re: [PATCH RESEND v1 4/5] powerpc/crash: use generic crashkernel reservation

2025-01-08 Thread Sourabh Jain


Hello Mahesh,

On 08/01/25 22:35, Mahesh J Salgaonkar wrote:

On 2025-01-08 15:44:57 Wed, Sourabh Jain wrote:

Commit 0ab97169aa05 ("crash_core: add generic function to do
reservation") added a generic function to reserve crashkernel memory.
So let's use the same function on powerpc and remove the
architecture-specific code that essentially does the same thing.

The generic crashkernel reservation also provides a way to split the
crashkernel reservation into high and low memory reservations, which can
be enabled for powerpc in the future.

Along with moving to the generic crashkernel reservation, the code
related to finding the base address for the crashkernel has been
separated into its own function name get_crash_base() for better
readability and maintainability.

To prevent crashkernel memory from being added to iomem_resource, the Mahesh.
function arch_add_crash_res_to_iomem() has been introduced. For further
details on why this should not be done for the PowerPC architecture,
please refer to the previous commit titled "crash: let arch decide crash
memory export to iomem_resource.

Cc: Andrew Morton 
Cc: Baoquan he 
Cc: Hari Bathini 
CC: Madhavan Srinivasan 
Cc: Michael Ellerman 
Cc: ke...@lists.infradead.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Sourabh Jain 
---
  arch/powerpc/Kconfig |  3 +
  arch/powerpc/include/asm/crash_reserve.h | 18 +
  arch/powerpc/include/asm/kexec.h |  4 +-
  arch/powerpc/kernel/prom.c   |  2 +-
  arch/powerpc/kexec/core.c| 90 ++--
  5 files changed, 63 insertions(+), 54 deletions(-)
  create mode 100644 arch/powerpc/include/asm/crash_reserve.h

[...]

@@ -113,9 +113,9 @@ int setup_new_fdt_ppc64(const struct kimage *image, void 
*fdt, struct crash_mem
  
  #ifdef CONFIG_CRASH_RESERVE

  int __init overlaps_crashkernel(unsigned long start, unsigned long size);
-extern void reserve_crashkernel(void);
+extern void arch_reserve_crashkernel(void); Mahesh. Mahesh

Do we really need to rename this ? it is still called from powepc arch
and not from the common code.


You are right, we don’t. However, all architectures (x86, RISC-V, 
LoongArch, ARM64) that use the
generic crash kernel reservation have named their architecture-specific 
function `arch_reserve_crashkernel()`.

So, I did the same for PowerPC, and this helps sometimes.

Maybe I should justify the name change in the commit to avoid confusion.

Please let me know your opinion.




  #else
-static inline void reserve_crashkernel(void) {}
+static inline void arch_reserve_crashkernel(void) {}
  static inline int overlaps_crashkernel(unsigned long start, unsigned long 
size) { return 0; }
  #endif
  
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c

index e0059842a1c6..9ed9dde7d231 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -860,7 +860,7 @@ void __init early_init_devtree(void *params)
 */
if (fadump_reserve_mem() == 0)
  #endif
-   reserve_crashkernel();
+   arch_reserve_crashkernel();
early_reserve_mem();
  
  	if (memory_limit > memblock_phys_mem_size())

Rest looks good to me.

Reviewed-by: Mahesh Salgaonkar 


Thank you for reviewing this.

- Sourabh Jain

Re: [PATCH] kexec: Initialize ELF lowest address to ULONG_MAX

2025-01-08 Thread Sourabh Jain


Hello Baoquan and Eric,


On 12/12/24 08:25, Baoquan he wrote:

On 12/10/24 at 02:43pm, Sourabh Jain wrote:

kexec_elf_load() loads an ELF executable and sets the address of the
lowest PT_LOAD section to the address held by the lowest_load_addr
function argument.

To determine the lowest PT_LOAD address, a local variable lowest_addr
(type unsigned long) is initialized to UINT_MAX. After loading each
PT_LOAD, its address is compared to lowest_addr. If a loaded PT_LOAD
address is lower, lowest_addr is updated. However, setting lowest_addr
to UINT_MAX won't work when the kernel image is loaded above 4G, as the
returned lowest PT_LOAD address would be invalid. This is resolved by
initializing lowest_addr to ULONG_MAX instead.

This issue was discovered while implementing crashkernel high/low
reservation on the PowerPC architecture.

Fixes: a0458284f062 ("powerpc: Add support code for kexec_file_load()")
Cc: Baoquan he 
Cc: Hari Bathini 
CC: Madhavan Srinivasan 
Cc: Michael Ellerman 
Cc: ke...@lists.infradead.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Sourabh Jain 
---
  kernel/kexec_elf.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/kexec_elf.c b/kernel/kexec_elf.c
index d3689632e8b9..3a5c25b2adc9 100644
--- a/kernel/kexec_elf.c
+++ b/kernel/kexec_elf.c
@@ -390,7 +390,7 @@ int kexec_elf_load(struct kimage *image, struct elfhdr 
*ehdr,
 struct kexec_buf *kbuf,
 unsigned long *lowest_load_addr)
  {
-   unsigned long lowest_addr = UINT_MAX;
+   unsigned long lowest_addr = ULONG_MAX;

Great catch.

Acked-by: Baoquan He 
Thank you for the Ack! The upcoming two patch series, which aim to 
enable generic crashkernel reservation, depends on this fix. One of them 
is already posted for upstream review: 
https://lore.kernel.org/all/20250108101458.406806-1-sourabhj...@linux.ibm.com/ 
I was wondering if you could guide us on how to get this fix pushed to 
the mainline tree.


Thanks,
Sourabh Jain

Re: [PATCH v5 03/25] fs/dax: Don't skip locked entries when scanning entries

2025-01-08 Thread Alistair Popple

On Wed, Jan 08, 2025 at 02:50:36PM -0800, Dan Williams wrote:
> Alistair Popple wrote:
> > Several functions internal to FS DAX use the following pattern when
> > trying to obtain an unlocked entry:
> > 
> > xas_for_each(&xas, entry, end_idx) {
> > if (dax_is_locked(entry))
> > entry = get_unlocked_entry(&xas, 0);
> > 
> > This is problematic because get_unlocked_entry() will get the next
> > present entry in the range, and the next entry may not be
> > locked. Therefore any processing of the original locked entry will be
> > skipped. This can cause dax_layout_busy_page_range() to miss DMA-busy
> > pages in the range, leading file systems to free blocks whilst DMA
> > operations are ongoing which can lead to file system corruption.
> > 
> > Instead callers from within a xas_for_each() loop should be waiting
> > for the current entry to be unlocked without advancing the XArray
> > state so a new function is introduced to wait.
> 
> Oh wow, good eye!
> 
> Did this trip up an xfstest, or did you see this purely by inspection?

Oh this was a "fun" one to track down :-)

The other half of the story is in "fs/dax: Always remove DAX page-cache entries
when breaking layouts".

With just that patch applied xfstest triggered the new WARN_ON_ONCE in
truncate_folio_batch_exceptionals(). That made no sense, because that patch
makes breaking layouts also remove the DAX page-cache entries. Therefore no DAX
page-cache entries should be found in truncate_folio_batch_exceptionals() which
is now more of a sanity check.

However due to the bug addressed by this patch DAX page-cache entries which
should have been deleted as part of breaking layouts were being observed in
truncate_folio_batch_exceptionals().

Prior to this series nothing would have noticed these being skipped because
dax_delete_mapping_entry() doesn't check if the page is DMA idle. I believe this
could lead to filesystem corruption if the locked entry was DMA-busy because the
filesystem would assume the page was DMA-idle and therefore the underlying block
free to be reallocated.

However writing a test to actually prove this is tricky, and I didn't get time
to do so.

> > Also while we are here rename get_unlocked_entry() to
> > get_next_unlocked_entry() to make it clear that it may advance the
> > iterator state.
> 
> Outside of the above clarification of how found / end user effect you
> can add:
> 
> Reviewed-by: Dan Williams

[PATCH] spi: fsl-spi: Remove display of virtual address

2025-01-08 Thread Christophe Leroy

The following appears in kernel log at boot:

fsl_spi b01004c0.spi: at 0x(ptrval) (irq = 51), QE mode

This is useless, so remove the display of that virtual address and
display the MMIO address instead, just like serial core does.

Signed-off-by: Christophe Leroy 
---
 drivers/spi/spi-fsl-spi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/spi/spi-fsl-spi.c b/drivers/spi/spi-fsl-spi.c
index 856a4a9def66..2f2082652a1a 100644
--- a/drivers/spi/spi-fsl-spi.c
+++ b/drivers/spi/spi-fsl-spi.c
@@ -618,7 +618,7 @@ static struct spi_controller *fsl_spi_probe(struct device 
*dev,
if (ret < 0)
goto err_probe;
 
-   dev_info(dev, "at 0x%p (irq = %d), %s mode\n", reg_base,
+   dev_info(dev, "at MMIO %pa (irq = %d), %s mode\n", &mem->start,
 mpc8xxx_spi->irq, mpc8xxx_spi_strmode(mpc8xxx_spi->flags));
 
return host;
-- 
2.47.0

[PATCH] powerpc/ipic: Stop printing address of registers

2025-01-08 Thread Christophe Leroy

The following line appears at boot:

IPIC (128 IRQ sources) at (ptrval)

This is pointless so remove the printing of the virtual address and
replace it by matching physical address.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/sysdev/ipic.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/sysdev/ipic.c b/arch/powerpc/sysdev/ipic.c
index 5f69e2d50f26..037b04bf9a9f 100644
--- a/arch/powerpc/sysdev/ipic.c
+++ b/arch/powerpc/sysdev/ipic.c
@@ -762,8 +762,7 @@ struct ipic * __init ipic_init(struct device_node *node, 
unsigned int flags)
ipic_write(ipic->regs, IPIC_SIMSR_H, 0);
ipic_write(ipic->regs, IPIC_SIMSR_L, 0);
 
-   printk ("IPIC (%d IRQ sources) at %p\n", NR_IPIC_INTS,
-   primary_ipic->regs);
+   pr_info("IPIC (%d IRQ sources) at MMIO %pa\n", NR_IPIC_INTS, 
&res.start);
 
return ipic;
 }
-- 
2.47.0

Re: [PATCH v5 05/25] fs/dax: Create a common implementation to break DAX layouts

2025-01-08 Thread Alistair Popple

On Wed, Jan 08, 2025 at 04:14:20PM -0800, Dan Williams wrote:
> Alistair Popple wrote:
> > Prior to freeing a block file systems supporting FS DAX must check
> > that the associated pages are both unmapped from user-space and not
> > undergoing DMA or other access from eg. get_user_pages(). This is
> > achieved by unmapping the file range and scanning the FS DAX
> > page-cache to see if any pages within the mapping have an elevated
> > refcount.
> > 
> > This is done using two functions - dax_layout_busy_page_range() which
> > returns a page to wait for the refcount to become idle on. Rather than
> > open-code this introduce a common implementation to both unmap and
> > wait for the page to become idle.
> > 
> > Signed-off-by: Alistair Popple 
> > 
> > ---
> > 
> > Changes for v5:
> > 
> >  - Don't wait for idle pages on non-DAX mappings
> > 
> > Changes for v4:
> > 
> >  - Fixed some build breakage due to missing symbol exports reported by
> >John Hubbard (thanks!).
> > ---
> >  fs/dax.c| 33 +
> >  fs/ext4/inode.c | 10 +-
> >  fs/fuse/dax.c   | 29 +
> >  fs/xfs/xfs_inode.c  | 23 +--
> >  fs/xfs/xfs_inode.h  |  2 +-
> >  include/linux/dax.h | 21 +
> >  mm/madvise.c|  8 
> >  7 files changed, 70 insertions(+), 56 deletions(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index d010c10..9c3bd07 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -845,6 +845,39 @@ int dax_delete_mapping_entry(struct address_space 
> > *mapping, pgoff_t index)
> > return ret;
> >  }
> >  
> > +static int wait_page_idle(struct page *page,
> > +   void (cb)(struct inode *),
> > +   struct inode *inode)
> > +{
> > +   return ___wait_var_event(page, page_ref_count(page) == 1,
> > +   TASK_INTERRUPTIBLE, 0, 0, cb(inode));
> > +}
> > +
> > +/*
> > + * Unmaps the inode and waits for any DMA to complete prior to deleting the
> > + * DAX mapping entries for the range.
> > + */
> > +int dax_break_mapping(struct inode *inode, loff_t start, loff_t end,
> > +   void (cb)(struct inode *))
> > +{
> > +   struct page *page;
> > +   int error;
> > +
> > +   if (!dax_mapping(inode->i_mapping))
> > +   return 0;
> > +
> > +   do {
> > +   page = dax_layout_busy_page_range(inode->i_mapping, start, end);
> > +   if (!page)
> > +   break;
> > +
> > +   error = wait_page_idle(page, cb, inode);
> 
> This implementations removes logic around @retry found in the XFS and
> FUSE implementations, I think that is a mistake, and EXT4 has
> apparently been broken in this regard.

I think both implementations are equivalent though, just that the XFS/FUSE ones 
are
spread across two functions with the retry happening in the outer function
whilst the EXT4 implementation is implemented in a single function with a do/
while loop.

Both exit early if dax_layout_busy_page() doesn't find a DMA-busy page, and
both call dax_layout_busy_page() a second time after waiting on a page to become
idle. So I don't think anything is broken here, unless I've missed something.

> wait_page_idle() returns after @page is idle, but that does not mean
> @inode is DMA idle. After one found page from
> dax_layout_busy_page_range() is waited upon a new call to
> dax_break_mapping() needs to made to check if another DMA started, or if
> there were originally more pages active.
> 
> > +   } while (error == 0);
> > +
> > +   return error;
> 
> Surprised that the compiler does not warn about an uninitialized
> variable here?

So am I. Turns out this is built with -Wno-maybe-uninitialized and it's not
certain error is used uninitialized because we may bail early if this is not a
dax_mapping. So it's only maybe used uninitialized which isn't warned about.

Re: [PATCH 4/6] powerpc/pseries: Add ibm,set-dynamic-indicator RTAS call support

2025-01-08 Thread Dan Carpenter

Hi Haren,

kernel test robot noticed the following build warnings:

https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:
https://github.com/intel-lab-lkp/linux/commits/Haren-Myneni/powerpc-pseries-Define-common-functions-for-RTAS-sequence-HCALLs/20250105-045010
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
patch link:
https://lore.kernel.org/r/20250104204652.388720-5-haren%40linux.ibm.com
patch subject: [PATCH 4/6] powerpc/pseries: Add ibm,set-dynamic-indicator RTAS 
call support
config: powerpc64-randconfig-r071-20250108 
(https://download.01.org/0day-ci/archive/20250109/202501090337.xkcgrblc-...@intel.com/config)
compiler: clang version 20.0.0git (https://github.com/llvm/llvm-project 
096551537b2a747a3387726ca618ceeb3950e9bc)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Reported-by: Dan Carpenter 
| Closes: https://lore.kernel.org/r/202501090337.xkcgrblc-...@intel.com/

smatch warnings:
arch/powerpc/platforms/pseries/papr-indices.c:438 
papr_dynamic_indicator_ioc_set() warn: inconsistent returns 'global 
&rtas_ibm_set_dynamic_indicator_lock'.

vim +438 arch/powerpc/platforms/pseries/papr-indices.c

3f48afd07934e4 Haren Myneni 2025-01-04  397  static long 
papr_dynamic_indicator_ioc_set(struct papr_indices_io_block __user *ubuf)
3f48afd07934e4 Haren Myneni 2025-01-04  398  {
3f48afd07934e4 Haren Myneni 2025-01-04  399 struct papr_indices_io_block 
kbuf;
3f48afd07934e4 Haren Myneni 2025-01-04  400 struct rtas_work_area 
*work_area;
3f48afd07934e4 Haren Myneni 2025-01-04  401 s32 fwrc, token, ret;
3f48afd07934e4 Haren Myneni 2025-01-04  402  
3f48afd07934e4 Haren Myneni 2025-01-04  403 token = 
rtas_function_token(RTAS_FN_IBM_SET_DYNAMIC_INDICATOR);
3f48afd07934e4 Haren Myneni 2025-01-04  404 if (token == 
RTAS_UNKNOWN_SERVICE)
3f48afd07934e4 Haren Myneni 2025-01-04  405 return -ENOENT;
3f48afd07934e4 Haren Myneni 2025-01-04  406  
3f48afd07934e4 Haren Myneni 2025-01-04  407 
mutex_lock(&rtas_ibm_set_dynamic_indicator_lock);
3f48afd07934e4 Haren Myneni 2025-01-04  408 work_area = 
papr_dynamic_indice_buf_from_user(ubuf, &kbuf);
3f48afd07934e4 Haren Myneni 2025-01-04  409 if (IS_ERR(work_area))
3f48afd07934e4 Haren Myneni 2025-01-04  410 return 
PTR_ERR(work_area);

mutex_unlock(&rtas_ibm_set_dynamic_indicator_lock); before returning

3f48afd07934e4 Haren Myneni 2025-01-04  411  
3f48afd07934e4 Haren Myneni 2025-01-04  412 do {
3f48afd07934e4 Haren Myneni 2025-01-04  413 fwrc = rtas_call(token, 
3, 1, NULL,
3f48afd07934e4 Haren Myneni 2025-01-04  414 
kbuf.dynamic_param.token,
3f48afd07934e4 Haren Myneni 2025-01-04  415 
kbuf.dynamic_param.state,
3f48afd07934e4 Haren Myneni 2025-01-04  416 
rtas_work_area_phys(work_area));
3f48afd07934e4 Haren Myneni 2025-01-04  417 } while (rtas_busy_delay(fwrc));
3f48afd07934e4 Haren Myneni 2025-01-04  418  
3f48afd07934e4 Haren Myneni 2025-01-04  419 rtas_work_area_free(work_area);
3f48afd07934e4 Haren Myneni 2025-01-04  420 
mutex_unlock(&rtas_ibm_set_dynamic_indicator_lock);
3f48afd07934e4 Haren Myneni 2025-01-04  421  
3f48afd07934e4 Haren Myneni 2025-01-04  422 switch (fwrc) {
3f48afd07934e4 Haren Myneni 2025-01-04  423 case 
RTAS_IBM_DYNAMIC_INDICE_SUCCESS:
3f48afd07934e4 Haren Myneni 2025-01-04  424 ret = 0;
3f48afd07934e4 Haren Myneni 2025-01-04  425 break;
3f48afd07934e4 Haren Myneni 2025-01-04  426 case 
RTAS_IBM_DYNAMIC_INDICE_NO_INDICATOR:  /* No such indicator */
3f48afd07934e4 Haren Myneni 2025-01-04  427 ret = -EOPNOTSUPP;
3f48afd07934e4 Haren Myneni 2025-01-04  428 break;
3f48afd07934e4 Haren Myneni 2025-01-04  429 default:
3f48afd07934e4 Haren Myneni 2025-01-04  430 pr_err("unexpected 
ibm,set-dynamic-indicator result %d\n",
3f48afd07934e4 Haren Myneni 2025-01-04  431 fwrc);
3f48afd07934e4 Haren Myneni 2025-01-04  432 fallthrough;
3f48afd07934e4 Haren Myneni 2025-01-04  433 case 
RTAS_IBM_DYNAMIC_INDICE_HW_ERROR:  /* Hardware/platform error */
3f48afd07934e4 Haren Myneni 2025-01-04  434 ret = -EIO;
3f48afd07934e4 Haren Myneni 2025-01-04  435 break;
3f48afd07934e4 Haren Myneni 2025-01-04  436 }
3f48afd07934e4 Haren Myneni 2025-01-04  437  
3f48afd07934e4 Haren Myneni 2025-01-04 @438 return ret;
3f48afd07934e4 Haren Myneni 2025-01-04  439  }

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

Re: [PATCH 5/6] powerpc/pseries: Add ibm,get-dynamic-sensor-state RTAS call support

2025-01-08 Thread Dan Carpenter

Hi Haren,

kernel test robot noticed the following build warnings:

https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:
https://github.com/intel-lab-lkp/linux/commits/Haren-Myneni/powerpc-pseries-Define-common-functions-for-RTAS-sequence-HCALLs/20250105-045010
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
patch link:
https://lore.kernel.org/r/20250104204652.388720-6-haren%40linux.ibm.com
patch subject: [PATCH 5/6] powerpc/pseries: Add ibm,get-dynamic-sensor-state 
RTAS call support
config: powerpc64-randconfig-r071-20250108 
(https://download.01.org/0day-ci/archive/20250109/202501090552.uzefb4qu-...@intel.com/config)
compiler: clang version 20.0.0git (https://github.com/llvm/llvm-project 
096551537b2a747a3387726ca618ceeb3950e9bc)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Reported-by: Dan Carpenter 
| Closes: https://lore.kernel.org/r/202501090552.uzefb4qu-...@intel.com/

New smatch warnings:
arch/powerpc/platforms/pseries/papr-indices.c:496 papr_dynamic_sensor_ioc_get() 
warn: inconsistent returns 'global &rtas_ibm_get_dynamic_sensor_state_lock'.

Old smatch warnings:
arch/powerpc/platforms/pseries/papr-indices.c:438 
papr_dynamic_indicator_ioc_set() warn: inconsistent returns 'global 
&rtas_ibm_set_dynamic_indicator_lock'.

vim +496 arch/powerpc/platforms/pseries/papr-indices.c

e44fb25ad9fa03 Haren Myneni 2025-01-04  452  static long 
papr_dynamic_sensor_ioc_get(struct papr_indices_io_block __user *ubuf)
e44fb25ad9fa03 Haren Myneni 2025-01-04  453  {
e44fb25ad9fa03 Haren Myneni 2025-01-04  454 struct papr_indices_io_block 
kbuf;
e44fb25ad9fa03 Haren Myneni 2025-01-04  455 struct rtas_work_area 
*work_area;
e44fb25ad9fa03 Haren Myneni 2025-01-04  456 s32 fwrc, token, ret;
e44fb25ad9fa03 Haren Myneni 2025-01-04  457 u32 rets;
e44fb25ad9fa03 Haren Myneni 2025-01-04  458  
e44fb25ad9fa03 Haren Myneni 2025-01-04  459 token = 
rtas_function_token(RTAS_FN_IBM_GET_DYNAMIC_SENSOR_STATE);
e44fb25ad9fa03 Haren Myneni 2025-01-04  460 if (token == 
RTAS_UNKNOWN_SERVICE)
e44fb25ad9fa03 Haren Myneni 2025-01-04  461 return -ENOENT;
e44fb25ad9fa03 Haren Myneni 2025-01-04  462  
e44fb25ad9fa03 Haren Myneni 2025-01-04  463 
mutex_lock(&rtas_ibm_get_dynamic_sensor_state_lock);
e44fb25ad9fa03 Haren Myneni 2025-01-04  464 work_area = 
papr_dynamic_indice_buf_from_user(ubuf, &kbuf);
e44fb25ad9fa03 Haren Myneni 2025-01-04  465 if (IS_ERR(work_area))
e44fb25ad9fa03 Haren Myneni 2025-01-04  466 return 
PTR_ERR(work_area);

Add an unlock, same as with the _set() function.

e44fb25ad9fa03 Haren Myneni 2025-01-04  467  
e44fb25ad9fa03 Haren Myneni 2025-01-04  468 do {
e44fb25ad9fa03 Haren Myneni 2025-01-04  469 fwrc = rtas_call(token, 
2, 2, &rets,
e44fb25ad9fa03 Haren Myneni 2025-01-04  470 
kbuf.dynamic_param.token,
e44fb25ad9fa03 Haren Myneni 2025-01-04  471 
rtas_work_area_phys(work_area));
e44fb25ad9fa03 Haren Myneni 2025-01-04  472 } while (rtas_busy_delay(fwrc));
e44fb25ad9fa03 Haren Myneni 2025-01-04  473  
e44fb25ad9fa03 Haren Myneni 2025-01-04  474 rtas_work_area_free(work_area);
e44fb25ad9fa03 Haren Myneni 2025-01-04  475 
mutex_unlock(&rtas_ibm_get_dynamic_sensor_state_lock);
e44fb25ad9fa03 Haren Myneni 2025-01-04  476  
e44fb25ad9fa03 Haren Myneni 2025-01-04  477 switch (fwrc) {
e44fb25ad9fa03 Haren Myneni 2025-01-04  478 case 
RTAS_IBM_DYNAMIC_INDICE_SUCCESS:
e44fb25ad9fa03 Haren Myneni 2025-01-04  479 if (put_user(rets, 
&ubuf->dynamic_param.state))
e44fb25ad9fa03 Haren Myneni 2025-01-04  480 ret = -EFAULT;
e44fb25ad9fa03 Haren Myneni 2025-01-04  481 else
e44fb25ad9fa03 Haren Myneni 2025-01-04  482 ret = 0;
e44fb25ad9fa03 Haren Myneni 2025-01-04  483 break;
e44fb25ad9fa03 Haren Myneni 2025-01-04  484 case 
RTAS_IBM_DYNAMIC_INDICE_NO_INDICATOR:  /* No such indicator */
e44fb25ad9fa03 Haren Myneni 2025-01-04  485 ret = -EOPNOTSUPP;
e44fb25ad9fa03 Haren Myneni 2025-01-04  486 break;
e44fb25ad9fa03 Haren Myneni 2025-01-04  487 default:
e44fb25ad9fa03 Haren Myneni 2025-01-04  488 pr_err("unexpected 
ibm,get-dynamic-sensor result %d\n",
e44fb25ad9fa03 Haren Myneni 2025-01-04  489 fwrc);
e44fb25ad9fa03 Haren Myneni 2025-01-04  490 fallthrough;
e44fb25ad9fa03 Haren Myneni 2025-01-04  491 case 
RTAS_IBM_DYNAMIC_INDICE_HW_ERROR:  /* Hardware/platform error */
e44fb25ad9fa03 Haren Myneni 2025-01-04  492 ret = -EIO;
e44fb25ad9fa03 Haren Myneni 2025-01-04  493 break;
e44fb25ad9fa03 Haren Myneni 2025-01-04  494 }
e44fb25ad9fa03 Haren Myneni

Re: [PATCH v2 0/3] sysfs: constify bin_attribute argument of sysfs_bin_attr_simple_read()

2025-01-08 Thread Greg Kroah-Hartman

On Wed, Jan 08, 2025 at 09:45:45AM -0800, Alexei Starovoitov wrote:
> On Tue, Dec 31, 2024 at 2:30 AM Thomas Weißschuh  wrote:
> >
> > On 2024-12-30 16:50:41-0800, Alexei Starovoitov wrote:
> > > On Sat, Dec 28, 2024 at 12:43 AM Thomas Weißschuh  
> > > wrote:
> > > >
> > > > Most users use this function through the BIN_ATTR_SIMPLE* macros,
> > > > they can handle the switch transparently.
> > > >
> > > > This series is meant to be merged through the driver core tree.
> > >
> > > hmm. why?
> >
> > Patch 1 changes the signature of sysfs_bin_attr_simple_read().
> > Before patch 1 sysfs_bin_attr_simple_read() needs to be assigned to the
> > callback member .read, after patch 1 it's .read_new.
> > (Both callbacks work exactly the same, except for their signature,
> > .read_new is only a transition mechanism and will go away again)
> >
> > > I'd rather take patches 2 and 3 into bpf-next to avoid
> > > potential conflicts.
> > > Patch 1 looks orthogonal and independent.
> >
> > If you pick up 2 and 3 through bpf-next you would need to adapt these
> > assignments. As soon as both patch 1 and the modified 2 and 3 hit
> > Linus' tree, the build would break due to mismatches function pointers.
> > (Casting function pointers to avoid the mismatch will blow up with KCFI)
> 
> I see. All these steps to constify is frankly a mess.
> You're wasting cpu and memory for this read vs read_new
> when const is not much more than syntactic sugar in C.
> You should have done one tree wide patch without doing this _new() hack.
> 
> Anyway, rant over. Carry patches 2,3. Hopefully they won't conflict.
> But I don't want to see any constification patches in bpf land
> that come with such pointless runtime penalty.

The "pointless" penalty will go away once we convert all instances, and
really, it's just one pointer check, sysfs files should NOT be a hot
path for anything real, and one more pointer check should be cached and
not measurable compared to the real logic behind the binary data coming
from the hardware/kernel, right?

sysfs is NOT tuned for speed at all, so adding more checks like this
should be fine.

thanks,

greg k-h

Re: [PATCH] perf machine: Don't ignore _etext when not a text symbol

2025-01-08 Thread Christophe Leroy





Le 08/01/2025 à 21:14, Arnaldo Carvalho de Melo a écrit :

On Wed, Jan 08, 2025 at 10:15:24AM +0100, Christophe Leroy wrote:

Depending on how vmlinux.lds is written, _etext might be the very
first data symbol instead of the very last text symbol.

Don't require it to be a text symbol, accept any symbol type.


I'm adding a Link:

Link: 
https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2F752a31b0-4370-4f52-b7cc-45f0078c1d6c%40csgroup.eu&data=05%7C02%7Cchristophe.leroy%40csgroup.eu%7C914f4c7995574ee91f5c08dd30211dd6%7C8b87af7d86474dc78df45f69a2011bb5%7C0%7C0%7C638719640997470461%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=kqCNbhhgKri3TlJaUb3mkTU6NyFRzhnb%2BDiK93h9aSQ%3D&reserved=0

To give more context as where this has been observed, and also add a
snippet of your explanation there, this:


# grep -e _stext -e _etext -e _edata /proc/kallsyms
c000 T _stext
c08b8000 D _etext

So there is no _edata and _etext is not text


For the absence of _edata, I sent another patch, will you take it as 
well ? :


https://lore.kernel.org/linux-perf-users/2fec8c50c271dff59f0177ff0884b6c374486ba5.1736327770.git.christophe.le...@csgroup.eu/T/#u

Thanks
Christophe

Re: [PATCH v3 5/6] s390/crash: Use note name macros

2025-01-08 Thread Akihiko Odaki


On 2025/01/08 22:50, Dave Martin wrote:

On Wed, Jan 08, 2025 at 01:53:51PM +0900, Akihiko Odaki wrote:

On 2025/01/08 1:17, Dave Martin wrote:

Hi,

On Tue, Jan 07, 2025 at 09:45:56PM +0900, Akihiko Odaki wrote:

Use note name macros to match with the userspace's expectation.

Signed-off-by: Akihiko Odaki 
---
   arch/s390/kernel/crash_dump.c | 62 
---
   1 file changed, 23 insertions(+), 39 deletions(-)

diff --git a/arch/s390/kernel/crash_dump.c b/arch/s390/kernel/crash_dump.c


[...]



+#define NT_INIT(buf, type, desc) \
+   (nt_init_name((buf), NT_ ## type, &(desc), sizeof(desc), NN_ ## type))


[...]


(Note also, the outer parentheses and the parentheses around (buf)
appear redundant -- although harmless?)


They only make a difference in trivial corner cases and may look needlessly
verbose.


(In case there was a misunderstanding here, I meant that some
parentheses can be removed without affecting correctness:

#define NT_INIT(buf, type, desc) \
nt_init_name(buf, NT_ ## type, &(desc), sizeof(desc), NN_ ## type))

It still doesn't matter though -- and some people do prefer to be
defensive anyway and err on the side of having too many parentheses
rather than too few.)


Well, being very pedantic, there are some cases where these parentheses 
have some effect.


If you omit the outer parentheses, the following code will have 
different consequences:

a->NT_INIT(buf, PRSTATUS, desc)

The parentheses around buf will make difference for the following code:
#define COMMA ,
NT_INIT(NULL COMMA buf, PRSTATUS, desc)

But nobody will write such code.

Regards,
Akihiko Odaki

Re: [PATCH] kexec: Initialize ELF lowest address to ULONG_MAX

2025-01-08 Thread Andrew Morton

On Thu, 9 Jan 2025 09:42:14 +0530 Sourabh Jain  
wrote:

> Hello Baoquan and Eric,
> 
> 
> On 12/12/24 08:25, Baoquan he wrote:
> > On 12/10/24 at 02:43pm, Sourabh Jain wrote:
> >> kexec_elf_load() loads an ELF executable and sets the address of the
> >> lowest PT_LOAD section to the address held by the lowest_load_addr
> >> function argument.
> >>
> >> To determine the lowest PT_LOAD address, a local variable lowest_addr
> >> (type unsigned long) is initialized to UINT_MAX. After loading each
> >> PT_LOAD, its address is compared to lowest_addr. If a loaded PT_LOAD
> >> address is lower, lowest_addr is updated. However, setting lowest_addr
> >> to UINT_MAX won't work when the kernel image is loaded above 4G, as the
> >> returned lowest PT_LOAD address would be invalid. This is resolved by
> >> initializing lowest_addr to ULONG_MAX instead.
> >>
> >> This issue was discovered while implementing crashkernel high/low
> >> reservation on the PowerPC architecture.
> >>
> >> Fixes: a0458284f062 ("powerpc: Add support code for kexec_file_load()")
> >> Cc: Baoquan he 
> >> Cc: Hari Bathini 
> >> CC: Madhavan Srinivasan 
> >> Cc: Michael Ellerman 
> >> Cc: ke...@lists.infradead.org
> >> Cc: linuxppc-dev@lists.ozlabs.org
> >> Cc: linux-ker...@vger.kernel.org
> >> Signed-off-by: Sourabh Jain 
> >> ---
> >>   kernel/kexec_elf.c | 2 +-
> >>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/kernel/kexec_elf.c b/kernel/kexec_elf.c
> >> index d3689632e8b9..3a5c25b2adc9 100644
> >> --- a/kernel/kexec_elf.c
> >> +++ b/kernel/kexec_elf.c
> >> @@ -390,7 +390,7 @@ int kexec_elf_load(struct kimage *image, struct elfhdr 
> >> *ehdr,
> >> struct kexec_buf *kbuf,
> >> unsigned long *lowest_load_addr)
> >>   {
> >> -  unsigned long lowest_addr = UINT_MAX;
> >> +  unsigned long lowest_addr = ULONG_MAX;
> > Great catch.
> >
> > Acked-by: Baoquan He 
> Thank you for the Ack! The upcoming two patch series, which aim to 
> enable generic crashkernel reservation, depends on this fix. One of them 
> is already posted for upstream review: 
> https://lore.kernel.org/all/20250108101458.406806-1-sourabhj...@linux.ibm.com/
>  
> I was wondering if you could guide us on how to get this fix pushed to 
> the mainline tree.

Please include this patch (with Baoquan's ack) in whichever tree
contains the powerpc patches which depend upon it.

Re: [PATCH] kexec: Initialize ELF lowest address to ULONG_MAX

2025-01-08 Thread Sourabh Jain


Hello Andrew,

On 09/01/25 10:58, Andrew Morton wrote:

On Thu, 9 Jan 2025 09:42:14 +0530 Sourabh Jain  
wrote:


Hello Baoquan and Eric,


On 12/12/24 08:25, Baoquan he wrote:

On 12/10/24 at 02:43pm, Sourabh Jain wrote:

kexec_elf_load() loads an ELF executable and sets the address of the
lowest PT_LOAD section to the address held by the lowest_load_addr
function argument.

To determine the lowest PT_LOAD address, a local variable lowest_addr
(type unsigned long) is initialized to UINT_MAX. After loading each
PT_LOAD, its address is compared to lowest_addr. If a loaded PT_LOAD
address is lower, lowest_addr is updated. However, setting lowest_addr
to UINT_MAX won't work when the kernel image is loaded above 4G, as the
returned lowest PT_LOAD address would be invalid. This is resolved by
initializing lowest_addr to ULONG_MAX instead.

This issue was discovered while implementing crashkernel high/low
reservation on the PowerPC architecture.

Fixes: a0458284f062 ("powerpc: Add support code for kexec_file_load()")
Cc: Baoquan he 
Cc: Hari Bathini 
CC: Madhavan Srinivasan 
Cc: Michael Ellerman 
Cc: ke...@lists.infradead.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Sourabh Jain 
---
   kernel/kexec_elf.c | 2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/kexec_elf.c b/kernel/kexec_elf.c
index d3689632e8b9..3a5c25b2adc9 100644
--- a/kernel/kexec_elf.c
+++ b/kernel/kexec_elf.c
@@ -390,7 +390,7 @@ int kexec_elf_load(struct kimage *image, struct elfhdr 
*ehdr,
 struct kexec_buf *kbuf,
 unsigned long *lowest_load_addr)
   {
-   unsigned long lowest_addr = UINT_MAX;
+   unsigned long lowest_addr = ULONG_MAX;

Great catch.

Acked-by: Baoquan He 

Thank you for the Ack! The upcoming two patch series, which aim to
enable generic crashkernel reservation, depends on this fix. One of them
is already posted for upstream review:
https://lore.kernel.org/all/20250108101458.406806-1-sourabhj...@linux.ibm.com/
I was wondering if you could guide us on how to get this fix pushed to
the mainline tree.

Please include this patch (with Baoquan's ack) in whichever tree
contains the powerpc patches which depend upon it.


Sure, I will include this patch in the respective patch series.

Thanks,
Sourabh Jain

Re: [PATCH RESEND v1 1/5] crash: remove an unused argument from reserve_crashkernel_generic()

2025-01-08 Thread Sourabh Jain


Hello Baoquan,


On 08/01/25 16:46, Baoquan he wrote:

On 01/08/25 at 03:44pm, Sourabh Jain wrote:

cmdline argument is not used in reserve_crashkernel_generic() so remove
it. Correspondingly, all the callers have been updated as well.

No functional change intended.

Cc: Andrew Morton 
Cc: Baoquan he 
Cc: Hari Bathini 
CC: Madhavan Srinivasan 
Cc: Michael Ellerman 
Cc: ke...@lists.infradead.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Sourabh Jain 
---
  arch/arm64/mm/init.c  |  6 ++
  arch/loongarch/kernel/setup.c |  5 ++---
  arch/riscv/mm/init.c  |  6 ++
  arch/x86/kernel/setup.c   |  6 ++
  include/linux/crash_reserve.h | 11 +--
  kernel/crash_reserve.c|  9 -
  6 files changed, 17 insertions(+), 26 deletions(-)

Acked-by: Baoquan He 

Thanks for the Ack!

- Sourabh Jain

Re: [PATCH v2] perf: Fix display of kernel symbols

2025-01-08 Thread Christophe Leroy





Le 08/01/2025 à 15:53, Arnaldo Carvalho de Melo a écrit :

On Wed, Jan 08, 2025 at 10:54:20AM +0100, Christophe Leroy wrote:

Since commit 659ad3492b91 ("perf maps: Switch from rbtree to lazily
sorted array for addresses"), perf doesn't display anymore kernel
symbols on powerpc, allthough it still detects them as kernel addresses.

# Overhead  Command Shared Object  Symbol
#   ..  . 
..
#
80.49%  Coeur main  [unknown]  [k] 0xc005f0f8
 3.91%  Coeur main  gau[.] 
engine_loop.constprop.0.isra.0
 1.72%  Coeur main  [unknown]  [k] 0xc005f11c
 1.09%  Coeur main  [unknown]  [k] 0xc01f82c8
 0.44%  Coeur main  libc.so.6  [.] epoll_wait
 0.38%  Coeur main  [unknown]  [k] 0xc0011718
 0.36%  Coeur main  [unknown]  [k] 0xc01f45c0

This is because function maps__find_next_entry() now returns current
entry instead of next entry, leading to kernel map end address
getting mis-configured with its own start address instead of the
start address of the following map.

Fix it by really taking the next entry, also make sure that entry
follows current one by making sure entries are sorted.

Fixes: 659ad3492b91 ("perf maps: Switch from rbtree to lazily sorted array for 
addresses")
Signed-off-by: Christophe Leroy 
Reviewed-by: Arnaldo Carvalho de Melo 
---
v2: Make sure the entries are sorted, if not sort them.


Since you have changed what I reviewed I'll have to re-review :-) Will
try to do it after some calls.


Ah yes sorry, should have removed your Reviewed-by.

Based on Ian's feedback "Using the next entry in this way won't work if 
the entries aren't sorted", I added the following block in front of the 
initial change:


+   while (!maps__maps_by_address_sorted(maps)) {
+   up_read(maps__lock(maps));
+   maps__sort_by_address(maps);
+   down_read(maps__lock(maps));
+   }

Christophe

Re: Raptor Engineering dedicating resources to KVM on PowerNV + KVM CI/CD

2025-01-08 Thread Shawn Anastasio

Hi Alex,

On 1/7/25 5:45 AM, Alex Williamson wrote
> Hi,
> 
> What are you supposing the value to the community is for a CI pipeline
> that always fails?  Are you hoping the community will address the
> failing tests or monitor the failures to try to make them not become
> worse?

The failing tests are all isolated to issues with the specific AMD
graphics hardware that the test machine is using for the VFIO and host
GPU tests, and are likely isolated to the amdgpu driver itself. We have
filed bugs with amdgpu folks.

The non-failing tests however, possess value for regression monitoring
including VM boot smoke tests for both little endian and big endian
ppc64/pseries targets, as well as the vfio-*-attach tests that ensure
hardware can be successfully bound to the vfio-pci driver on a PowerNV
host. The test artifacts also include full dmesg output from the host
and guest machine (when applicable) to assist with debugging.

The data could definitely be presented in an easier to digest way to
make it more obvious which failures are regressions and which are due to
the aforementioned amdgpu issues, so that's an area for improvement.

> 
> I would imagine that CI against key developer branches or linux-next
> would be more useful than finding problems after we've merged with
> mainline, but it's not clear there's any useful baseline here to
> monitor for regressions.  Thanks,
>

That's a good point -- I'll definitely look into adding at least
linux-next, as well as any other branch requests from developers.

> Alex

Thanks,
Shawn

Re: [PATCH v2] perf: Fix display of kernel symbols

2025-01-08 Thread Ian Rogers

On Wed, Jan 8, 2025 at 1:54 AM Christophe Leroy
 wrote:
>
> Since commit 659ad3492b91 ("perf maps: Switch from rbtree to lazily
> sorted array for addresses"), perf doesn't display anymore kernel
> symbols on powerpc, allthough it still detects them as kernel addresses.
>
> # Overhead  Command Shared Object  Symbol
> #   ..  . 
> ..
> #
> 80.49%  Coeur main  [unknown]  [k] 0xc005f0f8
>  3.91%  Coeur main  gau[.] 
> engine_loop.constprop.0.isra.0
>  1.72%  Coeur main  [unknown]  [k] 0xc005f11c
>  1.09%  Coeur main  [unknown]  [k] 0xc01f82c8
>  0.44%  Coeur main  libc.so.6  [.] epoll_wait
>  0.38%  Coeur main  [unknown]  [k] 0xc0011718
>  0.36%  Coeur main  [unknown]  [k] 0xc01f45c0
>
> This is because function maps__find_next_entry() now returns current
> entry instead of next entry, leading to kernel map end address
> getting mis-configured with its own start address instead of the
> start address of the following map.
>
> Fix it by really taking the next entry, also make sure that entry
> follows current one by making sure entries are sorted.
>
> Fixes: 659ad3492b91 ("perf maps: Switch from rbtree to lazily sorted array 
> for addresses")
> Signed-off-by: Christophe Leroy 
> Reviewed-by: Arnaldo Carvalho de Melo 

Reviewed-by: Ian Rogers 

Thanks!
Ian

> ---
> v2: Make sure the entries are sorted, if not sort them.
> ---
>  tools/perf/util/maps.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/tools/perf/util/maps.c b/tools/perf/util/maps.c
> index 432399cbe5dd..09c9cc326c08 100644
> --- a/tools/perf/util/maps.c
> +++ b/tools/perf/util/maps.c
> @@ -1136,8 +1136,13 @@ struct map *maps__find_next_entry(struct maps *maps, 
> struct map *map)
> struct map *result = NULL;
>
> down_read(maps__lock(maps));
> +   while (!maps__maps_by_address_sorted(maps)) {
> +   up_read(maps__lock(maps));
> +   maps__sort_by_address(maps);
> +   down_read(maps__lock(maps));
> +   }
> i = maps__by_address_index(maps, map);
> -   if (i < maps__nr_maps(maps))
> +   if (++i < maps__nr_maps(maps))
> result = map__get(maps__maps_by_address(maps)[i]);
>
> up_read(maps__lock(maps));
> --
> 2.47.0
>

Re: [PATCH v5 07/17] mm: pgtable: introduce pagetable_dtor()

2025-01-08 Thread Alexander Gordeev

On Wed, Jan 08, 2025 at 02:57:23PM +0800, Qi Zheng wrote:
> The pagetable_p*_dtor() are exactly the same except for the handling of
> ptlock. If we make ptlock_free() handle the case where ptdesc->ptl is
> NULL and remove VM_BUG_ON_PAGE() from pmd_ptlock_free(), we can unify
> pagetable_p*_dtor() into one function. Let's introduce pagetable_dtor()
> to do this.
> 
> Later, pagetable_dtor() will be moved to tlb_remove_ptdesc(), so that
> ptlock and page table pages can be freed together (regardless of whether
> RCU is used). This prevents the use-after-free problem where the ptlock
> is freed immediately but the page table pages is freed later via RCU.
> 
> Signed-off-by: Qi Zheng 
> Originally-by: Peter Zijlstra (Intel) 
> Reviewed-by: Kevin Brodsky 
> ---
...
>  arch/s390/include/asm/pgalloc.h|  6 +--
>  arch/s390/include/asm/tlb.h|  6 +--
>  arch/s390/mm/pgalloc.c |  2 +-
...
>  include/asm-generic/pgalloc.h  |  8 ++--
>  include/linux/mm.h | 52 --
>  mm/memory.c|  3 +-
>  28 files changed, 62 insertions(+), 95 deletions(-)
...
For s390:

Acked-by: Alexander Gordeev 

Thanks!

Re: [PATCH v5 13/17] s390: pgtable: consolidate PxD and PTE TLB free paths

2025-01-08 Thread Alexander Gordeev

On Wed, Jan 08, 2025 at 02:57:29PM +0800, Qi Zheng wrote:
> Call pagetable_dtor() for PMD|PUD|P4D tables just before ptdesc is
> freed - same as it is done for PTE tables. That allows consolidating
> TLB free paths for all table types.
> 
> Signed-off-by: Qi Zheng 
> Suggested-by: Peter Zijlstra (Intel) 
> Reviewed-by: Kevin Brodsky 
> Cc: linux-s...@vger.kernel.org
> ---
>  arch/s390/include/asm/tlb.h |  3 ---
>  arch/s390/mm/pgalloc.c  | 14 --
>  2 files changed, 4 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
> index dde847a5be545..d5b27a2445c96 100644
> --- a/arch/s390/include/asm/tlb.h
> +++ b/arch/s390/include/asm/tlb.h
> @@ -102,7 +102,6 @@ static inline void pmd_free_tlb(struct mmu_gather *tlb, 
> pmd_t *pmd,
>  {
>   if (mm_pmd_folded(tlb->mm))
>   return;
> - pagetable_dtor(virt_to_ptdesc(pmd));
>   __tlb_adjust_range(tlb, address, PAGE_SIZE);
>   tlb->mm->context.flush_mm = 1;
>   tlb->freed_tables = 1;
> @@ -122,7 +121,6 @@ static inline void p4d_free_tlb(struct mmu_gather *tlb, 
> p4d_t *p4d,
>  {
>   if (mm_p4d_folded(tlb->mm))
>   return;
> - pagetable_dtor(virt_to_ptdesc(p4d));
>   __tlb_adjust_range(tlb, address, PAGE_SIZE);
>   tlb->mm->context.flush_mm = 1;
>   tlb->freed_tables = 1;
> @@ -141,7 +139,6 @@ static inline void pud_free_tlb(struct mmu_gather *tlb, 
> pud_t *pud,
>  {
>   if (mm_pud_folded(tlb->mm))
>   return;
> - pagetable_dtor(virt_to_ptdesc(pud));
>   tlb->mm->context.flush_mm = 1;
>   tlb->freed_tables = 1;
>   tlb->cleared_p4ds = 1;
> diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
> index 569de24d33761..c73b89811a264 100644
> --- a/arch/s390/mm/pgalloc.c
> +++ b/arch/s390/mm/pgalloc.c
> @@ -180,7 +180,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
>   return table;
>  }
>  
> -static void pagetable_pte_dtor_free(struct ptdesc *ptdesc)
> +static void pagetable_dtor_free(struct ptdesc *ptdesc)
>  {
>   pagetable_dtor(ptdesc);
>   pagetable_free(ptdesc);
> @@ -190,20 +190,14 @@ void page_table_free(struct mm_struct *mm, unsigned 
> long *table)
>  {
>   struct ptdesc *ptdesc = virt_to_ptdesc(table);
>  
> - pagetable_pte_dtor_free(ptdesc);
> + pagetable_dtor_free(ptdesc);
>  }
>  
>  void __tlb_remove_table(void *table)
>  {
>   struct ptdesc *ptdesc = virt_to_ptdesc(table);
> - struct page *page = ptdesc_page(ptdesc);
>  
> - if (compound_order(page) == CRST_ALLOC_ORDER) {
> - /* pmd, pud, or p4d */
> - pagetable_free(ptdesc);
> - return;
> - }
> - pagetable_pte_dtor_free(ptdesc);
> + pagetable_dtor_free(ptdesc);
>  }
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> @@ -211,7 +205,7 @@ static void pte_free_now(struct rcu_head *head)
>  {
>   struct ptdesc *ptdesc = container_of(head, struct ptdesc, pt_rcu_head);
>  
> - pagetable_pte_dtor_free(ptdesc);
> + pagetable_dtor_free(ptdesc);
>  }
>  
>  void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)

Acked-by: Alexander Gordeev 

Thanks!

Re: [PATCH v8 1/7] PCI: Don't expose pcie_read_tlp_log() outside of PCI subsystem

2025-01-08 Thread Yazen Ghannam

On Wed, Dec 18, 2024 at 04:37:41PM +0200, Ilpo Järvinen wrote:
> pcie_read_tlp_log() was exposed by the commit 0a5a46a6a61b ("PCI/AER:
> Generalize TLP Header Log reading") but this is now considered a
> mistake. No drivers outside of PCI subsystem should build their own
> diagnostic logging but should rely on PCI core doing it for them.
> 
> There's currently one driver (ixgbe) doing it independently which was
> the initial reason why the export was added but it was decided by the
> PCI maintainer that it's something that should be eliminated.
> 
> Remove the unwanted EXPORT of pcie_read_tlp_log() and remove it from
> include/linux/aer.h.
> 
> Link: https://lore.kernel.org/all/20240322193011.GA701027@bhelgaas/
> Signed-off-by: Ilpo Järvinen 
> Reviewed-by: Jonathan Cameron 

Reviewed-by: Yazen Ghannam 

Thanks,
Yazen

Re: [PATCH v5 06/17] s390: pgtable: add statistics for PUD and P4D level page table

2025-01-08 Thread Alexander Gordeev

On Wed, Jan 08, 2025 at 02:57:22PM +0800, Qi Zheng wrote:
> Like PMD and PTE level page table, also add statistics for PUD and P4D
> page table.
> 
> Signed-off-by: Qi Zheng 
> Suggested-by: Peter Zijlstra (Intel) 
> Reviewed-by: Kevin Brodsky 
> Cc: linux-s...@vger.kernel.org
> ---
>  arch/s390/include/asm/pgalloc.h | 29 +
>  arch/s390/include/asm/tlb.h |  2 ++
>  2 files changed, 23 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
> index 7b84ef6dc4b6d..a0c1ca5d8423c 100644
> --- a/arch/s390/include/asm/pgalloc.h
> +++ b/arch/s390/include/asm/pgalloc.h
> @@ -53,29 +53,42 @@ static inline p4d_t *p4d_alloc_one(struct mm_struct *mm, 
> unsigned long address)
>  {
>   unsigned long *table = crst_table_alloc(mm);
>  
> - if (table)
> - crst_table_init(table, _REGION2_ENTRY_EMPTY);
> + if (!table)
> + return NULL;
> + crst_table_init(table, _REGION2_ENTRY_EMPTY);
> + pagetable_p4d_ctor(virt_to_ptdesc(table));
> +
>   return (p4d_t *) table;
>  }
>  
>  static inline void p4d_free(struct mm_struct *mm, p4d_t *p4d)
>  {
> - if (!mm_p4d_folded(mm))
> - crst_table_free(mm, (unsigned long *) p4d);
> + if (mm_p4d_folded(mm))
> + return;
> +
> + pagetable_p4d_dtor(virt_to_ptdesc(p4d));
> + crst_table_free(mm, (unsigned long *) p4d);
>  }
>  
>  static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long 
> address)
>  {
>   unsigned long *table = crst_table_alloc(mm);
> - if (table)
> - crst_table_init(table, _REGION3_ENTRY_EMPTY);
> +
> + if (!table)
> + return NULL;
> + crst_table_init(table, _REGION3_ENTRY_EMPTY);
> + pagetable_pud_ctor(virt_to_ptdesc(table));
> +
>   return (pud_t *) table;
>  }
>  
>  static inline void pud_free(struct mm_struct *mm, pud_t *pud)
>  {
> - if (!mm_pud_folded(mm))
> - crst_table_free(mm, (unsigned long *) pud);
> + if (mm_pud_folded(mm))
> + return;
> +
> + pagetable_pud_dtor(virt_to_ptdesc(pud));
> + crst_table_free(mm, (unsigned long *) pud);
>  }
>  
>  static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long 
> vmaddr)
> diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
> index e95b2c8081eb8..907d57a68145c 100644
> --- a/arch/s390/include/asm/tlb.h
> +++ b/arch/s390/include/asm/tlb.h
> @@ -122,6 +122,7 @@ static inline void p4d_free_tlb(struct mmu_gather *tlb, 
> p4d_t *p4d,
>  {
>   if (mm_p4d_folded(tlb->mm))
>   return;
> + pagetable_p4d_dtor(virt_to_ptdesc(p4d));
>   __tlb_adjust_range(tlb, address, PAGE_SIZE);
>   tlb->mm->context.flush_mm = 1;
>   tlb->freed_tables = 1;
> @@ -140,6 +141,7 @@ static inline void pud_free_tlb(struct mmu_gather *tlb, 
> pud_t *pud,
>  {
>   if (mm_pud_folded(tlb->mm))
>   return;
> + pagetable_pud_dtor(virt_to_ptdesc(pud));
>   tlb->mm->context.flush_mm = 1;
>   tlb->freed_tables = 1;
>   tlb->cleared_p4ds = 1;

Acked-by: Alexander Gordeev 

Thanks!

Re: [PATCH v8 2/7] PCI: Move TLP Log handling to own file

2025-01-08 Thread Yazen Ghannam

On Wed, Dec 18, 2024 at 04:37:42PM +0200, Ilpo Järvinen wrote:
> TLP Log is PCIe feature and is processed only by AER and DPC.
> Configwise, DPC depends AER being enabled. In lack of better place, the
> TLP Log handling code was initially placed into pci.c but it can be
> easily placed in a separate file.
> 
> Move TLP Log handling code to own file under pcie/ subdirectory and
> include it only when AER is enabled.
> 
> Signed-off-by: Ilpo Järvinen 
> Reviewed-by: Jonathan Cameron 
> ---

Reviewed-by: Yazen Ghannam 

Overall, looks good to me, but I have one idea below.

>  drivers/pci/pci.c | 27 ---
>  drivers/pci/pci.h |  2 +-
>  drivers/pci/pcie/Makefile |  2 +-
>  drivers/pci/pcie/tlp.c| 39 +++
>  4 files changed, 41 insertions(+), 29 deletions(-)
>  create mode 100644 drivers/pci/pcie/tlp.c
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e0fdc9d10f91..02cd4c7eb80b 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1099,33 +1099,6 @@ static void pci_enable_acs(struct pci_dev *dev)
>   pci_write_config_word(dev, pos + PCI_ACS_CTRL, caps.ctrl);
>  }
>  
> -/**
> - * pcie_read_tlp_log - read TLP Header Log
> - * @dev: PCIe device
> - * @where: PCI Config offset of TLP Header Log
> - * @tlp_log: TLP Log structure to fill
> - *
> - * Fill @tlp_log from TLP Header Log registers, e.g., AER or DPC.
> - *
> - * Return: 0 on success and filled TLP Log structure, <0 on error.
> - */
> -int pcie_read_tlp_log(struct pci_dev *dev, int where,
> -   struct pcie_tlp_log *tlp_log)
> -{
> - int i, ret;
> -
> - memset(tlp_log, 0, sizeof(*tlp_log));
> -
> - for (i = 0; i < 4; i++) {
> - ret = pci_read_config_dword(dev, where + i * 4,
> - &tlp_log->dw[i]);
> - if (ret)
> - return pcibios_err_to_errno(ret);
> - }
> -
> - return 0;
> -}
> -
>  /**
>   * pci_restore_bars - restore a device's BAR values (e.g. after wake-up)
>   * @dev: PCI device to have its BARs restored
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 8a60fc9e7786..55fcf3bac4f7 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -549,9 +549,9 @@ struct aer_err_info {
>  
>  int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info 
> *info);
>  void aer_print_error(struct pci_dev *dev, struct aer_err_info *info);
> -#endif   /* CONFIG_PCIEAER */
>  
>  int pcie_read_tlp_log(struct pci_dev *dev, int where, struct pcie_tlp_log 
> *log);
> +#endif   /* CONFIG_PCIEAER */
>  
>  #ifdef CONFIG_PCIEPORTBUS
>  /* Cached RCEC Endpoint Association */
> diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
> index 53ccab62314d..173829aa02e6 100644
> --- a/drivers/pci/pcie/Makefile
> +++ b/drivers/pci/pcie/Makefile
> @@ -7,7 +7,7 @@ pcieportdrv-y := portdrv.o rcec.o
>  obj-$(CONFIG_PCIEPORTBUS)+= pcieportdrv.o bwctrl.o
>  
>  obj-y+= aspm.o
> -obj-$(CONFIG_PCIEAER)+= aer.o err.o
> +obj-$(CONFIG_PCIEAER)+= aer.o err.o tlp.o
>  obj-$(CONFIG_PCIEAER_INJECT) += aer_inject.o
>  obj-$(CONFIG_PCIE_PME)   += pme.o
>  obj-$(CONFIG_PCIE_DPC)   += dpc.o
> diff --git a/drivers/pci/pcie/tlp.c b/drivers/pci/pcie/tlp.c
> new file mode 100644
> index ..3f053cc62290
> --- /dev/null
> +++ b/drivers/pci/pcie/tlp.c
> @@ -0,0 +1,39 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * PCIe TLP Log handling
> + *
> + * Copyright (C) 2024 Intel Corporation
> + */
> +
> +#include 
> +#include 
> +#include 
> +
> +#include "../pci.h"
> +
> +/**
> + * pcie_read_tlp_log - read TLP Header Log
> + * @dev: PCIe device
> + * @where: PCI Config offset of TLP Header Log
> + * @tlp_log: TLP Log structure to fill
> + *
> + * Fill @tlp_log from TLP Header Log registers, e.g., AER or DPC.
> + *
> + * Return: 0 on success and filled TLP Log structure, <0 on error.
> + */
> +int pcie_read_tlp_log(struct pci_dev *dev, int where,
> +   struct pcie_tlp_log *tlp_log)
> +{
> + int i, ret;
> +
> + memset(tlp_log, 0, sizeof(*tlp_log));
> +

Can we include a define for the number of registers? 

> + for (i = 0; i < 4; i++) {

This '4' is "MIN_TLP_REGS" or something similar.

> + ret = pci_read_config_dword(dev, where + i * 4,

This '4' is the register offset factor.

Another thought is to make the offset a variable and adjust it in the
for-loop conditions.

int i, ret, offset = where;

for (i = 0; i < MIN_TLP_REGS; i++, offset += 4) {
ret = pci_read_config_dword(dev, offset, &tlp_log->dw[i]);

I think this will help as variable-size TLP logs are added in later
patches.

> + &tlp_log->dw[i]);
> + if (ret)
> + return pcibios_err_to_errno(ret);
> + }
> +
> + return 0

Re: Perf doesn't display kernel symbols anymore (bisected commit 659ad3492b91 ("perf maps: Switch from rbtree to lazily sorted array for addresses"))

2025-01-08 Thread Christophe Leroy





Le 06/01/2025 à 22:46, Namhyung Kim a écrit :


And in System.map I have:

c136a000 D __start___bug_table
c1377c14 D __stop___bug_table
c1378000 B __bss_start
c1378000 B _edata
c1378000 B initcall_debug
c1378004 B reset_devices

Should perf try to locate the very last symbol when it doesn't find _edata ?
Or should architecture's link script be modified ? Otherwise commit
69a87a32f5cd ("perf machine: Include data symbols in the kernel map") is
just pointless.


Let's go with kallsyms__get_symbol_start().  I think it's the most
straight-forward and simplest fix.



Ok, I did that, see patch 
https://lore.kernel.org/linux-perf-users/b3ee1994d95257cb7f2de037c5030ba7d1bed404.1736327613.git.christophe.le...@csgroup.eu/T/#u


And for the _edata which is sometimes missing, I send patch 
https://lore.kernel.org/linux-perf-users/2fec8c50c271dff59f0177ff0884b6c374486ba5.1736327770.git.christophe.le...@csgroup.eu/T/#u


Christophe

Re: [PATCH v3 5/6] s390/crash: Use note name macros

2025-01-08 Thread Heiko Carstens

On Wed, Jan 08, 2025 at 01:53:51PM +0900, Akihiko Odaki wrote:
> On 2025/01/08 1:17, Dave Martin wrote:
> > > +#define NT_INIT(buf, type, desc) \
> > > + (nt_init_name((buf), NT_ ## type, &(desc), sizeof(desc), NN_ ## type))
> > 
> > Nit: this macro name clashes with the naming scheme in elf.h.
> > 
> > I think that there is a (weak) convention that macros with upper-case
> > names don't expand to a C function call; thus, a macro with an upper-
> > case name can be invoked in places where a C function call would not be
> > allowed.  (This convention is not followed everywhere, though -- it's
> > up to the maintainer what they prefer here.)
> 
> I wanted to clarify it is a macro as it concatenates tokens with ##, but I
> also find there are many macros that are named lower-case and performs token
> concatenation.
> 
> S390 maintainers, please tell usr your opinion.

Just make the new macros lower case to avoid the naming scheme
clashes, please. Otherwise it doesn't matter too much.

> > > +#define NT_SIZE(type, desc) (nt_size_name(sizeof(desc), NN_ ## type))
> > 
> > Nit: name prefix clash (again); possibly redundant parentheses.

Same here.

> > > - size =  nt_size(NT_PRSTATUS, sizeof(struct elf_prstatus));
> > > - size +=  nt_size(NT_PRFPREG, sizeof(elf_fpregset_t));
> > > - size +=  nt_size(NT_S390_TIMER, sizeof(sa->timer));
> > > - size +=  nt_size(NT_S390_TODCMP, sizeof(sa->todcmp));
> > > - size +=  nt_size(NT_S390_TODPREG, sizeof(sa->todpreg));
> > > - size +=  nt_size(NT_S390_CTRS, sizeof(sa->ctrs));
> > > - size +=  nt_size(NT_S390_PREFIX, sizeof(sa->prefix));
> > > + size =  NT_SIZE(PRSTATUS, struct elf_prstatus);
> > > + size +=  NT_SIZE(PRFPREG, elf_fpregset_t);
> > > + size +=  NT_SIZE(S390_TIMER, sa->timer);
> > > + size +=  NT_SIZE(S390_TODCMP, sa->todcmp);
> > > + size +=  NT_SIZE(S390_TODPREG, sa->todpreg);
> > > + size +=  NT_SIZE(S390_CTRS, sa->ctrs);
> > > + size +=  NT_SIZE(S390_PREFIX, sa->prefix);
> > 
> > It might be worth fixing the funny spacing on these lines, since all
> > the affected lines are being replaced.

Yes, please!

Besides that this looks good:
Acked-by: Heiko Carstens

Re: [PATCH] KVM: allow NULL writable argument to __kvm_faultin_pfn

2025-01-08 Thread Sean Christopherson

On Mon, Jan 06, 2025, Sean Christopherson wrote:
> On Wed, Jan 01, 2025, Paolo Bonzini wrote:
> > kvm_follow_pfn() is able to work with NULL in the .map_writable field
> > of the homonymous struct.  But __kvm_faultin_pfn() rejects the combo
> > despite KVM for e500 trying to use it.  Indeed .map_writable is not
> > particularly useful if the flags include FOLL_WRITE and readonly
> > guest memory is not supported, so add support to __kvm_faultin_pfn()
> > for this case.
> 
> I would prefer to keep the sanity check to minimize the risk of a page fault
> handler not supporting opportunistic write mappings.  e500 is definitely the
> odd one out here.

Per a quick chat at PUCK, Paolo is going to try and fix the e500 code to 
actually
use the @writable param as it's intended.

Re: [PATCH v3 2/6] binfmt_elf: Use note name macros

2025-01-08 Thread Dave Martin

Hi,

On Wed, Jan 08, 2025 at 01:34:24PM +0900, Akihiko Odaki wrote:
> On 2025/01/08 1:18, Dave Martin wrote:
> > On Tue, Jan 07, 2025 at 09:45:53PM +0900, Akihiko Odaki wrote:
> > > Use note name macros to match with the userspace's expectation.
> > 
> > Also (and more importantly) get rid of duplicated knowledge about the
> > mapping of note types to note names, so that elf.h is the authoritative
> > source of this information?
> > 
> > > 
> > > Signed-off-by: Akihiko Odaki 
> > > Acked-by: Baoquan He 
> > > ---
> > >   fs/binfmt_elf.c   | 21 ++---
> > >   fs/binfmt_elf_fdpic.c |  8 
> > >   2 files changed, 14 insertions(+), 15 deletions(-)
> > > 
> > > diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> > > index 106f0e8af177..5b4a92e5e508 100644
> > > --- a/fs/binfmt_elf.c
> > > +++ b/fs/binfmt_elf.c
> > 
> > [...]
> > 
> > > @@ -1538,7 +1538,7 @@ static int elf_fdpic_core_dump(struct 
> > > coredump_params *cprm)
> > >   do
> > >   i += 2;
> > >   while (auxv[i - 2] != AT_NULL);
> > > - fill_note(&auxv_note, "CORE", NT_AUXV, i * sizeof(elf_addr_t), auxv);
> > > + fill_note(&auxv_note, NN_AUXV, NT_AUXV, i * sizeof(elf_addr_t), auxv);
> > >   thread_status_size += notesize(&auxv_note);
> > >   offset = sizeof(*elf);  /* ELF header */
> > 
> > Looking at this code, it appears that the right name is explicitly
> > taken from elf.h for a few specific notes, but for those that are
> > specified by the arch code (e.g., in struct user_regset entries) the
> > name is still guessed locally:
> > 
> > static int fill_thread_core_info(...) {
> > 
> > ...
> > 
> > fill_note(&t->notes[note_iter], is_fpreg ? "CORE" : "LINUX",
> > note_type, ret, data);
> > 
> > 
> > It would be preferable to clean this up if we want elf.h to be the
> > authoritative source for the names.
> 
> If we want elf.h to be the authoritative source, yes, but I like the current
> form as it ensures nobody adds a note with a name different from "LINUX" and
> it is also simpler. There is a trade-off so I'd like to keep the current
> form unless anyone has a strong preference for one option.
> 
> Regards,
> Akihiko Odaki

I can see where you're coming from here.

It would be nice to at least be able to check that elf.h is consistent
with the behaviour here, but you're right -- there is a tradeoff.

Maybe add a comment in elf.h at the end of the block of #defines saying
that new Linux-specific entries should use the name "LINUX"?

Either way, I don't think it's a huge deal.  If people are happy with
this code as-is, then I don't have an issue with it.


I might follow up with a separate patch if this series is merged, and
people can consider it on its own merits (or lack thereof).

Cheers
---Dave

Re: [PATCH 03/14] ibmvnic: simplify ibmvnic_set_queue_affinity()y

2025-01-08 Thread Nick Child

On Tue, Jan 07, 2025 at 03:04:40PM -0800, Yury Norov wrote:
> On Tue, Jan 07, 2025 at 02:43:01PM -0800, Yury Norov wrote:
> > On Tue, Jan 07, 2025 at 04:37:17PM -0600, Nick Child wrote:
> > > On Sat, Dec 28, 2024 at 10:49:35AM -0800, Yury Norov wrote:
> > > > A loop based on cpumask_next_wrap() opencodes the dedicated macro
> > > > for_each_online_cpu_wrap(). Using the macro allows to avoid setting
> > > > bits affinity mask more than once when stride >= num_online_cpus.
> > > > 
> > > > This also helps to drop cpumask handling code in the caller function.
> > > > 
> > > > Signed-off-by: Yury Norov 
> > > > ---
> > > >  drivers/net/ethernet/ibm/ibmvnic.c | 17 ++---
> > > >  1 file changed, 10 insertions(+), 7 deletions(-)
> > > > 
> > > > diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
> > > > b/drivers/net/ethernet/ibm/ibmvnic.c
> > > > index e95ae0d39948..4cfd90fb206b 100644
> > > > --- a/drivers/net/ethernet/ibm/ibmvnic.c
> > > > +++ b/drivers/net/ethernet/ibm/ibmvnic.c
> > > > @@ -234,11 +234,16 @@ static int ibmvnic_set_queue_affinity(struct 
> > > > ibmvnic_sub_crq_queue *queue,
> > > > (*stragglers)--;
> > > > }
> > > > /* atomic write is safer than writing bit by bit directly */
> > > > -   for (i = 0; i < stride; i++) {
> > > > -   cpumask_set_cpu(*cpu, mask);
> > > > -   *cpu = cpumask_next_wrap(*cpu, cpu_online_mask,
> > > > -nr_cpu_ids, false);
> > > > +   for_each_online_cpu_wrap(i, *cpu) {
> > > > +   if (!stride--)
> > > > +   break;
> > > > +   cpumask_set_cpu(i, mask);
> > > > }
> > > > +
> > > > +   /* For the next queue we start from the first unused CPU in 
> > > > this queue */
> > > > +   if (i < nr_cpu_ids)
> > > > +   *cpu = i + 1;
> > > > +
> > > This should read '*cpu = i'. Since the loop breaks after incrementing i.
> > > Thanks!
> > 
> > cpumask_next_wrap() makes '+ 1' for you. The for_each_cpu_wrap() starts
> > exactly where you point. So, this '+1' needs to be explicit now.
> > 
> > Does that make sense?
> 
> Ah, I think I see what you mean. It should be like this, right?
> 
>   for_each_online_cpu_wrap(i, *cpu) {
>   if (!stride--) {
>   *cpu = i + 1;
>   break;
> }
>   cpumask_set_cpu(i, mask);
>   }
Not quite, for_each_online_cpu_wrap will increment i to point to the
next online cpu, then enter the body of the loop. When we break (beacuse
stride is zero), we exit the loop early before i is added to any mask, i
is the next unassigned online cpu.
I tested this to make sure, we see unused cpus (#7, #23)  with the patch as is:
  IRQ : 256 -> ibmvnic-3003-tx0
/proc/irq/256/smp_affinity_list:0-6
  IRQ : 257 -> ibmvnic-3003-tx1
/proc/irq/257/smp_affinity_list:16-22
  IRQ : 258 -> ibmvnic-3003-rx0
/proc/irq/258/smp_affinity_list:8-14
  IRQ : 259 -> ibmvnic-3003-rx1
/proc/irq/259/smp_affinity_list:24-30

Re: watchdog: BUG: soft lockup

2025-01-08 Thread Doug Anderson

Hi,

On Sun, Dec 22, 2024 at 10:32 PM wzs  wrote:
>
> Hello,
> when fuzzing the Linux kernel,
> I triggered many "watch: BUG: soft lockup" warnings.
> I am not sure whether this is an issue with the kernel or with the
> fuzzing program I ran.
> (The same fuzzing program, when tested on kernel versions from
> Linux-6.7.0 to 6.12.0, triggers the 'watchdog: BUG: soft lockup'
> warning on some versions, while others do not. Linux 6.12.0 is the
> latest stable release where this error occurs.)
>
> The bug information I provided below is from the Linux-6.12.0 kernel.
> If you need bug information from other versions, I would be happy to provide 
> it.
>
> kernel config :https://pastebin.com/i4LPXNAN
> console output :https://pastebin.com/uKVpvJ78

IMO it's nearly always a bug if userspace can cause the kernel to soft
lockup. I'd expect this isn't a bug in the soft lockup detector but a
problem in whatever part of the kernel you're fuzzing. For some
details of the soft lockup detector, see
`Documentation/admin-guide/lockup-watchdogs.rst`.

Presumably you're fuzzing the kernel in a way that causes it to enter
a big loop while preemption is disabled, or something like that.
Presumably the kernel should be detecting something invalid that
userspace did and that would keep it from looping so long.

I tried looking at your pastebin and probably what's going on is
somewhere hidden in there, but unfortunately the beginning of the logs
are a bit jumbled since it looks like the RCU warning and the soft
lockup warning happened at about the same time and their stuff is
jumbled. There's also a lot of tasks to go through. Honestly, it's
probably less work just to look at whatever you were trying to fuzz to
help you pinpoint the problem.

I'll also note that you seem to be using KASAN and are running in a
virtual machine. It's not inconceivable that's contributing to your
problems. KASAN makes things _a lot_ slower and a VM may be getting
its time stolen by the host.

-Doug

[PATCH] powerpc/pseries/iommu: create DDW for devices with DMA mask less than 64-bits

2025-01-08 Thread Gaurav Batra

Starting with PAPR level 2.13, platform supports placing PHB in limited
address mode. Devices that support DMA masks less that 64-bit but greater
than 32-bits are placed in limited address mode. In this mode, the
starting DMA address returned by the DDW is 4GB.

When the device driver calls dma_supported, with mask less then 64-bit, the
PowerPC IOMMU driver places PHB in the Limited Addressing Mode before
creating DDW.

Signed-off-by: Gaurav Batra 
---
 arch/powerpc/platforms/pseries/iommu.c | 110 +
 1 file changed, 94 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 534cd159e9ab..551e9ca4dcc2 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -52,7 +52,8 @@ enum {
 enum {
DDW_EXT_SIZE = 0,
DDW_EXT_RESET_DMA_WIN = 1,
-   DDW_EXT_QUERY_OUT_SIZE = 2
+   DDW_EXT_QUERY_OUT_SIZE = 2,
+   DDW_EXT_LIMITED_ADDR_MODE = 3
 };
 
 static struct iommu_table *iommu_pseries_alloc_table(int node)
@@ -1331,6 +1332,54 @@ static void reset_dma_window(struct pci_dev *dev, struct 
device_node *par_dn)
 ret);
 }
 
+/*
+ * Platforms support placing PHB in limited address mode starting with LoPAR
+ * level 2.13 implement. In this mode, the DMA address returned by DDW is over
+ * 4GB but, less than 64-bits. This benefits IO adapters that don't support
+ * 64-bits for DMA addresses.
+ */
+static int limited_dma_window(struct pci_dev *dev, struct device_node *par_dn)
+{
+   int ret;
+   u32 cfg_addr, reset_dma_win, las_supported;
+   u64 buid;
+   struct device_node *dn;
+   struct pci_dn *pdn;
+
+   ret = ddw_read_ext(par_dn, DDW_EXT_RESET_DMA_WIN, &reset_dma_win);
+   if (ret)
+   goto out;
+
+   ret = ddw_read_ext(par_dn, DDW_EXT_LIMITED_ADDR_MODE, &las_supported);
+
+   /* Limited Address Space extension available on the platform but DDW in
+* limited addressing mode not supported
+*/
+   if (!ret && !las_supported)
+   ret = -EPROTO;
+
+   if (ret) {
+   dev_info(&dev->dev, "Limited Address Space for DDW not 
Supported, err: %d", ret);
+   goto out;
+   }
+
+   dn = pci_device_to_OF_node(dev);
+   pdn = PCI_DN(dn);
+   buid = pdn->phb->buid;
+   cfg_addr = (pdn->busno << 16) | (pdn->devfn << 8);
+
+   ret = rtas_call(reset_dma_win, 4, 1, NULL, cfg_addr, BUID_HI(buid),
+   BUID_LO(buid), 1);
+   if (ret)
+   dev_info(&dev->dev,
+"ibm,reset-pe-dma-windows(%x) for Limited Addr 
Support: %x %x %x returned %d ",
+reset_dma_win, cfg_addr, BUID_HI(buid), BUID_LO(buid),
+ret);
+
+out:
+   return ret;
+}
+
 /* Return largest page shift based on "IO Page Sizes" output of 
ibm,query-pe-dma-window. */
 static int iommu_get_page_shift(u32 query_page_size)
 {
@@ -1398,7 +1447,7 @@ static struct property *ddw_property_create(const char 
*propname, u32 liobn, u64
  *
  * returns true if can map all pages (direct mapping), false otherwise..
  */
-static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
+static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn, u64 
dma_mask)
 {
int len = 0, ret;
int max_ram_len = order_base_2(ddw_memory_hotplug_max());
@@ -1417,6 +1466,9 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
bool pmem_present;
struct pci_dn *pci = PCI_DN(pdn);
struct property *default_win = NULL;
+   bool limited_addr_req = false, limited_addr_enabled = false;
+   int dev_max_ddw;
+   int ddw_sz;
 
dn = of_find_node_by_type(NULL, "ibm,pmemory");
pmem_present = dn != NULL;
@@ -1443,7 +1495,6 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
 * the ibm,ddw-applicable property holds the tokens for:
 * ibm,query-pe-dma-window
 * ibm,create-pe-dma-window
-* ibm,remove-pe-dma-window
 * for the given node in that order.
 * the property is actually in the parent, not the PE
 */
@@ -1463,6 +1514,20 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
if (ret != 0)
goto out_failed;
 
+   /* DMA Limited Addressing required? This is when the driver has
+* requested to create DDW but supports mask which is less than 64-bits
+*/
+   limited_addr_req = (dma_mask != DMA_BIT_MASK(64));
+
+   /* place the PHB in Limited Addressing mode */
+   if (limited_addr_req) {
+   if (limited_dma_window(dev, pdn))
+   goto out_failed;
+
+   /* PHB is in Limited address mode */
+   limited_addr_enabled = true;
+   }
+
/*
 * If there is no window available, remove the default DMA w

Re: [PATCH v2 0/3] sysfs: constify bin_attribute argument of sysfs_bin_attr_simple_read()

2025-01-08 Thread Alexei Starovoitov

On Tue, Dec 31, 2024 at 2:30 AM Thomas Weißschuh  wrote:
>
> On 2024-12-30 16:50:41-0800, Alexei Starovoitov wrote:
> > On Sat, Dec 28, 2024 at 12:43 AM Thomas Weißschuh  
> > wrote:
> > >
> > > Most users use this function through the BIN_ATTR_SIMPLE* macros,
> > > they can handle the switch transparently.
> > >
> > > This series is meant to be merged through the driver core tree.
> >
> > hmm. why?
>
> Patch 1 changes the signature of sysfs_bin_attr_simple_read().
> Before patch 1 sysfs_bin_attr_simple_read() needs to be assigned to the
> callback member .read, after patch 1 it's .read_new.
> (Both callbacks work exactly the same, except for their signature,
> .read_new is only a transition mechanism and will go away again)
>
> > I'd rather take patches 2 and 3 into bpf-next to avoid
> > potential conflicts.
> > Patch 1 looks orthogonal and independent.
>
> If you pick up 2 and 3 through bpf-next you would need to adapt these
> assignments. As soon as both patch 1 and the modified 2 and 3 hit
> Linus' tree, the build would break due to mismatches function pointers.
> (Casting function pointers to avoid the mismatch will blow up with KCFI)

I see. All these steps to constify is frankly a mess.
You're wasting cpu and memory for this read vs read_new
when const is not much more than syntactic sugar in C.
You should have done one tree wide patch without doing this _new() hack.

Anyway, rant over. Carry patches 2,3. Hopefully they won't conflict.
But I don't want to see any constification patches in bpf land
that come with such pointless runtime penalty.

Re: [PATCH] perf machine: Don't ignore _etext when not a text symbol

2025-01-08 Thread Arnaldo Carvalho de Melo

On Wed, Jan 08, 2025 at 10:15:24AM +0100, Christophe Leroy wrote:
> Depending on how vmlinux.lds is written, _etext might be the very
> first data symbol instead of the very last text symbol.
> 
> Don't require it to be a text symbol, accept any symbol type.

I'm adding a Link:

Link: 
https://lore.kernel.org/all/752a31b0-4370-4f52-b7cc-45f0078c1...@csgroup.eu

To give more context as where this has been observed, and also add a
snippet of your explanation there, this:


# grep -e _stext -e _etext -e _edata /proc/kallsyms
c000 T _stext
c08b8000 D _etext

So there is no _edata and _etext is not text

$ ppc-linux-objdump -x vmlinux | grep -e _stext -e _etext -e _edata
c000 g   .head.text  _stext
c08b8000 g   .rodata _etext
c1378000 g   .sbss   _edata


Thanks,

- Arnaldo

> Fixes: ed9adb2035b5 ("perf machine: Read also the end of the kernel")
> Signed-off-by: Christophe Leroy 
> ---
>  tools/perf/util/machine.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
> index 27d5345d2b30..9be2f4479f52 100644
> --- a/tools/perf/util/machine.c
> +++ b/tools/perf/util/machine.c
> @@ -1003,7 +1003,7 @@ static int machine__get_running_kernel_start(struct 
> machine *machine,
>  
>   err = kallsyms__get_symbol_start(filename, "_edata", &addr);
>   if (err)
> - err = kallsyms__get_function_start(filename, "_etext", &addr);
> + err = kallsyms__get_symbol_start(filename, "_etext", &addr);
>   if (!err)
>   *end = addr;
>  
> -- 
> 2.47.0

Re: [PATCH v8 3/7] PCI: Make pcie_read_tlp_log() signature same

2025-01-08 Thread Yazen Ghannam

On Wed, Dec 18, 2024 at 04:37:43PM +0200, Ilpo Järvinen wrote:
> pcie_read_tlp_log()'s prototype and function signature diverged due to
> changes made while applying.
> 
> Make the parameters of pcie_read_tlp_log() named identically.
> 
> Signed-off-by: Ilpo Järvinen 
> Reviewed-by: Jonathan Cameron 

Reviewed-by: Yazen Ghannam 

Just wondering, could this be squashed into the previous patch? Or is
the goal to be strict with one logical change per patch?

Thanks,
Yazen

Re: [PATCH v8 6/7] PCI: Add TLP Prefix reading into pcie_read_tlp_log()

2025-01-08 Thread Yazen Ghannam

On Wed, Dec 18, 2024 at 04:37:46PM +0200, Ilpo Järvinen wrote:
> pcie_read_tlp_log() handles only 4 Header Log DWORDs but TLP Prefix Log
> (PCIe r6.1 secs 7.8.4.12 & 7.9.14.13) may also be present.
> 
> Generalize pcie_read_tlp_log() and struct pcie_tlp_log to handle also
> TLP Prefix Log. The relevant registers are formatted identically in AER
> and DPC Capability, but has these variations:
> 
> a) The offsets of TLP Prefix Log registers vary.
> b) DPC RP PIO TLP Prefix Log register can be < 4 DWORDs.
> c) AER TLP Prefix Log Present (PCIe r6.1 sec 7.8.4.7) can indicate
>Prefix Log is not present.
> 
> Therefore callers must pass the offset of the TLP Prefix Log register
> and the entire length to pcie_read_tlp_log() to be able to read the
> correct number of TLP Prefix DWORDs from the correct offset.
> 
> Signed-off-by: Ilpo Järvinen 
> Reviewed-by: Jonathan Cameron 
> ---
>  drivers/pci/pci.h |  5 +++-
>  drivers/pci/pcie/aer.c|  5 +++-
>  drivers/pci/pcie/dpc.c| 13 +
>  drivers/pci/pcie/tlp.c| 51 +++
>  include/linux/aer.h   |  1 +
>  include/uapi/linux/pci_regs.h | 10 ---
>  6 files changed, 67 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 55fcf3bac4f7..7797b2544d00 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -550,7 +550,9 @@ struct aer_err_info {
>  int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info 
> *info);
>  void aer_print_error(struct pci_dev *dev, struct aer_err_info *info);
>  
> -int pcie_read_tlp_log(struct pci_dev *dev, int where, struct pcie_tlp_log 
> *log);
> +int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
> +   unsigned int tlp_len, struct pcie_tlp_log *log);
> +unsigned int aer_tlp_log_len(struct pci_dev *dev, u32 aercc);
>  #endif   /* CONFIG_PCIEAER */
>  
>  #ifdef CONFIG_PCIEPORTBUS
> @@ -569,6 +571,7 @@ void pci_dpc_init(struct pci_dev *pdev);
>  void dpc_process_error(struct pci_dev *pdev);
>  pci_ers_result_t dpc_reset_link(struct pci_dev *pdev);
>  bool pci_dpc_recovered(struct pci_dev *pdev);
> +unsigned int dpc_tlp_log_len(struct pci_dev *dev);
>  #else
>  static inline void pci_save_dpc_state(struct pci_dev *dev) { }
>  static inline void pci_restore_dpc_state(struct pci_dev *dev) { }
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 80c5ba8d8296..656dbf1ac45b 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1248,7 +1248,10 @@ int aer_get_device_error_info(struct pci_dev *dev, 
> struct aer_err_info *info)
>  
>   if (info->status & AER_LOG_TLP_MASKS) {
>   info->tlp_header_valid = 1;
> - pcie_read_tlp_log(dev, aer + PCI_ERR_HEADER_LOG, 
> &info->tlp);
> + pcie_read_tlp_log(dev, aer + PCI_ERR_HEADER_LOG,
> +   aer + PCI_ERR_PREFIX_LOG,
> +   aer_tlp_log_len(dev, aercc),
> +   &info->tlp);
>   }
>   }
>  
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index 2b6ef7efa3c1..7933b3cedb59 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -190,7 +190,7 @@ pci_ers_result_t dpc_reset_link(struct pci_dev *pdev)
>  static void dpc_process_rp_pio_error(struct pci_dev *pdev)
>  {
>   u16 cap = pdev->dpc_cap, dpc_status, first_error;
> - u32 status, mask, sev, syserr, exc, log, prefix;
> + u32 status, mask, sev, syserr, exc, log;
>   struct pcie_tlp_log tlp_log;
>   int i;
>  
> @@ -217,20 +217,19 @@ static void dpc_process_rp_pio_error(struct pci_dev 
> *pdev)
>  
>   if (pdev->dpc_rp_log_size < 4)
>   goto clear_status;
> - pcie_read_tlp_log(pdev, cap + PCI_EXP_DPC_RP_PIO_HEADER_LOG, &tlp_log);
> + pcie_read_tlp_log(pdev, cap + PCI_EXP_DPC_RP_PIO_HEADER_LOG,
> +   cap + PCI_EXP_DPC_RP_PIO_TLPPREFIX_LOG,
> +   dpc_tlp_log_len(pdev), &tlp_log);
>   pci_err(pdev, "TLP Header: %#010x %#010x %#010x %#010x\n",
>   tlp_log.dw[0], tlp_log.dw[1], tlp_log.dw[2], tlp_log.dw[3]);
> + for (i = 0; i < pdev->dpc_rp_log_size - 5; i++)
> + pci_err(pdev, "TLP Prefix Header: dw%d, %#010x\n", i, 
> tlp_log.prefix[i]);
>  
>   if (pdev->dpc_rp_log_size < 5)
>   goto clear_status;
>   pci_read_config_dword(pdev, cap + PCI_EXP_DPC_RP_PIO_IMPSPEC_LOG, &log);
>   pci_err(pdev, "RP PIO ImpSpec Log %#010x\n", log);
>  
> - for (i = 0; i < pdev->dpc_rp_log_size - 5; i++) {
> - pci_read_config_dword(pdev,
> - cap + PCI_EXP_DPC_RP_PIO_TLPPREFIX_LOG + i * 4, 
> &prefix);
> - pci_err(pdev, "TLP Prefix Header: dw%d, %#010x\n", i, prefix);
> - }
>   clear_status:
>   pci_write_config_dword(pdev, cap + PCI_EXP_DPC_RP_PIO_STATUS, stat

Re: [PATCH v8 5/7] PCI: Store # of supported End-End TLP Prefixes

2025-01-08 Thread Yazen Ghannam

On Wed, Dec 18, 2024 at 04:37:45PM +0200, Ilpo Järvinen wrote:
> eetlp_prefix_path in the struct pci_dev tells if End-End TLP Prefixes
> are supported by the path or not, the value is only calculated if
> CONFIG_PCI_PASID is set.
> 
> The Max End-End TLP Prefixes field in the Device Capabilities Register
> 2 also tells how many (1-4) End-End TLP Prefixes are supported (PCIe
> r6.2 sec 7.5.3.15). The number of supported End-End Prefixes is useful
> for reading correct number of DWORDs from TLP Prefix Log register in AER
> capability (PCIe r6.2 sec 7.8.4.12).
> 
> Replace eetlp_prefix_path with eetlp_prefix_max and determine the
> number of supported End-End Prefixes regardless of CONFIG_PCI_PASID so
> that an upcoming commit generalizing TLP Prefix Log register reading
> does not have to read extra DWORDs for End-End Prefixes that never will
> be there.
> 
> The value stored into eetlp_prefix_max is directly derived from
> device's Max End-End TLP Prefixes and does not consider limitations
> imposed by bridges or the Root Port beyond supported/not supported
> flags. This is intentional for two reasons:
> 
>   1) PCIe r6.2 spec sections r6.1 2.2.10.4 & 6.2.4.4 indicate that TLP
>   is handled malformed only if the number of prefixes exceed the number
>   of Max End-End TLP Prefixes, which seems to be the case even if the
>   device could never receive that many prefixes due to smaller maximum
>   imposed by a bridge or the Root Port. If TLP parsing is later added,
>   this distinction is significant in interpreting what is logged by the
>   TLP Prefix Log registers and the value matching to the Malformed TLP
>   threshold is going to be more useful.
> 
>   2) TLP Prefix handling happens autonomously on a low layer and the
>   value in eetlp_prefix_max is not programmed anywhere by the kernel
>   (i.e., there is no limiter OS can control to prevent sending more
>   than n TLP Prefixes).
> 
> Signed-off-by: Ilpo Järvinen 

Reviewed-by: Yazen Ghannam 

Thanks,
Yazen

Re: [PATCH] of: address: Unify resource bounds overflow checking

2025-01-08 Thread Basharath Hussain Khaja

Hi,

>> Thomas Weißschuh  writes:
>> > The members "start" and "end" of struct resource are of type
>> > "resource_size_t" which can be 32bit wide.
>> > Values read from OF however are always 64bit wide.
>> >
>> > Refactor the diff overflow checks into a helper function.
>> > Also extend the checks to validate each calculation step.
>> >
>> > Signed-off-by: Thomas Weißschuh 
>> > ---
>> >  drivers/of/address.c | 45 ++---
>> >  1 file changed, 26 insertions(+), 19 deletions(-)
>> >
>> > diff --git a/drivers/of/address.c b/drivers/of/address.c
>> > index 7e59283a4472..df854bb427ce 100644
>> > --- a/drivers/of/address.c
>> > +++ b/drivers/of/address.c
>> > @@ -198,6 +198,25 @@ static u64 of_bus_pci_map(__be32 *addr, const __be32 
>> > *range, int na, int ns,
>> >
>> >  #endif /* CONFIG_PCI */
>> >
>> > +static int __of_address_resource_bounds(struct resource *r, u64 start, 
>> > u64 size)
>> > +{
>> > + u64 end = start;
>> > +
>> > + if (overflows_type(start, r->start))
>> > + return -EOVERFLOW;
>> > + if (size == 0)
>> > + return -EOVERFLOW;
>> > + if (check_add_overflow(end, size - 1, &end))
>> > + return -EOVERFLOW;
>> > + if (overflows_type(end, r->end))
>> > + return -EOVERFLOW;
>>
>> This breaks PCI on powerpc qemu. Part of the PCI probe reads a resource
>> that's zero sized, which used to succeed but now fails due to the size
>> check above.
>>
>> The diff below fixes it for me.
>
> I fixed it up with your change.


This commit is breaking Ethernet functionality on the TI AM57xx platform due to 
zero byte SRAM block size allocation during initialization. Prior to this 
patch, zero byte block sizes were handled properly.

The issue is with the following line of code:

if (size && check_add_overflow(end, size - 1, &end))  // check_add_overflow not 
called when size is zero

We feel check_add_overflow() should be invoked even when the size is zero to 
ensure correct block size allocation.

Thanks & Best Regards,
Basharath

Re: [PATCH v5 05/25] fs/dax: Create a common implementation to break DAX layouts

2025-01-08 Thread Dan Williams

Alistair Popple wrote:
> Prior to freeing a block file systems supporting FS DAX must check
> that the associated pages are both unmapped from user-space and not
> undergoing DMA or other access from eg. get_user_pages(). This is
> achieved by unmapping the file range and scanning the FS DAX
> page-cache to see if any pages within the mapping have an elevated
> refcount.
> 
> This is done using two functions - dax_layout_busy_page_range() which
> returns a page to wait for the refcount to become idle on. Rather than
> open-code this introduce a common implementation to both unmap and
> wait for the page to become idle.
> 
> Signed-off-by: Alistair Popple 
> 
> ---
> 
> Changes for v5:
> 
>  - Don't wait for idle pages on non-DAX mappings
> 
> Changes for v4:
> 
>  - Fixed some build breakage due to missing symbol exports reported by
>John Hubbard (thanks!).
> ---
>  fs/dax.c| 33 +
>  fs/ext4/inode.c | 10 +-
>  fs/fuse/dax.c   | 29 +
>  fs/xfs/xfs_inode.c  | 23 +--
>  fs/xfs/xfs_inode.h  |  2 +-
>  include/linux/dax.h | 21 +
>  mm/madvise.c|  8 
>  7 files changed, 70 insertions(+), 56 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index d010c10..9c3bd07 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -845,6 +845,39 @@ int dax_delete_mapping_entry(struct address_space 
> *mapping, pgoff_t index)
>   return ret;
>  }
>  
> +static int wait_page_idle(struct page *page,
> + void (cb)(struct inode *),
> + struct inode *inode)
> +{
> + return ___wait_var_event(page, page_ref_count(page) == 1,
> + TASK_INTERRUPTIBLE, 0, 0, cb(inode));
> +}
> +
> +/*
> + * Unmaps the inode and waits for any DMA to complete prior to deleting the
> + * DAX mapping entries for the range.
> + */
> +int dax_break_mapping(struct inode *inode, loff_t start, loff_t end,
> + void (cb)(struct inode *))
> +{
> + struct page *page;
> + int error;
> +
> + if (!dax_mapping(inode->i_mapping))
> + return 0;
> +
> + do {
> + page = dax_layout_busy_page_range(inode->i_mapping, start, end);
> + if (!page)
> + break;
> +
> + error = wait_page_idle(page, cb, inode);

This implementations removes logic around @retry found in the XFS and
FUSE implementations, I think that is a mistake, and EXT4 has
apparently been broken in this regard.

wait_page_idle() returns after @page is idle, but that does not mean
@inode is DMA idle. After one found page from
dax_layout_busy_page_range() is waited upon a new call to
dax_break_mapping() needs to made to check if another DMA started, or if
there were originally more pages active.

> + } while (error == 0);
> +
> + return error;

Surprised that the compiler does not warn about an uninitialized
variable here?

Re: [PATCH] powerpc/32: Stop printing Kernel virtual memory layout

2025-01-08 Thread Kees Cook

On Wed, Jan 08, 2025 at 07:40:38PM +0100, Christophe Leroy wrote:
> Printing of Kernel virtual memory layout was added for debug purpose
> by commit f637a49e507c ("powerpc: Minor cleanups of kernel virt
> address space definitions")
> 
> For security reasons, don't display the kernel's virtual memory layout.
> 
> Other architectures have removed it through following commits.
> 
> 071929dbdd86 ("arm64: Stop printing the virtual memory layout")
> 1c31d4e96b8c ("ARM: 8820/1: mm: Stop printing the virtual memory layout")
> 3182f798 ("m68k/mm: Stop printing the virtual memory layout")
> fd8d0ca25631 ("parisc: Hide virtual kernel memory layout")
> 681ff0181bbf ("x86/mm/init/32: Stop printing the virtual memory layout")
> 
> Commit 681ff0181bbf ("x86/mm/init/32: Stop printing the virtual memory
> layout") thought x86 was the last one, but in reality powerpc/32 still
> had it.
> 
> So remove it now on powerpc/32 as well.
> 
> Cc: Arvind Sankar 
> Cc: Kees Cook 
> Signed-off-by: Christophe Leroy 

Reviewed-by: Kees Cook 

-- 
Kees Cook

Re: [PATCH v5 00/25] fs/dax: Fix ZONE_DEVICE page reference counts

2025-01-08 Thread Dan Williams

Andrew Morton wrote:
> On Tue,  7 Jan 2025 14:42:16 +1100 Alistair Popple  wrote:
> 
> > Device and FS DAX pages have always maintained their own page
> > reference counts without following the normal rules for page reference
> > counting. In particular pages are considered free when the refcount
> > hits one rather than zero and refcounts are not added when mapping the
> > page.
> > 
> > Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
> > mechanism for allowing GUP to hold references on the page (see
> > get_dev_pagemap). However there doesn't seem to be any reason why FS
> > DAX pages need their own reference counting scheme.
> > 
> > By treating the refcounts on these pages the same way as normal pages
> > we can remove a lot of special checks. In particular pXd_trans_huge()
> > becomes the same as pXd_leaf(), although I haven't made that change
> > here. It also frees up a valuable SW define PTE bit on architectures
> > that have devmap PTE bits defined.
> > 
> > It also almost certainly allows further clean-up of the devmap managed
> > functions, but I have left that as a future improvment. It also
> > enables support for compound ZONE_DEVICE pages which is one of my
> > primary motivators for doing this work.
> > 
> 
> https://lkml.kernel.org/r/wysuus23bqmjtwkfu3zutqtmkse3ki3erf45x32yezlrl24qto@xlqt7qducyld
> made me expect merge/build/runtime issues, however this series merges
> and builds OK on mm-unstable.  Did something change?  What's the story
> here?
> 
> Oh well, it built so I'll ship it!

So my plan is to review this latest set on top of -next as is and then
rebase (or ask Alistair to rebase) on a mainline tag so I can identify
the merge conflicts with -mm and communicate those to Linus.

I will double check that you have pulled these back out of mm-unstable
before doing that to avoid a double-commit conflicts in -next, but for
now exposure in mm-unstable is good to flush out issues.

linux-next: manual merge of the powerpc tree with the mm tree

2025-01-08 Thread Stephen Rothwell

Hi all,

Today's linux-next merge of the powerpc tree got a conflict in:

  arch/powerpc/Kconfig

between commit:

  c0c3319917db ("mm: remove devmap related functions and page table bits")

from the mm-unstable branch of the mm tree and commit:

  00199ed6f2ca ("powerpc: Add preempt lazy support")

from the powerpc tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc arch/powerpc/Kconfig
index 85409ec1fd83,db9f7b2d07bf..
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@@ -145,6 -145,8 +145,7 @@@ config PP
select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
select ARCH_HAS_PHYS_TO_DMA
select ARCH_HAS_PMEM_API
+   select ARCH_HAS_PREEMPT_LAZY
 -  select ARCH_HAS_PTE_DEVMAP  if PPC_BOOK3S_64
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_SCALED_CPUTIME  if VIRT_CPU_ACCOUNTING_NATIVE 
&& PPC_BOOK3S_64
select ARCH_HAS_SET_MEMORY


pgpDrCnE7fg2c.pgp
Description: OpenPGP digital signature

Re: [PATCH RESEND v1 3/5] powerpc/kdump: preserve user-specified memory limit

2025-01-08 Thread Sourabh Jain


Hello Mahesh,


On 08/01/25 22:36, Mahesh J Salgaonkar wrote:

On 2025-01-08 15:44:56 Wed, Sourabh Jain wrote:

Commit 59d58189f3d9 ("crash: fix crash memory reserve exceed system
memory bug") fails crashkernel parsing if the crash size is found to be
higher than system RAM, which makes the memory_limit adjustment code
ineffective due to an early exit from reserve_crashkernel().

Regardless lets not violated the user-specified memory limit by
adjusting it. Remove this adjustment to ensure all reservations stay
within the limit. Commit f94f5ac07983 ("powerpc/fadump: Don't update
the user-specified memory limit") did the same for fadump.

Agreed.

Reviewed-by: Mahesh Salgaonkar 


Thanks for the review.

- Sourabh Jain

Re: [PATCH v3 01/28] module: Extend the preempt disabled section in dereference_symbol_descriptor().

2025-01-08 Thread Sebastian Andrzej Siewior

On 2025-01-08 10:55:04 [+0100], Helge Deller wrote:
> Nice catch.
> 
> Acked-by: Helge Deller 
> 
> This patch really should be backported.
> Can you add a Cc: stable tag?

It should be picked due to the fixes tag. I add it if I am going to
repost it.

> Helge

Sebastian

[PATCH] powerpc/vmlinux: Remove etext, edata and end

2025-01-08 Thread Christophe Leroy

etext is not used anymore since commit 843a1ffaf6f2 ("powerpc/mm: use
core_kernel_text() helper")

edata and end have not been used since the merge of arch/ppc/ and
arch/ppc64/

Remove the three and remove macro PROVIDE32.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/vmlinux.lds.S | 9 -
 arch/powerpc/kvm/book3s_32_mmu_host.c | 2 --
 2 files changed, 11 deletions(-)

diff --git a/arch/powerpc/kernel/vmlinux.lds.S 
b/arch/powerpc/kernel/vmlinux.lds.S
index b4c9decc7a75..de6ee7d35cff 100644
--- a/arch/powerpc/kernel/vmlinux.lds.S
+++ b/arch/powerpc/kernel/vmlinux.lds.S
@@ -1,10 +1,4 @@
 /* SPDX-License-Identifier: GPL-2.0 */
-#ifdef CONFIG_PPC64
-#define PROVIDE32(x)   PROVIDE(__unused__##x)
-#else
-#define PROVIDE32(x)   PROVIDE(x)
-#endif
-
 #define BSS_FIRST_SECTIONS *(.bss.prominit)
 #define EMITS_PT_NOTE
 #define RO_EXCEPTION_TABLE_ALIGN   0
@@ -127,7 +121,6 @@ SECTIONS
 
. = ALIGN(PAGE_SIZE);
_etext = .;
-   PROVIDE32 (etext = .);
 
/* Read-only data */
RO_DATA(PAGE_SIZE)
@@ -394,7 +387,6 @@ SECTIONS
 
. = ALIGN(PAGE_SIZE);
_edata  =  .;
-   PROVIDE32 (edata = .);
 
 /*
  * And finally the bss
@@ -404,7 +396,6 @@ SECTIONS
 
. = ALIGN(PAGE_SIZE);
_end = . ;
-   PROVIDE32 (end = .);
 
DWARF_DEBUG
ELF_DETAILS
diff --git a/arch/powerpc/kvm/book3s_32_mmu_host.c 
b/arch/powerpc/kvm/book3s_32_mmu_host.c
index 5b7212edbb13..c7e4b62642ea 100644
--- a/arch/powerpc/kvm/book3s_32_mmu_host.c
+++ b/arch/powerpc/kvm/book3s_32_mmu_host.c
@@ -125,8 +125,6 @@ static u32 *kvmppc_mmu_get_pteg(struct kvm_vcpu *vcpu, u32 
vsid, u32 eaddr,
return (u32*)pteg;
 }
 
-extern char etext[];
-
 int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct kvmppc_pte *orig_pte,
bool iswrite)
 {
-- 
2.47.0

[PATCH] powerpc/44x: Declare primary_uic static in uic.c

2025-01-08 Thread Christophe Leroy

primary_uic is not used outside uic.c, declare it static.

Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-kbuild-all/202411101746.ld8ydvzy-...@intel.com/
Fixes: e58923ed1437 ("[POWERPC] Add arch/powerpc driver for UIC, PPC4xx 
interrupt controller")
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/platforms/44x/uic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/44x/uic.c b/arch/powerpc/platforms/44x/uic.c
index e3e148b9dd18..8b03ae4cb3f6 100644
--- a/arch/powerpc/platforms/44x/uic.c
+++ b/arch/powerpc/platforms/44x/uic.c
@@ -37,7 +37,7 @@
 #define UIC_VR 0x7
 #define UIC_VCR0x8
 
-struct uic *primary_uic;
+static struct uic *primary_uic;
 
 struct uic {
int index;
-- 
2.47.0

Re: [PATCH v5 03/17] asm-generic: pgalloc: Provide generic p4d_{alloc_one,free}

2025-01-08 Thread Arnd Bergmann

On Wed, Jan 8, 2025, at 07:57, Qi Zheng wrote:
> From: Kevin Brodsky 
>
> Four architectures currently implement 5-level pgtables: arm64,
> riscv, x86 and s390. The first three have essentially the same
> implementation for p4d_alloc_one() and p4d_free(), so we've got an
> opportunity to reduce duplication like at the lower levels.
>
> Provide a generic version of p4d_alloc_one() and p4d_free(), and
> make use of it on those architectures.
>
> Their implementation is the same as at PUD level, except that
> p4d_free() performs a runtime check by calling mm_p4d_folded().
> 5-level pgtables depend on a runtime-detected hardware feature on
> all supported architectures, so we might as well include this check
> in the generic implementation. No runtime check is required in
> p4d_alloc_one() as the top-level p4d_alloc() already does the
> required check.
>
> Signed-off-by: Kevin Brodsky 
> Acked-by: Dave Hansen 
> Signed-off-by: Qi Zheng 
> ---
>  arch/arm64/include/asm/pgalloc.h | 17 
>  arch/riscv/include/asm/pgalloc.h | 23 
>  arch/x86/include/asm/pgalloc.h   | 18 -
>  include/asm-generic/pgalloc.h| 45 
>  4 files changed, 45 insertions(+), 58 deletions(-)

Acked-by: Arnd Bergmann  # asm-generic

[PATCH] kallsyms: Always include _edata

2025-01-08 Thread Christophe Leroy

Since commit 69a87a32f5cd ("perf machine: Include data symbols in the
kernel map"), perf needs _edata symbol.

Make sure it is always included in /proc/kallsyms and not only when
CONFIG_KALLSYMS_ALL is selected.

Signed-off-by: Christophe Leroy 
Fixes: 69a87a32f5cd ("perf machine: Include data symbols in the kernel map")
---
 scripts/kallsyms.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/scripts/kallsyms.c b/scripts/kallsyms.c
index 03852da3d249..391ab7ebce7f 100644
--- a/scripts/kallsyms.c
+++ b/scripts/kallsyms.c
@@ -215,6 +215,8 @@ static int symbol_valid(const struct sym_entry *s)
if (string_starts_with(name, "__start_") ||
string_starts_with(name, "__stop_"))
return 1;
+   if (!strcmp(name, "_edata"))
+   return 1;
 
if (symbol_in_range(s, text_ranges,
ARRAY_SIZE(text_ranges)) == 0)
-- 
2.47.0

[PATCH RESEND v1 3/5] powerpc/kdump: preserve user-specified memory limit

2025-01-08 Thread Sourabh Jain

Commit 59d58189f3d9 ("crash: fix crash memory reserve exceed system
memory bug") fails crashkernel parsing if the crash size is found to be
higher than system RAM, which makes the memory_limit adjustment code
ineffective due to an early exit from reserve_crashkernel().

Regardless lets not violated the user-specified memory limit by
adjusting it. Remove this adjustment to ensure all reservations stay
within the limit. Commit f94f5ac07983 ("powerpc/fadump: Don't update
the user-specified memory limit") did the same for fadump.

Cc: Andrew Morton 
Cc: Baoquan he 
Cc: Hari Bathini 
CC: Madhavan Srinivasan 
Cc: Michael Ellerman 
Cc: ke...@lists.infradead.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Sourabh Jain 
---
 arch/powerpc/kexec/core.c | 8 
 1 file changed, 8 deletions(-)

diff --git a/arch/powerpc/kexec/core.c b/arch/powerpc/kexec/core.c
index b8333a49ea5d..4945b33322ae 100644
--- a/arch/powerpc/kexec/core.c
+++ b/arch/powerpc/kexec/core.c
@@ -150,14 +150,6 @@ void __init reserve_crashkernel(void)
return;
}
 
-   /* Crash kernel trumps memory limit */
-   if (memory_limit && memory_limit <= crashk_res.end) {
-   memory_limit = crashk_res.end + 1;
-   total_mem_sz = memory_limit;
-   printk("Adjusted memory limit for crashkernel, now 0x%llx\n",
-  memory_limit);
-   }
-
printk(KERN_INFO "Reserving %ldMB of memory at %ldMB "
"for crashkernel (System RAM: %ldMB)\n",
(unsigned long)(crash_size >> 20),
-- 
2.47.1

[PATCH RESEND v1 0/5] powerpc/crash: use generic crashkernel reservation

2025-01-08 Thread Sourabh Jain

Commit 0ab97169aa05 ("crash_core: add generic function to do
reservation") added a generic function to reserve crashkernel memory.
So let's use the same function on powerpc and remove the
architecture-specific code that essentially does the same thing.

The generic crashkernel reservation also provides a way to split the
crashkernel reservation into high and low memory reservations, which can
be enabled for powerpc in the future.

Additionally move powerpc to use generic APIs to locate memory hole for
kexec segments while loading kdump kernel.

Patch series summary:
=
Patch 1-2: generic changes
Patch 3-4: powerpc changes
Patch 5:   generic + powerpc changes

Cc: Andrew Morton 
Cc: Baoquan he 
Cc: Hari Bathini 
CC: Madhavan Srinivasan 
Cc: Michael Ellerman 
Cc: ke...@lists.infradead.org
Cc: linux-ker...@vger.kernel.org
Cc: sourabhj...@linux.ibm.com

Changelog:

v1 Resend:
 - Rebased on top of 6.13-rc6

Sourabh Jain (5):
  crash: remove an unused argument from reserve_crashkernel_generic()
  crash: let arch decide crash memory export to iomem_resource
  powerpc/kdump: preserve user-specified memory limit
  powerpc/crash: use generic crashkernel reservation
  crash: option to let arch decide mem range is usable

 arch/arm64/mm/init.c |   6 +-
 arch/loongarch/kernel/setup.c|   5 +-
 arch/powerpc/Kconfig |   3 +
 arch/powerpc/include/asm/crash_reserve.h |  18 ++
 arch/powerpc/include/asm/kexec.h |  10 +-
 arch/powerpc/kernel/prom.c   |   2 +-
 arch/powerpc/kexec/core.c|  96 -
 arch/powerpc/kexec/file_load_64.c| 259 +--
 arch/riscv/mm/init.c |   6 +-
 arch/x86/kernel/setup.c  |   6 +-
 include/linux/crash_reserve.h|  22 +-
 include/linux/kexec.h|   9 +
 kernel/crash_reserve.c   |  12 +-
 kernel/kexec_file.c  |  12 ++
 14 files changed, 127 insertions(+), 339 deletions(-)
 create mode 100644 arch/powerpc/include/asm/crash_reserve.h

-- 
2.47.1

[PATCH RESEND v1 2/5] crash: let arch decide crash memory export to iomem_resource

2025-01-08 Thread Sourabh Jain

insert_crashkernel_resources() adds crash memory to iomem_resource if
generic crashkernel reservation is enabled on an architecture.

On PowerPC, system RAM is added to iomem_resource. See commit
c40dd2f766440 ("powerpc: Add System RAM to /proc/iomem").

Enabling generic crashkernel reservation on PowerPC leads to a conflict
when system RAM is added to iomem_resource because a part of the system
RAM, the crashkernel memory, has already been added to iomem_resource.

The next commit in the series "powerpc/crash: use generic crashkernel
reservation" enables generic crashkernel reservation on PowerPC. If the
crashkernel is added to iomem_resource, the kernel fails to add
system RAM to /proc/iomem and prints the following traces:

CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.13.0-rc2+
snip...
NIP [c2016b3c] add_system_ram_resources+0xf0/0x15c
LR [c2016b34] add_system_ram_resources+0xe8/0x15c
Call Trace:
[c484bbc0] [c2016b34] add_system_ram_resources+0xe8/0x15c
[c484bc20] [c0010a4c] do_one_initcall+0x7c/0x39c
[c484bd00] [c2005418] do_initcalls+0x144/0x18c
[c484bd90] [c2005714] kernel_init_freeable+0x21c/0x290
[c484bdf0] [c00110f4] kernel_init+0x2c/0x1b8
[c484be50] [c000dd3c] ret_from_kernel_user_thread+0x14/0x1c

To avoid this, an architecture hook is added in
insert_crashkernel_resources(), allowing the architecture to decide
whether crashkernel memory should be added to iomem_resource.

Cc: Andrew Morton 
Cc: Baoquan he 
Cc: Hari Bathini 
CC: Madhavan Srinivasan 
Cc: Michael Ellerman 
Cc: ke...@lists.infradead.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Sourabh Jain 
---
 include/linux/crash_reserve.h | 11 +++
 kernel/crash_reserve.c|  3 +++
 2 files changed, 14 insertions(+)

diff --git a/include/linux/crash_reserve.h b/include/linux/crash_reserve.h
index 1fe7e7d1b214..f1205d044dae 100644
--- a/include/linux/crash_reserve.h
+++ b/include/linux/crash_reserve.h
@@ -18,6 +18,17 @@ int __init parse_crashkernel(char *cmdline, unsigned long 
long system_ram,
unsigned long long *crash_size, unsigned long long *crash_base,
unsigned long long *low_size, bool *high);
 
+#ifdef CONFIG_ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION
+
+#ifndef arch_add_crash_res_to_iomem
+static inline bool arch_add_crash_res_to_iomem(void)
+{
+   return true;
+}
+#endif
+
+#endif
+
 #ifdef CONFIG_ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION
 #ifndef DEFAULT_CRASH_KERNEL_LOW_SIZE
 #define DEFAULT_CRASH_KERNEL_LOW_SIZE  (128UL << 20)
diff --git a/kernel/crash_reserve.c b/kernel/crash_reserve.c
index aff7c0fdbefa..190104f32fe1 100644
--- a/kernel/crash_reserve.c
+++ b/kernel/crash_reserve.c
@@ -460,6 +460,9 @@ void __init reserve_crashkernel_generic(unsigned long long 
crash_size,
 #ifndef HAVE_ARCH_ADD_CRASH_RES_TO_IOMEM_EARLY
 static __init int insert_crashkernel_resources(void)
 {
+   if (!arch_add_crash_res_to_iomem())
+   return 0;
+
if (crashk_res.start < crashk_res.end)
insert_resource(&iomem_resource, &crashk_res);
 
-- 
2.47.1

[PATCH RESEND v1 4/5] powerpc/crash: use generic crashkernel reservation

2025-01-08 Thread Sourabh Jain

Commit 0ab97169aa05 ("crash_core: add generic function to do
reservation") added a generic function to reserve crashkernel memory.
So let's use the same function on powerpc and remove the
architecture-specific code that essentially does the same thing.

The generic crashkernel reservation also provides a way to split the
crashkernel reservation into high and low memory reservations, which can
be enabled for powerpc in the future.

Along with moving to the generic crashkernel reservation, the code
related to finding the base address for the crashkernel has been
separated into its own function name get_crash_base() for better
readability and maintainability.

To prevent crashkernel memory from being added to iomem_resource, the
function arch_add_crash_res_to_iomem() has been introduced. For further
details on why this should not be done for the PowerPC architecture,
please refer to the previous commit titled "crash: let arch decide crash
memory export to iomem_resource.

Cc: Andrew Morton 
Cc: Baoquan he 
Cc: Hari Bathini 
CC: Madhavan Srinivasan 
Cc: Michael Ellerman 
Cc: ke...@lists.infradead.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Sourabh Jain 
---
 arch/powerpc/Kconfig |  3 +
 arch/powerpc/include/asm/crash_reserve.h | 18 +
 arch/powerpc/include/asm/kexec.h |  4 +-
 arch/powerpc/kernel/prom.c   |  2 +-
 arch/powerpc/kexec/core.c| 90 ++--
 5 files changed, 63 insertions(+), 54 deletions(-)
 create mode 100644 arch/powerpc/include/asm/crash_reserve.h

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index a0ce777f9706..8df8871215f8 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -717,6 +717,9 @@ config ARCH_SUPPORTS_CRASH_HOTPLUG
def_bool y
depends on PPC64
 
+config ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION
+   def_bool CRASH_RESERVE
+
 config FA_DUMP
bool "Firmware-assisted dump"
depends on CRASH_DUMP && PPC64 && (PPC_RTAS || PPC_POWERNV)
diff --git a/arch/powerpc/include/asm/crash_reserve.h 
b/arch/powerpc/include/asm/crash_reserve.h
new file mode 100644
index ..f5e60721de41
--- /dev/null
+++ b/arch/powerpc/include/asm/crash_reserve.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_POWERPC_CRASH_RESERVE_H
+#define _ASM_POWERPC_CRASH_RESERVE_H
+
+/* crash kernel regions are Page size agliged */
+#define CRASH_ALIGN PAGE_SIZE
+
+#ifdef CONFIG_ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION
+
+static inline bool arch_add_crash_res_to_iomem(void)
+{
+   return false;
+}
+#define arch_add_crash_res_to_iomem arch_add_crash_res_to_iomem
+#endif
+
+#endif /* _ASM_POWERPC_CRASH_RESERVE_H */
+
diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index 270ee93a0f7d..64741558071f 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -113,9 +113,9 @@ int setup_new_fdt_ppc64(const struct kimage *image, void 
*fdt, struct crash_mem
 
 #ifdef CONFIG_CRASH_RESERVE
 int __init overlaps_crashkernel(unsigned long start, unsigned long size);
-extern void reserve_crashkernel(void);
+extern void arch_reserve_crashkernel(void);
 #else
-static inline void reserve_crashkernel(void) {}
+static inline void arch_reserve_crashkernel(void) {}
 static inline int overlaps_crashkernel(unsigned long start, unsigned long 
size) { return 0; }
 #endif
 
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index e0059842a1c6..9ed9dde7d231 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -860,7 +860,7 @@ void __init early_init_devtree(void *params)
 */
if (fadump_reserve_mem() == 0)
 #endif
-   reserve_crashkernel();
+   arch_reserve_crashkernel();
early_reserve_mem();
 
if (memory_limit > memblock_phys_mem_size())
diff --git a/arch/powerpc/kexec/core.c b/arch/powerpc/kexec/core.c
index 4945b33322ae..b21cfa814492 100644
--- a/arch/powerpc/kexec/core.c
+++ b/arch/powerpc/kexec/core.c
@@ -80,38 +80,20 @@ void machine_kexec(struct kimage *image)
 }
 
 #ifdef CONFIG_CRASH_RESERVE
-void __init reserve_crashkernel(void)
-{
-   unsigned long long crash_size, crash_base, total_mem_sz;
-   int ret;
 
-   total_mem_sz = memory_limit ? memory_limit : memblock_phys_mem_size();
-   /* use common parsing */
-   ret = parse_crashkernel(boot_command_line, total_mem_sz,
-   &crash_size, &crash_base, NULL, NULL);
-   if (ret == 0 && crash_size > 0) {
-   crashk_res.start = crash_base;
-   crashk_res.end = crash_base + crash_size - 1;
-   }
-
-   if (crashk_res.end == crashk_res.start) {
-   crashk_res.start = crashk_res.end = 0;
-   return;
-   }
-
-   /* We might have got these values via the command line or the
-* device tree, either way sanitise them now. */
-
-   crash_size = resource_size

[PATCH RESEND v1 1/5] crash: remove an unused argument from reserve_crashkernel_generic()

2025-01-08 Thread Sourabh Jain

cmdline argument is not used in reserve_crashkernel_generic() so remove
it. Correspondingly, all the callers have been updated as well.

No functional change intended.

Cc: Andrew Morton 
Cc: Baoquan he 
Cc: Hari Bathini 
CC: Madhavan Srinivasan 
Cc: Michael Ellerman 
Cc: ke...@lists.infradead.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Sourabh Jain 
---
 arch/arm64/mm/init.c  |  6 ++
 arch/loongarch/kernel/setup.c |  5 ++---
 arch/riscv/mm/init.c  |  6 ++
 arch/x86/kernel/setup.c   |  6 ++
 include/linux/crash_reserve.h | 11 +--
 kernel/crash_reserve.c|  9 -
 6 files changed, 17 insertions(+), 26 deletions(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index ccdef53872a0..2e496209af80 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -98,21 +98,19 @@ static void __init arch_reserve_crashkernel(void)
 {
unsigned long long low_size = 0;
unsigned long long crash_base, crash_size;
-   char *cmdline = boot_command_line;
bool high = false;
int ret;
 
if (!IS_ENABLED(CONFIG_CRASH_RESERVE))
return;
 
-   ret = parse_crashkernel(cmdline, memblock_phys_mem_size(),
+   ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
&crash_size, &crash_base,
&low_size, &high);
if (ret)
return;
 
-   reserve_crashkernel_generic(cmdline, crash_size, crash_base,
-   low_size, high);
+   reserve_crashkernel_generic(crash_size, crash_base, low_size, high);
 }
 
 static phys_addr_t __init max_zone_phys(phys_addr_t zone_limit)
diff --git a/arch/loongarch/kernel/setup.c b/arch/loongarch/kernel/setup.c
index 56934fe58170..ece9c4266c3f 100644
--- a/arch/loongarch/kernel/setup.c
+++ b/arch/loongarch/kernel/setup.c
@@ -259,18 +259,17 @@ static void __init arch_reserve_crashkernel(void)
int ret;
unsigned long long low_size = 0;
unsigned long long crash_base, crash_size;
-   char *cmdline = boot_command_line;
bool high = false;
 
if (!IS_ENABLED(CONFIG_CRASH_RESERVE))
return;
 
-   ret = parse_crashkernel(cmdline, memblock_phys_mem_size(),
+   ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
&crash_size, &crash_base, &low_size, &high);
if (ret)
return;
 
-   reserve_crashkernel_generic(cmdline, crash_size, crash_base, low_size, 
high);
+   reserve_crashkernel_generic(crash_size, crash_base, low_size, high);
 }
 
 static void __init fdt_setup(void)
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index fc53ce748c80..750991df9e52 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -1377,21 +1377,19 @@ static void __init arch_reserve_crashkernel(void)
 {
unsigned long long low_size = 0;
unsigned long long crash_base, crash_size;
-   char *cmdline = boot_command_line;
bool high = false;
int ret;
 
if (!IS_ENABLED(CONFIG_CRASH_RESERVE))
return;
 
-   ret = parse_crashkernel(cmdline, memblock_phys_mem_size(),
+   ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
&crash_size, &crash_base,
&low_size, &high);
if (ret)
return;
 
-   reserve_crashkernel_generic(cmdline, crash_size, crash_base,
-   low_size, high);
+   reserve_crashkernel_generic(crash_size, crash_base, low_size, high);
 }
 
 void __init paging_init(void)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f1fea506e20f..15b6823556c8 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -469,14 +469,13 @@ static void __init 
memblock_x86_reserve_range_setup_data(void)
 static void __init arch_reserve_crashkernel(void)
 {
unsigned long long crash_base, crash_size, low_size = 0;
-   char *cmdline = boot_command_line;
bool high = false;
int ret;
 
if (!IS_ENABLED(CONFIG_CRASH_RESERVE))
return;
 
-   ret = parse_crashkernel(cmdline, memblock_phys_mem_size(),
+   ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
&crash_size, &crash_base,
&low_size, &high);
if (ret)
@@ -487,8 +486,7 @@ static void __init arch_reserve_crashkernel(void)
return;
}
 
-   reserve_crashkernel_generic(cmdline, crash_size, crash_base,
-   low_size, high);
+   reserve_crashkernel_generic(crash_size, crash_base, low_size, high);
 }
 
 static struct resource standard_io_resources[] = {
diff --git a/include/linux/crash_reserve.h b/include/linux/crash_reserve.h
index 5a9df944fb80..1fe7e7d1b214 100644
-

[PATCH RESEND v1 5/5] crash: option to let arch decide mem range is usable

2025-01-08 Thread Sourabh Jain

On PowerPC, the memory reserved for the crashkernel can contain
components like RTAS, TCE, OPAL, etc., which should be avoided when
loading kexec segments into crashkernel memory. Due to these special
components, PowerPC has its own set of functions to locate holes in the
crashkernel memory for loading kexec segments for kdump. However, for
loading kexec segments in the kexec case, PowerPC uses generic functions
to locate holes.

So, let's use generic functions to locate memory holes for kdump on
PowerPC by adding an arch hook to handle such special regions while
loading kexec segments, and remove the PowerPC functions to locate
holes.

Cc: Andrew Morton 
Cc: Baoquan he 
Cc: Hari Bathini 
CC: Madhavan Srinivasan 
Cc: Michael Ellerman 
Cc: ke...@lists.infradead.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Sourabh Jain 
---
 arch/powerpc/include/asm/kexec.h  |   6 +-
 arch/powerpc/kexec/file_load_64.c | 259 ++
 include/linux/kexec.h |   9 ++
 kernel/kexec_file.c   |  12 ++
 4 files changed, 34 insertions(+), 252 deletions(-)

diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index 64741558071f..5e4680f9ff35 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -95,8 +95,10 @@ int arch_kexec_kernel_image_probe(struct kimage *image, void 
*buf, unsigned long
 int arch_kimage_file_post_load_cleanup(struct kimage *image);
 #define arch_kimage_file_post_load_cleanup arch_kimage_file_post_load_cleanup
 
-int arch_kexec_locate_mem_hole(struct kexec_buf *kbuf);
-#define arch_kexec_locate_mem_hole arch_kexec_locate_mem_hole
+int arch_check_excluded_range(struct kimage *image, unsigned long start,
+ unsigned long end);
+#define arch_check_excluded_range  arch_check_excluded_range
+
 
 int load_crashdump_segments_ppc64(struct kimage *image,
  struct kexec_buf *kbuf);
diff --git a/arch/powerpc/kexec/file_load_64.c 
b/arch/powerpc/kexec/file_load_64.c
index dc65c1391157..e7ef8b2a2554 100644
--- a/arch/powerpc/kexec/file_load_64.c
+++ b/arch/powerpc/kexec/file_load_64.c
@@ -49,201 +49,18 @@ const struct kexec_file_ops * const kexec_file_loaders[] = 
{
NULL
 };
 
-/**
- * __locate_mem_hole_top_down - Looks top down for a large enough memory hole
- *  in the memory regions between buf_min & buf_max
- *  for the buffer. If found, sets kbuf->mem.
- * @kbuf:   Buffer contents and memory parameters.
- * @buf_min:Minimum address for the buffer.
- * @buf_max:Maximum address for the buffer.
- *
- * Returns 0 on success, negative errno on error.
- */
-static int __locate_mem_hole_top_down(struct kexec_buf *kbuf,
- u64 buf_min, u64 buf_max)
-{
-   int ret = -EADDRNOTAVAIL;
-   phys_addr_t start, end;
-   u64 i;
-
-   for_each_mem_range_rev(i, &start, &end) {
-   /*
-* memblock uses [start, end) convention while it is
-* [start, end] here. Fix the off-by-one to have the
-* same convention.
-*/
-   end -= 1;
-
-   if (start > buf_max)
-   continue;
-
-   /* Memory hole not found */
-   if (end < buf_min)
-   break;
-
-   /* Adjust memory region based on the given range */
-   if (start < buf_min)
-   start = buf_min;
-   if (end > buf_max)
-   end = buf_max;
-
-   start = ALIGN(start, kbuf->buf_align);
-   if (start < end && (end - start + 1) >= kbuf->memsz) {
-   /* Suitable memory range found. Set kbuf->mem */
-   kbuf->mem = ALIGN_DOWN(end - kbuf->memsz + 1,
-  kbuf->buf_align);
-   ret = 0;
-   break;
-   }
-   }
-
-   return ret;
-}
-
-/**
- * locate_mem_hole_top_down_ppc64 - Skip special memory regions to find a
- *  suitable buffer with top down approach.
- * @kbuf:   Buffer contents and memory parameters.
- * @buf_min:Minimum address for the buffer.
- * @buf_max:Maximum address for the buffer.
- * @emem:   Exclude memory ranges.
- *
- * Returns 0 on success, negative errno on error.
- */
-static int locate_mem_hole_top_down_ppc64(struct kexec_buf *kbuf,
- u64 buf_min, u64 buf_max,
- const struct crash_mem *emem)
+int arch_check_excluded_range(struct kimage *image, unsigned long start,
+ unsigned long end)
 {
-   int i, ret = 0, err = -EAD

Re: [PATCH v5 14/17] mm: pgtable: introduce generic __tlb_remove_table()

2025-01-08 Thread Arnd Bergmann

On Wed, Jan 8, 2025, at 07:57, Qi Zheng wrote:
> Several architectures (arm, arm64, riscv and x86) define exactly the
> same __tlb_remove_table(), just introduce generic __tlb_remove_table() to
> eliminate these duplications.
>
> The s390 __tlb_remove_table() is nearly the same, so also make s390
> __tlb_remove_table() version generic.
>
> Signed-off-by: Qi Zheng 
> Reviewed-by: Kevin Brodsky 
> Acked-by: Andreas Larsson  # sparc
> Acked-by: Alexander Gordeev  # s390

Acked-by: Arnd Bergmann  # asm-generic

Re: [PATCH RESEND v1 5/5] crash: option to let arch decide mem range is usable

2025-01-08 Thread Baoquan he

On 01/08/25 at 03:44pm, Sourabh Jain wrote:
...snip...
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index f0e9f8eda7a3..407f8b0346aa 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -205,6 +205,15 @@ static inline int 
> arch_kimage_file_post_load_cleanup(struct kimage *image)
>  }
>  #endif
>  
> +#ifndef arch_check_excluded_range
> +static inline int arch_check_excluded_range(struct kimage *image,
> + unsigned long start,
> + unsigned long end)
> +{
> + return 0;
> +}
> +#endif
> +
>  #ifdef CONFIG_KEXEC_SIG
>  #ifdef CONFIG_SIGNED_PE_FILE_VERIFICATION
>  int kexec_kernel_verify_pe_sig(const char *kernel, unsigned long kernel_len);
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index 3eedb8c226ad..52e1480dbfa1 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -464,6 +464,12 @@ static int locate_mem_hole_top_down(unsigned long start, 
> unsigned long end,
>   continue;
>   }
>  
> + /* Make sure this does not conflict exclude range */
   ^
Make sure this doesn't conflict with excluded range?

> + if (arch_check_excluded_range(image, temp_start, temp_end)) {
> + temp_start = temp_start - PAGE_SIZE;
> + continue;
> + }
> +
>   /* We found a suitable memory range */
>   break;
>   } while (1);
> @@ -498,6 +504,12 @@ static int locate_mem_hole_bottom_up(unsigned long 
> start, unsigned long end,
>   continue;
>   }
>  
> + /* Make sure this does not conflict exclude range */
   ^
 Ditto.

> + if (arch_check_excluded_range(image, temp_start, temp_end)) {
> + temp_start = temp_start + PAGE_SIZE;
> + continue;
> + }
> +
>   /* We found a suitable memory range */
>   break;
>   } while (1);
> -- 
> 2.47.1
>

[PATCH] powerpc/64s: Rewrite __real_pte() as a static inline

2025-01-08 Thread Christophe Leroy

Rewrite __real_pte() as a static inline in order to avoid
following warning/error when building with 4k page size:

  CC  arch/powerpc/mm/book3s64/hash_tlb.o
arch/powerpc/mm/book3s64/hash_tlb.c: In function 'hpte_need_flush':
arch/powerpc/mm/book3s64/hash_tlb.c:49:16: error: variable 'offset' set 
but not used [-Werror=unused-but-set-variable]
   49 | int i, offset;
  |^~
cc1: all warnings being treated as errors

Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-kbuild-all/202501081741.ayfwybsq-...@intel.com/
Fixes: ff31e105464d ("powerpc/mm/hash64: Store the slot information at the 
right offset for hugetlb")
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/book3s/64/hash-4k.h | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index c3efacab4b94..a7a68ba9c71b 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -77,7 +77,10 @@
 /*
  * With 4K page size the real_pte machinery is all nops.
  */
-#define __real_pte(e, p, o)((real_pte_t){(e)})
+static inline real_pte_t __real_pte(pte_t pte, pte_t *ptep, int offset)
+{
+   return (real_pte_t){pte};
+}
 #define __rpte_to_pte(r)   ((r).pte)
 #define __rpte_to_hidx(r,index)(pte_val(__rpte_to_pte(r)) >> 
H_PAGE_F_GIX_SHIFT)
 
-- 
2.47.0

[PATCH v2] perf: Fix display of kernel symbols

2025-01-08 Thread Christophe Leroy

Since commit 659ad3492b91 ("perf maps: Switch from rbtree to lazily
sorted array for addresses"), perf doesn't display anymore kernel
symbols on powerpc, allthough it still detects them as kernel addresses.

# Overhead  Command Shared Object  Symbol
#   ..  . 
..
#
80.49%  Coeur main  [unknown]  [k] 0xc005f0f8
 3.91%  Coeur main  gau[.] 
engine_loop.constprop.0.isra.0
 1.72%  Coeur main  [unknown]  [k] 0xc005f11c
 1.09%  Coeur main  [unknown]  [k] 0xc01f82c8
 0.44%  Coeur main  libc.so.6  [.] epoll_wait
 0.38%  Coeur main  [unknown]  [k] 0xc0011718
 0.36%  Coeur main  [unknown]  [k] 0xc01f45c0

This is because function maps__find_next_entry() now returns current
entry instead of next entry, leading to kernel map end address
getting mis-configured with its own start address instead of the
start address of the following map.

Fix it by really taking the next entry, also make sure that entry
follows current one by making sure entries are sorted.

Fixes: 659ad3492b91 ("perf maps: Switch from rbtree to lazily sorted array for 
addresses")
Signed-off-by: Christophe Leroy 
Reviewed-by: Arnaldo Carvalho de Melo 
---
v2: Make sure the entries are sorted, if not sort them.
---
 tools/perf/util/maps.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/maps.c b/tools/perf/util/maps.c
index 432399cbe5dd..09c9cc326c08 100644
--- a/tools/perf/util/maps.c
+++ b/tools/perf/util/maps.c
@@ -1136,8 +1136,13 @@ struct map *maps__find_next_entry(struct maps *maps, 
struct map *map)
struct map *result = NULL;
 
down_read(maps__lock(maps));
+   while (!maps__maps_by_address_sorted(maps)) {
+   up_read(maps__lock(maps));
+   maps__sort_by_address(maps);
+   down_read(maps__lock(maps));
+   }
i = maps__by_address_index(maps, map);
-   if (i < maps__nr_maps(maps))
+   if (++i < maps__nr_maps(maps))
result = map__get(maps__maps_by_address(maps)[i]);
 
up_read(maps__lock(maps));
-- 
2.47.0

Re: [PATCH v3 01/28] module: Extend the preempt disabled section in dereference_symbol_descriptor().

2025-01-08 Thread Helge Deller


On 1/8/25 10:04, Sebastian Andrzej Siewior wrote:

dereference_symbol_descriptor() needs to obtain the module pointer
belonging to pointer in order to resolve that pointer.
The returned mod pointer is obtained under RCU-sched/ preempt_disable()
guarantees and needs to be used within this section to ensure that the
module is not removed in the meantime.

Extend the preempt_disable() section to also cover
dereference_module_function_descriptor().

Fixes: 04b8eb7a4ccd9 ("symbol lookup: introduce 
dereference_symbol_descriptor()")
Cc: James E.J. Bottomley 
Cc: Christophe Leroy 
Cc: Helge Deller 
Cc: Madhavan Srinivasan 
Cc: Michael Ellerman 
Cc: Naveen N Rao 
Cc: Nicholas Piggin 
Cc: Sergey Senozhatsky 
Cc: linux-par...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Sergey Senozhatsky 
Acked-by: Peter Zijlstra (Intel) 
Signed-off-by: Sebastian Andrzej Siewior 


Nice catch.

Acked-by: Helge Deller 

This patch really should be backported.
Can you add a Cc: stable tag?

Helge



---
  include/linux/kallsyms.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/kallsyms.h b/include/linux/kallsyms.h
index c3f075e8f60cb..1c6a6c1704d8d 100644
--- a/include/linux/kallsyms.h
+++ b/include/linux/kallsyms.h
@@ -57,10 +57,10 @@ static inline void *dereference_symbol_descriptor(void *ptr)

preempt_disable();
mod = __module_address((unsigned long)ptr);
-   preempt_enable();

if (mod)
ptr = dereference_module_function_descriptor(mod, ptr);
+   preempt_enable();
  #endif
return ptr;
  }

Re: [PATCH RESEND v1 2/5] crash: let arch decide crash memory export to iomem_resource

2025-01-08 Thread Baoquan he

Hi,

On 01/08/25 at 03:44pm, Sourabh Jain wrote:
> insert_crashkernel_resources() adds crash memory to iomem_resource if
> generic crashkernel reservation is enabled on an architecture.
> 
> On PowerPC, system RAM is added to iomem_resource. See commit
> c40dd2f766440 ("powerpc: Add System RAM to /proc/iomem").
> 
> Enabling generic crashkernel reservation on PowerPC leads to a conflict
> when system RAM is added to iomem_resource because a part of the system
> RAM, the crashkernel memory, has already been added to iomem_resource.
> 
> The next commit in the series "powerpc/crash: use generic crashkernel
> reservation" enables generic crashkernel reservation on PowerPC. If the
> crashkernel is added to iomem_resource, the kernel fails to add
> system RAM to /proc/iomem and prints the following traces:
> 
> CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.13.0-rc2+
> snip...
> NIP [c2016b3c] add_system_ram_resources+0xf0/0x15c
> LR [c2016b34] add_system_ram_resources+0xe8/0x15c
> Call Trace:
> [c484bbc0] [c2016b34] add_system_ram_resources+0xe8/0x15c
> [c484bc20] [c0010a4c] do_one_initcall+0x7c/0x39c
> [c484bd00] [c2005418] do_initcalls+0x144/0x18c
> [c484bd90] [c2005714] kernel_init_freeable+0x21c/0x290
> [c484bdf0] [c00110f4] kernel_init+0x2c/0x1b8
> [c484be50] [c000dd3c] ret_from_kernel_user_thread+0x14/0x1c
> 
> To avoid this, an architecture hook is added in
> insert_crashkernel_resources(), allowing the architecture to decide
> whether crashkernel memory should be added to iomem_resource.

Have you tried defining HAVE_ARCH_ADD_CRASH_RES_TO_IOMEM_EARLY in ppc to
add crashkernel region to iomem early? Now there are two branches in the
existing code, adding a hook will make three ways.

Re: [PATCH v5 03/25] fs/dax: Don't skip locked entries when scanning entries

2025-01-08 Thread Dan Williams

Alistair Popple wrote:
> Several functions internal to FS DAX use the following pattern when
> trying to obtain an unlocked entry:
> 
> xas_for_each(&xas, entry, end_idx) {
>   if (dax_is_locked(entry))
>   entry = get_unlocked_entry(&xas, 0);
> 
> This is problematic because get_unlocked_entry() will get the next
> present entry in the range, and the next entry may not be
> locked. Therefore any processing of the original locked entry will be
> skipped. This can cause dax_layout_busy_page_range() to miss DMA-busy
> pages in the range, leading file systems to free blocks whilst DMA
> operations are ongoing which can lead to file system corruption.
> 
> Instead callers from within a xas_for_each() loop should be waiting
> for the current entry to be unlocked without advancing the XArray
> state so a new function is introduced to wait.

Oh wow, good eye!

Did this trip up an xfstest, or did you see this purely by inspection?

> Also while we are here rename get_unlocked_entry() to
> get_next_unlocked_entry() to make it clear that it may advance the
> iterator state.

Outside of the above clarification of how found / end user effect you
can add:

Reviewed-by: Dan Williams

Re: [PATCH] of: address: Unify resource bounds overflow checking

2025-01-08 Thread Rob Herring

On Wed, Jan 8, 2025 at 8:04 AM Basharath Hussain Khaja
 wrote:
>
> Hi,
>
> >> Thomas Weißschuh  writes:
> >> > The members "start" and "end" of struct resource are of type
> >> > "resource_size_t" which can be 32bit wide.
> >> > Values read from OF however are always 64bit wide.
> >> >
> >> > Refactor the diff overflow checks into a helper function.
> >> > Also extend the checks to validate each calculation step.
> >> >
> >> > Signed-off-by: Thomas Weißschuh 
> >> > ---
> >> >  drivers/of/address.c | 45 ++---
> >> >  1 file changed, 26 insertions(+), 19 deletions(-)
> >> >
> >> > diff --git a/drivers/of/address.c b/drivers/of/address.c
> >> > index 7e59283a4472..df854bb427ce 100644
> >> > --- a/drivers/of/address.c
> >> > +++ b/drivers/of/address.c
> >> > @@ -198,6 +198,25 @@ static u64 of_bus_pci_map(__be32 *addr, const 
> >> > __be32 *range, int na, int ns,
> >> >
> >> >  #endif /* CONFIG_PCI */
> >> >
> >> > +static int __of_address_resource_bounds(struct resource *r, u64 start, 
> >> > u64 size)
> >> > +{
> >> > + u64 end = start;
> >> > +
> >> > + if (overflows_type(start, r->start))
> >> > + return -EOVERFLOW;
> >> > + if (size == 0)
> >> > + return -EOVERFLOW;
> >> > + if (check_add_overflow(end, size - 1, &end))
> >> > + return -EOVERFLOW;
> >> > + if (overflows_type(end, r->end))
> >> > + return -EOVERFLOW;
> >>
> >> This breaks PCI on powerpc qemu. Part of the PCI probe reads a resource
> >> that's zero sized, which used to succeed but now fails due to the size
> >> check above.
> >>
> >> The diff below fixes it for me.
> >
> > I fixed it up with your change.
>
>
> This commit is breaking Ethernet functionality on the TI AM57xx platform due 
> to zero byte SRAM block size allocation during initialization. Prior to this 
> patch, zero byte block sizes were handled properly.

What driver and where exactly?

Rob

Re: [PATCH v8 3/7] PCI: Make pcie_read_tlp_log() signature same

2025-01-08 Thread Bjorn Helgaas

On Wed, Jan 08, 2025 at 03:40:11PM -0500, Yazen Ghannam wrote:
> On Wed, Dec 18, 2024 at 04:37:43PM +0200, Ilpo Järvinen wrote:
> > pcie_read_tlp_log()'s prototype and function signature diverged due to
> > changes made while applying.
> > 
> > Make the parameters of pcie_read_tlp_log() named identically.
> > 
> > Signed-off-by: Ilpo Järvinen 
> > Reviewed-by: Jonathan Cameron 
> 
> Reviewed-by: Yazen Ghannam 
> 
> Just wondering, could this be squashed into the previous patch? Or is
> the goal to be strict with one logical change per patch?

I haven't looked closely enough to opine on whether this should be
squashed, but I generally do prefer one logical change per patch.

It's much easier for me to squash things when merging than it is to
untangle things that should be separate patches.

Bjorn

Re: [PATCH v5 01/25] fuse: Fix dax truncate/punch_hole fault path

2025-01-08 Thread Dan Williams

Alistair Popple wrote:
> FS DAX requires file systems to call into the DAX layout prior to
> unlinking inodes to ensure there is no ongoing DMA or other remote
> access to the direct mapped page. The fuse file system implements
> fuse_dax_break_layouts() to do this which includes a comment
> indicating that passing dmap_end == 0 leads to unmapping of the whole
> file.
> 
> However this is not true - passing dmap_end == 0 will not unmap
> anything before dmap_start, and further more
> dax_layout_busy_page_range() will not scan any of the range to see if
> there maybe ongoing DMA access to the range.

It would be useful to clarify that this is bug was found by inspection
and that there are no known end user reports of trouble but that the
failure more would look like random fs corruption. The window is hard to
hit because a block needs to be truncated, reallocated to
a file, and written to before stale DMA could corrupt it. So that may
contribute to the fact that fuse-dax users have not reported an issue
since v5.10.

> Fix this by checking for dmap_end == 0 in fuse_dax_break_layouts() and
> pass the entire file range to dax_layout_busy_page_range().

That's not what this patch does, maybe a rebase error that pushed the
@dmap_end fixup after the call to dax_layout_busy_page_range?

However, I don't think this is quite the right fix, more below...

> Signed-off-by: Alistair Popple 
> Fixes: 6ae330cad6ef ("virtiofs: serialize truncate/punch_hole and dax fault 
> path")
> Cc: Vivek Goyal 
> 
> ---
> 
> I am not at all familiar with the fuse file system driver so I have no
> idea if the comment is relevant or not and whether the documented
> behaviour for dmap_end == 0 is ever relied upon. However this seemed
> like the safest fix unless someone more familiar with fuse can confirm
> that dmap_end == 0 is never used.

It is used in several places and has been wrong since day one. I believe
the original commit simply misunderstood that
dax_layout_busy_page_range() semantics are analogous to
invalidate_inode_pages2_range() semantics in terms of what @start and
@end mean.

You can add:

Co-developed-by: Dan Williams 
Signed-off-by: Dan Williams 

...if you end up doing a resend, or I will add it on applying to
nvdimm.git if the rebase does not end up being too prickly.

-- 8< --
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index c5d1feaa239c..455c4a16080b 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -681,7 +681,6 @@ static int __fuse_dax_break_layouts(struct inode *inode, 
bool *retry,
0, 0, fuse_wait_dax_page(inode));
 }
 
-/* dmap_end == 0 leads to unmapping of whole file */
 int fuse_dax_break_layouts(struct inode *inode, u64 dmap_start,
  u64 dmap_end)
 {
@@ -693,10 +692,6 @@ int fuse_dax_break_layouts(struct inode *inode, u64 
dmap_start,
ret = __fuse_dax_break_layouts(inode, &retry, dmap_start,
   dmap_end);
} while (ret == 0 && retry);
-   if (!dmap_end) {
-   dmap_start = 0;
-   dmap_end = LLONG_MAX;
-   }
 
return ret;
 }
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 0b2f8567ca30..bc6c8936c529 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1936,7 +1936,7 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct 
dentry *dentry,
if (FUSE_IS_DAX(inode) && is_truncate) {
filemap_invalidate_lock(mapping);
fault_blocked = true;
-   err = fuse_dax_break_layouts(inode, 0, 0);
+   err = fuse_dax_break_layouts(inode, 0, -1);
if (err) {
filemap_invalidate_unlock(mapping);
return err;
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 082ee374f694..cef7a8f75821 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -253,7 +253,7 @@ static int fuse_open(struct inode *inode, struct file *file)
 
if (dax_truncate) {
filemap_invalidate_lock(inode->i_mapping);
-   err = fuse_dax_break_layouts(inode, 0, 0);
+   err = fuse_dax_break_layouts(inode, 0, -1);
if (err)
goto out_inode_unlock;
}
@@ -2890,7 +2890,7 @@ static long fuse_file_fallocate(struct file *file, int 
mode, loff_t offset,
inode_lock(inode);
if (block_faults) {
filemap_invalidate_lock(inode->i_mapping);
-   err = fuse_dax_break_layouts(inode, 0, 0);
+   err = fuse_dax_break_layouts(inode, 0, -1);
if (err)
goto out;
}

Re: [PATCH v5 02/25] fs/dax: Return unmapped busy pages from dax_layout_busy_page_range()

2025-01-08 Thread Dan Williams

Alistair Popple wrote:
> dax_layout_busy_page_range() is used by file systems to scan the DAX
> page-cache to unmap mapping pages from user-space and to determine if
> any pages in the given range are busy, either due to ongoing DMA or
> other get_user_pages() usage.
> 
> Currently it checks to see the file mapping is mapped into user-space
> with mapping_mapped() and returns early if not, skipping the check for
> DMA busy pages. This is wrong as pages may still be undergoing DMA
> access even if they have subsequently been unmapped from
> user-space. Fix this by dropping the check for mapping_mapped().
> 
> Signed-off-by: Alistair Popple 
> Suggested-by: Dan Williams 

Reviewed-by: Dan Williams

79 matches

Mail list logo