date:20250122

Re: BUG: KASAN: vmalloc-out-of-bounds in copy_to_kernel_nofault+0xd8/0x1c8 (v6.13-rc6, PowerMac G4)

2025-01-22 Thread Christophe Leroy





Le 22/01/2025 à 00:21, Erhard Furtner a écrit :

On Tue, 21 Jan 2025 23:07:25 +0100
Christophe Leroy  wrote:


Meanwhile I bisected the bug. Offending commit is:

   # git bisect good
32913f348229c9f72dda45fc2c08c6d9dfcd3d6d is the first bad commit
commit 32913f348229c9f72dda45fc2c08c6d9dfcd3d6d
Author: Linus Torvalds 
Date:   Mon Dec 9 10:00:25 2024 -0800

  futex: fix user access on powerpc
  
  The powerpc user access code is special, and unlike other architectures

  distinguishes between user access for reading and writing.
  
  And commit 43a43faf5376 ("futex: improve user space accesses") messed

  that up.  It went undetected elsewhere, but caused ppc32 to fail early
  during boot, because the user access had been started with
  user_read_access_begin(), but then finished off with just a plain
  "user_access_end()".
  
  Note that the address-masking user access helpers don't even have that

  read-vs-write distinction, so if powerpc ever wants to do address
  masking tricks, we'll have to do some extra work for it.
  
  [ Make sure to also do it for the EFAULT case, as pointed out by

Christophe Leroy ]
  
  Reported-by: Andreas Schwab 

  Cc: Christophe Leroy 
  Link: 
https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fall%2F87bjxl6b0i.fsf%40igel.home%2F&data=05%7C02%7Cchristophe.leroy%40csgroup.eu%7Cb4c1dc7184f54a410a0e08dd3a7270b6%7C8b87af7d86474dc78df45f69a2011bb5%7C0%7C0%7C638730985407902881%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=E5Yp9jopCPE1NFuBM8rs%2B1jXZ%2FXAaKvBGpcEP%2BaMyz0%3D&reserved=0
  Signed-off-by: Linus Torvalds 

   kernel/futex/futex.h | 4 ++--
   1 file changed, 2 insertions(+), 2 deletions(-)


Indeed, reverting 32913f348229c9f72dda45fc2c08c6d9dfcd3d6d on top of v6.13 
makes the KASAN hit disappear.


That looks terribly odd.

On G4, user_read_access_begin() and user_read_access_end() are no-op
because book3s/32 can only protect user access by kernel against write.
Read is always granted.

So the bug must be an indirect side effect of what user_access_end()
does. user_access_end() does a sync. Would the lack of sync (once
replaced user_access_end() by user_read_access_end() ) lead to some odd
re-ordering ? Or another possibility is that user_access_end() is called
on some kernel address (I see in the description of commit 43a43faf5376
("futex: improve user space accesses") that the replaced __get_user()
was expected to work on kernel adresses) ? Calling user_access_begin()
and user_access_end() is unexpected and there is no guard so it could
lead to strange segment settings which hides a KASAN hit. But once the
fix the issue the KASAN resurfaces ? Could this be the problem ?

Do you have a way to reproduce the bug on QEMU ? It would enable me to
investigate it further.


Attached v6.13 .config plays nicely with qemu ttyS0 (forgot to disable 
SERIAL_8250 and set SERIAL_PMACZILOG + SERIAL_PMACZILOG_CONSOLE instead as I 
prefer the PCI Serial card in my G4).

The KASAN hit also shows up on qemu 8.2.7 via via:
qemu-system-ppc -machine mac99,via=pmu -cpu 7450 -m 2G -nographic -append 
console=ttyS0 -kernel vmlinux-6.13.0-PMacG4 -hda Debian-VM_g4.img



I was able to reproduce it with v6.13 with QEMU when loading test_bpf 
module.


On my side, the problem doesn't disappear when reverting of commit 
32913f348229 ("futex: fix user access on powerpc")


I bisected it to commit e4137f08816b ("mm, kasan, kmsan: instrument 
copy_from/to_kernel_nofault"), which makes a lot more sense to me.


It might be a problem in the way patch_instruction() is implemented on 
powerpc, to be investigated.


Christophe

Re: Re: [PATCH v2] treewide: const qualify ctl_tables where applicable

2025-01-22 Thread Joel Granados

On Tue, Jan 21, 2025 at 02:40:16PM +0100, Alexander Gordeev wrote:
> On Fri, Jan 10, 2025 at 03:16:08PM +0100, Joel Granados wrote:
> 
> Hi Joel,
> 
> > Add the const qualifier to all the ctl_tables in the tree except for
> > watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls,
> > loadpin_sysctl_table and the ones calling register_net_sysctl (./net,
> > drivers/inifiniband dirs). These are special cases as they use a
> > registration function with a non-const qualified ctl_table argument or
> > modify the arrays before passing them on to the registration function.
> > 
> > Constifying ctl_table structs will prevent the modification of
> > proc_handler function pointers as the arrays would reside in .rodata.
> > This is made possible after commit 78eb4ea25cd5 ("sysctl: treewide:
> > constify the ctl_table argument of proc_handlers") constified all the
> > proc_handlers.
> 
> I could identify at least these occurences in s390 code as well:
Hey Alexander

Thx for bringing these to my attention. I had completely missed them as
the spatch only deals with ctl_tables outside functions.

Short answer:
These should not be included in the current patch because they are a
different pattern from how sysctl tables are usually used. So I will not
include them.

With that said, I think it might be interesting to look closer at them
as they seem to be complicating the proc_handler (I have to look at them
closer).

I see that they are defining a ctl_table struct within the functions and
just using the data (from the incoming ctl_table) to forward things down
to proc_do{u,}intvec_* functions. This is very odd and I have only seen
it done in order to change the incoming ctl_table (which is not what is
being done here).

I will take a closer look after the merge window and circle back with
more info. Might take me a while as I'm not very familiar with s390
code; any additional information on why those are being used inside the
functions would be helpfull.

Best


> 
> diff --git a/arch/s390/appldata/appldata_base.c 
> b/arch/s390/appldata/appldata_base.c
> index dd7ba7587dd5..9b83c318f919 100644
> --- a/arch/s390/appldata/appldata_base.c
> +++ b/arch/s390/appldata/appldata_base.c
> @@ -204,7 +204,7 @@ appldata_timer_handler(const struct ctl_table *ctl, int 
> write,
>  {
>   int timer_active = appldata_timer_active;
>   int rc;
> - struct ctl_table ctl_entry = {
> + const struct ctl_table ctl_entry = {
>   .procname   = ctl->procname,
>   .data   = &timer_active,
>   .maxlen = sizeof(int),
> @@ -237,7 +237,7 @@ appldata_interval_handler(const struct ctl_table *ctl, 
> int write,
>  {
>   int interval = appldata_interval;
>   int rc;
> - struct ctl_table ctl_entry = {
> + const struct ctl_table ctl_entry = {
>   .procname   = ctl->procname,
>   .data   = &interval,
>   .maxlen = sizeof(int),
> @@ -269,7 +269,7 @@ appldata_generic_handler(const struct ctl_table *ctl, int 
> write,
>   struct list_head *lh;
>   int rc, found;
>   int active;
> - struct ctl_table ctl_entry = {
> + const struct ctl_table ctl_entry = {
>   .data   = &active,
>   .maxlen = sizeof(int),
>   .extra1 = SYSCTL_ZERO,
> diff --git a/arch/s390/kernel/hiperdispatch.c 
> b/arch/s390/kernel/hiperdispatch.c
> index 7857a7e8e56c..7d0ba16085c1 100644
> --- a/arch/s390/kernel/hiperdispatch.c
> +++ b/arch/s390/kernel/hiperdispatch.c
> @@ -273,7 +273,7 @@ static int hiperdispatch_ctl_handler(const struct 
> ctl_table *ctl, int write,
>  {
>   int hiperdispatch;
>   int rc;
> - struct ctl_table ctl_entry = {
> + const struct ctl_table ctl_entry = {
>   .procname   = ctl->procname,
>   .data   = &hiperdispatch,
>   .maxlen = sizeof(int),
> diff --git a/arch/s390/kernel/topology.c b/arch/s390/kernel/topology.c
> index 6691808bf50a..26e50de83d80 100644
> --- a/arch/s390/kernel/topology.c
> +++ b/arch/s390/kernel/topology.c
> @@ -629,7 +629,7 @@ static int topology_ctl_handler(const struct ctl_table 
> *ctl, int write,
>   int enabled = topology_is_enabled();
>   int new_mode;
>   int rc;
> - struct ctl_table ctl_entry = {
> + const struct ctl_table ctl_entry = {
>   .procname   = ctl->procname,
>   .data   = &enabled,
>   .maxlen = sizeof(int),
> @@ -658,7 +658,7 @@ static int polarization_ctl_handler(const struct 
> ctl_table *ctl, int write,
>  {
>   int polarization;
>   int rc;
> - struct ctl_table ctl_entry = {
> + const struct ctl_table ctl_entry = {
>   .procname   = ctl->procname,
>   .data   = &polarization,
>   .maxlen = sizeof(int),
> diff --git a/arch/s390/mm/cmm.c b/arch/s390/mm/cmm.c
> index 939e3bec2db7..8

Re: [PATCH v2 0/2] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2025-01-22 Thread Shuai Xue


Hi, all,

Gentle ping.
 
Best Regards,

Shuai

在 2024/12/24 19:03, Shuai Xue 写道:



在 2024/11/12 21:54, Shuai Xue 写道:

changes since v1:
- rewrite commit log per Bjorn
- refactor aer_get_device_error_info to reduce duplication per Keith
- fix to avoid reporting fatal errors twice for root and downstream ports per 
Keith

The AER driver has historically avoided reading the configuration space of an
endpoint or RCiEP that reported a fatal error, considering the link to that
device unreliable. Consequently, when a fatal error occurs, the AER and DPC
drivers do not report specific error types, resulting in logs like:

   pcieport :30:03.0: EDR: EDR event received
   pcieport :30:03.0: DPC: containment event, status:0x0005 source:0x3400
   pcieport :30:03.0: DPC: ERR_FATAL detected
   pcieport :30:03.0: AER: broadcast error_detected message
   nvme nvme0: frozen state error detected, reset controller
   nvme :34:00.0: ready 0ms after DPC
   pcieport :30:03.0: AER: broadcast slot_reset message

AER status registers are sticky and Write-1-to-clear. If the link recovered
after hot reset, we can still safely access AER status of the error device.
In such case, report fatal errors which helps to figure out the error root
case.

- Patch 1/2 identifies the error device by SOURCE ID register
- Patch 2/3 reports the AER status if link recoverd.

After this patch set, the logs like:

   pcieport :30:03.0: EDR: EDR event received
   pcieport :30:03.0: DPC: containment event, status:0x0005 source:0x3400
   pcieport :30:03.0: DPC: ERR_FATAL detected
   pcieport :30:03.0: AER: broadcast error_detected message
   nvme nvme0: frozen state error detected, reset controller
   pcieport :30:03.0: waiting 100 ms for downstream link, after activation
   nvme :34:00.0: ready 0ms after DPC
   nvme :34:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Data 
Link Layer, (Receiver ID)
   nvme :34:00.0:   device [144d:a804] error status/mask=0010/00504000
   nvme :34:00.0:    [ 4] DLP    (First)
   pcieport :30:03.0: AER: broadcast slot_reset message

Shuai Xue (2):
   PCI/DPC: Run recovery on device that detected the error
   PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

  drivers/pci/pci.h  |  5 +++--
  drivers/pci/pcie/aer.c | 11 +++
  drivers/pci/pcie/dpc.c | 32 +---
  drivers/pci/pcie/edr.c | 35 ++-
  drivers/pci/pcie/err.c |  9 +
  5 files changed, 62 insertions(+), 30 deletions(-)



Hi, all,

Gentle ping.

Best Regards,
Shuai

Re: Re: [PATCH v2] treewide: const qualify ctl_tables where applicable

2025-01-22 Thread Ard Biesheuvel

On Wed, 22 Jan 2025 at 13:25, Joel Granados  wrote:
>
> On Tue, Jan 21, 2025 at 02:40:16PM +0100, Alexander Gordeev wrote:
> > On Fri, Jan 10, 2025 at 03:16:08PM +0100, Joel Granados wrote:
> >
> > Hi Joel,
> >
> > > Add the const qualifier to all the ctl_tables in the tree except for
> > > watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls,
> > > loadpin_sysctl_table and the ones calling register_net_sysctl (./net,
> > > drivers/inifiniband dirs). These are special cases as they use a
> > > registration function with a non-const qualified ctl_table argument or
> > > modify the arrays before passing them on to the registration function.
> > >
> > > Constifying ctl_table structs will prevent the modification of
> > > proc_handler function pointers as the arrays would reside in .rodata.
> > > This is made possible after commit 78eb4ea25cd5 ("sysctl: treewide:
> > > constify the ctl_table argument of proc_handlers") constified all the
> > > proc_handlers.
> >
> > I could identify at least these occurences in s390 code as well:
> Hey Alexander
>
> Thx for bringing these to my attention. I had completely missed them as
> the spatch only deals with ctl_tables outside functions.
>
> Short answer:
> These should not be included in the current patch because they are a
> different pattern from how sysctl tables are usually used. So I will not
> include them.
>
> With that said, I think it might be interesting to look closer at them
> as they seem to be complicating the proc_handler (I have to look at them
> closer).
>
> I see that they are defining a ctl_table struct within the functions and
> just using the data (from the incoming ctl_table) to forward things down
> to proc_do{u,}intvec_* functions. This is very odd and I have only seen
> it done in order to change the incoming ctl_table (which is not what is
> being done here).
>
> I will take a closer look after the merge window and circle back with
> more info. Might take me a while as I'm not very familiar with s390
> code; any additional information on why those are being used inside the
> functions would be helpfull.
>

Using const data on the stack is not as useful, because the stack is
always mapped writable.

Global data structures marked 'const' will be moved into an ELF
section that is typically mapped read-only in its entirely, and so the
data cannot be modified by writing to it directly. No such protection
is possible for the stack, and so the constness there is only enforced
at compile time.

Re: [PATCH] powerpc/pseries/iommu: Don't unset window if it was never set

2025-01-22 Thread Madhavan Srinivasan

On Mon, 13 Jan 2025 03:48:55 +, Shivaprasad G Bhat wrote:
> On pSeries, when user attempts to use the same vfio container used by
> different iommu group, the spapr_tce_set_window() returns -EPERM
> and the subsequent cleanup leads to the below crash.
> 
>Kernel attempted to read user page (308) - exploit attempt?
>BUG: Kernel NULL pointer dereference on read at 0x0308
>Faulting instruction address: 0xc01ce358
>Oops: Kernel access of bad area, sig: 11 [#1]
>NIP:  c01ce358 LR: c01ce05c CTR: c005add0
>
>NIP [c01ce358] spapr_tce_unset_window+0x3b8/0x510
>LR [c01ce05c] spapr_tce_unset_window+0xbc/0x510
>Call Trace:
>  spapr_tce_unset_window+0xbc/0x510 (unreliable)
>  tce_iommu_attach_group+0x24c/0x340 [vfio_iommu_spapr_tce]
>  vfio_container_attach_group+0xec/0x240 [vfio]
>  vfio_group_fops_unl_ioctl+0x548/0xb00 [vfio]
>  sys_ioctl+0x754/0x1580
>  system_call_exception+0x13c/0x330
>  system_call_vectored_common+0x15c/0x2ec
>
>--- interrupt: 3000
> 
> [...]

Applied to powerpc/next.

[1/1] powerpc/pseries/iommu: Don't unset window if it was never set
  https://git.kernel.org/powerpc/c/17391cb2613b82f8c405570fea605af3255ff8d2

Thanks

Re: [PATCH v4 RESEND] powerpc/pseries/eeh: Fix get PE state translation

2025-01-22 Thread Madhavan Srinivasan

On Thu, 16 Jan 2025 04:39:54 -0600, Narayana Murty N wrote:
> The PE Reset State "0" returned by RTAS calls
> "ibm_read_slot_reset_[state|state2]" indicates that the reset is
> deactivated and the PE is in a state where MMIO and DMA are allowed.
> However, the current implementation of "pseries_eeh_get_state()" does
> not reflect this, causing drivers to incorrectly assume that MMIO and
> DMA operations cannot be resumed.
> 
> [...]

Applied to powerpc/next.

[1/1] powerpc/pseries/eeh: Fix get PE state translation
  https://git.kernel.org/powerpc/c/11b93559000c686ad7e5ab0547e76f21cc143844

Thanks

Re: [PATCH v2] powerpc: increase MIN RMA size for CAS negotiation

2025-01-22 Thread Madhavan Srinivasan

On Fri, 06 Dec 2024 12:25:45 +0530, Avnish Chouhan wrote:
> Change RMA size from 512 MB to 768 MB which will result
> in more RMA at boot time for PowerPC. When PowerPC LPAR use/uses vTPM,
> Secure Boot or FADump, the 512 MB RMA memory is not sufficient for
> booting. With this 512 MB RMA, GRUB2 run out of memory and unable to
> load the necessary. Sometimes even usage of CDROM which requires more
> memory for installation along with the options mentioned above troubles
> the boot memory and result in boot failures. Increasing the RMA size
> will resolves multiple out of memory issues observed in PowerPC.
> 
> [...]

Applied to powerpc/next.

[1/1] powerpc: increase MIN RMA size for CAS negotiation
  https://git.kernel.org/powerpc/c/ae908b87b6bb32c170e9baf5858f2a7553cacc06

Thanks

[PATCH v2] fs: introduce getfsxattrat and setfsxattrat syscalls

2025-01-22 Thread Andrey Albershteyn

From: Andrey Albershteyn 

Introduce getfsxattrat and setfsxattrat syscalls to manipulate inode
extended attributes/flags. The syscalls take parent directory FD and
path to the child together with struct fsxattr.

This is an alternative to FS_IOC_FSSETXATTR ioctl with a difference
that file don't need to be open. By having this we can manipulated
inode extended attributes not only on normal files but also on
special ones. This is not possible with FS_IOC_FSSETXATTR ioctl as
opening special files returns VFS special inode instead of
underlying filesystem one.

This patch adds two new syscalls which allows userspace to set
extended inode attributes on special files by using parent directory
to open FS inode.

Also, as vfs_fileattr_set() is now will be called on special files
too, let's forbid any other attributes except projid and nextents
(symlink can have an extent).

CC: linux-...@vger.kernel.org
CC: linux-fsde...@vger.kernel.org
CC: linux-...@vger.kernel.org
Signed-off-by: Andrey Albershteyn 
---
v1:
https://lore.kernel.org/linuxppc-dev/20250109174540.893098-1-aalbe...@kernel.org/

Previous discussion:
https://lore.kernel.org/linux-xfs/20240520164624.665269-2-aalbe...@redhat.com/

XFS has project quotas which could be attached to a directory. All
new inodes in these directories inherit project ID set on parent
directory.

The project is created from userspace by opening and calling
FS_IOC_FSSETXATTR on each inode. This is not possible for special
files such as FIFO, SOCK, BLK etc. Therefore, some inodes are left
with empty project ID. Those inodes then are not shown in the quota
accounting but still exist in the directory. Moreover, in the case
when special files are created in the directory with already
existing project quota, these inode inherit extended attributes.
This than leaves them with these attributes without the possibility
to clear them out. This, in turn, prevents userspace from
re-creating quota project on these existing files.
---
 arch/alpha/kernel/syscalls/syscall.tbl  |  2 +
 arch/arm/tools/syscall.tbl  |  2 +
 arch/arm64/tools/syscall_32.tbl |  2 +
 arch/m68k/kernel/syscalls/syscall.tbl   |  2 +
 arch/microblaze/kernel/syscalls/syscall.tbl |  2 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |  2 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |  2 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |  2 +
 arch/parisc/kernel/syscalls/syscall.tbl |  2 +
 arch/powerpc/kernel/syscalls/syscall.tbl|  2 +
 arch/s390/kernel/syscalls/syscall.tbl   |  2 +
 arch/sh/kernel/syscalls/syscall.tbl |  2 +
 arch/sparc/kernel/syscalls/syscall.tbl  |  2 +
 arch/x86/entry/syscalls/syscall_32.tbl  |  2 +
 arch/x86/entry/syscalls/syscall_64.tbl  |  2 +
 arch/xtensa/kernel/syscalls/syscall.tbl |  2 +
 fs/inode.c  | 99 +
 fs/ioctl.c  | 16 -
 include/linux/fileattr.h|  1 +
 include/linux/syscalls.h|  4 ++
 include/uapi/asm-generic/unistd.h   |  8 ++-
 21 files changed, 157 insertions(+), 3 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl 
b/arch/alpha/kernel/syscalls/syscall.tbl
index 
c59d53d6d3f3490f976ca179ddfe02e69265ae4d..4b9e687494c16b60c6fd6ca1dc4d6564706a7e25
 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -506,3 +506,5 @@
 574common  getxattrat  sys_getxattrat
 575common  listxattrat sys_listxattrat
 576common  removexattrat   sys_removexattrat
+577common  getfsxattratsys_getfsxattrat
+578common  setfsxattratsys_setfsxattrat
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 
49eeb2ad8dbd8e074c6240417693f23fb328afa8..66466257f3c2debb3e2299f0b608c6740c98cab2
 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -481,3 +481,5 @@
 464common  getxattrat  sys_getxattrat
 465common  listxattrat sys_listxattrat
 466common  removexattrat   sys_removexattrat
+467common  getfsxattratsys_getfsxattrat
+468common  setfsxattratsys_setfsxattrat
diff --git a/arch/arm64/tools/syscall_32.tbl b/arch/arm64/tools/syscall_32.tbl
index 
69a829912a05eb8a3e21ed701d1030e31c0148bc..9c516118b154811d8d11d5696f32817430320dbf
 100644
--- a/arch/arm64/tools/syscall_32.tbl
+++ b/arch/arm64/tools/syscall_32.tbl
@@ -478,3 +478,5 @@
 464common  getxattrat  sys_getxattrat
 465common  listxattrat sys_listxattrat
 466common  removexattrat   sys_removexattrat
+467common  getfsxattratsys_getfsxattrat
+468common  setfsxattratsys_setfsxattrat
diff --git a/arch/m68k/kernel/syscalls/

Re: [PATCH] mm/hugetlb: bring gigantic page allocation under hugepages_supported()

2025-01-22 Thread Gerald Schaefer

On Tue, 21 Jan 2025 20:34:19 +0530
Sourabh Jain  wrote:

> Despite having kernel arguments to enable gigantic hugepages, this
> provides a way for the architecture to disable gigantic hugepages on the
> fly, similar to what we do for hugepages.
> 
> Components like fadump (PowerPC-specific) need this functionality to
> disable gigantic hugepages when the kernel is booted solely to collect
> the kernel core dump.
> 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Borislav Petkov 
> Cc: Heiko Carstens 
> Cc: Vasily Gorbik 
> Cc: Muchun Song 
> Cc: Madhavan Srinivasan 
> Cc: Michael Ellerman 
> Cc: linux...@kvack.org
> Cc: linux-ker...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Sourabh Jain 
> ---
> 
> To evaluate the impact of this change on architectures other than
> PowerPC, I did the following analysis:
> 
> For architectures where hugepages_supported() is not redefined, it
> depends on HPAGE_SHIFT, which is found to be a constant. It is mostly
> initialized to PMD_SHIFT.
> 
> Architecture : HPAGE_SHIFT initialized with
> 
> ARC: PMD_SHIFT (constant)
> ARM: PMD_SHIFT (constant)
> ARM64: PMD_SHIFT (constant)
> Hexagon: 22 (constant)
> LoongArch: (PAGE_SHIFT + PAGE_SHIFT - 3) (appears to be constant)
> MIPS: (PAGE_SHIFT + PAGE_SHIFT - 3) (appears to be constant)
> PARISC: PMD_SHIFT (appears to be constant)
> RISC-V: PMD_SHIFT (constant)
> SH: 16 | 18 | 20 | 22 | 26 (constant)
> SPARC: 23 (constant)
> 
> So seems like this change shouldn't have any impact on above
> architectures.
> 
> On the S390 and X86 architectures, hugepages_supported() is redefined,
> and I am uncertain at what point it is safe to call
> hugepages_supported().

For s390, hugepages_supported() checks EDAT1 machine flag, which is
initialized long before any initcalls. So it is safe to be called
here.

My common code hugetlb skills got a little rusty, but shouldn't
arch_hugetlb_valid_size() already prevent getting here for gigantic
hugepages, in case they are not supported? And could you not use
that for your purpose?

BTW, please also add arch mailing list for such questions.

> 
> ---
>  mm/hugetlb.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index cec4b121193f..48b42b8d26b4 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4629,7 +4629,7 @@ static int __init hugepages_setup(char *s)
>* But we need to allocate gigantic hstates here early to still
>* use the bootmem allocator.
>*/
> - if (hugetlb_max_hstate && hstate_is_gigantic(parsed_hstate))
> + if (hugetlb_max_hstate && hstate_is_gigantic(parsed_hstate) && 
> hugepages_supported())
>   hugetlb_hstate_alloc_pages(parsed_hstate);
>  
>   last_mhp = mhp;

Re: [PATCH 5.10] ibmvnic: Add tx check to prevent skb leak

2025-01-22 Thread Jacob Keller




On 1/20/2025 4:46 AM, Denis Arefev wrote:
> From: Nick Child 
> 
> From: Nick Child 
> 
> commit 0983d288caf984de0202c66641577b739caad561 upstream.
> 
> Below is a summary of how the driver stores a reference to an skb during
> transmit:
> tx_buff[free_map[consumer_index]]->skb = new_skb;
> free_map[consumer_index] = IBMVNIC_INVALID_MAP;
> consumer_index ++;
> Where variable data looks like this:
> free_map == [4, IBMVNIC_INVALID_MAP, IBMVNIC_INVALID_MAP, 0, 3]
>   consumer_index^
> tx_buff == [skb=null, skb=, skb=, skb=null, skb=null]
> 
> The driver has checks to ensure that free_map[consumer_index] pointed to
> a valid index but there was no check to ensure that this index pointed
> to an unused/null skb address. So, if, by some chance, our free_map and
> tx_buff lists become out of sync then we were previously risking an
> skb memory leak. This could then cause tcp congestion control to stop
> sending packets, eventually leading to ETIMEDOUT.
> 
> Therefore, add a conditional to ensure that the skb address is null. If
> not then warn the user (because this is still a bug that should be
> patched) and free the old pointer to prevent memleak/tcp problems.
> 
> Signed-off-by: Nick Child 
> Signed-off-by: Paolo Abeni 
> [Denis: minor fix to resolve merge conflict.]
> Signed-off-by: Denis Arefev 
> ---

I thought the process asked to have the stable tag, i.e.

Cc:  # 5.10.x

Anyways, this looks good to me, and seems like a good candidate for
backporting.

Reviewed-by: Jacob Keller 

Thanks,
Jake

> Backport fix for CVE-2024-41066
> Link: https://nvd.nist.gov/vuln/detail/CVE-2024-41066
> ---
>  drivers/net/ethernet/ibm/ibmvnic.c | 12 
>  1 file changed, 12 insertions(+)
> 
> diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
> b/drivers/net/ethernet/ibm/ibmvnic.c
> index 84da6ccaf339..439796975cbf 100644
> --- a/drivers/net/ethernet/ibm/ibmvnic.c
> +++ b/drivers/net/ethernet/ibm/ibmvnic.c
> @@ -1625,6 +1625,18 @@ static netdev_tx_t ibmvnic_xmit(struct sk_buff *skb, 
> struct net_device *netdev)
>   (tx_pool->consumer_index + 1) % tx_pool->num_buffers;
>  
>   tx_buff = &tx_pool->tx_buff[index];
> +
> + /* Sanity checks on our free map to make sure it points to an index
> +  * that is not being occupied by another skb. If skb memory is
> +  * not freed then we see congestion control kick in and halt tx.
> +  */
> + if (unlikely(tx_buff->skb)) {
> + dev_warn_ratelimited(dev, "TX free map points to untracked skb 
> (%s %d idx=%d)\n",
> +  skb_is_gso(skb) ? "tso_pool" : "tx_pool",
> +  queue_num, bufidx);
> + dev_kfree_skb_any(tx_buff->skb);
> + }
> +
>   tx_buff->skb = skb;
>   tx_buff->data_dma[0] = data_dma_addr;
>   tx_buff->data_len[0] = skb->len;

Re: BUG: KASAN: vmalloc-out-of-bounds in copy_to_kernel_nofault+0xd8/0x1c8 (v6.13-rc6, PowerMac G4)

2025-01-22 Thread Christophe Leroy





Le 22/01/2025 à 16:32, Christophe Leroy a écrit :



Le 22/01/2025 à 00:21, Erhard Furtner a écrit :

On Tue, 21 Jan 2025 23:07:25 +0100
Christophe Leroy  wrote:


Meanwhile I bisected the bug. Offending commit is:

   # git bisect good
32913f348229c9f72dda45fc2c08c6d9dfcd3d6d is the first bad commit
commit 32913f348229c9f72dda45fc2c08c6d9dfcd3d6d
Author: Linus Torvalds 
Date:   Mon Dec 9 10:00:25 2024 -0800

  futex: fix user access on powerpc
  The powerpc user access code is special, and unlike other 
architectures

  distinguishes between user access for reading and writing.
  And commit 43a43faf5376 ("futex: improve user space accesses") 
messed
  that up.  It went undetected elsewhere, but caused ppc32 to 
fail early

  during boot, because the user access had been started with
  user_read_access_begin(), but then finished off with just a plain
  "user_access_end()".
  Note that the address-masking user access helpers don't even 
have that

  read-vs-write distinction, so if powerpc ever wants to do address
  masking tricks, we'll have to do some extra work for it.
  [ Make sure to also do it for the EFAULT case, as pointed out by
    Christophe Leroy ]
  Reported-by: Andreas Schwab 
  Cc: Christophe Leroy 
  Link: https://eur01.safelinks.protection.outlook.com/? 
url=https%3A%2F%2Flore.kernel.org%2Fall%2F87bjxl6b0i.fsf%40igel.home%2F&data=05%7C02%7Cchristophe.leroy%40csgroup.eu%7Cb4c1dc7184f54a410a0e08dd3a7270b6%7C8b87af7d86474dc78df45f69a2011bb5%7C0%7C0%7C638730985407902881%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=E5Yp9jopCPE1NFuBM8rs%2B1jXZ%2FXAaKvBGpcEP%2BaMyz0%3D&reserved=0

  Signed-off-by: Linus Torvalds 

   kernel/futex/futex.h | 4 ++--
   1 file changed, 2 insertions(+), 2 deletions(-)


Indeed, reverting 32913f348229c9f72dda45fc2c08c6d9dfcd3d6d on top of 
v6.13 makes the KASAN hit disappear.


That looks terribly odd.

On G4, user_read_access_begin() and user_read_access_end() are no-op
because book3s/32 can only protect user access by kernel against write.
Read is always granted.

So the bug must be an indirect side effect of what user_access_end()
does. user_access_end() does a sync. Would the lack of sync (once
replaced user_access_end() by user_read_access_end() ) lead to some odd
re-ordering ? Or another possibility is that user_access_end() is called
on some kernel address (I see in the description of commit 43a43faf5376
("futex: improve user space accesses") that the replaced __get_user()
was expected to work on kernel adresses) ? Calling user_access_begin()
and user_access_end() is unexpected and there is no guard so it could
lead to strange segment settings which hides a KASAN hit. But once the
fix the issue the KASAN resurfaces ? Could this be the problem ?

Do you have a way to reproduce the bug on QEMU ? It would enable me to
investigate it further.


Attached v6.13 .config plays nicely with qemu ttyS0 (forgot to disable 
SERIAL_8250 and set SERIAL_PMACZILOG + SERIAL_PMACZILOG_CONSOLE 
instead as I prefer the PCI Serial card in my G4).


The KASAN hit also shows up on qemu 8.2.7 via via:
qemu-system-ppc -machine mac99,via=pmu -cpu 7450 -m 2G -nographic - 
append console=ttyS0 -kernel vmlinux-6.13.0-PMacG4 -hda Debian-VM_g4.img




I was able to reproduce it with v6.13 with QEMU when loading test_bpf 
module.


On my side, the problem doesn't disappear when reverting of commit 
32913f348229 ("futex: fix user access on powerpc")


I bisected it to commit e4137f08816b ("mm, kasan, kmsan: instrument 
copy_from/to_kernel_nofault"), which makes a lot more sense to me.


It might be a problem in the way patch_instruction() is implemented on 
powerpc, to be investigated.


I think the problem is commit 37bc3e5fd764 ("powerpc/lib/code-patching: 
Use alternate map for patch_instruction()")


Can you try the change below:

diff --git a/arch/powerpc/lib/code-patching.c 
b/arch/powerpc/lib/code-patching.c

index af97fbb3c257..8a378fc19074 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -108,7 +108,7 @@ static int text_area_cpu_up(unsigned int cpu)
unsigned long addr;
int err;

-   area = get_vm_area(PAGE_SIZE, VM_ALLOC);
+   area = get_vm_area(PAGE_SIZE, 0);
if (!area) {
WARN_ONCE(1, "Failed to create text area for cpu %d\n",
cpu);

Re: [PATCH v2] powerpc: increase MIN RMA size for CAS negotiation

2025-01-22 Thread Sourabh Jain


Hello Maddy,

On 22/01/25 18:13, Madhavan Srinivasan wrote:

On Fri, 06 Dec 2024 12:25:45 +0530, Avnish Chouhan wrote:

Change RMA size from 512 MB to 768 MB which will result
in more RMA at boot time for PowerPC. When PowerPC LPAR use/uses vTPM,
Secure Boot or FADump, the 512 MB RMA memory is not sufficient for
booting. With this 512 MB RMA, GRUB2 run out of memory and unable to
load the necessary. Sometimes even usage of CDROM which requires more
memory for installation along with the options mentioned above troubles
the boot memory and result in boot failures. Increasing the RMA size
will resolves multiple out of memory issues observed in PowerPC.

[...]

Applied to powerpc/next.

[1/1] powerpc: increase MIN RMA size for CAS negotiation
   https://git.kernel.org/powerpc/c/ae908b87b6bb32c170e9baf5858f2a7553cacc06


The above changes break the fadump additional parameter for the HASH MMU.
Here is the proposed fix:
https://lore.kernel.org/all/20250120173501.1147236-1-sourabhj...@linux.ibm.com/

The proposed fix ensures that the memory reserved for the additional 
parameter for

fadump on the HASH MMU does not overlap with GRUB.

Please share your opinion on the fixes.

Thanks,
Sourabh Jain

Re: [PATCH v2 1/2] PCI/DPC: Run recovery on device that detected the error

2025-01-22 Thread Sathyanarayanan Kuppuswamy




On 11/12/24 5:54 AM, Shuai Xue wrote:

The current implementation of pcie_do_recovery() assumes that the
recovery process is executed on the device that detected the error.
However, the DPC driver currently passes the error port that experienced
the DPC event to pcie_do_recovery().

Use the SOURCE ID register to correctly identify the device that detected the
error. By passing this error device to pcie_do_recovery(), subsequent
patches will be able to accurately access AER status of the error device.


When passing the error device, I assume pcie_do_recovery() will find the
upstream bride and run the recovery logic .



Signed-off-by: Shuai Xue 
---


IMO, moving the "err_port" rename to a separate patch will make this change
more clear.  But it is up to you.


  drivers/pci/pci.h  |  2 +-
  drivers/pci/pcie/dpc.c | 30 --
  drivers/pci/pcie/edr.c | 35 ++-
  3 files changed, 43 insertions(+), 24 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 14d00ce45bfa..0866f79aec54 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -521,7 +521,7 @@ struct rcec_ea {
  void pci_save_dpc_state(struct pci_dev *dev);
  void pci_restore_dpc_state(struct pci_dev *dev);
  void pci_dpc_init(struct pci_dev *pdev);
-void dpc_process_error(struct pci_dev *pdev);
+struct pci_dev *dpc_process_error(struct pci_dev *pdev);
  pci_ers_result_t dpc_reset_link(struct pci_dev *pdev);
  bool pci_dpc_recovered(struct pci_dev *pdev);
  #else
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index 2b6ef7efa3c1..62a68cde4364 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -257,10 +257,17 @@ static int dpc_get_aer_uncorrect_severity(struct pci_dev 
*dev,
return 1;
  }
  
-void dpc_process_error(struct pci_dev *pdev)

+/**
+ * dpc_process_error - handle the DPC error status


Handling the DPC error status has nothing to do with finding
the error source. Why not add a new helper function?


+ * @pdev: the port that experienced the containment event
+ *
+ * Return the device that experienced the error.

detected the error?

+ */
+struct pci_dev *dpc_process_error(struct pci_dev *pdev)
  {
u16 cap = pdev->dpc_cap, status, source, reason, ext_reason;
struct aer_err_info info;
+   struct pci_dev *err_dev = NULL;


I don't think you need NULL initialization here.

  
  	pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, &status);

pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, &source);
@@ -283,6 +290,13 @@ void dpc_process_error(struct pci_dev *pdev)
 "software trigger" :
 "reserved error");
  
+	if (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_NFE ||

+   reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE)
+   err_dev = pci_get_domain_bus_and_slot(pci_domain_nr(pdev->bus),
+   PCI_BUS_NUM(source), source & 0xff);
+   else
+   err_dev = pci_dev_get(pdev);
+
/* show RP PIO error detail information */
if (pdev->dpc_rp_extensions &&
reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_IN_EXT &&
@@ -295,6 +309,8 @@ void dpc_process_error(struct pci_dev *pdev)
pci_aer_clear_nonfatal_status(pdev);
pci_aer_clear_fatal_status(pdev);
}
+
+   return err_dev;
  }
  
  static void pci_clear_surpdn_errors(struct pci_dev *pdev)

@@ -350,21 +366,23 @@ static bool dpc_is_surprise_removal(struct pci_dev *pdev)
  
  static irqreturn_t dpc_handler(int irq, void *context)

  {
-   struct pci_dev *pdev = context;
+   struct pci_dev *err_port = context, *err_dev = NULL;


NULL initialization is not needed.

  
  	/*

 * According to PCIe r6.0 sec 6.7.6, errors are an expected side effect
 * of async removal and should be ignored by software.
 */
-   if (dpc_is_surprise_removal(pdev)) {
-   dpc_handle_surprise_removal(pdev);
+   if (dpc_is_surprise_removal(err_port)) {
+   dpc_handle_surprise_removal(err_port);
return IRQ_HANDLED;
}
  
-	dpc_process_error(pdev);

+   err_dev = dpc_process_error(err_port);
  
  	/* We configure DPC so it only triggers on ERR_FATAL */

-   pcie_do_recovery(pdev, pci_channel_io_frozen, dpc_reset_link);
+   pcie_do_recovery(err_dev, pci_channel_io_frozen, dpc_reset_link);
+
+   pci_dev_put(err_dev);
  
  	return IRQ_HANDLED;

  }
diff --git a/drivers/pci/pcie/edr.c b/drivers/pci/pcie/edr.c
index e86298dbbcff..6ac95e5e001b 100644
--- a/drivers/pci/pcie/edr.c
+++ b/drivers/pci/pcie/edr.c
@@ -150,7 +150,7 @@ static int acpi_send_edr_status(struct pci_dev *pdev, 
struct pci_dev *edev,
  
  static void edr_handle_event(acpi_handle handle, u32 event, void *data)

  {
-   struct pci_dev *pdev = data, *edev;
+   struct pci_dev *pdev = data, *err_port, *err_dev = NULL;
pci_ers_result_t estate = PCI_ERS_RESULT_D

Re: [PATCH] mm/hugetlb: bring gigantic page allocation under hugepages_supported()

2025-01-22 Thread Sourabh Jain


Hello Gerald,

On 22/01/25 19:36, Gerald Schaefer wrote:

On Tue, 21 Jan 2025 20:34:19 +0530
Sourabh Jain  wrote:


Despite having kernel arguments to enable gigantic hugepages, this
provides a way for the architecture to disable gigantic hugepages on the
fly, similar to what we do for hugepages.

Components like fadump (PowerPC-specific) need this functionality to
disable gigantic hugepages when the kernel is booted solely to collect
the kernel core dump.

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Heiko Carstens 
Cc: Vasily Gorbik 
Cc: Muchun Song 
Cc: Madhavan Srinivasan 
Cc: Michael Ellerman 
Cc: linux...@kvack.org
Cc: linux-ker...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Sourabh Jain 
---

To evaluate the impact of this change on architectures other than
PowerPC, I did the following analysis:

For architectures where hugepages_supported() is not redefined, it
depends on HPAGE_SHIFT, which is found to be a constant. It is mostly
initialized to PMD_SHIFT.

Architecture : HPAGE_SHIFT initialized with

ARC: PMD_SHIFT (constant)
ARM: PMD_SHIFT (constant)
ARM64: PMD_SHIFT (constant)
Hexagon: 22 (constant)
LoongArch: (PAGE_SHIFT + PAGE_SHIFT - 3) (appears to be constant)
MIPS: (PAGE_SHIFT + PAGE_SHIFT - 3) (appears to be constant)
PARISC: PMD_SHIFT (appears to be constant)
RISC-V: PMD_SHIFT (constant)
SH: 16 | 18 | 20 | 22 | 26 (constant)
SPARC: 23 (constant)

So seems like this change shouldn't have any impact on above
architectures.

On the S390 and X86 architectures, hugepages_supported() is redefined,
and I am uncertain at what point it is safe to call
hugepages_supported().

For s390, hugepages_supported() checks EDAT1 machine flag, which is
initialized long before any initcalls. So it is safe to be called
here.

Thanks for the info.


My common code hugetlb skills got a little rusty, but shouldn't
arch_hugetlb_valid_size() already prevent getting here for gigantic
hugepages, in case they are not supported? And could you not use
that for your purpose?


Yes, handling this in arch_hugetlb_valid_size is even better. That way,
we can avoid initializing data structures to hold hstate, which is not
required anyway.

Thanks for the review and suggestion. I will handle this in the
architecture-specific code.

- Sourabh Jain

Re: [PATCH 2/2] powerpc/fadump: fix additional param memory reservation for HASH MMU

2025-01-22 Thread Mahesh J Salgaonkar

On 2025-01-20 23:05:00 Mon, Sourabh Jain wrote:
> Commit 683eab94da75bc ("powerpc/fadump: setup additional parameters for
> dump capture kernel") introduced the additional parameter feature in
> fadump for HASH MMU with the understanding that GRUB does not use the
> memory area between 640MB and 768MB for its operation.
> 
> However, the patch ("powerpc: increase MIN RMA size for CAS
> negotiation") changes the MIN RMA size to 768MB, allowing GRUB to use
> memory up to 768MB. This makes the fadump reservation for the additional
> parameter feature for HASH MMU unreliable.

I see. Please include above patch in single patch series, with this
change as first patch.

> 
> To address this, adjust the memory range for the additional parameter in
> fadump for HASH MMU. This will ensure that GRUB does not overwrite the
> memory reserved for fadump's additional parameter in HASH MMU.
> 
> The new policy for the memory range for the additional parameter in HASH
> MMU is that the first memory block must be larger than the MIN_RMA size,
> as the bootloader can use memory up to the MIN_RMA size. The range
> should be between MIN_RMA and the RMA size (ppc64_rma_size), and it must
> not overlap with the fadump reserved area.
> 
> Cc: Avnish Chouhan 
> Cc: Brian King 
> Cc: Hari Bathini 
> Cc: Madhavan Srinivasan 
> Cc: Michael Ellerman 
> Cc: Mahesh Salgaonkar 
> Signed-off-by: Sourabh Jain 
> ---
>  arch/powerpc/kernel/fadump.c | 21 +++--
>  1 file changed, 11 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
> index 4b371c738213..5831f3ec8561 100644
> --- a/arch/powerpc/kernel/fadump.c
> +++ b/arch/powerpc/kernel/fadump.c
> @@ -33,6 +33,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /*
>   * The CPU who acquired the lock to trigger the fadump crash should
> @@ -1764,19 +1765,19 @@ void __init fadump_setup_param_area(void)
>   range_end = memblock_end_of_DRAM();
>   } else {
>   /*
> -  * Passing additional parameters is supported for hash MMU only
> -  * if the first memory block size is 768MB or higher.
> +  * Memory range for passing additional parameters for HASH MMU
> +  * must meet the following conditions:
> +  * 1. The first memory block size must be higher than the
> +  *minimum RMA (MIN_RMA) size. Bootloader can use memory
> +  *up to RMA size. So it should be avoided.
> +  * 2. The range should be between MIN_RMA and RMA size 
> (ppc64_rma_size)
> +  * 3. It must not overlap with the fadump reserved area.
>*/
> - if (ppc64_rma_size < 0x3000)
> + if (ppc64_rma_size < MIN_RMA*1024*1024)
>   return;
>  
> - /*
> -  * 640 MB to 768 MB is not used by PFW/bootloader. So, try 
> reserving
> -  * memory for passing additional parameters in this range to 
> avoid
> -  * being stomped on by PFW/bootloader.
> -  */
> - range_start = 0x2A00;
> - range_end = range_start + 0x400;
> + range_start = MIN_RMA * 1024 * 1024;
> + range_end = min(ppc64_rma_size, fw_dump.boot_mem_top);

Please update fadump documentation which makes this restriction clear.

Rest looks good to me.

Reviewed-by: Mahesh Salgaonkar 

Thanks,
-Mahesh

Re: [PATCH v2 1/2] PCI/DPC: Run recovery on device that detected the error

2025-01-22 Thread Shuai Xue





在 2025/1/23 12:53, Sathyanarayanan Kuppuswamy 写道:


On 11/12/24 5:54 AM, Shuai Xue wrote:

The current implementation of pcie_do_recovery() assumes that the
recovery process is executed on the device that detected the error.
However, the DPC driver currently passes the error port that experienced
the DPC event to pcie_do_recovery().

Use the SOURCE ID register to correctly identify the device that detected the
error. By passing this error device to pcie_do_recovery(), subsequent
patches will be able to accurately access AER status of the error device.


When passing the error device, I assume pcie_do_recovery() will find the
upstream bride and run the recovery logic .



Yes, the pcie_do_recovery() will find the upstream bridge and walk bridges
potentially AER affected.



Signed-off-by: Shuai Xue 
---


IMO, moving the "err_port" rename to a separate patch will make this change
more clear.  But it is up to you.


I see, I will add a separate patch.




  drivers/pci/pci.h  |  2 +-
  drivers/pci/pcie/dpc.c | 30 --
  drivers/pci/pcie/edr.c | 35 ++-
  3 files changed, 43 insertions(+), 24 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 14d00ce45bfa..0866f79aec54 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -521,7 +521,7 @@ struct rcec_ea {
  void pci_save_dpc_state(struct pci_dev *dev);
  void pci_restore_dpc_state(struct pci_dev *dev);
  void pci_dpc_init(struct pci_dev *pdev);
-void dpc_process_error(struct pci_dev *pdev);
+struct pci_dev *dpc_process_error(struct pci_dev *pdev);
  pci_ers_result_t dpc_reset_link(struct pci_dev *pdev);
  bool pci_dpc_recovered(struct pci_dev *pdev);
  #else
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index 2b6ef7efa3c1..62a68cde4364 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -257,10 +257,17 @@ static int dpc_get_aer_uncorrect_severity(struct pci_dev 
*dev,
  return 1;
  }
-void dpc_process_error(struct pci_dev *pdev)
+/**
+ * dpc_process_error - handle the DPC error status


Handling the DPC error status has nothing to do with finding
the error source. Why not add a new helper function?


As PCIe Spec,

DPC Error Source ID - When the DPC Trigger Reason field indicates that DPC
was triggered due to the reception of an ERR_NONFATAL or ERR_FATAL, this
register contains the Requester ID of the received Message. Otherwise, the
value of this register is undefined.

To find the error source, we need to

  - check the error reason from PCI_EXP_DPC_STATUS,
  - Identify the error device by PCI_EXP_DPC_SOURCE_ID for ERR_NONFATAL and
ERR_FATAL reason.

The code will duplicate with dpc_process_error. Therefore, I directly reused
dpc_process_error.




+ * @pdev: the port that experienced the containment event
+ *
+ * Return the device that experienced the error.

detected the error?


Will change it.


+ */
+struct pci_dev *dpc_process_error(struct pci_dev *pdev)
  {
  u16 cap = pdev->dpc_cap, status, source, reason, ext_reason;
  struct aer_err_info info;
+struct pci_dev *err_dev = NULL;


I don't think you need NULL initialization here.


Will remove it.




  pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, &status);
  pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, &source);
@@ -283,6 +290,13 @@ void dpc_process_error(struct pci_dev *pdev)
   "software trigger" :
   "reserved error");
+if (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_NFE ||
+reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE)
+err_dev = pci_get_domain_bus_and_slot(pci_domain_nr(pdev->bus),
+PCI_BUS_NUM(source), source & 0xff);
+else
+err_dev = pci_dev_get(pdev);
+
  /* show RP PIO error detail information */
  if (pdev->dpc_rp_extensions &&
  reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_IN_EXT &&
@@ -295,6 +309,8 @@ void dpc_process_error(struct pci_dev *pdev)
  pci_aer_clear_nonfatal_status(pdev);
  pci_aer_clear_fatal_status(pdev);
  }
+
+return err_dev;
  }
  static void pci_clear_surpdn_errors(struct pci_dev *pdev)
@@ -350,21 +366,23 @@ static bool dpc_is_surprise_removal(struct pci_dev *pdev)
  static irqreturn_t dpc_handler(int irq, void *context)
  {
-struct pci_dev *pdev = context;
+struct pci_dev *err_port = context, *err_dev = NULL;


NULL initialization is not needed.


Will remove it.

Thanks for valuable comments.

Best Regards,
Shuai

Re: BUG: KASAN: vmalloc-out-of-bounds in copy_to_kernel_nofault+0xd8/0x1c8 (v6.13-rc6, PowerMac G4)

Re: Re: [PATCH v2] treewide: const qualify ctl_tables where applicable

Re: [PATCH v2 0/2] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

Re: Re: [PATCH v2] treewide: const qualify ctl_tables where applicable

Re: [PATCH] powerpc/pseries/iommu: Don't unset window if it was never set

Re: [PATCH v4 RESEND] powerpc/pseries/eeh: Fix get PE state translation

Re: [PATCH v2] powerpc: increase MIN RMA size for CAS negotiation

[PATCH v2] fs: introduce getfsxattrat and setfsxattrat syscalls

Re: [PATCH] mm/hugetlb: bring gigantic page allocation under hugepages_supported()

Re: [PATCH 5.10] ibmvnic: Add tx check to prevent skb leak

Re: BUG: KASAN: vmalloc-out-of-bounds in copy_to_kernel_nofault+0xd8/0x1c8 (v6.13-rc6, PowerMac G4)

Re: [PATCH v2] powerpc: increase MIN RMA size for CAS negotiation

Re: [PATCH v2 1/2] PCI/DPC: Run recovery on device that detected the error

Re: [PATCH] mm/hugetlb: bring gigantic page allocation under hugepages_supported()

Re: [PATCH 2/2] powerpc/fadump: fix additional param memory reservation for HASH MMU

Re: [PATCH v2 1/2] PCI/DPC: Run recovery on device that detected the error

16 matches

Site Navigation

Mail list logo

Footer information