Re: [PATCH v4 3/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2025-03-02 Thread Shuai Xue




在 2025/3/3 11:43, Sathyanarayanan Kuppuswamy 写道:


On 2/16/25 6:42 PM, Shuai Xue wrote:

The AER driver has historically avoided reading the configuration space of
an endpoint or RCiEP that reported a fatal error, considering the link to
that device unreliable. Consequently, when a fatal error occurs, the AER
and DPC drivers do not report specific error types, resulting in logs like:

   pcieport :30:03.0: EDR: EDR event received
   pcieport :30:03.0: DPC: containment event, status:0x0005 source:0x3400
   pcieport :30:03.0: DPC: ERR_FATAL detected
   pcieport :30:03.0: AER: broadcast error_detected message
   nvme nvme0: frozen state error detected, reset controller
   nvme :34:00.0: ready 0ms after DPC
   pcieport :30:03.0: AER: broadcast slot_reset message

AER status registers are sticky and Write-1-to-clear. If the link recovered
after hot reset, we can still safely access AER status of the error device.
In such case, report fatal errors which helps to figure out the error root
case.

After this patch, the logs like:

   pcieport :30:03.0: EDR: EDR event received
   pcieport :30:03.0: DPC: containment event, status:0x0005 source:0x3400
   pcieport :30:03.0: DPC: ERR_FATAL detected
   pcieport :30:03.0: AER: broadcast error_detected message
   nvme nvme0: frozen state error detected, reset controller
   pcieport :30:03.0: waiting 100 ms for downstream link, after activation
   nvme :34:00.0: ready 0ms after DPC
   nvme :34:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Data 
Link Layer, (Receiver ID)
   nvme :34:00.0:   device [144d:a804] error status/mask=0010/00504000
   nvme :34:00.0:    [ 4] DLP    (First)
   pcieport :30:03.0: AER: broadcast slot_reset message


IMO, above info about device error details is more of a debug info. Since the
main use of this info use to understand more details about the recovered
DPC error. So I think is better to print with debug tag. Lets see what others
think.

Code wise, looks fine to me.


thanks, looking forward to more feedback.





Signed-off-by: Shuai Xue 
---
  drivers/pci/pci.h  |  3 ++-
  drivers/pci/pcie/aer.c | 11 +++
  drivers/pci/pcie/dpc.c |  2 +-
  drivers/pci/pcie/err.c |  9 +
  4 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 870d2fbd6ff2..e852fa58b250 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -549,7 +549,8 @@ struct aer_err_info {
  struct pcie_tlp_log tlp;    /* TLP Header */
  };
-int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info);
+int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info,
+  bool link_healthy);
  void aer_print_error(struct pci_dev *dev, struct aer_err_info *info);
  int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 508474e17183..bfb67db074f0 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1197,12 +1197,14 @@ EXPORT_SYMBOL_GPL(aer_recover_queue);
   * aer_get_device_error_info - read error status from dev and store it to info
   * @dev: pointer to the device expected to have a error record
   * @info: pointer to structure to store the error record
+ * @link_healthy: link is healthy or not
   *
   * Return 1 on success, 0 on error.
   *
   * Note that @info is reused among all error devices. Clear fields properly.
   */
-int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
+int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info,
+  bool link_healthy)
  {
  int type = pci_pcie_type(dev);
  int aer = dev->aer_cap;
@@ -1226,7 +1228,8 @@ int aer_get_device_error_info(struct pci_dev *dev, struct 
aer_err_info *info)
  } else if (type == PCI_EXP_TYPE_ROOT_PORT ||
 type == PCI_EXP_TYPE_RC_EC ||
 type == PCI_EXP_TYPE_DOWNSTREAM ||
-   info->severity == AER_NONFATAL) {
+   info->severity == AER_NONFATAL ||
+   (info->severity == AER_FATAL && link_healthy)) {
  /* Link is still healthy for IO reads */
  pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS,
@@ -1258,11 +1261,11 @@ static inline void aer_process_err_devices(struct 
aer_err_info *e_info)
  /* Report all before handle them, not to lost records by reset etc. */
  for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
-    if (aer_get_device_error_info(e_info->dev[i], e_info))
+    if (aer_get_device_error_info(e_info->dev[i], e_info, false))
  aer_print_error(e_info->dev[i], e_info);
  }
  for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
-    if (aer_get_device_error_info(e_info->dev[i], e_info))
+    if (aer_get_device_error_info(e_info->dev[i], e_info, false))
  handle_error_source(e_info->dev[i], e

[powerpc:merge] BUILD SUCCESS 1304f486dbf1f455c9abe284c155747be90043b3

2025-03-02 Thread kernel test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
merge
branch HEAD: 1304f486dbf1f455c9abe284c155747be90043b3  Automatic merge of 
'next' into merge (2025-03-02 10:01)

elapsed time: 1441m

configs tested: 135
configs skipped: 4

The following configs have been built successfully.
More configs may be tested in the coming days.

tested configs:
alpha allnoconfiggcc-14.2.0
alphaallyesconfiggcc-14.2.0
alpha   defconfiggcc-14.2.0
arc  allmodconfiggcc-13.2.0
arc   allnoconfiggcc-13.2.0
arc  allyesconfiggcc-13.2.0
arc  axs103_defconfiggcc-13.2.0
arc defconfiggcc-13.2.0
archsdk_defconfiggcc-13.2.0
arc nsimosci_hs_smp_defconfiggcc-13.2.0
arc   randconfig-001-20250302gcc-13.2.0
arc   randconfig-002-20250302gcc-13.2.0
arm  allmodconfiggcc-14.2.0
arm   allnoconfigclang-17
arm  allyesconfiggcc-14.2.0
arm defconfigclang-21
armmps2_defconfigclang-15
arm   randconfig-001-20250302gcc-14.2.0
arm   randconfig-002-20250302clang-21
arm   randconfig-003-20250302gcc-14.2.0
arm   randconfig-004-20250302clang-21
arm64allmodconfigclang-18
arm64 allnoconfiggcc-14.2.0
arm64 randconfig-001-20250302clang-18
arm64 randconfig-002-20250302gcc-14.2.0
arm64 randconfig-003-20250302gcc-14.2.0
arm64 randconfig-004-20250302clang-16
csky  allnoconfiggcc-14.2.0
cskydefconfiggcc-14.2.0
csky  randconfig-001-20250302gcc-14.2.0
csky  randconfig-002-20250302gcc-14.2.0
hexagon  allmodconfigclang-21
hexagon   allnoconfigclang-21
hexagon  allyesconfigclang-18
hexagon   randconfig-001-20250302clang-21
hexagon   randconfig-002-20250302clang-21
i386 allmodconfiggcc-12
i386  allnoconfiggcc-12
i386 allyesconfiggcc-12
i386buildonly-randconfig-001-20250302gcc-12
i386buildonly-randconfig-002-20250302clang-19
i386buildonly-randconfig-003-20250302gcc-12
i386buildonly-randconfig-004-20250302gcc-12
i386buildonly-randconfig-005-20250302gcc-12
i386buildonly-randconfig-006-20250302gcc-12
i386defconfigclang-19
loongarchalldefconfiggcc-14.2.0
loongarchallmodconfiggcc-14.2.0
loongarch allnoconfiggcc-14.2.0
loongarch randconfig-001-20250302gcc-14.2.0
loongarch randconfig-002-20250302gcc-14.2.0
m68k allmodconfiggcc-14.2.0
m68k  allnoconfiggcc-14.2.0
m68k allyesconfiggcc-14.2.0
m68k  multi_defconfiggcc-14.2.0
microblaze   allmodconfiggcc-14.2.0
microblazeallnoconfiggcc-14.2.0
microblaze   allyesconfiggcc-14.2.0
mips  allnoconfiggcc-14.2.0
mips  ath25_defconfigclang-16
mips  eyeq5_defconfiggcc-14.2.0
nios2 allnoconfiggcc-14.2.0
nios2 randconfig-001-20250302gcc-14.2.0
nios2 randconfig-002-20250302gcc-14.2.0
openrisc  allnoconfiggcc-14.2.0
openrisc allyesconfiggcc-14.2.0
openriscdefconfiggcc-14.2.0
parisc   allmodconfiggcc-14.2.0
pariscallnoconfiggcc-14.2.0
parisc   allyesconfiggcc-14.2.0
parisc  defconfiggcc-14.2.0
pariscrandconfig-001-20250302gcc-14.2.0
pariscrandconfig-002-20250302gcc-14.2.0
powerpc  allmodconfiggcc-14.2.0
powerpc   allnoconfiggcc-14.2.0
powerpc  allyesconfigclang-16
powerpc   ebony_defconfigclang-18
powerpcgamecube_defconfigclang-

[PATCH v2 RESEND] sched/membarrier: Fix redundant load of membarrier_state

2025-03-02 Thread Nysal Jan K.A.
On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
is not selected, sync_core_before_usermode() is a no-op.
In membarrier_mm_sync_core_before_usermode() the compiler does not
eliminate redundant branches and load of mm->membarrier_state
for this case as the atomic_read() cannot be optimized away.

Here's a snippet of the code generated for finish_task_switch() on powerpc
prior to this change:

1b786c:   ld  r26,2624(r30)   # mm = rq->prev_mm;
...
1b78c8:   cmpdi   cr7,r26,0
1b78cc:   beq cr7,1b78e4 
1b78d0:   ld  r9,2312(r13)# current
1b78d4:   ld  r9,1888(r9) # current->mm
1b78d8:   cmpdcr7,r26,r9
1b78dc:   beq cr7,1b7a70 
1b78e0:   hwsync
1b78e4:   cmplwi  cr7,r27,128
...
1b7a70:   lwz r9,176(r26) # atomic_read(&mm->membarrier_state)
1b7a74:   b   1b78e0 

This was found while analyzing "perf c2c" reports on kernels prior
to commit c1753fd02a00 ("mm: move mm_count into its own cache line")
where mm_count was false sharing with membarrier_state.

There is a minor improvement in the size of finish_task_switch().
The following are results from bloat-o-meter for ppc64le:

GCC 7.5.0
-
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
Function old new   delta
finish_task_switch   884 852 -32

GCC 12.2.1
--
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
Function old new   delta
finish_task_switch.isra  852 820 -32

LLVM 17.0.6
---
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-36 (-36)
Function old new   delta
rt_mutex_schedule120 104 -16
finish_task_switch   792 772 -20

Results on aarch64:

GCC 14.1.1
--
add/remove: 0/2 grow/shrink: 1/1 up/down: 4/-60 (-56)
Function old new   delta
get_nohz_timer_target352 356  +4
e843419@0b02_d7e7_408  8   -  -8
e843419@01bb_21d2_868  8   -  -8
finish_task_switch.isra  592 548 -44

Signed-off-by: Nysal Jan K.A. 
Reviewed-by: Mathieu Desnoyers 
Reviewed-by: Michael Ellerman 
Reviewed-by: Segher Boessenkool 
---
V1 -> V2:
- Add results for aarch64
- Add a comment describing the changes
---
 include/linux/sched/mm.h | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 928a626725e6..b13474825130 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -531,6 +531,13 @@ enum {
 
 static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct 
*mm)
 {
+   /*
+* The atomic_read() below prevents CSE. The following should
+* help the compiler generate more efficient code on architectures
+* where sync_core_before_usermode() is a no-op.
+*/
+   if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE))
+   return;
if (current->mm != mm)
return;
if (likely(!(atomic_read(&mm->membarrier_state) &
-- 
2.48.1




Re: [PATCH v4 3/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2025-03-02 Thread Sathyanarayanan Kuppuswamy



On 2/16/25 6:42 PM, Shuai Xue wrote:

The AER driver has historically avoided reading the configuration space of
an endpoint or RCiEP that reported a fatal error, considering the link to
that device unreliable. Consequently, when a fatal error occurs, the AER
and DPC drivers do not report specific error types, resulting in logs like:

   pcieport :30:03.0: EDR: EDR event received
   pcieport :30:03.0: DPC: containment event, status:0x0005 source:0x3400
   pcieport :30:03.0: DPC: ERR_FATAL detected
   pcieport :30:03.0: AER: broadcast error_detected message
   nvme nvme0: frozen state error detected, reset controller
   nvme :34:00.0: ready 0ms after DPC
   pcieport :30:03.0: AER: broadcast slot_reset message

AER status registers are sticky and Write-1-to-clear. If the link recovered
after hot reset, we can still safely access AER status of the error device.
In such case, report fatal errors which helps to figure out the error root
case.

After this patch, the logs like:

   pcieport :30:03.0: EDR: EDR event received
   pcieport :30:03.0: DPC: containment event, status:0x0005 source:0x3400
   pcieport :30:03.0: DPC: ERR_FATAL detected
   pcieport :30:03.0: AER: broadcast error_detected message
   nvme nvme0: frozen state error detected, reset controller
   pcieport :30:03.0: waiting 100 ms for downstream link, after activation
   nvme :34:00.0: ready 0ms after DPC
   nvme :34:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Data 
Link Layer, (Receiver ID)
   nvme :34:00.0:   device [144d:a804] error status/mask=0010/00504000
   nvme :34:00.0:[ 4] DLP(First)
   pcieport :30:03.0: AER: broadcast slot_reset message


IMO, above info about device error details is more of a debug info. 
Since the

main use of this info use to understand more details about the recovered
DPC error. So I think is better to print with debug tag. Lets see what 
others

think.

Code wise, looks fine to me.




Signed-off-by: Shuai Xue 
---
  drivers/pci/pci.h  |  3 ++-
  drivers/pci/pcie/aer.c | 11 +++
  drivers/pci/pcie/dpc.c |  2 +-
  drivers/pci/pcie/err.c |  9 +
  4 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 870d2fbd6ff2..e852fa58b250 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -549,7 +549,8 @@ struct aer_err_info {
struct pcie_tlp_log tlp;/* TLP Header */
  };
  
-int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info);

+int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info,
+ bool link_healthy);
  void aer_print_error(struct pci_dev *dev, struct aer_err_info *info);
  
  int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 508474e17183..bfb67db074f0 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1197,12 +1197,14 @@ EXPORT_SYMBOL_GPL(aer_recover_queue);
   * aer_get_device_error_info - read error status from dev and store it to info
   * @dev: pointer to the device expected to have a error record
   * @info: pointer to structure to store the error record
+ * @link_healthy: link is healthy or not
   *
   * Return 1 on success, 0 on error.
   *
   * Note that @info is reused among all error devices. Clear fields properly.
   */
-int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
+int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info,
+ bool link_healthy)
  {
int type = pci_pcie_type(dev);
int aer = dev->aer_cap;
@@ -1226,7 +1228,8 @@ int aer_get_device_error_info(struct pci_dev *dev, struct 
aer_err_info *info)
} else if (type == PCI_EXP_TYPE_ROOT_PORT ||
   type == PCI_EXP_TYPE_RC_EC ||
   type == PCI_EXP_TYPE_DOWNSTREAM ||
-  info->severity == AER_NONFATAL) {
+  info->severity == AER_NONFATAL ||
+  (info->severity == AER_FATAL && link_healthy)) {
  
  		/* Link is still healthy for IO reads */

pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS,
@@ -1258,11 +1261,11 @@ static inline void aer_process_err_devices(struct 
aer_err_info *e_info)
  
  	/* Report all before handle them, not to lost records by reset etc. */

for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
-   if (aer_get_device_error_info(e_info->dev[i], e_info))
+   if (aer_get_device_error_info(e_info->dev[i], e_info, false))
aer_print_error(e_info->dev[i], e_info);
}
for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
-   if (aer_get_device_error_info(e_info->dev[i], e_info))
+   if (aer_get_device_error_info(e_info->dev[i], e_info, false))

Re: [PATCH v4 2/3] PCI/DPC: Run recovery on device that detected the error

2025-03-02 Thread Shuai Xue




在 2025/3/3 11:36, Sathyanarayanan Kuppuswamy 写道:


On 2/16/25 6:42 PM, Shuai Xue wrote:

The current implementation of pcie_do_recovery() assumes that the
recovery process is executed on the device that detected the error.
However, the DPC driver currently passes the error port that experienced
the DPC event to pcie_do_recovery().

Use the SOURCE ID register to correctly identify the device that
detected the error. When passing the error device, the
pcie_do_recovery() will find the upstream bridge and walk bridges
potentially AER affected. And subsequent patches will be able to
accurately access AER status of the error device.

Should not observe any functional changes.

Signed-off-by: Shuai Xue 
---


Looks good to me

Reviewed-by: Kuppuswamy Sathyanarayanan 




Thanks.
Shuai



Re: [PATCH v2 2/4] powerpc/perf/hv-24x7: Avoid loading hv-24x7 during dump kernel

2025-03-02 Thread kernel test robot
Hi Madhavan,

kernel test robot noticed the following build errors:

[auto build test ERROR on powerpc/next]
[also build test ERROR on powerpc/fixes linus/master v6.14-rc4 next-20250228]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:
https://github.com/intel-lab-lkp/linux/commits/Madhavan-Srinivasan/powerpc-perf-hv-24x7-Avoid-loading-hv-24x7-during-dump-kernel/20250302-022531
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
patch link:
https://lore.kernel.org/r/20250301182310.6832-2-maddy%40linux.ibm.com
patch subject: [PATCH v2 2/4] powerpc/perf/hv-24x7: Avoid loading hv-24x7 
during dump kernel
config: powerpc64-randconfig-001-20250302 
(https://download.01.org/0day-ci/archive/20250302/202503021908.ed1go0gi-...@intel.com/config)
compiler: powerpc64-linux-gcc (GCC) 14.2.0
reproduce (this is a W=1 build): 
(https://download.01.org/0day-ci/archive/20250302/202503021908.ed1go0gi-...@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202503021908.ed1go0gi-...@intel.com/

All errors (new ones prefixed by >>):

   arch/powerpc/perf/hv-24x7.c: In function 'hv_24x7_init':
>> arch/powerpc/perf/hv-24x7.c:1701:13: error: implicit declaration of function 
>> 'is_kdump_kernel' [-Wimplicit-function-declaration]
1701 | if (is_kdump_kernel() || is_fadump_active())
 | ^~~


vim +/is_kdump_kernel +1701 arch/powerpc/perf/hv-24x7.c

  1693  
  1694  static int hv_24x7_init(void)
  1695  {
  1696  int r;
  1697  unsigned long hret;
  1698  unsigned int pvr = mfspr(SPRN_PVR);
  1699  struct hv_perf_caps caps;
  1700  
> 1701  if (is_kdump_kernel() || is_fadump_active())
  1702  return 0;
  1703  
  1704  if (!firmware_has_feature(FW_FEATURE_LPAR)) {
  1705  pr_debug("not a virtualized system, not enabling\n");
  1706  return -ENODEV;
  1707  }
  1708  
  1709  /* POWER8 only supports v1, while POWER9 only supports v2. */
  1710  if (PVR_VER(pvr) == PVR_POWER8 || PVR_VER(pvr) == PVR_POWER8E ||
  1711  PVR_VER(pvr) == PVR_POWER8NVL)
  1712  interface_version = 1;
  1713  else {
  1714  interface_version = 2;
  1715  
  1716  /* SMT8 in POWER9 needs to aggregate result elements. */
  1717  if (threads_per_core == 8)
  1718  aggregate_result_elements = true;
  1719  }
  1720  
  1721  hret = hv_perf_caps_get(&caps);
  1722  if (hret) {
  1723  pr_debug("could not obtain capabilities, not enabling, 
rc=%ld\n",
  1724  hret);
  1725  return -ENODEV;
  1726  }
  1727  
  1728  hv_page_cache = kmem_cache_create("hv-page-4096", 4096, 4096, 
0, NULL);
  1729  if (!hv_page_cache)
  1730  return -ENOMEM;
  1731  
  1732  /* sampling not supported */
  1733  h_24x7_pmu.capabilities |= PERF_PMU_CAP_NO_INTERRUPT;
  1734  
  1735  r = create_events_from_catalog(&event_group.attrs,
  1736 &event_desc_group.attrs,
  1737 &event_long_desc_group.attrs);
  1738  
  1739  if (r)
  1740  return r;
  1741  
  1742  /* init cpuhotplug */
  1743  r = hv_24x7_cpu_hotplug_init();
  1744  if (r)
  1745  return r;
  1746  
  1747  r = perf_pmu_register(&h_24x7_pmu, h_24x7_pmu.name, -1);
  1748  if (r)
  1749  return r;
  1750  
  1751  read_24x7_sys_info();
  1752  
  1753  return 0;
  1754  }
  1755  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



Re: [PATCH 0/5] Microwatt updates

2025-03-02 Thread Gabriel Paubert


[Sorry, I wanted to reply earlier, but it stayed in my drafts folder for a 
month]

On Sat, Feb 01, 2025 at 12:22:51PM +1100, Paul Mackerras wrote:
[snipped]
> 
> 603 was a looong time ago, I don't recall the details.
> 
> Regarding broadcast TLBIEs, the protocols and mechanisms for doing
> that are known to be complex and slow in the IBM Power processors (ask
> Derek Williams about that :).  Anton found that in fact doing only
> local TLBIEs and using IPIs gave *better* performance on IBM Power
> systems than using hardware broadcast TLBIEs in many cases (the reason
> being that software knows which other CPUs might have a given TLB
> entry, often quite a small set, whereas hardware doesn't, and has to
> send the invalidation to every CPU and wait for a response from every
> CPU).  Add to that, that most other SMP-capable CPU architectures
> don't do broadcast TLB invalidations, Intel x86 for example.

Actually it's coming to x86, at least on the AMD side:

https://lore.kernel.org/all/20250206044346.3810242-1-r...@surriel.com/

with performance numbers which look rather good.

I don't know how it looks like at the level of the hardware protocol,
but implementing it on a single chip/socket is likely relatively simple.

Gabriel

> 
> > > the kernel already has code to deal with this.  One of the patches in
> > > this series provides a config option to allow platforms to select
> > > unconditionally the behaviour where cross-CPU TLB invalidations are
> > > handled using inter-processor interrupts.
> > 
> > Are there plans to broadcast the (SMP cache invalidation) messages?
> 
> Cache (i.e. instruction and data cache) - yes, they *are* coherent.
> More precisely, the D caches are write-through, and all I and D caches
> snoop writes to memory (including DMA writes) and invalidate any cache
> lines being written to.
> 
> > Will uwatt support some real bus protocol, for example?
> 
> "Real" meaning using tri-state bus drivers, like we did in the 90s? :)
> 
> > Again, congrats on this great milestone!  Does this floating point
> > support do square roots as well (aka "gpopt"; does it do "gfxopt" for
> > that matter, fsel?)  fsqrt is kinda tricky to get to work fully
> > correctly :-)
> 
> Yes, fsqrt and fsel are implemented in hardware, and are accurate to
> the last bit.  Also, the FPU handles denormalized values in hardware
> (both input and output) and implements all exception handling as per
> the ISA, including the trap-enabled overflow cases.  Feel free to run
> whatever tests you like and report bugs.  But we're getting a bit
> off-topic from the kernel patches. :)
> 
> Paul.
> 
 




Re: [PATCH v2 1/4] powerpc/perf/core-book3s: Avoid loading platform pmu driver during dump kernel

2025-03-02 Thread kernel test robot
Hi Madhavan,

kernel test robot noticed the following build errors:

[auto build test ERROR on powerpc/next]
[also build test ERROR on powerpc/fixes linus/master v6.14-rc4 next-20250228]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:
https://github.com/intel-lab-lkp/linux/commits/Madhavan-Srinivasan/powerpc-perf-hv-24x7-Avoid-loading-hv-24x7-during-dump-kernel/20250302-022531
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
patch link:
https://lore.kernel.org/r/20250301182310.6832-1-maddy%40linux.ibm.com
patch subject: [PATCH v2 1/4] powerpc/perf/core-book3s: Avoid loading platform 
pmu driver during dump kernel
config: powerpc64-randconfig-001-20250302 
(https://download.01.org/0day-ci/archive/20250302/202503021712.evdjymst-...@intel.com/config)
compiler: powerpc64-linux-gcc (GCC) 14.2.0
reproduce (this is a W=1 build): 
(https://download.01.org/0day-ci/archive/20250302/202503021712.evdjymst-...@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202503021712.evdjymst-...@intel.com/

All errors (new ones prefixed by >>):

   arch/powerpc/perf/core-book3s.c: In function 'init_ppc64_pmu':
>> arch/powerpc/perf/core-book3s.c:2599:13: error: implicit declaration of 
>> function 'is_kdump_kernel' [-Wimplicit-function-declaration]
2599 | if (is_kdump_kernel() || is_fadump_active())
 | ^~~


vim +/is_kdump_kernel +2599 arch/powerpc/perf/core-book3s.c

  2587  
  2588  static int __init init_ppc64_pmu(void)
  2589  {
  2590  if (cpu_has_feature(CPU_FTR_HVMODE) && pmu_override) {
  2591  pr_warn("disabling perf due to pmu_override= command 
line option.\n");
  2592  on_each_cpu(do_pmu_override, NULL, 1);
  2593  return 0;
  2594  }
  2595  
  2596  /*
  2597   * If the dump kernel is active, skip loading these drivers
  2598   */
> 2599  if (is_kdump_kernel() || is_fadump_active())
  2600  return 0;
  2601  
  2602  /* run through all the pmu drivers one at a time */
  2603  if (!init_power5_pmu())
  2604  return 0;
  2605  else if (!init_power5p_pmu())
  2606  return 0;
  2607  else if (!init_power6_pmu())
  2608  return 0;
  2609  else if (!init_power7_pmu())
  2610  return 0;
  2611  else if (!init_power8_pmu())
  2612  return 0;
  2613  else if (!init_power9_pmu())
  2614  return 0;
  2615  else if (!init_power10_pmu())
  2616  return 0;
  2617  else if (!init_power11_pmu())
  2618  return 0;
  2619  else if (!init_ppc970_pmu())
  2620  return 0;
  2621  else
  2622  return init_generic_compat_pmu();
  2623  }
  2624  early_initcall(init_ppc64_pmu);
  2625  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



Re: [PATCH v4 0/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

2025-03-02 Thread Shuai Xue




在 2025/2/17 10:42, Shuai Xue 写道:

changes since v3:
- squash patch 1 and 2 into one patch per Sathyanarayanan
- add comments note for dpc_process_error per Sathyanarayanan
- pick up Reviewed-by tag from Sathyanarayanan

changes since v2:
- moving the "err_port" rename to a separate patch per Sathyanarayanan
- rewrite comments of dpc_process_error per Sathyanarayanan
- remove NULL initialization for err_dev per Sathyanarayanan

changes since v1:
- rewrite commit log per Bjorn
- refactor aer_get_device_error_info to reduce duplication per Keith
- fix to avoid reporting fatal errors twice for root and downstream ports per 
Keith

The AER driver has historically avoided reading the configuration space of an
endpoint or RCiEP that reported a fatal error, considering the link to that
device unreliable. Consequently, when a fatal error occurs, the AER and DPC
drivers do not report specific error types, resulting in logs like:

pcieport :30:03.0: EDR: EDR event received
pcieport :30:03.0: DPC: containment event, status:0x0005 source:0x3400
pcieport :30:03.0: DPC: ERR_FATAL detected
pcieport :30:03.0: AER: broadcast error_detected message
nvme nvme0: frozen state error detected, reset controller
nvme :34:00.0: ready 0ms after DPC
pcieport :30:03.0: AER: broadcast slot_reset message

AER status registers are sticky and Write-1-to-clear. If the link recovered
after hot reset, we can still safely access AER status of the error device.
In such case, report fatal errors which helps to figure out the error root
case.

After this patch set, the logs like:

pcieport :30:03.0: EDR: EDR event received
pcieport :30:03.0: DPC: containment event, status:0x0005 source:0x3400
pcieport :30:03.0: DPC: ERR_FATAL detected
pcieport :30:03.0: AER: broadcast error_detected message
nvme nvme0: frozen state error detected, reset controller
pcieport :30:03.0: waiting 100 ms for downstream link, after activation
nvme :34:00.0: ready 0ms after DPC
nvme :34:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), 
type=Data Link Layer, (Receiver ID)
nvme :34:00.0:   device [144d:a804] error status/mask=0010/00504000
nvme :34:00.0:[ 4] DLP(First)
pcieport :30:03.0: AER: broadcast slot_reset message

Shuai Xue (3):
   PCI/DPC: Clarify naming for error port in DPC Handling
   PCI/DPC: Run recovery on device that detected the error
   PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

  drivers/pci/pci.h  |  5 +++--
  drivers/pci/pcie/aer.c | 11 +++
  drivers/pci/pcie/dpc.c | 34 +++---
  drivers/pci/pcie/edr.c | 35 ++-
  drivers/pci/pcie/err.c |  9 +
  5 files changed, 64 insertions(+), 30 deletions(-)




Hi, All,

Gentle ping.

Thanks.
Shuai



Re: [PATCH v4 2/3] PCI/DPC: Run recovery on device that detected the error

2025-03-02 Thread Sathyanarayanan Kuppuswamy



On 2/16/25 6:42 PM, Shuai Xue wrote:

The current implementation of pcie_do_recovery() assumes that the
recovery process is executed on the device that detected the error.
However, the DPC driver currently passes the error port that experienced
the DPC event to pcie_do_recovery().

Use the SOURCE ID register to correctly identify the device that
detected the error. When passing the error device, the
pcie_do_recovery() will find the upstream bridge and walk bridges
potentially AER affected. And subsequent patches will be able to
accurately access AER status of the error device.

Should not observe any functional changes.

Signed-off-by: Shuai Xue 
---


Looks good to me

Reviewed-by: Kuppuswamy Sathyanarayanan 




  drivers/pci/pci.h  |  2 +-
  drivers/pci/pcie/dpc.c | 28 
  drivers/pci/pcie/edr.c |  7 ---
  3 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 01e51db8d285..870d2fbd6ff2 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -572,7 +572,7 @@ struct rcec_ea {
  void pci_save_dpc_state(struct pci_dev *dev);
  void pci_restore_dpc_state(struct pci_dev *dev);
  void pci_dpc_init(struct pci_dev *pdev);
-void dpc_process_error(struct pci_dev *pdev);
+struct pci_dev *dpc_process_error(struct pci_dev *pdev);
  pci_ers_result_t dpc_reset_link(struct pci_dev *pdev);
  bool pci_dpc_recovered(struct pci_dev *pdev);
  unsigned int dpc_tlp_log_len(struct pci_dev *dev);
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index 1a54a0b657ae..ea3ea989afa7 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -253,10 +253,20 @@ static int dpc_get_aer_uncorrect_severity(struct pci_dev 
*dev,
return 1;
  }
  
-void dpc_process_error(struct pci_dev *pdev)

+/**
+ * dpc_process_error - handle the DPC error status
+ * @pdev: the port that experienced the containment event
+ *
+ * Return the device that detected the error.
+ *
+ * NOTE: The device reference count is increased, the caller must decrement
+ * the reference count by calling pci_dev_put().
+ */
+struct pci_dev *dpc_process_error(struct pci_dev *pdev)
  {
u16 cap = pdev->dpc_cap, status, source, reason, ext_reason;
struct aer_err_info info;
+   struct pci_dev *err_dev;
  
  	pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, &status);

pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, &source);
@@ -279,6 +289,13 @@ void dpc_process_error(struct pci_dev *pdev)
 "software trigger" :
 "reserved error");
  
+	if (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_NFE ||

+   reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE)
+   err_dev = pci_get_domain_bus_and_slot(pci_domain_nr(pdev->bus),
+   PCI_BUS_NUM(source), source & 0xff);
+   else
+   err_dev = pci_dev_get(pdev);
+
/* show RP PIO error detail information */
if (pdev->dpc_rp_extensions &&
reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_IN_EXT &&
@@ -291,6 +308,8 @@ void dpc_process_error(struct pci_dev *pdev)
pci_aer_clear_nonfatal_status(pdev);
pci_aer_clear_fatal_status(pdev);
}
+
+   return err_dev;
  }
  
  static void pci_clear_surpdn_errors(struct pci_dev *pdev)

@@ -346,7 +365,7 @@ static bool dpc_is_surprise_removal(struct pci_dev *pdev)
  
  static irqreturn_t dpc_handler(int irq, void *context)

  {
-   struct pci_dev *err_port = context;
+   struct pci_dev *err_port = context, *err_dev;
  
  	/*

 * According to PCIe r6.0 sec 6.7.6, errors are an expected side effect
@@ -357,10 +376,11 @@ static irqreturn_t dpc_handler(int irq, void *context)
return IRQ_HANDLED;
}
  
-	dpc_process_error(err_port);

+   err_dev = dpc_process_error(err_port);
  
  	/* We configure DPC so it only triggers on ERR_FATAL */

-   pcie_do_recovery(err_port, pci_channel_io_frozen, dpc_reset_link);
+   pcie_do_recovery(err_dev, pci_channel_io_frozen, dpc_reset_link);
+   pci_dev_put(err_dev);
  
  	return IRQ_HANDLED;

  }
diff --git a/drivers/pci/pcie/edr.c b/drivers/pci/pcie/edr.c
index 521fca2f40cb..088f3e188f54 100644
--- a/drivers/pci/pcie/edr.c
+++ b/drivers/pci/pcie/edr.c
@@ -150,7 +150,7 @@ static int acpi_send_edr_status(struct pci_dev *pdev, 
struct pci_dev *edev,
  
  static void edr_handle_event(acpi_handle handle, u32 event, void *data)

  {
-   struct pci_dev *pdev = data, *err_port;
+   struct pci_dev *pdev = data, *err_port, *err_dev;
pci_ers_result_t estate = PCI_ERS_RESULT_DISCONNECT;
u16 status;
  
@@ -190,7 +190,7 @@ static void edr_handle_event(acpi_handle handle, u32 event, void *data)

goto send_ost;
}
  
-	dpc_process_error(err_port);

+   err_dev = dpc_process_error(err_port);
pci_aer_raw_clear_status(err_port);
  
  	/*

@@ -198,7 +198,7 @@ static void edr_han

[PATCH 0/2] powerpc: gpio_mdio: Simplify gpio_mdio_probe()

2025-03-02 Thread Christophe JAILLET
While wondering if it was correct to call mdiobus_free() in the remove
function and only kfree() in the error handling of the probe, I
arrived at the conclusion that the code could be simpler here.

Patch 1 uses mdiobus_alloc_size() instead of a hand written
mdiobus_alloc() + kzalloc(). it also uses the devm_ version in order to
save some LoC (and answer my initial question)

Patch 2 uses devm_of_mdiobus_register() to completly remove the .remove()
function and save some more LoC.

Both patches are compile tested only.

Christophe JAILLET (2):
  powerpc: gpio_mdio: Use devm_mdiobus_alloc_size()
  powerpc: gpio_mdio: Use devm_of_mdiobus_register()

 arch/powerpc/platforms/pasemi/gpio_mdio.c | 41 ---
 1 file changed, 6 insertions(+), 35 deletions(-)

-- 
2.48.1




[PATCH 1/2] powerpc: gpio_mdio: Use devm_mdiobus_alloc_size()

2025-03-02 Thread Christophe JAILLET
Use mdiobus_alloc_size() instead of a hand written mdiobus_alloc() +
kzalloc().

This is less verbose and more robust. It also reduces memory fragmentation
and saves a few bytes of memory.

While at it, switch to devm_mdiobus_alloc_size() for extra simplification.

Signed-off-by: Christophe JAILLET 
---
This patch is compile tested only.

Some memory is saved because pahole states, on a x86_64,  that:

struct mii_bus {
...
/* size: 3640, cachelines: 57, members: 23 */
/* sum members: 3633, holes: 2, sum holes: 7 */
/* member types with holes: 3, total: 4, bit paddings: 1, total: 1 bit 
*/
/* paddings: 1, sum paddings: 3 */
/* forced alignments: 1, forced holes: 1, sum forced holes: 4 */
/* last cacheline: 56 bytes */
}

Because of the way allocation works, 4096 bytes are allocated. When
mdiobus_alloc_size() is used, struct gpio_priv fits in this "wasted" space
and so is available "for free".
---
 arch/powerpc/platforms/pasemi/gpio_mdio.c | 26 +--
 1 file changed, 5 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/platforms/pasemi/gpio_mdio.c 
b/arch/powerpc/platforms/pasemi/gpio_mdio.c
index e4538d471256..2c54f5f063b7 100644
--- a/arch/powerpc/platforms/pasemi/gpio_mdio.c
+++ b/arch/powerpc/platforms/pasemi/gpio_mdio.c
@@ -213,15 +213,11 @@ static int gpio_mdio_probe(struct platform_device *ofdev)
const unsigned int *prop;
int err;
 
-   err = -ENOMEM;
-   priv = kzalloc(sizeof(struct gpio_priv), GFP_KERNEL);
-   if (!priv)
-   goto out;
-
-   new_bus = mdiobus_alloc();
-
+   new_bus = devm_mdiobus_alloc_size(dev, sizeof(*priv));
if (!new_bus)
-   goto out_free_priv;
+   return -ENOMEM;
+
+   priv = new_bus->priv;
 
new_bus->name = "pasemi gpio mdio bus";
new_bus->read = &gpio_mdio_read;
@@ -230,7 +226,6 @@ static int gpio_mdio_probe(struct platform_device *ofdev)
 
prop = of_get_property(np, "reg", NULL);
snprintf(new_bus->id, MII_BUS_ID_SIZE, "%x", *prop);
-   new_bus->priv = priv;
 
prop = of_get_property(np, "mdc-pin", NULL);
priv->mdc_pin = *prop;
@@ -246,17 +241,10 @@ static int gpio_mdio_probe(struct platform_device *ofdev)
if (err != 0) {
pr_err("%s: Cannot register as MDIO bus, err %d\n",
new_bus->name, err);
-   goto out_free_irq;
+   return err;
}
 
return 0;
-
-out_free_irq:
-   kfree(new_bus);
-out_free_priv:
-   kfree(priv);
-out:
-   return err;
 }
 
 
@@ -267,10 +255,6 @@ static void gpio_mdio_remove(struct platform_device *dev)
mdiobus_unregister(bus);
 
dev_set_drvdata(&dev->dev, NULL);
-
-   kfree(bus->priv);
-   bus->priv = NULL;
-   mdiobus_free(bus);
 }
 
 static const struct of_device_id gpio_mdio_match[] =
-- 
2.48.1




[PATCH 2/2] powerpc: gpio_mdio: Use devm_of_mdiobus_register()

2025-03-02 Thread Christophe JAILLET
Use devm_of_mdiobus_register() in order to remove the now empty .remove()
function.

Doing so dev_set_drvdata() is now also unneeded. Remove it as well.

Signed-off-by: Christophe JAILLET 
---
This patch is compile tested only.
---
 arch/powerpc/platforms/pasemi/gpio_mdio.c | 15 +--
 1 file changed, 1 insertion(+), 14 deletions(-)

diff --git a/arch/powerpc/platforms/pasemi/gpio_mdio.c 
b/arch/powerpc/platforms/pasemi/gpio_mdio.c
index 2c54f5f063b7..6712ccb84c0a 100644
--- a/arch/powerpc/platforms/pasemi/gpio_mdio.c
+++ b/arch/powerpc/platforms/pasemi/gpio_mdio.c
@@ -234,10 +234,8 @@ static int gpio_mdio_probe(struct platform_device *ofdev)
priv->mdio_pin = *prop;
 
new_bus->parent = dev;
-   dev_set_drvdata(dev, new_bus);
-
-   err = of_mdiobus_register(new_bus, np);
 
+   err = devm_of_mdiobus_register(dev, new_bus, np);
if (err != 0) {
pr_err("%s: Cannot register as MDIO bus, err %d\n",
new_bus->name, err);
@@ -247,16 +245,6 @@ static int gpio_mdio_probe(struct platform_device *ofdev)
return 0;
 }
 
-
-static void gpio_mdio_remove(struct platform_device *dev)
-{
-   struct mii_bus *bus = dev_get_drvdata(&dev->dev);
-
-   mdiobus_unregister(bus);
-
-   dev_set_drvdata(&dev->dev, NULL);
-}
-
 static const struct of_device_id gpio_mdio_match[] =
 {
{
@@ -269,7 +257,6 @@ MODULE_DEVICE_TABLE(of, gpio_mdio_match);
 static struct platform_driver gpio_mdio_driver =
 {
.probe  = gpio_mdio_probe,
-   .remove = gpio_mdio_remove,
.driver = {
.name = "gpio-mdio-bitbang",
.of_match_table = gpio_mdio_match,
-- 
2.48.1




Re: [PATCH v4] powerpc/hugetlb: Disable gigantic hugepages if fadump is active

2025-03-02 Thread Sourabh Jain

Hello Ritesh,

Thanks for the review.


On 02/03/25 12:05, Ritesh Harjani (IBM) wrote:

Sourabh Jain  writes:


The fadump kernel boots with limited memory solely to collect the kernel
core dump. Having gigantic hugepages in the fadump kernel is of no use.

Sure got it.


Many times, the fadump kernel encounters OOM (Out of Memory) issues if
gigantic hugepages are allocated.

To address this, disable gigantic hugepages if fadump is active by
returning early from arch_hugetlb_valid_size() using
hugepages_supported(). When fadump is active, the global variable
hugetlb_disabled is set to true, which is later used by the
PowerPC-specific hugepages_supported() function to determine hugepage
support.

Returning early from arch_hugetlb_vali_size() not only disables
gigantic hugepages but also avoids unnecessary hstate initialization for
every hugepage size supported by the platform.

kernel logs related to hugepages with this patch included:
kernel argument passed: hugepagesz=1G hugepages=1

First kernel: gigantic hugepage got allocated
==

dmesg | grep -i "hugetlb"
-
HugeTLB: registered 1.00 GiB page size, pre-allocated 1 pages
HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page
HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages
HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page

$ cat /proc/meminfo | grep -i "hugetlb"
-
Hugetlb: 1048576 kB

Was this tested with patch [1] in your local tree?

[1]: 
https://web.git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?id=d629d7a8efc33

IIUC, this patch [1] disables the boot time allocation of hugepages.
Isn't it also disabling the boot time allocation for gigantic huge pages
passed by the cmdline params like hugepagesz=1G and hugepages=2 ?


Yes, I had the patch [1] in my tree.

My understanding is that gigantic pages are allocated before normal huge 
pages.


In mm/hugetlb.c:hugepages_setup(), we have:

if (hugetlb_max_hstate && hstate_is_gigantic(parsed_hstate))
    hugetlb_hstate_alloc_pages(parsed_hstate);

I believe the above code allocates memory for gigantic pages, and 
mm/hugetlb.c:hugetlb_init()

is called later because it is a subsys_initcall.

So, by the time the kernel reaches hugetlb_init(), the gigantic pages 
are already

allocated. Isn't that right?

With this understanding, this patch avoids populating hstate so that 
gigantic

huge page allocation fails for the fadump kernel.

Please let me know your opinion.





HugeTLB: registered 1.00 GiB page size, pre-allocated 1 pages

This print comes from report_hugepages(). The only place from where
report_hugepages() gets called is hugetlb_init(). hugetlb_init() is what
is responsible for hugepages & gigantic hugepage allocations of the
passed kernel cmdline params.

But hugetlb_init() already checks for hugepages_supported() in the very
beginning. So I am not sure whether we need this extra patch to disable
gigantic hugepages allocation by the kernel cmdline params like
hugepagesz=1G and hugepages=2 type of options.

Hence I was wondering if you had this patch [1] in your tree when you were
testing this?

But I may be missing something. Could you please help clarify on whether
we really need this patch to disable gigantic hugetlb page allocations?


Fadump kernel: gigantic hugepage not allocated
===

dmesg | grep -i "hugetlb"
-
[0.00] HugeTLB: unsupported hugepagesz=1G
[0.00] HugeTLB: hugepages=1 does not follow a valid hugepagesz, ignoring
[0.706375] HugeTLB support is disabled!
[0.773530] hugetlbfs: disabling because there are no supported hugepage 
sizes

$ cat /proc/meminfo | grep -i "hugetlb"
--


Cc: Hari Bathini 
Cc: Madhavan Srinivasan 
Cc: Mahesh Salgaonkar 
Cc: Michael Ellerman 
Cc: Ritesh Harjani (IBM)" 

I guess the extra " in the above was not adding me in the cc list.
Hence I missed to see this patch early.


Thanks for pointing it. I will fix it.



-ritesh



Reviewed-by: Christophe Leroy 
Signed-off-by: Sourabh Jain 
---
Changelog:

v1:
https://lore.kernel.org/all/20250121150419.1342794-1-sourabhj...@linux.ibm.com/

v2:
https://lore.kernel.org/all/20250124103220.111303-1-sourabhj...@linux.ibm.com/
  - disable gigantic hugepage in arch code, arch_hugetlb_valid_size()

v3:
https://lore.kernel.org/all/20250125104928.1-1-sourabhj...@linux.ibm.com/
  - Do not modify the initialization of the shift variable

v4:
- Update commit message to include how hugepages_supported() detects
   hugepages support when fadump is active
- Add Reviewed-by tag
- NO functional change

---
  arch/powerpc/mm/hugetlbpage.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 6b043180220a..88cfd182db4e 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage

Re: [PATCH v4] powerpc/hugetlb: Disable gigantic hugepages if fadump is active

2025-03-02 Thread Sourabh Jain

Hello Ritesh,

Thanks for the review.

On 02/03/25 12:05, Ritesh Harjani (IBM) wrote:

Sourabh Jain  writes:


The fadump kernel boots with limited memory solely to collect the kernel
core dump. Having gigantic hugepages in the fadump kernel is of no use.

Sure got it.


Many times, the fadump kernel encounters OOM (Out of Memory) issues if
gigantic hugepages are allocated.

To address this, disable gigantic hugepages if fadump is active by
returning early from arch_hugetlb_valid_size() using
hugepages_supported(). When fadump is active, the global variable
hugetlb_disabled is set to true, which is later used by the
PowerPC-specific hugepages_supported() function to determine hugepage
support.

Returning early from arch_hugetlb_vali_size() not only disables
gigantic hugepages but also avoids unnecessary hstate initialization for
every hugepage size supported by the platform.

kernel logs related to hugepages with this patch included:
kernel argument passed: hugepagesz=1G hugepages=1

First kernel: gigantic hugepage got allocated
==

dmesg | grep -i "hugetlb"
-
HugeTLB: registered 1.00 GiB page size, pre-allocated 1 pages
HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page
HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages
HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page

$ cat /proc/meminfo | grep -i "hugetlb"
-
Hugetlb: 1048576 kB

Was this tested with patch [1] in your local tree?

[1]: 
https://web.git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?id=d629d7a8efc33

IIUC, this patch [1] disables the boot time allocation of hugepages.
Isn't it also disabling the boot time allocation for gigantic huge pages
passed by the cmdline params like hugepagesz=1G and hugepages=2 ?


Yes, I had the patch [1] in my tree.

My understanding is that gigantic pages are allocated before normal huge 
pages.


In hugepages_setup() in hugetlb.c, we have:

    if (hugetlb_max_hstate && hstate_is_gigantic(parsed_hstate))
    hugetlb_hstate_alloc_pages(parsed_hstate);

I believe the above code allocates memory for gigantic pages, and 
hugetlb_init() is

called later because it is a subsys_initcall.

So, by the time the kernel reaches hugetlb_init(), the gigantic pages 
are already

allocated. Isn't that right?

Please let me know your opinion.

Thanks,
Sourabh Jain






HugeTLB: registered 1.00 GiB page size, pre-allocated 1 pages

This print comes from report_hugepages(). The only place from where
report_hugepages() gets called is hugetlb_init(). hugetlb_init() is what
is responsible for hugepages & gigantic hugepage allocations of the
passed kernel cmdline params.

But hugetlb_init() already checks for hugepages_supported() in the very
beginning. So I am not sure whether we need this extra patch to disable
gigantic hugepages allocation by the kernel cmdline params like
hugepagesz=1G and hugepages=2 type of options.

Hence I was wondering if you had this patch [1] in your tree when you were
testing this?

But I may be missing something. Could you please help clarify on whether
we really need this patch to disable gigantic hugetlb page allocations?


Fadump kernel: gigantic hugepage not allocated
===

dmesg | grep -i "hugetlb"
-
[0.00] HugeTLB: unsupported hugepagesz=1G
[0.00] HugeTLB: hugepages=1 does not follow a valid hugepagesz, ignoring
[0.706375] HugeTLB support is disabled!
[0.773530] hugetlbfs: disabling because there are no supported hugepage 
sizes

$ cat /proc/meminfo | grep -i "hugetlb"
--


Cc: Hari Bathini 
Cc: Madhavan Srinivasan 
Cc: Mahesh Salgaonkar 
Cc: Michael Ellerman 
Cc: Ritesh Harjani (IBM)" 

I guess the extra " in the above was not adding me in the cc list.
Hence I missed to see this patch early.


Thanks for point it out. I will fix it.




-ritesh



Reviewed-by: Christophe Leroy 
Signed-off-by: Sourabh Jain 
---
Changelog:

v1:
https://lore.kernel.org/all/20250121150419.1342794-1-sourabhj...@linux.ibm.com/

v2:
https://lore.kernel.org/all/20250124103220.111303-1-sourabhj...@linux.ibm.com/
  - disable gigantic hugepage in arch code, arch_hugetlb_valid_size()

v3:
https://lore.kernel.org/all/20250125104928.1-1-sourabhj...@linux.ibm.com/
  - Do not modify the initialization of the shift variable

v4:
- Update commit message to include how hugepages_supported() detects
   hugepages support when fadump is active
- Add Reviewed-by tag
- NO functional change

---
  arch/powerpc/mm/hugetlbpage.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 6b043180220a..88cfd182db4e 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -138,6 +138,9 @@ bool __init arch_hugetlb_valid_size(unsigned long size)
int shift = __ffs(size);

Re: [PATCH v3] fs: introduce getfsxattrat and setfsxattrat syscalls

2025-03-02 Thread Pali Rohár
On Friday 28 February 2025 09:30:38 Andrey Albershteyn wrote:
> On 2025-02-21 20:15:24, Amir Goldstein wrote:
> > On Fri, Feb 21, 2025 at 7:13 PM Darrick J. Wong  wrote:
> > >
> > > On Tue, Feb 11, 2025 at 06:22:47PM +0100, Andrey Albershteyn wrote:
> > > > From: Andrey Albershteyn 
> > > >
> > > > Introduce getfsxattrat and setfsxattrat syscalls to manipulate inode
> > > > extended attributes/flags. The syscalls take parent directory fd and
> > > > path to the child together with struct fsxattr.
> > > >
> > > > This is an alternative to FS_IOC_FSSETXATTR ioctl with a difference
> > > > that file don't need to be open as we can reference it with a path
> > > > instead of fd. By having this we can manipulated inode extended
> > > > attributes not only on regular files but also on special ones. This
> > > > is not possible with FS_IOC_FSSETXATTR ioctl as with special files
> > > > we can not call ioctl() directly on the filesystem inode using fd.
> > > >
> > > > This patch adds two new syscalls which allows userspace to get/set
> > > > extended inode attributes on special files by using parent directory
> > > > and a path - *at() like syscall.
> > > >
> > > > Also, as vfs_fileattr_set() is now will be called on special files
> > > > too, let's forbid any other attributes except projid and nextents
> > > > (symlink can have an extent).
> > > >
> > > > CC: linux-...@vger.kernel.org
> > > > CC: linux-fsde...@vger.kernel.org
> > > > CC: linux-...@vger.kernel.org
> > > > Signed-off-by: Andrey Albershteyn 
> > > > ---
> > > > v1:
> > > > https://lore.kernel.org/linuxppc-dev/20250109174540.893098-1-aalbe...@kernel.org/
> > > >
> > > > Previous discussion:
> > > > https://lore.kernel.org/linux-xfs/20240520164624.665269-2-aalbe...@redhat.com/
> > > >
> > > > XFS has project quotas which could be attached to a directory. All
> > > > new inodes in these directories inherit project ID set on parent
> > > > directory.
> > > >
> > > > The project is created from userspace by opening and calling
> > > > FS_IOC_FSSETXATTR on each inode. This is not possible for special
> > > > files such as FIFO, SOCK, BLK etc. Therefore, some inodes are left
> > > > with empty project ID. Those inodes then are not shown in the quota
> > > > accounting but still exist in the directory. Moreover, in the case
> > > > when special files are created in the directory with already
> > > > existing project quota, these inode inherit extended attributes.
> > > > This than leaves them with these attributes without the possibility
> > > > to clear them out. This, in turn, prevents userspace from
> > > > re-creating quota project on these existing files.
> > > > ---
> > > > Changes in v3:
> > > > - Remove unnecessary "dfd is dir" check as it checked in user_path_at()
> > > > - Remove unnecessary "same filesystem" check
> > > > - Use CLASS() instead of directly calling fdget/fdput
> > > > - Link to v2: 
> > > > https://lore.kernel.org/r/20250122-xattrat-syscall-v2-1-5b360d4fb...@kernel.org
> > > > ---
> > > >  arch/alpha/kernel/syscalls/syscall.tbl  |  2 +
> > > >  arch/arm/tools/syscall.tbl  |  2 +
> > > >  arch/arm64/tools/syscall_32.tbl |  2 +
> > > >  arch/m68k/kernel/syscalls/syscall.tbl   |  2 +
> > > >  arch/microblaze/kernel/syscalls/syscall.tbl |  2 +
> > > >  arch/mips/kernel/syscalls/syscall_n32.tbl   |  2 +
> > > >  arch/mips/kernel/syscalls/syscall_n64.tbl   |  2 +
> > > >  arch/mips/kernel/syscalls/syscall_o32.tbl   |  2 +
> > > >  arch/parisc/kernel/syscalls/syscall.tbl |  2 +
> > > >  arch/powerpc/kernel/syscalls/syscall.tbl|  2 +
> > > >  arch/s390/kernel/syscalls/syscall.tbl   |  2 +
> > > >  arch/sh/kernel/syscalls/syscall.tbl |  2 +
> > > >  arch/sparc/kernel/syscalls/syscall.tbl  |  2 +
> > > >  arch/x86/entry/syscalls/syscall_32.tbl  |  2 +
> > > >  arch/x86/entry/syscalls/syscall_64.tbl  |  2 +
> > > >  arch/xtensa/kernel/syscalls/syscall.tbl |  2 +
> > > >  fs/inode.c  | 75 
> > > > +
> > > >  fs/ioctl.c  | 16 +-
> > > >  include/linux/fileattr.h|  1 +
> > > >  include/linux/syscalls.h|  4 ++
> > > >  include/uapi/asm-generic/unistd.h   |  8 ++-
> > > >  21 files changed, 133 insertions(+), 3 deletions(-)
> > > >
> > >
> > > 
> > >
> > > > diff --git a/fs/inode.c b/fs/inode.c
> > > > index 
> > > > 6b4c77268fc0ecace4ac78a9ca777fbffc277f4a..b29db4fabaf67a6cbf541a86978b290411ec
> > > >  100644
> > > > --- a/fs/inode.c
> > > > +++ b/fs/inode.c
> > > > @@ -23,6 +23,9 @@
> > > >  #include 
> > > >  #include 
> > > >  #include 
> > > > +#include 
> > > > +#include 
> > > > +#include 
> > > >  #include 
> > > >  #define CREATE_TRACE_POINTS
> > > >  #include 
> > > > @@ -2953,3 +2956,75 @@ umode_t mode_strip_sgid(struct mnt_idmap *idmap,
> > > >   return mode & ~S_ISGID;
> > > >  }
> > > >  EXPORT_SYMBOL(mode_strip_s