Re: [PATCH v4 3/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd
在 2025/3/3 11:43, Sathyanarayanan Kuppuswamy 写道: On 2/16/25 6:42 PM, Shuai Xue wrote: The AER driver has historically avoided reading the configuration space of an endpoint or RCiEP that reported a fatal error, considering the link to that device unreliable. Consequently, when a fatal error occurs, the AER and DPC drivers do not report specific error types, resulting in logs like: pcieport :30:03.0: EDR: EDR event received pcieport :30:03.0: DPC: containment event, status:0x0005 source:0x3400 pcieport :30:03.0: DPC: ERR_FATAL detected pcieport :30:03.0: AER: broadcast error_detected message nvme nvme0: frozen state error detected, reset controller nvme :34:00.0: ready 0ms after DPC pcieport :30:03.0: AER: broadcast slot_reset message AER status registers are sticky and Write-1-to-clear. If the link recovered after hot reset, we can still safely access AER status of the error device. In such case, report fatal errors which helps to figure out the error root case. After this patch, the logs like: pcieport :30:03.0: EDR: EDR event received pcieport :30:03.0: DPC: containment event, status:0x0005 source:0x3400 pcieport :30:03.0: DPC: ERR_FATAL detected pcieport :30:03.0: AER: broadcast error_detected message nvme nvme0: frozen state error detected, reset controller pcieport :30:03.0: waiting 100 ms for downstream link, after activation nvme :34:00.0: ready 0ms after DPC nvme :34:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Data Link Layer, (Receiver ID) nvme :34:00.0: device [144d:a804] error status/mask=0010/00504000 nvme :34:00.0: [ 4] DLP (First) pcieport :30:03.0: AER: broadcast slot_reset message IMO, above info about device error details is more of a debug info. Since the main use of this info use to understand more details about the recovered DPC error. So I think is better to print with debug tag. Lets see what others think. Code wise, looks fine to me. thanks, looking forward to more feedback. Signed-off-by: Shuai Xue --- drivers/pci/pci.h | 3 ++- drivers/pci/pcie/aer.c | 11 +++ drivers/pci/pcie/dpc.c | 2 +- drivers/pci/pcie/err.c | 9 + 4 files changed, 19 insertions(+), 6 deletions(-) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 870d2fbd6ff2..e852fa58b250 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -549,7 +549,8 @@ struct aer_err_info { struct pcie_tlp_log tlp; /* TLP Header */ }; -int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info); +int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info, + bool link_healthy); void aer_print_error(struct pci_dev *dev, struct aer_err_info *info); int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2, diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 508474e17183..bfb67db074f0 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -1197,12 +1197,14 @@ EXPORT_SYMBOL_GPL(aer_recover_queue); * aer_get_device_error_info - read error status from dev and store it to info * @dev: pointer to the device expected to have a error record * @info: pointer to structure to store the error record + * @link_healthy: link is healthy or not * * Return 1 on success, 0 on error. * * Note that @info is reused among all error devices. Clear fields properly. */ -int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info) +int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info, + bool link_healthy) { int type = pci_pcie_type(dev); int aer = dev->aer_cap; @@ -1226,7 +1228,8 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info) } else if (type == PCI_EXP_TYPE_ROOT_PORT || type == PCI_EXP_TYPE_RC_EC || type == PCI_EXP_TYPE_DOWNSTREAM || - info->severity == AER_NONFATAL) { + info->severity == AER_NONFATAL || + (info->severity == AER_FATAL && link_healthy)) { /* Link is still healthy for IO reads */ pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS, @@ -1258,11 +1261,11 @@ static inline void aer_process_err_devices(struct aer_err_info *e_info) /* Report all before handle them, not to lost records by reset etc. */ for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) { - if (aer_get_device_error_info(e_info->dev[i], e_info)) + if (aer_get_device_error_info(e_info->dev[i], e_info, false)) aer_print_error(e_info->dev[i], e_info); } for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) { - if (aer_get_device_error_info(e_info->dev[i], e_info)) + if (aer_get_device_error_info(e_info->dev[i], e_info, false)) handle_error_source(e_info->dev[i], e
[powerpc:merge] BUILD SUCCESS 1304f486dbf1f455c9abe284c155747be90043b3
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git merge branch HEAD: 1304f486dbf1f455c9abe284c155747be90043b3 Automatic merge of 'next' into merge (2025-03-02 10:01) elapsed time: 1441m configs tested: 135 configs skipped: 4 The following configs have been built successfully. More configs may be tested in the coming days. tested configs: alpha allnoconfiggcc-14.2.0 alphaallyesconfiggcc-14.2.0 alpha defconfiggcc-14.2.0 arc allmodconfiggcc-13.2.0 arc allnoconfiggcc-13.2.0 arc allyesconfiggcc-13.2.0 arc axs103_defconfiggcc-13.2.0 arc defconfiggcc-13.2.0 archsdk_defconfiggcc-13.2.0 arc nsimosci_hs_smp_defconfiggcc-13.2.0 arc randconfig-001-20250302gcc-13.2.0 arc randconfig-002-20250302gcc-13.2.0 arm allmodconfiggcc-14.2.0 arm allnoconfigclang-17 arm allyesconfiggcc-14.2.0 arm defconfigclang-21 armmps2_defconfigclang-15 arm randconfig-001-20250302gcc-14.2.0 arm randconfig-002-20250302clang-21 arm randconfig-003-20250302gcc-14.2.0 arm randconfig-004-20250302clang-21 arm64allmodconfigclang-18 arm64 allnoconfiggcc-14.2.0 arm64 randconfig-001-20250302clang-18 arm64 randconfig-002-20250302gcc-14.2.0 arm64 randconfig-003-20250302gcc-14.2.0 arm64 randconfig-004-20250302clang-16 csky allnoconfiggcc-14.2.0 cskydefconfiggcc-14.2.0 csky randconfig-001-20250302gcc-14.2.0 csky randconfig-002-20250302gcc-14.2.0 hexagon allmodconfigclang-21 hexagon allnoconfigclang-21 hexagon allyesconfigclang-18 hexagon randconfig-001-20250302clang-21 hexagon randconfig-002-20250302clang-21 i386 allmodconfiggcc-12 i386 allnoconfiggcc-12 i386 allyesconfiggcc-12 i386buildonly-randconfig-001-20250302gcc-12 i386buildonly-randconfig-002-20250302clang-19 i386buildonly-randconfig-003-20250302gcc-12 i386buildonly-randconfig-004-20250302gcc-12 i386buildonly-randconfig-005-20250302gcc-12 i386buildonly-randconfig-006-20250302gcc-12 i386defconfigclang-19 loongarchalldefconfiggcc-14.2.0 loongarchallmodconfiggcc-14.2.0 loongarch allnoconfiggcc-14.2.0 loongarch randconfig-001-20250302gcc-14.2.0 loongarch randconfig-002-20250302gcc-14.2.0 m68k allmodconfiggcc-14.2.0 m68k allnoconfiggcc-14.2.0 m68k allyesconfiggcc-14.2.0 m68k multi_defconfiggcc-14.2.0 microblaze allmodconfiggcc-14.2.0 microblazeallnoconfiggcc-14.2.0 microblaze allyesconfiggcc-14.2.0 mips allnoconfiggcc-14.2.0 mips ath25_defconfigclang-16 mips eyeq5_defconfiggcc-14.2.0 nios2 allnoconfiggcc-14.2.0 nios2 randconfig-001-20250302gcc-14.2.0 nios2 randconfig-002-20250302gcc-14.2.0 openrisc allnoconfiggcc-14.2.0 openrisc allyesconfiggcc-14.2.0 openriscdefconfiggcc-14.2.0 parisc allmodconfiggcc-14.2.0 pariscallnoconfiggcc-14.2.0 parisc allyesconfiggcc-14.2.0 parisc defconfiggcc-14.2.0 pariscrandconfig-001-20250302gcc-14.2.0 pariscrandconfig-002-20250302gcc-14.2.0 powerpc allmodconfiggcc-14.2.0 powerpc allnoconfiggcc-14.2.0 powerpc allyesconfigclang-16 powerpc ebony_defconfigclang-18 powerpcgamecube_defconfigclang-
[PATCH v2 RESEND] sched/membarrier: Fix redundant load of membarrier_state
On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE is not selected, sync_core_before_usermode() is a no-op. In membarrier_mm_sync_core_before_usermode() the compiler does not eliminate redundant branches and load of mm->membarrier_state for this case as the atomic_read() cannot be optimized away. Here's a snippet of the code generated for finish_task_switch() on powerpc prior to this change: 1b786c: ld r26,2624(r30) # mm = rq->prev_mm; ... 1b78c8: cmpdi cr7,r26,0 1b78cc: beq cr7,1b78e4 1b78d0: ld r9,2312(r13)# current 1b78d4: ld r9,1888(r9) # current->mm 1b78d8: cmpdcr7,r26,r9 1b78dc: beq cr7,1b7a70 1b78e0: hwsync 1b78e4: cmplwi cr7,r27,128 ... 1b7a70: lwz r9,176(r26) # atomic_read(&mm->membarrier_state) 1b7a74: b 1b78e0 This was found while analyzing "perf c2c" reports on kernels prior to commit c1753fd02a00 ("mm: move mm_count into its own cache line") where mm_count was false sharing with membarrier_state. There is a minor improvement in the size of finish_task_switch(). The following are results from bloat-o-meter for ppc64le: GCC 7.5.0 - add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32) Function old new delta finish_task_switch 884 852 -32 GCC 12.2.1 -- add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32) Function old new delta finish_task_switch.isra 852 820 -32 LLVM 17.0.6 --- add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-36 (-36) Function old new delta rt_mutex_schedule120 104 -16 finish_task_switch 792 772 -20 Results on aarch64: GCC 14.1.1 -- add/remove: 0/2 grow/shrink: 1/1 up/down: 4/-60 (-56) Function old new delta get_nohz_timer_target352 356 +4 e843419@0b02_d7e7_408 8 - -8 e843419@01bb_21d2_868 8 - -8 finish_task_switch.isra 592 548 -44 Signed-off-by: Nysal Jan K.A. Reviewed-by: Mathieu Desnoyers Reviewed-by: Michael Ellerman Reviewed-by: Segher Boessenkool --- V1 -> V2: - Add results for aarch64 - Add a comment describing the changes --- include/linux/sched/mm.h | 7 +++ 1 file changed, 7 insertions(+) diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 928a626725e6..b13474825130 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -531,6 +531,13 @@ enum { static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) { + /* +* The atomic_read() below prevents CSE. The following should +* help the compiler generate more efficient code on architectures +* where sync_core_before_usermode() is a no-op. +*/ + if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE)) + return; if (current->mm != mm) return; if (likely(!(atomic_read(&mm->membarrier_state) & -- 2.48.1
Re: [PATCH v4 3/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd
On 2/16/25 6:42 PM, Shuai Xue wrote: The AER driver has historically avoided reading the configuration space of an endpoint or RCiEP that reported a fatal error, considering the link to that device unreliable. Consequently, when a fatal error occurs, the AER and DPC drivers do not report specific error types, resulting in logs like: pcieport :30:03.0: EDR: EDR event received pcieport :30:03.0: DPC: containment event, status:0x0005 source:0x3400 pcieport :30:03.0: DPC: ERR_FATAL detected pcieport :30:03.0: AER: broadcast error_detected message nvme nvme0: frozen state error detected, reset controller nvme :34:00.0: ready 0ms after DPC pcieport :30:03.0: AER: broadcast slot_reset message AER status registers are sticky and Write-1-to-clear. If the link recovered after hot reset, we can still safely access AER status of the error device. In such case, report fatal errors which helps to figure out the error root case. After this patch, the logs like: pcieport :30:03.0: EDR: EDR event received pcieport :30:03.0: DPC: containment event, status:0x0005 source:0x3400 pcieport :30:03.0: DPC: ERR_FATAL detected pcieport :30:03.0: AER: broadcast error_detected message nvme nvme0: frozen state error detected, reset controller pcieport :30:03.0: waiting 100 ms for downstream link, after activation nvme :34:00.0: ready 0ms after DPC nvme :34:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Data Link Layer, (Receiver ID) nvme :34:00.0: device [144d:a804] error status/mask=0010/00504000 nvme :34:00.0:[ 4] DLP(First) pcieport :30:03.0: AER: broadcast slot_reset message IMO, above info about device error details is more of a debug info. Since the main use of this info use to understand more details about the recovered DPC error. So I think is better to print with debug tag. Lets see what others think. Code wise, looks fine to me. Signed-off-by: Shuai Xue --- drivers/pci/pci.h | 3 ++- drivers/pci/pcie/aer.c | 11 +++ drivers/pci/pcie/dpc.c | 2 +- drivers/pci/pcie/err.c | 9 + 4 files changed, 19 insertions(+), 6 deletions(-) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 870d2fbd6ff2..e852fa58b250 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -549,7 +549,8 @@ struct aer_err_info { struct pcie_tlp_log tlp;/* TLP Header */ }; -int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info); +int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info, + bool link_healthy); void aer_print_error(struct pci_dev *dev, struct aer_err_info *info); int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2, diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index 508474e17183..bfb67db074f0 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -1197,12 +1197,14 @@ EXPORT_SYMBOL_GPL(aer_recover_queue); * aer_get_device_error_info - read error status from dev and store it to info * @dev: pointer to the device expected to have a error record * @info: pointer to structure to store the error record + * @link_healthy: link is healthy or not * * Return 1 on success, 0 on error. * * Note that @info is reused among all error devices. Clear fields properly. */ -int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info) +int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info, + bool link_healthy) { int type = pci_pcie_type(dev); int aer = dev->aer_cap; @@ -1226,7 +1228,8 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info) } else if (type == PCI_EXP_TYPE_ROOT_PORT || type == PCI_EXP_TYPE_RC_EC || type == PCI_EXP_TYPE_DOWNSTREAM || - info->severity == AER_NONFATAL) { + info->severity == AER_NONFATAL || + (info->severity == AER_FATAL && link_healthy)) { /* Link is still healthy for IO reads */ pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS, @@ -1258,11 +1261,11 @@ static inline void aer_process_err_devices(struct aer_err_info *e_info) /* Report all before handle them, not to lost records by reset etc. */ for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) { - if (aer_get_device_error_info(e_info->dev[i], e_info)) + if (aer_get_device_error_info(e_info->dev[i], e_info, false)) aer_print_error(e_info->dev[i], e_info); } for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) { - if (aer_get_device_error_info(e_info->dev[i], e_info)) + if (aer_get_device_error_info(e_info->dev[i], e_info, false))
Re: [PATCH v4 2/3] PCI/DPC: Run recovery on device that detected the error
在 2025/3/3 11:36, Sathyanarayanan Kuppuswamy 写道: On 2/16/25 6:42 PM, Shuai Xue wrote: The current implementation of pcie_do_recovery() assumes that the recovery process is executed on the device that detected the error. However, the DPC driver currently passes the error port that experienced the DPC event to pcie_do_recovery(). Use the SOURCE ID register to correctly identify the device that detected the error. When passing the error device, the pcie_do_recovery() will find the upstream bridge and walk bridges potentially AER affected. And subsequent patches will be able to accurately access AER status of the error device. Should not observe any functional changes. Signed-off-by: Shuai Xue --- Looks good to me Reviewed-by: Kuppuswamy Sathyanarayanan Thanks. Shuai
Re: [PATCH v2 2/4] powerpc/perf/hv-24x7: Avoid loading hv-24x7 during dump kernel
Hi Madhavan, kernel test robot noticed the following build errors: [auto build test ERROR on powerpc/next] [also build test ERROR on powerpc/fixes linus/master v6.14-rc4 next-20250228] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Madhavan-Srinivasan/powerpc-perf-hv-24x7-Avoid-loading-hv-24x7-during-dump-kernel/20250302-022531 base: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next patch link: https://lore.kernel.org/r/20250301182310.6832-2-maddy%40linux.ibm.com patch subject: [PATCH v2 2/4] powerpc/perf/hv-24x7: Avoid loading hv-24x7 during dump kernel config: powerpc64-randconfig-001-20250302 (https://download.01.org/0day-ci/archive/20250302/202503021908.ed1go0gi-...@intel.com/config) compiler: powerpc64-linux-gcc (GCC) 14.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250302/202503021908.ed1go0gi-...@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot | Closes: https://lore.kernel.org/oe-kbuild-all/202503021908.ed1go0gi-...@intel.com/ All errors (new ones prefixed by >>): arch/powerpc/perf/hv-24x7.c: In function 'hv_24x7_init': >> arch/powerpc/perf/hv-24x7.c:1701:13: error: implicit declaration of function >> 'is_kdump_kernel' [-Wimplicit-function-declaration] 1701 | if (is_kdump_kernel() || is_fadump_active()) | ^~~ vim +/is_kdump_kernel +1701 arch/powerpc/perf/hv-24x7.c 1693 1694 static int hv_24x7_init(void) 1695 { 1696 int r; 1697 unsigned long hret; 1698 unsigned int pvr = mfspr(SPRN_PVR); 1699 struct hv_perf_caps caps; 1700 > 1701 if (is_kdump_kernel() || is_fadump_active()) 1702 return 0; 1703 1704 if (!firmware_has_feature(FW_FEATURE_LPAR)) { 1705 pr_debug("not a virtualized system, not enabling\n"); 1706 return -ENODEV; 1707 } 1708 1709 /* POWER8 only supports v1, while POWER9 only supports v2. */ 1710 if (PVR_VER(pvr) == PVR_POWER8 || PVR_VER(pvr) == PVR_POWER8E || 1711 PVR_VER(pvr) == PVR_POWER8NVL) 1712 interface_version = 1; 1713 else { 1714 interface_version = 2; 1715 1716 /* SMT8 in POWER9 needs to aggregate result elements. */ 1717 if (threads_per_core == 8) 1718 aggregate_result_elements = true; 1719 } 1720 1721 hret = hv_perf_caps_get(&caps); 1722 if (hret) { 1723 pr_debug("could not obtain capabilities, not enabling, rc=%ld\n", 1724 hret); 1725 return -ENODEV; 1726 } 1727 1728 hv_page_cache = kmem_cache_create("hv-page-4096", 4096, 4096, 0, NULL); 1729 if (!hv_page_cache) 1730 return -ENOMEM; 1731 1732 /* sampling not supported */ 1733 h_24x7_pmu.capabilities |= PERF_PMU_CAP_NO_INTERRUPT; 1734 1735 r = create_events_from_catalog(&event_group.attrs, 1736 &event_desc_group.attrs, 1737 &event_long_desc_group.attrs); 1738 1739 if (r) 1740 return r; 1741 1742 /* init cpuhotplug */ 1743 r = hv_24x7_cpu_hotplug_init(); 1744 if (r) 1745 return r; 1746 1747 r = perf_pmu_register(&h_24x7_pmu, h_24x7_pmu.name, -1); 1748 if (r) 1749 return r; 1750 1751 read_24x7_sys_info(); 1752 1753 return 0; 1754 } 1755 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki
Re: [PATCH 0/5] Microwatt updates
[Sorry, I wanted to reply earlier, but it stayed in my drafts folder for a month] On Sat, Feb 01, 2025 at 12:22:51PM +1100, Paul Mackerras wrote: [snipped] > > 603 was a looong time ago, I don't recall the details. > > Regarding broadcast TLBIEs, the protocols and mechanisms for doing > that are known to be complex and slow in the IBM Power processors (ask > Derek Williams about that :). Anton found that in fact doing only > local TLBIEs and using IPIs gave *better* performance on IBM Power > systems than using hardware broadcast TLBIEs in many cases (the reason > being that software knows which other CPUs might have a given TLB > entry, often quite a small set, whereas hardware doesn't, and has to > send the invalidation to every CPU and wait for a response from every > CPU). Add to that, that most other SMP-capable CPU architectures > don't do broadcast TLB invalidations, Intel x86 for example. Actually it's coming to x86, at least on the AMD side: https://lore.kernel.org/all/20250206044346.3810242-1-r...@surriel.com/ with performance numbers which look rather good. I don't know how it looks like at the level of the hardware protocol, but implementing it on a single chip/socket is likely relatively simple. Gabriel > > > > the kernel already has code to deal with this. One of the patches in > > > this series provides a config option to allow platforms to select > > > unconditionally the behaviour where cross-CPU TLB invalidations are > > > handled using inter-processor interrupts. > > > > Are there plans to broadcast the (SMP cache invalidation) messages? > > Cache (i.e. instruction and data cache) - yes, they *are* coherent. > More precisely, the D caches are write-through, and all I and D caches > snoop writes to memory (including DMA writes) and invalidate any cache > lines being written to. > > > Will uwatt support some real bus protocol, for example? > > "Real" meaning using tri-state bus drivers, like we did in the 90s? :) > > > Again, congrats on this great milestone! Does this floating point > > support do square roots as well (aka "gpopt"; does it do "gfxopt" for > > that matter, fsel?) fsqrt is kinda tricky to get to work fully > > correctly :-) > > Yes, fsqrt and fsel are implemented in hardware, and are accurate to > the last bit. Also, the FPU handles denormalized values in hardware > (both input and output) and implements all exception handling as per > the ISA, including the trap-enabled overflow cases. Feel free to run > whatever tests you like and report bugs. But we're getting a bit > off-topic from the kernel patches. :) > > Paul. >
Re: [PATCH v2 1/4] powerpc/perf/core-book3s: Avoid loading platform pmu driver during dump kernel
Hi Madhavan, kernel test robot noticed the following build errors: [auto build test ERROR on powerpc/next] [also build test ERROR on powerpc/fixes linus/master v6.14-rc4 next-20250228] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Madhavan-Srinivasan/powerpc-perf-hv-24x7-Avoid-loading-hv-24x7-during-dump-kernel/20250302-022531 base: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next patch link: https://lore.kernel.org/r/20250301182310.6832-1-maddy%40linux.ibm.com patch subject: [PATCH v2 1/4] powerpc/perf/core-book3s: Avoid loading platform pmu driver during dump kernel config: powerpc64-randconfig-001-20250302 (https://download.01.org/0day-ci/archive/20250302/202503021712.evdjymst-...@intel.com/config) compiler: powerpc64-linux-gcc (GCC) 14.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250302/202503021712.evdjymst-...@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot | Closes: https://lore.kernel.org/oe-kbuild-all/202503021712.evdjymst-...@intel.com/ All errors (new ones prefixed by >>): arch/powerpc/perf/core-book3s.c: In function 'init_ppc64_pmu': >> arch/powerpc/perf/core-book3s.c:2599:13: error: implicit declaration of >> function 'is_kdump_kernel' [-Wimplicit-function-declaration] 2599 | if (is_kdump_kernel() || is_fadump_active()) | ^~~ vim +/is_kdump_kernel +2599 arch/powerpc/perf/core-book3s.c 2587 2588 static int __init init_ppc64_pmu(void) 2589 { 2590 if (cpu_has_feature(CPU_FTR_HVMODE) && pmu_override) { 2591 pr_warn("disabling perf due to pmu_override= command line option.\n"); 2592 on_each_cpu(do_pmu_override, NULL, 1); 2593 return 0; 2594 } 2595 2596 /* 2597 * If the dump kernel is active, skip loading these drivers 2598 */ > 2599 if (is_kdump_kernel() || is_fadump_active()) 2600 return 0; 2601 2602 /* run through all the pmu drivers one at a time */ 2603 if (!init_power5_pmu()) 2604 return 0; 2605 else if (!init_power5p_pmu()) 2606 return 0; 2607 else if (!init_power6_pmu()) 2608 return 0; 2609 else if (!init_power7_pmu()) 2610 return 0; 2611 else if (!init_power8_pmu()) 2612 return 0; 2613 else if (!init_power9_pmu()) 2614 return 0; 2615 else if (!init_power10_pmu()) 2616 return 0; 2617 else if (!init_power11_pmu()) 2618 return 0; 2619 else if (!init_ppc970_pmu()) 2620 return 0; 2621 else 2622 return init_generic_compat_pmu(); 2623 } 2624 early_initcall(init_ppc64_pmu); 2625 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki
Re: [PATCH v4 0/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd
在 2025/2/17 10:42, Shuai Xue 写道: changes since v3: - squash patch 1 and 2 into one patch per Sathyanarayanan - add comments note for dpc_process_error per Sathyanarayanan - pick up Reviewed-by tag from Sathyanarayanan changes since v2: - moving the "err_port" rename to a separate patch per Sathyanarayanan - rewrite comments of dpc_process_error per Sathyanarayanan - remove NULL initialization for err_dev per Sathyanarayanan changes since v1: - rewrite commit log per Bjorn - refactor aer_get_device_error_info to reduce duplication per Keith - fix to avoid reporting fatal errors twice for root and downstream ports per Keith The AER driver has historically avoided reading the configuration space of an endpoint or RCiEP that reported a fatal error, considering the link to that device unreliable. Consequently, when a fatal error occurs, the AER and DPC drivers do not report specific error types, resulting in logs like: pcieport :30:03.0: EDR: EDR event received pcieport :30:03.0: DPC: containment event, status:0x0005 source:0x3400 pcieport :30:03.0: DPC: ERR_FATAL detected pcieport :30:03.0: AER: broadcast error_detected message nvme nvme0: frozen state error detected, reset controller nvme :34:00.0: ready 0ms after DPC pcieport :30:03.0: AER: broadcast slot_reset message AER status registers are sticky and Write-1-to-clear. If the link recovered after hot reset, we can still safely access AER status of the error device. In such case, report fatal errors which helps to figure out the error root case. After this patch set, the logs like: pcieport :30:03.0: EDR: EDR event received pcieport :30:03.0: DPC: containment event, status:0x0005 source:0x3400 pcieport :30:03.0: DPC: ERR_FATAL detected pcieport :30:03.0: AER: broadcast error_detected message nvme nvme0: frozen state error detected, reset controller pcieport :30:03.0: waiting 100 ms for downstream link, after activation nvme :34:00.0: ready 0ms after DPC nvme :34:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Data Link Layer, (Receiver ID) nvme :34:00.0: device [144d:a804] error status/mask=0010/00504000 nvme :34:00.0:[ 4] DLP(First) pcieport :30:03.0: AER: broadcast slot_reset message Shuai Xue (3): PCI/DPC: Clarify naming for error port in DPC Handling PCI/DPC: Run recovery on device that detected the error PCI/AER: Report fatal errors of RCiEP and EP if link recoverd drivers/pci/pci.h | 5 +++-- drivers/pci/pcie/aer.c | 11 +++ drivers/pci/pcie/dpc.c | 34 +++--- drivers/pci/pcie/edr.c | 35 ++- drivers/pci/pcie/err.c | 9 + 5 files changed, 64 insertions(+), 30 deletions(-) Hi, All, Gentle ping. Thanks. Shuai
Re: [PATCH v4 2/3] PCI/DPC: Run recovery on device that detected the error
On 2/16/25 6:42 PM, Shuai Xue wrote: The current implementation of pcie_do_recovery() assumes that the recovery process is executed on the device that detected the error. However, the DPC driver currently passes the error port that experienced the DPC event to pcie_do_recovery(). Use the SOURCE ID register to correctly identify the device that detected the error. When passing the error device, the pcie_do_recovery() will find the upstream bridge and walk bridges potentially AER affected. And subsequent patches will be able to accurately access AER status of the error device. Should not observe any functional changes. Signed-off-by: Shuai Xue --- Looks good to me Reviewed-by: Kuppuswamy Sathyanarayanan drivers/pci/pci.h | 2 +- drivers/pci/pcie/dpc.c | 28 drivers/pci/pcie/edr.c | 7 --- 3 files changed, 29 insertions(+), 8 deletions(-) diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index 01e51db8d285..870d2fbd6ff2 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -572,7 +572,7 @@ struct rcec_ea { void pci_save_dpc_state(struct pci_dev *dev); void pci_restore_dpc_state(struct pci_dev *dev); void pci_dpc_init(struct pci_dev *pdev); -void dpc_process_error(struct pci_dev *pdev); +struct pci_dev *dpc_process_error(struct pci_dev *pdev); pci_ers_result_t dpc_reset_link(struct pci_dev *pdev); bool pci_dpc_recovered(struct pci_dev *pdev); unsigned int dpc_tlp_log_len(struct pci_dev *dev); diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c index 1a54a0b657ae..ea3ea989afa7 100644 --- a/drivers/pci/pcie/dpc.c +++ b/drivers/pci/pcie/dpc.c @@ -253,10 +253,20 @@ static int dpc_get_aer_uncorrect_severity(struct pci_dev *dev, return 1; } -void dpc_process_error(struct pci_dev *pdev) +/** + * dpc_process_error - handle the DPC error status + * @pdev: the port that experienced the containment event + * + * Return the device that detected the error. + * + * NOTE: The device reference count is increased, the caller must decrement + * the reference count by calling pci_dev_put(). + */ +struct pci_dev *dpc_process_error(struct pci_dev *pdev) { u16 cap = pdev->dpc_cap, status, source, reason, ext_reason; struct aer_err_info info; + struct pci_dev *err_dev; pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, &status); pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, &source); @@ -279,6 +289,13 @@ void dpc_process_error(struct pci_dev *pdev) "software trigger" : "reserved error"); + if (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_NFE || + reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE) + err_dev = pci_get_domain_bus_and_slot(pci_domain_nr(pdev->bus), + PCI_BUS_NUM(source), source & 0xff); + else + err_dev = pci_dev_get(pdev); + /* show RP PIO error detail information */ if (pdev->dpc_rp_extensions && reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_IN_EXT && @@ -291,6 +308,8 @@ void dpc_process_error(struct pci_dev *pdev) pci_aer_clear_nonfatal_status(pdev); pci_aer_clear_fatal_status(pdev); } + + return err_dev; } static void pci_clear_surpdn_errors(struct pci_dev *pdev) @@ -346,7 +365,7 @@ static bool dpc_is_surprise_removal(struct pci_dev *pdev) static irqreturn_t dpc_handler(int irq, void *context) { - struct pci_dev *err_port = context; + struct pci_dev *err_port = context, *err_dev; /* * According to PCIe r6.0 sec 6.7.6, errors are an expected side effect @@ -357,10 +376,11 @@ static irqreturn_t dpc_handler(int irq, void *context) return IRQ_HANDLED; } - dpc_process_error(err_port); + err_dev = dpc_process_error(err_port); /* We configure DPC so it only triggers on ERR_FATAL */ - pcie_do_recovery(err_port, pci_channel_io_frozen, dpc_reset_link); + pcie_do_recovery(err_dev, pci_channel_io_frozen, dpc_reset_link); + pci_dev_put(err_dev); return IRQ_HANDLED; } diff --git a/drivers/pci/pcie/edr.c b/drivers/pci/pcie/edr.c index 521fca2f40cb..088f3e188f54 100644 --- a/drivers/pci/pcie/edr.c +++ b/drivers/pci/pcie/edr.c @@ -150,7 +150,7 @@ static int acpi_send_edr_status(struct pci_dev *pdev, struct pci_dev *edev, static void edr_handle_event(acpi_handle handle, u32 event, void *data) { - struct pci_dev *pdev = data, *err_port; + struct pci_dev *pdev = data, *err_port, *err_dev; pci_ers_result_t estate = PCI_ERS_RESULT_DISCONNECT; u16 status; @@ -190,7 +190,7 @@ static void edr_handle_event(acpi_handle handle, u32 event, void *data) goto send_ost; } - dpc_process_error(err_port); + err_dev = dpc_process_error(err_port); pci_aer_raw_clear_status(err_port); /* @@ -198,7 +198,7 @@ static void edr_han
[PATCH 0/2] powerpc: gpio_mdio: Simplify gpio_mdio_probe()
While wondering if it was correct to call mdiobus_free() in the remove function and only kfree() in the error handling of the probe, I arrived at the conclusion that the code could be simpler here. Patch 1 uses mdiobus_alloc_size() instead of a hand written mdiobus_alloc() + kzalloc(). it also uses the devm_ version in order to save some LoC (and answer my initial question) Patch 2 uses devm_of_mdiobus_register() to completly remove the .remove() function and save some more LoC. Both patches are compile tested only. Christophe JAILLET (2): powerpc: gpio_mdio: Use devm_mdiobus_alloc_size() powerpc: gpio_mdio: Use devm_of_mdiobus_register() arch/powerpc/platforms/pasemi/gpio_mdio.c | 41 --- 1 file changed, 6 insertions(+), 35 deletions(-) -- 2.48.1
[PATCH 1/2] powerpc: gpio_mdio: Use devm_mdiobus_alloc_size()
Use mdiobus_alloc_size() instead of a hand written mdiobus_alloc() + kzalloc(). This is less verbose and more robust. It also reduces memory fragmentation and saves a few bytes of memory. While at it, switch to devm_mdiobus_alloc_size() for extra simplification. Signed-off-by: Christophe JAILLET --- This patch is compile tested only. Some memory is saved because pahole states, on a x86_64, that: struct mii_bus { ... /* size: 3640, cachelines: 57, members: 23 */ /* sum members: 3633, holes: 2, sum holes: 7 */ /* member types with holes: 3, total: 4, bit paddings: 1, total: 1 bit */ /* paddings: 1, sum paddings: 3 */ /* forced alignments: 1, forced holes: 1, sum forced holes: 4 */ /* last cacheline: 56 bytes */ } Because of the way allocation works, 4096 bytes are allocated. When mdiobus_alloc_size() is used, struct gpio_priv fits in this "wasted" space and so is available "for free". --- arch/powerpc/platforms/pasemi/gpio_mdio.c | 26 +-- 1 file changed, 5 insertions(+), 21 deletions(-) diff --git a/arch/powerpc/platforms/pasemi/gpio_mdio.c b/arch/powerpc/platforms/pasemi/gpio_mdio.c index e4538d471256..2c54f5f063b7 100644 --- a/arch/powerpc/platforms/pasemi/gpio_mdio.c +++ b/arch/powerpc/platforms/pasemi/gpio_mdio.c @@ -213,15 +213,11 @@ static int gpio_mdio_probe(struct platform_device *ofdev) const unsigned int *prop; int err; - err = -ENOMEM; - priv = kzalloc(sizeof(struct gpio_priv), GFP_KERNEL); - if (!priv) - goto out; - - new_bus = mdiobus_alloc(); - + new_bus = devm_mdiobus_alloc_size(dev, sizeof(*priv)); if (!new_bus) - goto out_free_priv; + return -ENOMEM; + + priv = new_bus->priv; new_bus->name = "pasemi gpio mdio bus"; new_bus->read = &gpio_mdio_read; @@ -230,7 +226,6 @@ static int gpio_mdio_probe(struct platform_device *ofdev) prop = of_get_property(np, "reg", NULL); snprintf(new_bus->id, MII_BUS_ID_SIZE, "%x", *prop); - new_bus->priv = priv; prop = of_get_property(np, "mdc-pin", NULL); priv->mdc_pin = *prop; @@ -246,17 +241,10 @@ static int gpio_mdio_probe(struct platform_device *ofdev) if (err != 0) { pr_err("%s: Cannot register as MDIO bus, err %d\n", new_bus->name, err); - goto out_free_irq; + return err; } return 0; - -out_free_irq: - kfree(new_bus); -out_free_priv: - kfree(priv); -out: - return err; } @@ -267,10 +255,6 @@ static void gpio_mdio_remove(struct platform_device *dev) mdiobus_unregister(bus); dev_set_drvdata(&dev->dev, NULL); - - kfree(bus->priv); - bus->priv = NULL; - mdiobus_free(bus); } static const struct of_device_id gpio_mdio_match[] = -- 2.48.1
[PATCH 2/2] powerpc: gpio_mdio: Use devm_of_mdiobus_register()
Use devm_of_mdiobus_register() in order to remove the now empty .remove() function. Doing so dev_set_drvdata() is now also unneeded. Remove it as well. Signed-off-by: Christophe JAILLET --- This patch is compile tested only. --- arch/powerpc/platforms/pasemi/gpio_mdio.c | 15 +-- 1 file changed, 1 insertion(+), 14 deletions(-) diff --git a/arch/powerpc/platforms/pasemi/gpio_mdio.c b/arch/powerpc/platforms/pasemi/gpio_mdio.c index 2c54f5f063b7..6712ccb84c0a 100644 --- a/arch/powerpc/platforms/pasemi/gpio_mdio.c +++ b/arch/powerpc/platforms/pasemi/gpio_mdio.c @@ -234,10 +234,8 @@ static int gpio_mdio_probe(struct platform_device *ofdev) priv->mdio_pin = *prop; new_bus->parent = dev; - dev_set_drvdata(dev, new_bus); - - err = of_mdiobus_register(new_bus, np); + err = devm_of_mdiobus_register(dev, new_bus, np); if (err != 0) { pr_err("%s: Cannot register as MDIO bus, err %d\n", new_bus->name, err); @@ -247,16 +245,6 @@ static int gpio_mdio_probe(struct platform_device *ofdev) return 0; } - -static void gpio_mdio_remove(struct platform_device *dev) -{ - struct mii_bus *bus = dev_get_drvdata(&dev->dev); - - mdiobus_unregister(bus); - - dev_set_drvdata(&dev->dev, NULL); -} - static const struct of_device_id gpio_mdio_match[] = { { @@ -269,7 +257,6 @@ MODULE_DEVICE_TABLE(of, gpio_mdio_match); static struct platform_driver gpio_mdio_driver = { .probe = gpio_mdio_probe, - .remove = gpio_mdio_remove, .driver = { .name = "gpio-mdio-bitbang", .of_match_table = gpio_mdio_match, -- 2.48.1
Re: [PATCH v4] powerpc/hugetlb: Disable gigantic hugepages if fadump is active
Hello Ritesh, Thanks for the review. On 02/03/25 12:05, Ritesh Harjani (IBM) wrote: Sourabh Jain writes: The fadump kernel boots with limited memory solely to collect the kernel core dump. Having gigantic hugepages in the fadump kernel is of no use. Sure got it. Many times, the fadump kernel encounters OOM (Out of Memory) issues if gigantic hugepages are allocated. To address this, disable gigantic hugepages if fadump is active by returning early from arch_hugetlb_valid_size() using hugepages_supported(). When fadump is active, the global variable hugetlb_disabled is set to true, which is later used by the PowerPC-specific hugepages_supported() function to determine hugepage support. Returning early from arch_hugetlb_vali_size() not only disables gigantic hugepages but also avoids unnecessary hstate initialization for every hugepage size supported by the platform. kernel logs related to hugepages with this patch included: kernel argument passed: hugepagesz=1G hugepages=1 First kernel: gigantic hugepage got allocated == dmesg | grep -i "hugetlb" - HugeTLB: registered 1.00 GiB page size, pre-allocated 1 pages HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page $ cat /proc/meminfo | grep -i "hugetlb" - Hugetlb: 1048576 kB Was this tested with patch [1] in your local tree? [1]: https://web.git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?id=d629d7a8efc33 IIUC, this patch [1] disables the boot time allocation of hugepages. Isn't it also disabling the boot time allocation for gigantic huge pages passed by the cmdline params like hugepagesz=1G and hugepages=2 ? Yes, I had the patch [1] in my tree. My understanding is that gigantic pages are allocated before normal huge pages. In mm/hugetlb.c:hugepages_setup(), we have: if (hugetlb_max_hstate && hstate_is_gigantic(parsed_hstate)) hugetlb_hstate_alloc_pages(parsed_hstate); I believe the above code allocates memory for gigantic pages, and mm/hugetlb.c:hugetlb_init() is called later because it is a subsys_initcall. So, by the time the kernel reaches hugetlb_init(), the gigantic pages are already allocated. Isn't that right? With this understanding, this patch avoids populating hstate so that gigantic huge page allocation fails for the fadump kernel. Please let me know your opinion. HugeTLB: registered 1.00 GiB page size, pre-allocated 1 pages This print comes from report_hugepages(). The only place from where report_hugepages() gets called is hugetlb_init(). hugetlb_init() is what is responsible for hugepages & gigantic hugepage allocations of the passed kernel cmdline params. But hugetlb_init() already checks for hugepages_supported() in the very beginning. So I am not sure whether we need this extra patch to disable gigantic hugepages allocation by the kernel cmdline params like hugepagesz=1G and hugepages=2 type of options. Hence I was wondering if you had this patch [1] in your tree when you were testing this? But I may be missing something. Could you please help clarify on whether we really need this patch to disable gigantic hugetlb page allocations? Fadump kernel: gigantic hugepage not allocated === dmesg | grep -i "hugetlb" - [0.00] HugeTLB: unsupported hugepagesz=1G [0.00] HugeTLB: hugepages=1 does not follow a valid hugepagesz, ignoring [0.706375] HugeTLB support is disabled! [0.773530] hugetlbfs: disabling because there are no supported hugepage sizes $ cat /proc/meminfo | grep -i "hugetlb" -- Cc: Hari Bathini Cc: Madhavan Srinivasan Cc: Mahesh Salgaonkar Cc: Michael Ellerman Cc: Ritesh Harjani (IBM)" I guess the extra " in the above was not adding me in the cc list. Hence I missed to see this patch early. Thanks for pointing it. I will fix it. -ritesh Reviewed-by: Christophe Leroy Signed-off-by: Sourabh Jain --- Changelog: v1: https://lore.kernel.org/all/20250121150419.1342794-1-sourabhj...@linux.ibm.com/ v2: https://lore.kernel.org/all/20250124103220.111303-1-sourabhj...@linux.ibm.com/ - disable gigantic hugepage in arch code, arch_hugetlb_valid_size() v3: https://lore.kernel.org/all/20250125104928.1-1-sourabhj...@linux.ibm.com/ - Do not modify the initialization of the shift variable v4: - Update commit message to include how hugepages_supported() detects hugepages support when fadump is active - Add Reviewed-by tag - NO functional change --- arch/powerpc/mm/hugetlbpage.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 6b043180220a..88cfd182db4e 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage
Re: [PATCH v4] powerpc/hugetlb: Disable gigantic hugepages if fadump is active
Hello Ritesh, Thanks for the review. On 02/03/25 12:05, Ritesh Harjani (IBM) wrote: Sourabh Jain writes: The fadump kernel boots with limited memory solely to collect the kernel core dump. Having gigantic hugepages in the fadump kernel is of no use. Sure got it. Many times, the fadump kernel encounters OOM (Out of Memory) issues if gigantic hugepages are allocated. To address this, disable gigantic hugepages if fadump is active by returning early from arch_hugetlb_valid_size() using hugepages_supported(). When fadump is active, the global variable hugetlb_disabled is set to true, which is later used by the PowerPC-specific hugepages_supported() function to determine hugepage support. Returning early from arch_hugetlb_vali_size() not only disables gigantic hugepages but also avoids unnecessary hstate initialization for every hugepage size supported by the platform. kernel logs related to hugepages with this patch included: kernel argument passed: hugepagesz=1G hugepages=1 First kernel: gigantic hugepage got allocated == dmesg | grep -i "hugetlb" - HugeTLB: registered 1.00 GiB page size, pre-allocated 1 pages HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page HugeTLB: registered 2.00 MiB page size, pre-allocated 0 pages HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page $ cat /proc/meminfo | grep -i "hugetlb" - Hugetlb: 1048576 kB Was this tested with patch [1] in your local tree? [1]: https://web.git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?id=d629d7a8efc33 IIUC, this patch [1] disables the boot time allocation of hugepages. Isn't it also disabling the boot time allocation for gigantic huge pages passed by the cmdline params like hugepagesz=1G and hugepages=2 ? Yes, I had the patch [1] in my tree. My understanding is that gigantic pages are allocated before normal huge pages. In hugepages_setup() in hugetlb.c, we have: if (hugetlb_max_hstate && hstate_is_gigantic(parsed_hstate)) hugetlb_hstate_alloc_pages(parsed_hstate); I believe the above code allocates memory for gigantic pages, and hugetlb_init() is called later because it is a subsys_initcall. So, by the time the kernel reaches hugetlb_init(), the gigantic pages are already allocated. Isn't that right? Please let me know your opinion. Thanks, Sourabh Jain HugeTLB: registered 1.00 GiB page size, pre-allocated 1 pages This print comes from report_hugepages(). The only place from where report_hugepages() gets called is hugetlb_init(). hugetlb_init() is what is responsible for hugepages & gigantic hugepage allocations of the passed kernel cmdline params. But hugetlb_init() already checks for hugepages_supported() in the very beginning. So I am not sure whether we need this extra patch to disable gigantic hugepages allocation by the kernel cmdline params like hugepagesz=1G and hugepages=2 type of options. Hence I was wondering if you had this patch [1] in your tree when you were testing this? But I may be missing something. Could you please help clarify on whether we really need this patch to disable gigantic hugetlb page allocations? Fadump kernel: gigantic hugepage not allocated === dmesg | grep -i "hugetlb" - [0.00] HugeTLB: unsupported hugepagesz=1G [0.00] HugeTLB: hugepages=1 does not follow a valid hugepagesz, ignoring [0.706375] HugeTLB support is disabled! [0.773530] hugetlbfs: disabling because there are no supported hugepage sizes $ cat /proc/meminfo | grep -i "hugetlb" -- Cc: Hari Bathini Cc: Madhavan Srinivasan Cc: Mahesh Salgaonkar Cc: Michael Ellerman Cc: Ritesh Harjani (IBM)" I guess the extra " in the above was not adding me in the cc list. Hence I missed to see this patch early. Thanks for point it out. I will fix it. -ritesh Reviewed-by: Christophe Leroy Signed-off-by: Sourabh Jain --- Changelog: v1: https://lore.kernel.org/all/20250121150419.1342794-1-sourabhj...@linux.ibm.com/ v2: https://lore.kernel.org/all/20250124103220.111303-1-sourabhj...@linux.ibm.com/ - disable gigantic hugepage in arch code, arch_hugetlb_valid_size() v3: https://lore.kernel.org/all/20250125104928.1-1-sourabhj...@linux.ibm.com/ - Do not modify the initialization of the shift variable v4: - Update commit message to include how hugepages_supported() detects hugepages support when fadump is active - Add Reviewed-by tag - NO functional change --- arch/powerpc/mm/hugetlbpage.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c index 6b043180220a..88cfd182db4e 100644 --- a/arch/powerpc/mm/hugetlbpage.c +++ b/arch/powerpc/mm/hugetlbpage.c @@ -138,6 +138,9 @@ bool __init arch_hugetlb_valid_size(unsigned long size) int shift = __ffs(size);
Re: [PATCH v3] fs: introduce getfsxattrat and setfsxattrat syscalls
On Friday 28 February 2025 09:30:38 Andrey Albershteyn wrote: > On 2025-02-21 20:15:24, Amir Goldstein wrote: > > On Fri, Feb 21, 2025 at 7:13 PM Darrick J. Wong wrote: > > > > > > On Tue, Feb 11, 2025 at 06:22:47PM +0100, Andrey Albershteyn wrote: > > > > From: Andrey Albershteyn > > > > > > > > Introduce getfsxattrat and setfsxattrat syscalls to manipulate inode > > > > extended attributes/flags. The syscalls take parent directory fd and > > > > path to the child together with struct fsxattr. > > > > > > > > This is an alternative to FS_IOC_FSSETXATTR ioctl with a difference > > > > that file don't need to be open as we can reference it with a path > > > > instead of fd. By having this we can manipulated inode extended > > > > attributes not only on regular files but also on special ones. This > > > > is not possible with FS_IOC_FSSETXATTR ioctl as with special files > > > > we can not call ioctl() directly on the filesystem inode using fd. > > > > > > > > This patch adds two new syscalls which allows userspace to get/set > > > > extended inode attributes on special files by using parent directory > > > > and a path - *at() like syscall. > > > > > > > > Also, as vfs_fileattr_set() is now will be called on special files > > > > too, let's forbid any other attributes except projid and nextents > > > > (symlink can have an extent). > > > > > > > > CC: linux-...@vger.kernel.org > > > > CC: linux-fsde...@vger.kernel.org > > > > CC: linux-...@vger.kernel.org > > > > Signed-off-by: Andrey Albershteyn > > > > --- > > > > v1: > > > > https://lore.kernel.org/linuxppc-dev/20250109174540.893098-1-aalbe...@kernel.org/ > > > > > > > > Previous discussion: > > > > https://lore.kernel.org/linux-xfs/20240520164624.665269-2-aalbe...@redhat.com/ > > > > > > > > XFS has project quotas which could be attached to a directory. All > > > > new inodes in these directories inherit project ID set on parent > > > > directory. > > > > > > > > The project is created from userspace by opening and calling > > > > FS_IOC_FSSETXATTR on each inode. This is not possible for special > > > > files such as FIFO, SOCK, BLK etc. Therefore, some inodes are left > > > > with empty project ID. Those inodes then are not shown in the quota > > > > accounting but still exist in the directory. Moreover, in the case > > > > when special files are created in the directory with already > > > > existing project quota, these inode inherit extended attributes. > > > > This than leaves them with these attributes without the possibility > > > > to clear them out. This, in turn, prevents userspace from > > > > re-creating quota project on these existing files. > > > > --- > > > > Changes in v3: > > > > - Remove unnecessary "dfd is dir" check as it checked in user_path_at() > > > > - Remove unnecessary "same filesystem" check > > > > - Use CLASS() instead of directly calling fdget/fdput > > > > - Link to v2: > > > > https://lore.kernel.org/r/20250122-xattrat-syscall-v2-1-5b360d4fb...@kernel.org > > > > --- > > > > arch/alpha/kernel/syscalls/syscall.tbl | 2 + > > > > arch/arm/tools/syscall.tbl | 2 + > > > > arch/arm64/tools/syscall_32.tbl | 2 + > > > > arch/m68k/kernel/syscalls/syscall.tbl | 2 + > > > > arch/microblaze/kernel/syscalls/syscall.tbl | 2 + > > > > arch/mips/kernel/syscalls/syscall_n32.tbl | 2 + > > > > arch/mips/kernel/syscalls/syscall_n64.tbl | 2 + > > > > arch/mips/kernel/syscalls/syscall_o32.tbl | 2 + > > > > arch/parisc/kernel/syscalls/syscall.tbl | 2 + > > > > arch/powerpc/kernel/syscalls/syscall.tbl| 2 + > > > > arch/s390/kernel/syscalls/syscall.tbl | 2 + > > > > arch/sh/kernel/syscalls/syscall.tbl | 2 + > > > > arch/sparc/kernel/syscalls/syscall.tbl | 2 + > > > > arch/x86/entry/syscalls/syscall_32.tbl | 2 + > > > > arch/x86/entry/syscalls/syscall_64.tbl | 2 + > > > > arch/xtensa/kernel/syscalls/syscall.tbl | 2 + > > > > fs/inode.c | 75 > > > > + > > > > fs/ioctl.c | 16 +- > > > > include/linux/fileattr.h| 1 + > > > > include/linux/syscalls.h| 4 ++ > > > > include/uapi/asm-generic/unistd.h | 8 ++- > > > > 21 files changed, 133 insertions(+), 3 deletions(-) > > > > > > > > > > > > > > > > > diff --git a/fs/inode.c b/fs/inode.c > > > > index > > > > 6b4c77268fc0ecace4ac78a9ca777fbffc277f4a..b29db4fabaf67a6cbf541a86978b290411ec > > > > 100644 > > > > --- a/fs/inode.c > > > > +++ b/fs/inode.c > > > > @@ -23,6 +23,9 @@ > > > > #include > > > > #include > > > > #include > > > > +#include > > > > +#include > > > > +#include > > > > #include > > > > #define CREATE_TRACE_POINTS > > > > #include > > > > @@ -2953,3 +2956,75 @@ umode_t mode_strip_sgid(struct mnt_idmap *idmap, > > > > return mode & ~S_ISGID; > > > > } > > > > EXPORT_SYMBOL(mode_strip_s