Re: [Intel-wired-lan] [PATCH net-next v9 0/6] net: napi: add CPU affinity to napi->config
Hello: This series was applied to netdev/net-next.git (main) by Jakub Kicinski : On Mon, 24 Feb 2025 16:22:21 -0700 you wrote: > Drivers usually need to re-apply the user-set IRQ affinity to their IRQs > after reset. However, since there can be only one IRQ affinity notifier > for each IRQ, registering IRQ notifiers conflicts with the ARFS rmap > management in the core (which also registers separate IRQ affinity > notifiers). > > Move the IRQ affinity management to the napi struct. This way we can have > a unified IRQ notifier to re-apply the user-set affinity and also manage > the ARFS rmaps. > > [...] Here is the summary with links: - [net-next,v9,1/6] net: move aRFS rmap management and CPU affinity to core https://git.kernel.org/netdev/net-next/c/bd7c00605ee0 - [net-next,v9,2/6] net: ena: use napi's aRFS rmap notifers https://git.kernel.org/netdev/net-next/c/de340d8206bf - [net-next,v9,3/6] ice: clear NAPI's IRQ numbers in ice_vsi_clear_napi_queues() https://git.kernel.org/netdev/net-next/c/30b78ba3d4fe - [net-next,v9,4/6] ice: use napi's irq affinity and rmap IRQ notifiers https://git.kernel.org/netdev/net-next/c/4063af296762 - [net-next,v9,5/6] idpf: use napi's irq affinity https://git.kernel.org/netdev/net-next/c/deab38f8f011 - [net-next,v9,6/6] selftests: drv-net: add tests for napi IRQ affinity notifiers https://git.kernel.org/netdev/net-next/c/185646a8a0a8 You are awesome, thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html
Re: [Intel-wired-lan] [PATCH net] idpf: fix checksums set in idpf_rx_rsc()
On 2/26/25 23:12, Eric Dumazet wrote: idpf_rx_rsc() uses skb_transport_offset(skb) while the transport header is not set yet. This triggers the following warning for CONFIG_DEBUG_NET=y builds. DEBUG_NET_WARN_ON_ONCE(!skb_transport_header_was_set(skb)) [ 69.261620] WARNING: CPU: 7 PID: 0 at ./include/linux/skbuff.h:3020 idpf_vport_splitq_napi_poll (include/linux/skbuff.h:3020) idpf [ 69.261629] Modules linked in: vfat fat dummy bridge intel_uncore_frequency_tpmi intel_uncore_frequency_common intel_vsec_tpmi idpf intel_vsec cdc_ncm cdc_eem cdc_ether usbnet mii xhci_pci xhci_hcd ehci_pci ehci_hcd libeth [ 69.261644] CPU: 7 UID: 0 PID: 0 Comm: swapper/7 Tainted: G S W 6.14.0-smp-DEV #1697 [ 69.261648] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN [ 69.261650] RIP: 0010:idpf_vport_splitq_napi_poll (include/linux/skbuff.h:3020) idpf [ 69.261677] ? __warn (kernel/panic.c:242 kernel/panic.c:748) [ 69.261682] ? idpf_vport_splitq_napi_poll (include/linux/skbuff.h:3020) idpf [ 69.261687] ? report_bug (lib/bug.c:?) [ 69.261690] ? handle_bug (arch/x86/kernel/traps.c:285) [ 69.261694] ? exc_invalid_op (arch/x86/kernel/traps.c:309) [ 69.261697] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:621) [ 69.261700] ? __pfx_idpf_vport_splitq_napi_poll (drivers/net/ethernet/intel/idpf/idpf_txrx.c:4011) idpf [ 69.261704] ? idpf_vport_splitq_napi_poll (include/linux/skbuff.h:3020) idpf [ 69.261708] ? idpf_vport_splitq_napi_poll (drivers/net/ethernet/intel/idpf/idpf_txrx.c:3072) idpf [ 69.261712] __napi_poll (net/core/dev.c:7194) [ 69.261716] net_rx_action (net/core/dev.c:7265) [ 69.261718] ? __qdisc_run (net/sched/sch_generic.c:293) [ 69.261721] ? sched_clock (arch/x86/include/asm/preempt.h:84 arch/x86/kernel/tsc.c:288) [ 69.261726] handle_softirqs (kernel/softirq.c:561) Fixes: 3a8845af66edb ("idpf: add RX splitq napi poll support") Signed-off-by: Eric Dumazet thank you for the fix, Acked-by: Przemek Kitszel Cc: Alan Brady Cc: Joshua Hay Cc: Willem de Bruijn --- drivers/net/ethernet/intel/idpf/idpf_txrx.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_txrx.c index 9be6a6b59c4e1414f993de39698b00fffa7d2940..977741c4149805b13b3b77fdfb612c514e2530e6 100644 --- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c +++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c @@ -3013,7 +3013,6 @@ static int idpf_rx_rsc(struct idpf_rx_queue *rxq, struct sk_buff *skb, skb_shinfo(skb)->gso_size = rsc_seg_len; skb_reset_network_header(skb); - len = skb->len - skb_transport_offset(skb); if (ipv4) { struct iphdr *ipv4h = ip_hdr(skb); @@ -3022,6 +3021,7 @@ static int idpf_rx_rsc(struct idpf_rx_queue *rxq, struct sk_buff *skb, /* Reset and set transport header offset in skb */ skb_set_transport_header(skb, sizeof(struct iphdr)); + len = skb->len - skb_transport_offset(skb); /* Compute the TCP pseudo header checksum*/ tcp_hdr(skb)->check = @@ -3031,6 +3031,7 @@ static int idpf_rx_rsc(struct idpf_rx_queue *rxq, struct sk_buff *skb, skb_shinfo(skb)->gso_type = SKB_GSO_TCPV6; skb_set_transport_header(skb, sizeof(struct ipv6hdr)); + len = skb->len - skb_transport_offset(skb); tcp_hdr(skb)->check = ~tcp_v6_check(len, &ipv6h->saddr, &ipv6h->daddr, 0); }
Re: [Intel-wired-lan] [iwl-next v4 1/1] iidc/ice/irdma: Update IDC to support multiple consumers
> -Original Message- > From: Leon Romanovsky > Sent: Monday, February 24, 2025 11:56 PM > To: Nikolova, Tatyana E > Cc: j...@nvidia.com; intel-wired-...@lists.osuosl.org; linux- > r...@vger.kernel.org; net...@vger.kernel.org; Ertman, David M > > Subject: Re: [iwl-next v4 1/1] iidc/ice/irdma: Update IDC to support multiple > consumers > > On Mon, Feb 24, 2025 at 11:04:28PM -0600, Tatyana Nikolova wrote: > > From: Dave Ertman > > > > To support RDMA for E2000 product, the idpf driver will use the IDC > > interface with the irdma auxiliary driver, thus becoming a second > > consumer of it. This requires the IDC be updated to support multiple > > consumers. The use of exported symbols no longer makes sense because it > > will require all core drivers (ice/idpf) that can interface with irdma > > auxiliary driver to be loaded even if hardware is not present for those > > drivers. > > In auxiliary bus world, the code drivers (ice/idpf) need to created > auxiliary devices only if specific device present. That auxiliary device > will trigger the load of specific module (irdma in our case). > > EXPORT_SYMBOL won't trigger load of irdma driver, but the opposite is > true, load of irdma will trigger load of ice/idpf drivers (depends on > their exported symbol). > > > > > To address this, implement an ops struct that will be universal set of > > naked function pointers that will be populated by each core driver for > > the irdma auxiliary driver to call. > > No, we worked very hard to make proper HW discovery and driver autoload, > let's not return back. For now, it is no-go. Hi Leon, I am a little confused about what the problem here is. The main issue I pull from your response is: Removing exported symbols will stop ice/idpf from autoloading when irdma loads. Is this correct or did I miss your point? But, if there is an ice or idpf supported device present in the system, the appropriate driver will have already been loaded anyway (and gone through its probe flow to create auxiliary devices). If it is not loaded, then the system owner has either unloaded it manually or blacklisted it. This would not cause an issue anyway, since irdma and ice/idpf can load in any order. > > <...> > > > +/* Following APIs are implemented by core PCI driver */ > > +struct idc_rdma_core_ops { > > + int (*vc_send_sync)(struct idc_rdma_core_dev_info *cdev_info, u8 > *msg, > > + u16 len, u8 *recv_msg, u16 *recv_len); > > + int (*vc_queue_vec_map_unmap)(struct idc_rdma_core_dev_info > *cdev_info, > > + struct idc_rdma_qvlist_info *qvl_info, > > + bool map); > > + /* vport_dev_ctrl is for RDMA CORE driver to indicate it is either > ready > > +* for individual vport aux devices, or it is leaving the state where it > > +* can support vports and they need to be downed > > +*/ > > + int (*vport_dev_ctrl)(struct idc_rdma_core_dev_info *cdev_info, > > + bool up); > > + int (*request_reset)(struct idc_rdma_core_dev_info *cdev_info, > > +enum idc_rdma_reset_type reset_type); > > +}; > > Core driver can call to callbacks in irdma, like you already have for > irdma_iidc_event_handler(), but all calls from irdma to core driver must > be through exported symbols. It gives us race-free world in whole driver > except one very specific place (irdma_iidc_event_handler). I am confused here as well. Calling a function through an exported symbol, or calling the same function from a function pointer should not affect the generation of a race condition, as the same function is being called. What is inherently better about an exported symbol versus a function pointer when considering race conditions? Also, why is calling a function pointer from the irdma module ok, but calling one from the core module not? Again - Thank you for the review, and if I completely missed your points, please let me know! Thanks DaveE > > Thanks
Re: [Intel-wired-lan] [iwl-next v4 1/1] iidc/ice/irdma: Update IDC to support multiple consumers
On Wed, Feb 26, 2025 at 05:36:44PM +, Ertman, David M wrote: > > -Original Message- > > From: Leon Romanovsky > > Sent: Monday, February 24, 2025 11:56 PM > > To: Nikolova, Tatyana E > > Cc: j...@nvidia.com; intel-wired-...@lists.osuosl.org; linux- > > r...@vger.kernel.org; net...@vger.kernel.org; Ertman, David M > > > > Subject: Re: [iwl-next v4 1/1] iidc/ice/irdma: Update IDC to support > > multiple > > consumers > > > > On Mon, Feb 24, 2025 at 11:04:28PM -0600, Tatyana Nikolova wrote: > > > From: Dave Ertman > > > > > > To support RDMA for E2000 product, the idpf driver will use the IDC > > > interface with the irdma auxiliary driver, thus becoming a second > > > consumer of it. This requires the IDC be updated to support multiple > > > consumers. The use of exported symbols no longer makes sense because it > > > will require all core drivers (ice/idpf) that can interface with irdma > > > auxiliary driver to be loaded even if hardware is not present for those > > > drivers. > > > > In auxiliary bus world, the code drivers (ice/idpf) need to created > > auxiliary devices only if specific device present. That auxiliary device > > will trigger the load of specific module (irdma in our case). > > > > EXPORT_SYMBOL won't trigger load of irdma driver, but the opposite is > > true, load of irdma will trigger load of ice/idpf drivers (depends on > > their exported symbol). > > > > > > > > To address this, implement an ops struct that will be universal set of > > > naked function pointers that will be populated by each core driver for > > > the irdma auxiliary driver to call. > > > > No, we worked very hard to make proper HW discovery and driver autoload, > > let's not return back. For now, it is no-go. > > Hi Leon, > > I am a little confused about what the problem here is. The main issue I pull > from your response is: Removing exported symbols will stop ice/idpf from > autoloading when irdma loads. Is this correct or did I miss your point? It is one of the main points. > > But, if there is an ice or idpf supported device present in the system, the > appropriate driver will have already been loaded anyway (and gone through its > probe flow to create auxiliary devices). If it is not loaded, then the > system owner > has either unloaded it manually or blacklisted it. This would not cause an > issue > anyway, since irdma and ice/idpf can load in any order. There are two assumptions above, which both not true. 1. Users never issue "modprobe irdma" command alone and always will call to whole chain "modprobe ice ..." before. 2. You open-code module subsystem properly with reference counters, ownership and locks to protect from function pointers to be set/clear dynamically. > > > > > <...> > > > > > +/* Following APIs are implemented by core PCI driver */ > > > +struct idc_rdma_core_ops { > > > + int (*vc_send_sync)(struct idc_rdma_core_dev_info *cdev_info, u8 > > *msg, > > > + u16 len, u8 *recv_msg, u16 *recv_len); > > > + int (*vc_queue_vec_map_unmap)(struct idc_rdma_core_dev_info > > *cdev_info, > > > + struct idc_rdma_qvlist_info *qvl_info, > > > + bool map); > > > + /* vport_dev_ctrl is for RDMA CORE driver to indicate it is either > > ready > > > + * for individual vport aux devices, or it is leaving the state where it > > > + * can support vports and they need to be downed > > > + */ > > > + int (*vport_dev_ctrl)(struct idc_rdma_core_dev_info *cdev_info, > > > + bool up); > > > + int (*request_reset)(struct idc_rdma_core_dev_info *cdev_info, > > > + enum idc_rdma_reset_type reset_type); > > > +}; > > > > Core driver can call to callbacks in irdma, like you already have for > > irdma_iidc_event_handler(), but all calls from irdma to core driver must > > be through exported symbols. It gives us race-free world in whole driver > > except one very specific place (irdma_iidc_event_handler). > > I am confused here as well. Calling a function through an exported symbol, > or calling the same function from a function pointer should not affect the > generation of a race condition, as the same function is being called. > What is inherently better about an exported symbol versus a function > pointer when considering race conditions? Exported symbol guarantees that function exists in core module. Module subsystem will ensure that core module is impossible to unload until all users are gone. Function pointer has no such guarantees. > > Also, why is calling a function pointer from the irdma module ok, but calling > one from the core module not? Because we need to make sure that core module doesn't disappear while irdma executes its flow. The opposite is not true because core module controls irdma devices and aware than irdma module is loaded/unloaded. Thanks > > Again - Thank you for the review, and if I completely missed your points, > please let me kn
[Intel-wired-lan] [PATCH] e1000e: Link flap workaround option for false IRP events
Issue is seen on some Lenovo desktop workstations where there is a false IRP event which triggers a link flap. Condition is rare and only seen on networks where link speed may differ along the path between nodes (e.g 10M/100M) Intel are not able to determine root cause but provided a workaround that does fix the issue. Tested extensively at Lenovo. Adding a module option to enable this workaround for users who are impacted by this issue. Signed-off-by: Mark Pearson --- drivers/net/ethernet/intel/e1000e/netdev.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c index 286155efcedf..06774fb4b2dd 100644 --- a/drivers/net/ethernet/intel/e1000e/netdev.c +++ b/drivers/net/ethernet/intel/e1000e/netdev.c @@ -37,6 +37,10 @@ static int debug = -1; module_param(debug, int, 0); MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)"); +static int false_irp_workaround; +module_param(false_irp_workaround, int, 0); +MODULE_PARM_DESC(false_irp_workaround, "Enable workaround for rare false IRP event causing link flap"); + static const struct e1000_info *e1000_info_tbl[] = { [board_82571] = &e1000_82571_info, [board_82572] = &e1000_82572_info, @@ -1757,6 +1761,21 @@ static irqreturn_t e1000_intr_msi(int __always_unused irq, void *data) /* read ICR disables interrupts using IAM */ if (icr & E1000_ICR_LSC) { hw->mac.get_link_status = true; + + /* +* False IRP workaround +* Issue seen on Lenovo P5 and P7 workstations where if there +* are different link speeds in the network a false IRP event +* is received, leading to a link flap. +* Intel unable to determine root cause. This read prevents +* the issue occurring +*/ + if (false_irp_workaround) { + u16 phy_data; + + e1e_rphy(hw, PHY_REG(772, 26), &phy_data); + } + /* ICH8 workaround-- Call gig speed drop workaround on cable * disconnect (LSC) before accessing any PHY registers */ -- 2.43.0
[Intel-wired-lan] [tnguy-next-queue:main] BUILD SUCCESS 287044abff8291993ce9565ac6e6a72b85e33b85
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue.git main branch HEAD: 287044abff8291993ce9565ac6e6a72b85e33b85 sctp: Remove unused payload from sctp_idatahdr elapsed time: 1442m configs tested: 181 configs skipped: 5 The following configs have been built successfully. More configs may be tested in the coming days. tested configs: alpha allnoconfiggcc-14.2.0 alphaallyesconfigclang-21 alpha defconfiggcc-14.2.0 arc allmodconfigclang-18 arc allmodconfiggcc-13.2.0 arc allnoconfiggcc-14.2.0 arc allyesconfigclang-18 arc allyesconfiggcc-13.2.0 arc defconfiggcc-14.2.0 arc randconfig-001-20250226gcc-13.2.0 arc randconfig-002-20250226gcc-13.2.0 arm allmodconfigclang-18 arm allmodconfiggcc-14.2.0 arm allnoconfiggcc-14.2.0 arm allyesconfigclang-18 arm allyesconfiggcc-14.2.0 arm defconfiggcc-14.2.0 arm randconfig-001-20250226gcc-14.2.0 arm randconfig-002-20250226clang-21 arm randconfig-003-20250226gcc-14.2.0 arm randconfig-004-20250226gcc-14.2.0 arm64allmodconfigclang-18 arm64 allnoconfiggcc-14.2.0 arm64 defconfiggcc-14.2.0 arm64 randconfig-001-20250226gcc-14.2.0 arm64 randconfig-002-20250226gcc-14.2.0 arm64 randconfig-003-20250226clang-21 arm64 randconfig-004-20250226gcc-14.2.0 csky allnoconfiggcc-14.2.0 cskydefconfiggcc-14.2.0 csky randconfig-001-20250226gcc-14.2.0 csky randconfig-002-20250226gcc-14.2.0 hexagon allmodconfigclang-21 hexagon allnoconfiggcc-14.2.0 hexagon allyesconfigclang-21 hexagon defconfiggcc-14.2.0 hexagon randconfig-001-20250226clang-21 hexagon randconfig-002-20250226clang-21 i386 allmodconfigclang-19 i386 allmodconfiggcc-12 i386 allnoconfigclang-19 i386 allnoconfiggcc-12 i386 allyesconfigclang-19 i386 allyesconfiggcc-12 i386buildonly-randconfig-001-20250226clang-19 i386buildonly-randconfig-001-20250226gcc-12 i386buildonly-randconfig-002-20250226clang-19 i386buildonly-randconfig-002-20250226gcc-12 i386buildonly-randconfig-003-20250226clang-19 i386buildonly-randconfig-003-20250226gcc-12 i386buildonly-randconfig-004-20250226clang-19 i386buildonly-randconfig-005-20250226clang-19 i386buildonly-randconfig-005-20250226gcc-12 i386buildonly-randconfig-006-20250226clang-19 i386buildonly-randconfig-006-20250226gcc-12 i386defconfigclang-19 i386 randconfig-001-20250226clang-19 i386 randconfig-002-20250226clang-19 i386 randconfig-003-20250226clang-19 i386 randconfig-004-20250226clang-19 i386 randconfig-005-20250226clang-19 i386 randconfig-006-20250226clang-19 i386 randconfig-007-20250226clang-19 i386 randconfig-011-20250226gcc-12 i386 randconfig-012-20250226gcc-12 i386 randconfig-013-20250226gcc-12 i386 randconfig-014-20250226gcc-12 i386 randconfig-015-20250226gcc-12 i386 randconfig-016-20250226gcc-12 i386 randconfig-017-20250226gcc-12 loongarchallmodconfiggcc-14.2.0 loongarch allnoconfiggcc-14.2.0 loongarch defconfiggcc-14.2.0 loongarch randconfig-001-20250226gcc-14.2.0 loongarch randconfig-002-20250226gcc-14.2.0 m68k allmodconfiggcc-14.2.0 m68k allnoconfiggcc-14.2.0 m68k allyesconfiggcc-14.2.0 m68kdefconfiggcc-14.2.0 m68k
Re: [Intel-wired-lan] [RFC net-next v2 1/2] devlink: add whole device devlink instance
Wed, Feb 26, 2025 at 04:06:19PM +0100, przemyslaw.kits...@intel.com wrote: >On 2/26/25 15:48, Jiri Pirko wrote: >> Tue, Feb 25, 2025 at 04:40:49PM +0100, przemyslaw.kits...@intel.com wrote: >> > On 2/25/25 15:35, Jiri Pirko wrote: >> > > Tue, Feb 25, 2025 at 12:30:49PM +0100, przemyslaw.kits...@intel.com >> > > wrote: >> >> [...] >> >> > > > output, for all PFs and VFs on given device: >> > > > >> > > > pci/:af:00: >> > > >name rss size 8 unit entry size_min 0 size_max 24 size_gran 1 >> > > > resources: >> > > >name lut_512 size 0 unit entry size_min 0 size_max 16 size_gran >> > > > 1 >> > > >name lut_2048 size 8 unit entry size_min 0 size_max 8 size_gran >> > > > 1 >> > > > >> > > > What is contributing to the hardness, this is not just one for all ice >> > > > PFs, but one per device, which we distinguish via pci BDF. >> > > >> > > How? >> > >> > code is in ice_adapter_index() >> >> If you pass 2 pfs of the same device to a VM with random BDF, you get 2 >> ice_adapters, correct? > >Right now, yes That is a bug. > >> >> [...] > >What I want is to keep two ice_adapters for two actual devices (SDNs)
Re: [Intel-wired-lan] [RFC net-next v2 1/2] devlink: add whole device devlink instance
Tue, Feb 25, 2025 at 04:40:49PM +0100, przemyslaw.kits...@intel.com wrote: >On 2/25/25 15:35, Jiri Pirko wrote: >> Tue, Feb 25, 2025 at 12:30:49PM +0100, przemyslaw.kits...@intel.com wrote: [...] >> > output, for all PFs and VFs on given device: >> > >> > pci/:af:00: >> > name rss size 8 unit entry size_min 0 size_max 24 size_gran 1 >> > resources: >> > name lut_512 size 0 unit entry size_min 0 size_max 16 size_gran 1 >> > name lut_2048 size 8 unit entry size_min 0 size_max 8 size_gran 1 >> > >> > What is contributing to the hardness, this is not just one for all ice >> > PFs, but one per device, which we distinguish via pci BDF. >> >> How? > >code is in ice_adapter_index() If you pass 2 pfs of the same device to a VM with random BDF, you get 2 ice_adapters, correct? [...]
Re: [Intel-wired-lan] [RFC net-next v2 1/2] devlink: add whole device devlink instance
On 2/26/25 15:48, Jiri Pirko wrote: Tue, Feb 25, 2025 at 04:40:49PM +0100, przemyslaw.kits...@intel.com wrote: On 2/25/25 15:35, Jiri Pirko wrote: Tue, Feb 25, 2025 at 12:30:49PM +0100, przemyslaw.kits...@intel.com wrote: [...] output, for all PFs and VFs on given device: pci/:af:00: name rss size 8 unit entry size_min 0 size_max 24 size_gran 1 resources: name lut_512 size 0 unit entry size_min 0 size_max 16 size_gran 1 name lut_2048 size 8 unit entry size_min 0 size_max 8 size_gran 1 What is contributing to the hardness, this is not just one for all ice PFs, but one per device, which we distinguish via pci BDF. How? code is in ice_adapter_index() If you pass 2 pfs of the same device to a VM with random BDF, you get 2 ice_adapters, correct? Right now, yes [...] What I want is to keep two ice_adapters for two actual devices (SDNs)
Re: [Intel-wired-lan] [iwl-next v4 1/1] iidc/ice/irdma: Update IDC to support multiple consumers
> -Original Message- > From: Leon Romanovsky > Sent: Wednesday, February 26, 2025 10:50 AM > To: Ertman, David M > Cc: Nikolova, Tatyana E ; j...@nvidia.com; > intel-wired-...@lists.osuosl.org; linux-r...@vger.kernel.org; > net...@vger.kernel.org > Subject: Re: [iwl-next v4 1/1] iidc/ice/irdma: Update IDC to support multiple > consumers > > On Wed, Feb 26, 2025 at 05:36:44PM +, Ertman, David M wrote: > > > -Original Message- > > > From: Leon Romanovsky > > > Sent: Monday, February 24, 2025 11:56 PM > > > To: Nikolova, Tatyana E > > > Cc: j...@nvidia.com; intel-wired-...@lists.osuosl.org; linux- > > > r...@vger.kernel.org; net...@vger.kernel.org; Ertman, David M > > > > > > Subject: Re: [iwl-next v4 1/1] iidc/ice/irdma: Update IDC to support > multiple > > > consumers > > > > > > On Mon, Feb 24, 2025 at 11:04:28PM -0600, Tatyana Nikolova wrote: > > > > From: Dave Ertman > > > > > > > > To support RDMA for E2000 product, the idpf driver will use the IDC > > > > interface with the irdma auxiliary driver, thus becoming a second > > > > consumer of it. This requires the IDC be updated to support multiple > > > > consumers. The use of exported symbols no longer makes sense > because it > > > > will require all core drivers (ice/idpf) that can interface with irdma > > > > auxiliary driver to be loaded even if hardware is not present for those > > > > drivers. > > > > > > In auxiliary bus world, the code drivers (ice/idpf) need to created > > > auxiliary devices only if specific device present. That auxiliary device > > > will trigger the load of specific module (irdma in our case). > > > > > > EXPORT_SYMBOL won't trigger load of irdma driver, but the opposite is > > > true, load of irdma will trigger load of ice/idpf drivers (depends on > > > their exported symbol). > > > > > > > > > > > To address this, implement an ops struct that will be universal set of > > > > naked function pointers that will be populated by each core driver for > > > > the irdma auxiliary driver to call. > > > > > > No, we worked very hard to make proper HW discovery and driver > autoload, > > > let's not return back. For now, it is no-go. > > > > Hi Leon, > > > > I am a little confused about what the problem here is. The main issue I > > pull > > from your response is: Removing exported symbols will stop ice/idpf from > > autoloading when irdma loads. Is this correct or did I miss your point? > > It is one of the main points. > > > > > But, if there is an ice or idpf supported device present in the system, the > > appropriate driver will have already been loaded anyway (and gone > through its > > probe flow to create auxiliary devices). If it is not loaded, then the > > system > owner > > has either unloaded it manually or blacklisted it. This would not cause an > issue > > anyway, since irdma and ice/idpf can load in any order. > > There are two assumptions above, which both not true. > 1. Users never issue "modprobe irdma" command alone and always will call > to whole chain "modprobe ice ..." before. > 2. You open-code module subsystem properly with reference counters, > ownership and locks to protect from function pointers to be set/clear > dynamically. Ah, I see your reasoning now. Our goal was to make the two modules independent, with no prescribed load order mandated, and utilize the auxiliary bus and device subsystem to handle load order and unload of one or the other module. The auxiliary device only has the lifespan of the core PCI driver, so if the core driver unloads, then the auxiliary device gets destroyed, and the associated link based off it will be gone. We wanted to be able to unload and reload either of the modules (core or irdma) and have the interaction be able to restart with a new probe. All our inter-driver function calls are protected by device lock on the auxiliary device for the duration of the call. > > > > > > > > > <...> > > > > > > > +/* Following APIs are implemented by core PCI driver */ > > > > +struct idc_rdma_core_ops { > > > > + int (*vc_send_sync)(struct idc_rdma_core_dev_info *cdev_info, u8 > > > *msg, > > > > + u16 len, u8 *recv_msg, u16 *recv_len); > > > > + int (*vc_queue_vec_map_unmap)(struct idc_rdma_core_dev_info > > > *cdev_info, > > > > + struct idc_rdma_qvlist_info > > > > *qvl_info, > > > > + bool map); > > > > + /* vport_dev_ctrl is for RDMA CORE driver to indicate it is > > > > either > > > ready > > > > +* for individual vport aux devices, or it is leaving the state > > > > where it > > > > +* can support vports and they need to be downed > > > > +*/ > > > > + int (*vport_dev_ctrl)(struct idc_rdma_core_dev_info *cdev_info, > > > > + bool up); > > > > + int (*request_reset)(struct idc_rdma_core_dev_info *cdev_info, > > > > +enum i
Re: [Intel-wired-lan] [PATCH] e1000e: Link flap workaround option for false IRP events
On Wed, Feb 26, 2025 at 02:44:12PM -0500, Mark Pearson wrote: > Issue is seen on some Lenovo desktop workstations where there > is a false IRP event which triggers a link flap. > Condition is rare and only seen on networks where link speed > may differ along the path between nodes (e.g 10M/100M) > > Intel are not able to determine root cause but provided a > workaround that does fix the issue. Tested extensively at Lenovo. > > Adding a module option to enable this workaround for users > who are impacted by this issue. Why is a module option needed? Does the workaround itself introduce issues? Please describe those issues? In general, module options are not liked. So please include in the commit message why a module option is the only option. > Signed-off-by: Mark Pearson > --- > drivers/net/ethernet/intel/e1000e/netdev.c | 19 +++ > 1 file changed, 19 insertions(+) > > diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c > b/drivers/net/ethernet/intel/e1000e/netdev.c > index 286155efcedf..06774fb4b2dd 100644 > --- a/drivers/net/ethernet/intel/e1000e/netdev.c > +++ b/drivers/net/ethernet/intel/e1000e/netdev.c > @@ -37,6 +37,10 @@ static int debug = -1; > module_param(debug, int, 0); > MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)"); > > +static int false_irp_workaround; > +module_param(false_irp_workaround, int, 0); > +MODULE_PARM_DESC(false_irp_workaround, "Enable workaround for rare false IRP > event causing link flap"); > + > static const struct e1000_info *e1000_info_tbl[] = { > [board_82571] = &e1000_82571_info, > [board_82572] = &e1000_82572_info, > @@ -1757,6 +1761,21 @@ static irqreturn_t e1000_intr_msi(int __always_unused > irq, void *data) > /* read ICR disables interrupts using IAM */ > if (icr & E1000_ICR_LSC) { > hw->mac.get_link_status = true; > + > + /* > + * False IRP workaround > + * Issue seen on Lenovo P5 and P7 workstations where if there > + * are different link speeds in the network a false IRP event > + * is received, leading to a link flap. > + * Intel unable to determine root cause. This read prevents > + * the issue occurring > + */ > + if (false_irp_workaround) { > + u16 phy_data; > + > + e1e_rphy(hw, PHY_REG(772, 26), &phy_data); Please add some #define for these magic numbers, so we have some idea what PHY register you are actually reading. That in itself might help explain how the workaround actually works. Andrew --- pw-bot: cr
[Intel-wired-lan] [PATCH net-next v10 0/4] fix the DMA API misuse problem for page_pool
This patchset fix the dma API misuse problem as below: Networking driver with page_pool support may hand over page still with dma mapping to network stack and try to reuse that page after network stack is done with it and passes it back to page_pool to avoid the penalty of dma mapping/unmapping. With all the caching in the network stack, some pages may be held in the network stack without returning to the page_pool soon enough, and with VF disable causing the driver unbound, the page_pool does not stop the driver from doing it's unbounding work, instead page_pool uses workqueue to check if there is some pages coming back from the network stack periodically, if there is any, it will do the dma unmmapping related cleanup work. As mentioned in [1], attempting DMA unmaps after the driver has already unbound may leak resources or at worst corrupt memory. Fundamentally, the page pool code cannot allow DMA mappings to outlive the driver they belong to. By using the 'struct page_pool_item' referenced by page->pp_item, page_pool is not only able to keep track of the inflight page to do dma unmmaping if some pages are still handled in networking stack when page_pool_destroy() is called, and networking stack is also able to find the page_pool owning the page when returning pages back into page_pool: 1. When a page is added to the page_pool, an item is deleted from pool->hold_items and set the 'pp_netmem' pointing to that page and set item->state and item->pp_netmem accordingly in order to keep track of that page, refill from pool->release_items when pool->hold_items is empty or use the item from pool->slow_items when fast items run out. 2. When a page is released from the page_pool, it is able to tell which page_pool this page belongs to by masking off the lower bits of the pointer to page_pool_item *item, as the 'struct page_pool_item_block' is stored in the top of a struct page. And after clearing the pp_item->state', the item for the released page is added back to pool->release_items so that it can be reused for new pages or just free it when it is from the pool->slow_items. 3. When page_pool_destroy() is called, item->state is used to tell if a specific item is being used/dma mapped or not by scanning all the item blocks in pool->item_blocks, then item->netmem can be used to do the dma unmmaping if the corresponding inflight page is dma mapped. >From the below performance data, the overhead is not so obvious due to performance variations in arm64 server and less than 1 ns in x86 server for time_bench_page_pool01_fast_path() and time_bench_page_pool02_ptr_ring, and there is about 10~20ns overhead for time_bench_page_pool03_slow(), see more detail in [2]. arm64 server: Before this patchset: fast_path ptr_ringslow 1. 31.171 ns 60.980 ns 164.917 ns 2. 28.824 ns 60.891 ns 170.241 ns 3. 14.236 ns 60.583 ns 164.355 ns With patchset: 6. 26.163 ns 53.781 ns 189.450 ns 7. 26.189 ns 53.798 ns 189.466 ns X86 server: | Test name |Cycles | 1-5 || Nanosec |1-5 || % | | (tasklet_*)|Before | After |diff| Before | After | diff | change | |+---+---++-+++| | fast_path |19 |19 | 0| 5.399 | 5.492 | 0.093 |1.7 | | ptr_ring |54 |57 | 3| 15.090 | 15.849 | 0.759 |5.0 | | slow | 238 | 284 | 46| 66.134 | 78.909 | 12.775 | 19.3 | And about 16 bytes of memory is also needed for each page_pool owned page to fix the dma API misuse problem 1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26...@kernel.org/T/ 2. https://lore.kernel.org/all/f558df7a-d983-4fc5-8358-faf251994...@kernel.org/ CC: Alexander Lobakin CC: Robin Murphy CC: Alexander Duyck CC: Andrew Morton CC: Gaurav Batra CC: Matthew Rosato CC: IOMMU CC: MM Change log: V10: 1. Add nl API to dump item memory usage. 2. Use __acquires() and __releases() to avoid 'context imbalance' warning. V9. 1. Drop the fix of a possible time window problem for NPAI recycling. 2. Add design description for the fix in patch 2. V8: 1. Drop last 3 patch as it causes observable performance degradation for x86 system. 2. Remove rcu read lock in page_pool_napi_local(). 3. Renaming item function more consistently. V7: 1. Fix a used-after-free bug reported by KASAN as mentioned by Jakub. 2. Fix the 'netmem' variable not setting up correctly bug as mentioned by Simon. V6: 1. Repost based on latest net-next. 2. Rename page_pool_to_pp() to page_pool_get_pp(). V5: 1. Support unlimit inflight pages. 2. Add some optimization to avoid the overhead of fixing bug. V4: 1. use scanning to do the unmapping 2. spilt dma sync skipping into separate patch V3: 1. Target net-
[Intel-wired-lan] [PATCH net-next v10 1/4] page_pool: introduce page_pool_get_pp() API
Introduce page_pool_get_pp() API to avoid caller accessing page->pp directly, in order to make the following patch more reviewable as the following patch will change page->pp to page->pp_item to fix the DMA API misuse problem. Signed-off-by: Yunsheng Lin --- drivers/net/ethernet/freescale/fec_main.c | 8 +--- .../net/ethernet/google/gve/gve_buffer_mgmt_dqo.c | 2 +- drivers/net/ethernet/intel/iavf/iavf_txrx.c| 6 -- drivers/net/ethernet/intel/idpf/idpf_txrx.c| 14 +- drivers/net/ethernet/intel/libeth/rx.c | 2 +- drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c | 3 ++- drivers/net/netdevsim/netdev.c | 6 -- drivers/net/wireless/mediatek/mt76/mt76.h | 2 +- include/net/libeth/rx.h| 3 ++- include/net/page_pool/helpers.h| 5 + 10 files changed, 34 insertions(+), 17 deletions(-) diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c index a86cfebedaa8..4ade1553557a 100644 --- a/drivers/net/ethernet/freescale/fec_main.c +++ b/drivers/net/ethernet/freescale/fec_main.c @@ -1038,7 +1038,8 @@ static void fec_enet_bd_init(struct net_device *dev) struct page *page = txq->tx_buf[i].buf_p; if (page) - page_pool_put_page(page->pp, page, 0, false); + page_pool_put_page(page_pool_get_pp(page), + page, 0, false); } txq->tx_buf[i].buf_p = NULL; @@ -1576,7 +1577,7 @@ fec_enet_tx_queue(struct net_device *ndev, u16 queue_id, int budget) xdp_return_frame_rx_napi(xdpf); } else { /* recycle pages of XDP_TX frames */ /* The dma_sync_size = 0 as XDP_TX has already synced DMA for_device */ - page_pool_put_page(page->pp, page, 0, true); + page_pool_put_page(page_pool_get_pp(page), page, 0, true); } txq->tx_buf[index].buf_p = NULL; @@ -3343,7 +3344,8 @@ static void fec_enet_free_buffers(struct net_device *ndev) } else { struct page *page = txq->tx_buf[i].buf_p; - page_pool_put_page(page->pp, page, 0, false); + page_pool_put_page(page_pool_get_pp(page), + page, 0, false); } txq->tx_buf[i].buf_p = NULL; diff --git a/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c b/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c index 403f0f335ba6..87422b8828ff 100644 --- a/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c +++ b/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c @@ -210,7 +210,7 @@ void gve_free_to_page_pool(struct gve_rx_ring *rx, if (!page) return; - page_pool_put_full_page(page->pp, page, allow_direct); + page_pool_put_full_page(page_pool_get_pp(page), page, allow_direct); buf_state->page_info.page = NULL; } diff --git a/drivers/net/ethernet/intel/iavf/iavf_txrx.c b/drivers/net/ethernet/intel/iavf/iavf_txrx.c index 422312b8b54a..72f17eaac277 100644 --- a/drivers/net/ethernet/intel/iavf/iavf_txrx.c +++ b/drivers/net/ethernet/intel/iavf/iavf_txrx.c @@ -1197,7 +1197,8 @@ static void iavf_add_rx_frag(struct sk_buff *skb, const struct libeth_fqe *rx_buffer, unsigned int size) { - u32 hr = rx_buffer->page->pp->p.offset; + struct page_pool *pool = page_pool_get_pp(rx_buffer->page); + u32 hr = pool->p.offset; skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, rx_buffer->page, rx_buffer->offset + hr, size, rx_buffer->truesize); @@ -1214,7 +1215,8 @@ static void iavf_add_rx_frag(struct sk_buff *skb, static struct sk_buff *iavf_build_skb(const struct libeth_fqe *rx_buffer, unsigned int size) { - u32 hr = rx_buffer->page->pp->p.offset; + struct page_pool *pool = page_pool_get_pp(rx_buffer->page); + u32 hr = pool->p.offset; struct sk_buff *skb; void *va; diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.c b/drivers/net/ethernet/intel/idpf/idpf_txrx.c index 2747dc6a..248495f587d0 100644 --- a/drivers/net/ethernet/intel/idpf/idpf_txrx.c +++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.c @@ -385,7 +385,8 @@ static void idpf_rx_page_rel(struct libeth_fqe *rx_buf) if (unlikely(!rx_buf->page)) return; - page_pool_put_full_page(rx_buf->page->pp, rx_buf->page, false); + page_pool_put_full_page(page_pool_get_pp(rx_buf->page), rx_buf->page, +
[Intel-wired-lan] [PATCH iwl-next v1] ice: refactor the Tx scheduler feature
Embed ice_get_tx_topo_user_sel() inside the only caller: ice_devlink_tx_sched_layers_get(). Instead of jump from the wrapper to the function that does "get" operation it does "get" itself. Remove unnecessary comment and make usage of str_enabled_disabled() in ice_init_tx_topology(). Suggested-by: Marcin Szycik Reviewed-by: Michal Swiatkowski Reviewed-by: Jedrzej Jagielski Reviewed-by: Przemek Kitszel Reviewed-by: Aleksandr Loktionov Signed-off-by: Mateusz Polchlopek --- .../net/ethernet/intel/ice/devlink/devlink.c | 56 +++ drivers/net/ethernet/intel/ice/ice_ddp.c | 2 - drivers/net/ethernet/intel/ice/ice_main.c | 8 +-- 3 files changed, 23 insertions(+), 43 deletions(-) diff --git a/drivers/net/ethernet/intel/ice/devlink/devlink.c b/drivers/net/ethernet/intel/ice/devlink/devlink.c index fcb199efbea5..2355e21d115c 100644 --- a/drivers/net/ethernet/intel/ice/devlink/devlink.c +++ b/drivers/net/ethernet/intel/ice/devlink/devlink.c @@ -529,41 +529,6 @@ ice_devlink_reload_empr_finish(struct ice_pf *pf, return 0; } -/** - * ice_get_tx_topo_user_sel - Read user's choice from flash - * @pf: pointer to pf structure - * @layers: value read from flash will be saved here - * - * Reads user's preference for Tx Scheduler Topology Tree from PFA TLV. - * - * Return: zero when read was successful, negative values otherwise. - */ -static int ice_get_tx_topo_user_sel(struct ice_pf *pf, uint8_t *layers) -{ - struct ice_aqc_nvm_tx_topo_user_sel usr_sel = {}; - struct ice_hw *hw = &pf->hw; - int err; - - err = ice_acquire_nvm(hw, ICE_RES_READ); - if (err) - return err; - - err = ice_aq_read_nvm(hw, ICE_AQC_NVM_TX_TOPO_MOD_ID, 0, - sizeof(usr_sel), &usr_sel, true, true, NULL); - if (err) - goto exit_release_res; - - if (usr_sel.data & ICE_AQC_NVM_TX_TOPO_USER_SEL) - *layers = ICE_SCHED_5_LAYERS; - else - *layers = ICE_SCHED_9_LAYERS; - -exit_release_res: - ice_release_nvm(hw); - - return err; -} - /** * ice_update_tx_topo_user_sel - Save user's preference in flash * @pf: pointer to pf structure @@ -610,19 +575,36 @@ static int ice_update_tx_topo_user_sel(struct ice_pf *pf, int layers) * @id: the parameter ID to set * @ctx: context to store the parameter value * + * Reads user's preference for Tx Scheduler Topology Tree from PFA TLV. + * * Return: zero on success and negative value on failure. */ static int ice_devlink_tx_sched_layers_get(struct devlink *devlink, u32 id, struct devlink_param_gset_ctx *ctx) { + struct ice_aqc_nvm_tx_topo_user_sel usr_sel = {}; struct ice_pf *pf = devlink_priv(devlink); + struct ice_hw *hw = &pf->hw; int err; - err = ice_get_tx_topo_user_sel(pf, &ctx->val.vu8); + err = ice_acquire_nvm(hw, ICE_RES_READ); if (err) return err; - return 0; + err = ice_aq_read_nvm(hw, ICE_AQC_NVM_TX_TOPO_MOD_ID, 0, + sizeof(usr_sel), &usr_sel, true, true, NULL); + if (err) + goto exit_release_res; + + if (usr_sel.data & ICE_AQC_NVM_TX_TOPO_USER_SEL) + ctx->val.vu8 = ICE_SCHED_5_LAYERS; + else + ctx->val.vu8 = ICE_SCHED_9_LAYERS; + +exit_release_res: + ice_release_nvm(hw); + + return err; } /** diff --git a/drivers/net/ethernet/intel/ice/ice_ddp.c b/drivers/net/ethernet/intel/ice/ice_ddp.c index 69d5b1a28491..a2f738eaf02e 100644 --- a/drivers/net/ethernet/intel/ice/ice_ddp.c +++ b/drivers/net/ethernet/intel/ice/ice_ddp.c @@ -2324,8 +2324,6 @@ enum ice_ddp_state ice_copy_and_init_pkg(struct ice_hw *hw, const u8 *buf, * @flags: pointer to descriptor flags * @set: 0-get, 1-set topology * - * The function will get or set Tx topology - * * Return: zero when set was successful, negative values otherwise. */ static int diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c index b084839eb811..9d9cad81b336 100644 --- a/drivers/net/ethernet/intel/ice/ice_main.c +++ b/drivers/net/ethernet/intel/ice/ice_main.c @@ -4544,10 +4544,10 @@ ice_init_tx_topology(struct ice_hw *hw, const struct firmware *firmware) dev = ice_pf_to_dev(pf); err = ice_cfg_tx_topo(hw, firmware->data, firmware->size); if (!err) { - if (hw->num_tx_sched_layers > num_tx_sched_layers) - dev_info(dev, "Tx scheduling layers switching feature disabled\n"); - else - dev_info(dev, "Tx scheduling layers switching feature enabled\n"); + dev_info(dev, "Tx scheduling layers switching feature %s\n", +str_enabled_disabled(hw->num_tx_sched_layers <= + num_tx_sched_layers)); + /* if there w