Re: [Intel-wired-lan] [RFC net-next 1/1] idpf: Don't hard code napi_struct size
From: Joe Damato Date: Mon, 30 Sep 2024 15:17:46 -0700 > On Mon, Sep 30, 2024 at 03:10:41PM +0200, Przemek Kitszel wrote: >> On 9/30/24 14:38, Alexander Lobakin wrote: >>> From: Alexander Lobakin >>> Date: Mon, 30 Sep 2024 14:33:45 +0200 >>> From: Joe Damato Date: Wed, 25 Sep 2024 18:00:17 + >> >>> struct napi_struct doesn't have any such fields and doesn't depend on >>> the kernel configuration, that's why it's hardcoded. >>> Please don't change that, just adjust the hardcoded values when needed. >> >> This is the crucial point, and I agree with Olek. >> >> If you will find it more readable/future proof, feel free to add >> comments like /* napi_struct */ near their "400" part in the hardcode. >> >> Side note: you could just run this as a part of your netdev series, >> given you will properly CC. > > I've already sent the official patch because I didn't hear back on > this RFC. > > Sorry, but I respectfully disagree with you both on this; I don't > think it makes sense to have code that will break if fields are > added to napi_struct thereby requiring anyone who works on the core > to update this code over and over again. > > I understand that the sizeofs are "meaningless" because of your > desire to optimize cachelines, but IMHO and, again, respectfully, it > seems strange that any adjustments to core should require a change > to this code. But if you change any core API, let's say rename a field used in several drivers, you anyway need to adjust the affected drivers. It's a common practice that some core changes require touching drivers. Moreover, sizeof(struct napi_struct) doesn't get changed often, so I don't see any issue in adjusting one line in idpf by just increasing one value there by sizeof(your_new_field). If you do that, then: + you get notified that you may affect the performance of different drivers (napi_struct is usually embedded into perf-critical structures in drivers); + I get notified (Cced) that someone's change will affect idpf, so I'll be aware of it and review the change; - you need to adjust one line in idpf. Is it just me or these '+'s easily outweight that sole '-'? > > I really do not want to include a patch to change the size of > napi_struct in idpf as part of my RFC which is totally unrelated to > idpf and will detract from the review of my core changes. One line won't distract anyone. The kernel tree contains let's say one patch from me where I needed to modify around 20 drivers within one commit due to core code structure change -- the number of locs I changed in the drivers was way bigger than the number of locs I changed in the core. And there's a ton of such commits in there. Again, it's a common practice. > > Perhaps my change is unacceptable, but there should be a way to deal > with this that doesn't require everyone working on core networking > code to update idpf, right? We can't isolate the core code from the drivers up to the point that you wouldn't require to touch the drivers at all when working on the core changes, we're not Windows. The drivers and the core are inside one tree, so that such changes can be made easily and no code inside the whole tree is ABI (excl uAPI). Thanks, Olek
[Intel-wired-lan] [PATCH iwl-next v1 1/1] igc: remove autoneg parameter from igc_mac_info
Since the igc driver doesn't support forced speed configuration and its current related hardware doesn't support it either, there is no use of the mac.autoneg parameter. Moreover, in one case this usage might result in a NULL pointer dereference due to an uninitialized function pointer, phy.ops.force_speed_duplex. Therefore, remove this parameter from the igc code. Signed-off-by: Vitaly Lifshits --- drivers/net/ethernet/intel/igc/igc_diag.c| 3 +- drivers/net/ethernet/intel/igc/igc_ethtool.c | 13 +- drivers/net/ethernet/intel/igc/igc_hw.h | 1 - drivers/net/ethernet/intel/igc/igc_mac.c | 316 +-- drivers/net/ethernet/intel/igc/igc_main.c| 1 - drivers/net/ethernet/intel/igc/igc_phy.c | 24 +- 6 files changed, 163 insertions(+), 195 deletions(-) diff --git a/drivers/net/ethernet/intel/igc/igc_diag.c b/drivers/net/ethernet/intel/igc/igc_diag.c index cc621970c0cd..a43d7244ee70 100644 --- a/drivers/net/ethernet/intel/igc/igc_diag.c +++ b/drivers/net/ethernet/intel/igc/igc_diag.c @@ -173,8 +173,7 @@ bool igc_link_test(struct igc_adapter *adapter, u64 *data) *data = 0; /* add delay to give enough time for autonegotioation to finish */ - if (adapter->hw.mac.autoneg) - ssleep(5); + ssleep(5); link_up = igc_has_link(adapter); if (!link_up) { diff --git a/drivers/net/ethernet/intel/igc/igc_ethtool.c b/drivers/net/ethernet/intel/igc/igc_ethtool.c index 5b0c6f433767..817838677817 100644 --- a/drivers/net/ethernet/intel/igc/igc_ethtool.c +++ b/drivers/net/ethernet/intel/igc/igc_ethtool.c @@ -1821,11 +1821,8 @@ static int igc_ethtool_get_link_ksettings(struct net_device *netdev, ethtool_link_ksettings_add_link_mode(cmd, advertising, 2500baseT_Full); /* set autoneg settings */ - if (hw->mac.autoneg == 1) { - ethtool_link_ksettings_add_link_mode(cmd, supported, Autoneg); - ethtool_link_ksettings_add_link_mode(cmd, advertising, -Autoneg); - } + ethtool_link_ksettings_add_link_mode(cmd, supported, Autoneg); + ethtool_link_ksettings_add_link_mode(cmd, advertising, Autoneg); /* Set pause flow control settings */ ethtool_link_ksettings_add_link_mode(cmd, supported, Pause); @@ -1878,10 +1875,7 @@ static int igc_ethtool_get_link_ksettings(struct net_device *netdev, cmd->base.duplex = DUPLEX_UNKNOWN; } cmd->base.speed = speed; - if (hw->mac.autoneg) - cmd->base.autoneg = AUTONEG_ENABLE; - else - cmd->base.autoneg = AUTONEG_DISABLE; + cmd->base.autoneg = AUTONEG_ENABLE; /* MDI-X => 2; MDI =>1; Invalid =>0 */ if (hw->phy.media_type == igc_media_type_copper) @@ -1955,7 +1949,6 @@ igc_ethtool_set_link_ksettings(struct net_device *netdev, advertised |= ADVERTISE_10_HALF; if (cmd->base.autoneg == AUTONEG_ENABLE) { - hw->mac.autoneg = 1; hw->phy.autoneg_advertised = advertised; if (adapter->fc_autoneg) hw->fc.requested_mode = igc_fc_default; diff --git a/drivers/net/ethernet/intel/igc/igc_hw.h b/drivers/net/ethernet/intel/igc/igc_hw.h index e1c572e0d4ef..d9d1a1a11daf 100644 --- a/drivers/net/ethernet/intel/igc/igc_hw.h +++ b/drivers/net/ethernet/intel/igc/igc_hw.h @@ -92,7 +92,6 @@ struct igc_mac_info { bool asf_firmware_present; bool arc_subsystem_valid; - bool autoneg; bool autoneg_failed; bool get_link_status; }; diff --git a/drivers/net/ethernet/intel/igc/igc_mac.c b/drivers/net/ethernet/intel/igc/igc_mac.c index a5c4b19d71a2..d344e0a1cd5e 100644 --- a/drivers/net/ethernet/intel/igc/igc_mac.c +++ b/drivers/net/ethernet/intel/igc/igc_mac.c @@ -386,14 +386,6 @@ s32 igc_check_for_copper_link(struct igc_hw *hw) */ igc_check_downshift(hw); - /* If we are forcing speed/duplex, then we simply return since -* we have already determined whether we have link or not. -*/ - if (!mac->autoneg) { - ret_val = -IGC_ERR_CONFIG; - goto out; - } - /* Auto-Neg is enabled. Auto Speed Detection takes care * of MAC speed/duplex configuration. So we only need to * configure Collision Distance in the MAC. @@ -468,173 +460,171 @@ s32 igc_config_fc_after_link_up(struct igc_hw *hw) goto out; } - /* Check for the case where we have copper media and auto-neg is -* enabled. In this case, we need to check and see if Auto-Neg -* has completed, and if so, how the PHY and link partner has -* flow control configured. + /* In auto-neg, we need to check and see if Auto-Neg has completed, +* and if so, how the PHY and link partner has flow control +* configured. */ - if (mac->auto
[Intel-wired-lan] [PATCH iwl-net v1 1/1] e1000e: Remove Meteor Lake SMBUS workarounds
This is a partial revert to commit 76a0a3f9cc2f ("e1000e: fix force smbus during suspend flow"). That commit fixed a sporadic PHY access issue but introduced a regression in runtime suspend flows. The original issue on Meteor Lake systems was rare in terms of the reproduction rate and the number of the systems affected. After the integration of commit 0a6ad4d9e169 ("e1000e: avoid failing the system during pm_suspend"), PHY access loss can no longer cause a system-level suspend failure. As it only occurs when the LAN cable is disconnected, and is recovered during system resume flow. Therefore, its functional impact is low, and the priority is given to stabilizing runtime suspend. Fixes: 76a0a3f9cc2f ("e1000e: fix force smbus during suspend flow") Signed-off-by: Vitaly Lifshits --- drivers/net/ethernet/intel/e1000e/ich8lan.c | 17 - 1 file changed, 4 insertions(+), 13 deletions(-) diff --git a/drivers/net/ethernet/intel/e1000e/ich8lan.c b/drivers/net/ethernet/intel/e1000e/ich8lan.c index ce227b56cf72..2f9655cf5dd9 100644 --- a/drivers/net/ethernet/intel/e1000e/ich8lan.c +++ b/drivers/net/ethernet/intel/e1000e/ich8lan.c @@ -1205,12 +1205,10 @@ s32 e1000_enable_ulp_lpt_lp(struct e1000_hw *hw, bool to_sx) if (ret_val) goto out; - if (hw->mac.type != e1000_pch_mtp) { - ret_val = e1000e_force_smbus(hw); - if (ret_val) { - e_dbg("Failed to force SMBUS: %d\n", ret_val); - goto release; - } + ret_val = e1000e_force_smbus(hw); + if (ret_val) { + e_dbg("Failed to force SMBUS: %d\n", ret_val); + goto release; } /* Si workaround for ULP entry flow on i127/rev6 h/w. Enable @@ -1273,13 +1271,6 @@ s32 e1000_enable_ulp_lpt_lp(struct e1000_hw *hw, bool to_sx) } release: - if (hw->mac.type == e1000_pch_mtp) { - ret_val = e1000e_force_smbus(hw); - if (ret_val) - e_dbg("Failed to force SMBUS over MTL system: %d\n", - ret_val); - } - hw->phy.ops.release(hw); out: if (ret_val) -- 2.34.1
Re: [Intel-wired-lan] [net-next v3 1/2] e1000e: Link NAPI instances to queues and IRQs
On 10/1/2024 1:50 PM, Simon Horman wrote: On Mon, Sep 30, 2024 at 05:12:31PM +, Joe Damato wrote: Add support for netdev-genl, allowing users to query IRQ, NAPI, and queue information. After this patch is applied, note the IRQs assigned to my NIC: $ cat /proc/interrupts | grep ens | cut -f1 --delimiter=':' 50 51 52 While e1000e allocates 3 IRQs (RX, TX, and other), it looks like e1000e only has a single NAPI, so I've associated the NAPI with the RX IRQ (50 on my system, seen above). Note the output from the cli: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 2}' [{'id': 145, 'ifindex': 2, 'irq': 50}] This device supports only 1 rx and 1 tx queue. so querying that: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump queue-get --json='{"ifindex": 2}' [{'id': 0, 'ifindex': 2, 'napi-id': 145, 'type': 'rx'}, {'id': 0, 'ifindex': 2, 'napi-id': 145, 'type': 'tx'}] Signed-off-by: Joe Damato Reviewed-by: Simon Horman Acked-by: Vitaly Lifshits
Re: [Intel-wired-lan] [RFC net-next 1/1] idpf: Don't hard code napi_struct size
On Tue, Oct 01, 2024 at 03:14:07PM +0200, Alexander Lobakin wrote: > From: Joe Damato > Date: Mon, 30 Sep 2024 15:17:46 -0700 > > > On Mon, Sep 30, 2024 at 03:10:41PM +0200, Przemek Kitszel wrote: > >> On 9/30/24 14:38, Alexander Lobakin wrote: > >>> From: Alexander Lobakin > >>> Date: Mon, 30 Sep 2024 14:33:45 +0200 > >>> > From: Joe Damato > Date: Wed, 25 Sep 2024 18:00:17 + > >> > >>> struct napi_struct doesn't have any such fields and doesn't depend on > >>> the kernel configuration, that's why it's hardcoded. > >>> Please don't change that, just adjust the hardcoded values when needed. > >> > >> This is the crucial point, and I agree with Olek. > >> > >> If you will find it more readable/future proof, feel free to add > >> comments like /* napi_struct */ near their "400" part in the hardcode. > >> > >> Side note: you could just run this as a part of your netdev series, > >> given you will properly CC. > > > > I've already sent the official patch because I didn't hear back on > > this RFC. > > > > Sorry, but I respectfully disagree with you both on this; I don't > > think it makes sense to have code that will break if fields are > > added to napi_struct thereby requiring anyone who works on the core > > to update this code over and over again. > > > > I understand that the sizeofs are "meaningless" because of your > > desire to optimize cachelines, but IMHO and, again, respectfully, it > > seems strange that any adjustments to core should require a change > > to this code. > > But if you change any core API, let's say rename a field used in several > drivers, you anyway need to adjust the affected drivers. Sorry, but that's a totally different argument. There are obvious cases where touching certain parts of core would require changes to drivers, yes. I agree on that if I change an API or a struct field name, or remove an enum, then this affects drivers which must be updated. idpf does not fall in this category as it relies on the *size* of the structure, not the field names. Adding a new field wouldn't break any of your existing code accessing fields in the struct since I haven't removed a field. Adding a new field may adjust the size. According to your previous email: idpf cares about the size because it wants the cachelines to look a certain way in pahole. OK, but why is that the concern of someone working on core code? It doesn't make sense to me that if I add a new field, I now need to look at pahole output for idpf to make sure you will approve of the cacheline placement. > It's a common practice that some core changes require touching drivers. > Moreover, sizeof(struct napi_struct) doesn't get changed often, so I > don't see any issue in adjusting one line in idpf by just increasing one > value there by sizeof(your_new_field). The problem is: what if everyone starts doing this? Trying to rely on the specific size of some core data structure in their driver code for cacheline placement? Do I then need to shift through each driver with pahole and adjust the cacheline placement of each affected structure because I added a field to napi_struct ? The burden of optimizing cache line placement should fall on the driver maintainers, not the person adding the field to napi_struct. It would be different if I deleted a field or renamed a field. In those cases: sure that's my issue to deal with changing the affected drivers. Just like it would be if I removed an enum value. > If you do that, then: > + you get notified that you may affect the performance of different > drivers (napi_struct is usually embedded into perf-critical > structures in drivers); But I don't want to think about idpf; it's not my responsibility at all to ensure that the cacheline placement in idpf is optimal. > + I get notified (Cced) that someone's change will affect idpf, so I'll > be aware of it and review the change; > - you need to adjust one line in idpf. Adjust one line in idpf and then go through another revision if the maintainers don't like what the cachelines look like in pahole? And I need to do this for something totally unrelated that idpf won't even support (because I'm not adding support for it in the RFC) ? > Is it just me or these '+'s easily outweight that sole '-'? I disagree even more than I disagreed yesterday.
Re: [Intel-wired-lan] [net-next v3 1/2] e1000e: Link NAPI instances to queues and IRQs
On Mon, Sep 30, 2024 at 05:12:31PM +, Joe Damato wrote: > Add support for netdev-genl, allowing users to query IRQ, NAPI, and queue > information. > > After this patch is applied, note the IRQs assigned to my NIC: > > $ cat /proc/interrupts | grep ens | cut -f1 --delimiter=':' > 50 > 51 > 52 > > While e1000e allocates 3 IRQs (RX, TX, and other), it looks like e1000e > only has a single NAPI, so I've associated the NAPI with the RX IRQ (50 > on my system, seen above). > > Note the output from the cli: > > $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ >--dump napi-get --json='{"ifindex": 2}' > [{'id': 145, 'ifindex': 2, 'irq': 50}] > > This device supports only 1 rx and 1 tx queue. so querying that: > > $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ >--dump queue-get --json='{"ifindex": 2}' > [{'id': 0, 'ifindex': 2, 'napi-id': 145, 'type': 'rx'}, > {'id': 0, 'ifindex': 2, 'napi-id': 145, 'type': 'tx'}] > > Signed-off-by: Joe Damato Reviewed-by: Simon Horman
[Intel-wired-lan] [tnguy-net-queue:100GbE] BUILD SUCCESS 09d0fb5cb30ebcaed4a33028ae383f5a1463e2b2
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue.git 100GbE branch HEAD: 09d0fb5cb30ebcaed4a33028ae383f5a1463e2b2 idpf: deinit virtchnl transaction manager after vport and vectors elapsed time: 829m configs tested: 94 configs skipped: 3 The following configs have been built successfully. More configs may be tested in the coming days. tested configs: alpha allnoconfiggcc-14.1.0 alpha allyesconfigclang-20 alpha defconfiggcc-14.1.0 arcallmodconfigclang-20 arc allnoconfiggcc-14.1.0 arcallyesconfigclang-20 arcaxs103_defconfiggcc-14.1.0 arc defconfiggcc-14.1.0 armallmodconfigclang-20 arm allnoconfiggcc-14.1.0 armallyesconfigclang-20 armcollie_defconfiggcc-14.1.0 arm davinci_all_defconfiggcc-14.1.0 arm defconfiggcc-14.1.0 arm lpc18xx_defconfiggcc-14.1.0 arm versatile_defconfiggcc-14.1.0 arm64 allmodconfigclang-20 arm64 allnoconfiggcc-14.1.0 arm64 defconfiggcc-14.1.0 cskyallnoconfiggcc-14.1.0 csky defconfiggcc-14.1.0 hexagonallmodconfigclang-20 hexagon allnoconfiggcc-14.1.0 hexagonallyesconfigclang-20 hexagon defconfiggcc-14.1.0 i386 allmodconfigclang-18 i386allnoconfigclang-18 i386 allyesconfigclang-18 i386 defconfigclang-18 loongarch allmodconfiggcc-14.1.0 loongarch allnoconfiggcc-14.1.0 loongarch defconfiggcc-14.1.0 m68k allmodconfiggcc-14.1.0 m68kallnoconfiggcc-14.1.0 m68k allyesconfiggcc-14.1.0 m68k defconfiggcc-14.1.0 microblaze allmodconfiggcc-14.1.0 microblaze allnoconfiggcc-14.1.0 microblaze allyesconfiggcc-14.1.0 microblazedefconfiggcc-14.1.0 mipsallnoconfiggcc-14.1.0 mips db1xxx_defconfiggcc-14.1.0 mips rs90_defconfiggcc-14.1.0 nios2 allnoconfiggcc-14.1.0 nios2 defconfiggcc-14.1.0 openriscallnoconfigclang-20 openrisc allyesconfiggcc-14.1.0 openrisc defconfiggcc-12 openriscor1klitex_defconfiggcc-14.1.0 parisc allmodconfiggcc-14.1.0 parisc allnoconfigclang-20 parisc allyesconfiggcc-14.1.0 pariscdefconfiggcc-12 parisc64 defconfiggcc-14.1.0 powerpcallmodconfiggcc-14.1.0 powerpc allnoconfigclang-20 powerpcallyesconfiggcc-14.1.0 powerpc canyonlands_defconfiggcc-14.1.0 powerpc cell_defconfiggcc-14.1.0 powerpc fsp2_defconfiggcc-14.1.0 powerpc mpc834x_itxgp_defconfiggcc-14.1.0 powerpc wii_defconfiggcc-14.1.0 riscv allmodconfiggcc-14.1.0 riscv allnoconfigclang-20 riscv allyesconfiggcc-14.1.0 riscv defconfiggcc-12 s390 allmodconfigclang-20 s390 allmodconfiggcc-14.1.0 s390allnoconfigclang-20 s390 allyesconfiggcc-14.1.0 s390 defconfiggcc-12 sh alldefconfiggcc-14.1.0 sh allmodconfiggcc-14.1.0 sh allnoconfiggcc-14.1.0 sh allyesconfiggcc-14.1.0 shdefconfiggcc-12 sh ecovec24-romimage_defconfiggcc-14.1.0 sh titan_defconfiggcc-14.1.0 shul2_defconfiggcc-14.1.0 sparc allmodconfiggcc-14.1.0 sparc64 defconfiggcc-12 um allmodconfigclang-20 um allnoconfigclang-20 um allyesconfigclang-20 umdefconfiggcc-12 um i386_defconfiggcc-12 um x86_64_defconfig
Re: [Intel-wired-lan] [PATCH iwl-next v10 07/14] iavf: add support for indirect access to PHC time
On 8/21/2024 4:31 PM, Alexander Lobakin wrote: From: Wojciech Drewek Date: Wed, 21 Aug 2024 14:15:32 +0200 From: Jacob Keller Implement support for reading the PHC time indirectly via the VIRTCHNL_OP_1588_PTP_GET_TIME operation. [...] +/** + * iavf_queue_ptp_cmd - Queue PTP command for sending over virtchnl + * @adapter: private adapter structure + * @cmd: the command structure to send + * + * Queue the given command structure into the PTP virtchnl command queue tos + * end to the PF. + */ +static void iavf_queue_ptp_cmd(struct iavf_adapter *adapter, + struct iavf_ptp_aq_cmd *cmd) +{ + mutex_lock(&adapter->ptp.aq_cmd_lock); + list_add_tail(&cmd->list, &adapter->ptp.aq_cmds); + mutex_unlock(&adapter->ptp.aq_cmd_lock); + + adapter->aq_required |= IAVF_FLAG_AQ_SEND_PTP_CMD; + mod_delayed_work(adapter->wq, &adapter->watchdog_task, 0); Are you sure you need delayed_work here? delayed_work is used only when you need to run it after a delay. If the delay is always 0, then you only need work_struct and queue_work(). Sorry for late response but I was on quite long vacation. Regarding your question - I think it is needed to have mod_delayed_work() here, at least as for now. I agree with your suggestion to use queue_work() but this function takes as parameter work_struct and in our case the adapter->watchdog_task field is of struct delayed_work type. It uses the function iavf_watchdog_task() which does plenty of things (including sending ptp cmd). Changing adapter->watchdog_task to be just struct work_struct is not applicable here as in iavf_main.c file in few places we add it to queue with different time. Make long story short - I agree with your argument but as far as this 0 delay is not causing performance regression then I will stick with this solution implemented by Jake. +} + +/** + * iavf_send_phc_read - Send request to read PHC time [...] +static int iavf_ptp_gettimex64(struct ptp_clock_info *info, + struct timespec64 *ts, + struct ptp_system_timestamp *sts) +{ + struct iavf_adapter *adapter = iavf_clock_to_adapter(info); + + if (!adapter->ptp.initialized) + return -ENODEV; Why is it -ENODEV here, but -EOPNOTSUPP several functions above, are you sure these codes are the ones expected by the upper layers? + + return iavf_read_phc_indirect(adapter, ts, sts); +} [...] diff --git a/drivers/net/ethernet/intel/iavf/iavf_ptp.h b/drivers/net/ethernet/intel/iavf/iavf_ptp.h index c2ed24cef926..0bb4bddc1495 100644 --- a/drivers/net/ethernet/intel/iavf/iavf_ptp.h +++ b/drivers/net/ethernet/intel/iavf/iavf_ptp.h @@ -6,9 +6,13 @@ #include "iavf_types.h" +#define iavf_clock_to_adapter(info)\ + container_of_const(info, struct iavf_adapter, ptp.info) It's only used in one file, are you sure you need it here in the header? Or it will be used in later patches? [...] +void iavf_virtchnl_send_ptp_cmd(struct iavf_adapter *adapter) +{ + struct device *dev = &adapter->pdev->dev; + struct iavf_ptp_aq_cmd *cmd; + int err; + + if (!adapter->ptp.initialized) { BTW does it make sense to introduce ptp.initialized since you can always check ptp.clock for being %NULL and it will be the same? + /* This shouldn't be possible to hit, since no messages should +* be queued if PTP is not initialized. +*/ + pci_err(adapter->pdev, "PTP is not initialized\n"); + adapter->aq_required &= ~IAVF_FLAG_AQ_SEND_PTP_CMD; + return; + } + + mutex_lock(&adapter->ptp.aq_cmd_lock); + cmd = list_first_entry_or_null(&adapter->ptp.aq_cmds, + struct iavf_ptp_aq_cmd, list); + if (!cmd) { + /* no further PTP messages to send */ + adapter->aq_required &= ~IAVF_FLAG_AQ_SEND_PTP_CMD; + goto out_unlock; + } + + if (adapter->current_op != VIRTCHNL_OP_UNKNOWN) { + /* bail because we already have a command pending */ + dev_err(dev, "Cannot send PTP command %d, command %d pending\n", pci_err() + cmd->v_opcode, adapter->current_op); + goto out_unlock; + } + + err = iavf_send_pf_msg(adapter, cmd->v_opcode, cmd->msg, cmd->msglen); + if (!err) { + /* Command was sent without errors, so we can remove it from +* the list and discard it. +*/ + list_del(&cmd->list); + kfree(cmd); + } else { + /* We failed to send the command, try again next cycle */ + dev_warn(dev, "Failed to send PTP command %d\n", cmd->v_opcode); pci_err() I'd say. + } + + if (list_empty(&adapter->ptp.aq_cmds)) + /* no further PTP messages to send */ +
[Intel-wired-lan] [PATCH iwl-next 1/2] ice: rework of dump serdes equalizer values feature
Refactor function ice_get_tx_rx_equa() to iterate over new table of params instead of multiple calls to ice_aq_get_phy_equalization(). Subsequent commit will extend that function by add more serdes equalizer values to dump. Shorten the fields of struct ice_serdes_equalization_to_ethtool for readability purposes. Reviewed-by: Przemek Kitszel Signed-off-by: Mateusz Polchlopek --- drivers/net/ethernet/intel/ice/ice_ethtool.c | 93 ++-- drivers/net/ethernet/intel/ice/ice_ethtool.h | 22 ++--- 2 files changed, 38 insertions(+), 77 deletions(-) diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c b/drivers/net/ethernet/intel/ice/ice_ethtool.c index 2924ac61300d..19e7a9d93928 100644 --- a/drivers/net/ethernet/intel/ice/ice_ethtool.c +++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c @@ -693,75 +693,36 @@ static int ice_get_port_topology(struct ice_hw *hw, u8 lport, static int ice_get_tx_rx_equa(struct ice_hw *hw, u8 serdes_num, struct ice_serdes_equalization_to_ethtool *ptr) { + static const int tx = ICE_AQC_OP_CODE_TX_EQU; + static const int rx = ICE_AQC_OP_CODE_RX_EQU; + struct { + int data_in; + int opcode; + int *out; + } aq_params[] = { + { ICE_AQC_TX_EQU_PRE1, tx, &ptr->tx_equ_pre1 }, + { ICE_AQC_TX_EQU_PRE3, tx, &ptr->tx_equ_pre3 }, + { ICE_AQC_TX_EQU_ATTEN, tx, &ptr->tx_equ_atten }, + { ICE_AQC_TX_EQU_POST1, tx, &ptr->tx_equ_post1 }, + { ICE_AQC_TX_EQU_PRE2, tx, &ptr->tx_equ_pre2 }, + { ICE_AQC_RX_EQU_PRE2, rx, &ptr->rx_equ_pre2 }, + { ICE_AQC_RX_EQU_PRE1, rx, &ptr->rx_equ_pre1 }, + { ICE_AQC_RX_EQU_POST1, rx, &ptr->rx_equ_post1 }, + { ICE_AQC_RX_EQU_BFLF, rx, &ptr->rx_equ_bflf }, + { ICE_AQC_RX_EQU_BFHF, rx, &ptr->rx_equ_bfhf }, + { ICE_AQC_RX_EQU_DRATE, rx, &ptr->rx_equ_drate }, + }; int err; - err = ice_aq_get_phy_equalization(hw, ICE_AQC_TX_EQU_PRE1, - ICE_AQC_OP_CODE_TX_EQU, serdes_num, - &ptr->tx_equalization_pre1); - if (err) - return err; - - err = ice_aq_get_phy_equalization(hw, ICE_AQC_TX_EQU_PRE3, - ICE_AQC_OP_CODE_TX_EQU, serdes_num, - &ptr->tx_equalization_pre3); - if (err) - return err; - - err = ice_aq_get_phy_equalization(hw, ICE_AQC_TX_EQU_ATTEN, - ICE_AQC_OP_CODE_TX_EQU, serdes_num, - &ptr->tx_equalization_atten); - if (err) - return err; - - err = ice_aq_get_phy_equalization(hw, ICE_AQC_TX_EQU_POST1, - ICE_AQC_OP_CODE_TX_EQU, serdes_num, - &ptr->tx_equalization_post1); - if (err) - return err; - - err = ice_aq_get_phy_equalization(hw, ICE_AQC_TX_EQU_PRE2, - ICE_AQC_OP_CODE_TX_EQU, serdes_num, - &ptr->tx_equalization_pre2); - if (err) - return err; - - err = ice_aq_get_phy_equalization(hw, ICE_AQC_RX_EQU_PRE2, - ICE_AQC_OP_CODE_RX_EQU, serdes_num, - &ptr->rx_equalization_pre2); - if (err) - return err; - - err = ice_aq_get_phy_equalization(hw, ICE_AQC_RX_EQU_PRE1, - ICE_AQC_OP_CODE_RX_EQU, serdes_num, - &ptr->rx_equalization_pre1); - if (err) - return err; - - err = ice_aq_get_phy_equalization(hw, ICE_AQC_RX_EQU_POST1, - ICE_AQC_OP_CODE_RX_EQU, serdes_num, - &ptr->rx_equalization_post1); - if (err) - return err; - - err = ice_aq_get_phy_equalization(hw, ICE_AQC_RX_EQU_BFLF, - ICE_AQC_OP_CODE_RX_EQU, serdes_num, - &ptr->rx_equalization_bflf); - if (err) - return err; - - err = ice_aq_get_phy_equalization(hw, ICE_AQC_RX_EQU_BFHF, - ICE_AQC_OP_CODE_RX_EQU, serdes_num, - &ptr->rx_equalization_bfhf); - if (err) - return err; - - err = ice_aq_get_phy_equalization(hw, ICE_AQC_RX_EQU_DRATE, - ICE_AQC_OP_CODE_RX_EQU, serdes_num, - &ptr->rx_equalization_drate); - if (err) - return err; + for (int i = 0; i < ARRAY_SIZE(aq_params); i++) { + err
[Intel-wired-lan] [PATCH iwl-next 2/2] ice: extend dump serdes equalizer values feature
Extend the work done in commit 70838938e89c ("ice: Implement driver functionality to dump serdes equalizer values") by adding the new set of Rx registers that can be read using command: $ ethtool -d interface_name Rx equalization parameters are E810 PHY registers used by end user to gather information about configuration and status to debug link and connection issues in the field. Reviewed-by: Przemek Kitszel Signed-off-by: Mateusz Polchlopek --- drivers/net/ethernet/intel/ice/ice_adminq_cmd.h | 17 + drivers/net/ethernet/intel/ice/ice_ethtool.c| 17 + drivers/net/ethernet/intel/ice/ice_ethtool.h| 17 + 3 files changed, 51 insertions(+) diff --git a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h index 1f01f3501d6b..1489a8ceec51 100644 --- a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h +++ b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h @@ -1492,6 +1492,23 @@ struct ice_aqc_dnl_equa_param { #define ICE_AQC_RX_EQU_BFLF (0x13 << ICE_AQC_RX_EQU_SHIFT) #define ICE_AQC_RX_EQU_BFHF (0x14 << ICE_AQC_RX_EQU_SHIFT) #define ICE_AQC_RX_EQU_DRATE (0x15 << ICE_AQC_RX_EQU_SHIFT) +#define ICE_AQC_RX_EQU_CTLE_GAINHF (0x20 << ICE_AQC_RX_EQU_SHIFT) +#define ICE_AQC_RX_EQU_CTLE_GAINLF (0x21 << ICE_AQC_RX_EQU_SHIFT) +#define ICE_AQC_RX_EQU_CTLE_GAINDC (0x22 << ICE_AQC_RX_EQU_SHIFT) +#define ICE_AQC_RX_EQU_CTLE_BW (0x23 << ICE_AQC_RX_EQU_SHIFT) +#define ICE_AQC_RX_EQU_DFE_GAIN (0x30 << ICE_AQC_RX_EQU_SHIFT) +#define ICE_AQC_RX_EQU_DFE_GAIN2 (0x31 << ICE_AQC_RX_EQU_SHIFT) +#define ICE_AQC_RX_EQU_DFE_2 (0x32 << ICE_AQC_RX_EQU_SHIFT) +#define ICE_AQC_RX_EQU_DFE_3 (0x33 << ICE_AQC_RX_EQU_SHIFT) +#define ICE_AQC_RX_EQU_DFE_4 (0x34 << ICE_AQC_RX_EQU_SHIFT) +#define ICE_AQC_RX_EQU_DFE_5 (0x35 << ICE_AQC_RX_EQU_SHIFT) +#define ICE_AQC_RX_EQU_DFE_6 (0x36 << ICE_AQC_RX_EQU_SHIFT) +#define ICE_AQC_RX_EQU_DFE_7 (0x37 << ICE_AQC_RX_EQU_SHIFT) +#define ICE_AQC_RX_EQU_DFE_8 (0x38 << ICE_AQC_RX_EQU_SHIFT) +#define ICE_AQC_RX_EQU_DFE_9 (0x39 << ICE_AQC_RX_EQU_SHIFT) +#define ICE_AQC_RX_EQU_DFE_10 (0x3A << ICE_AQC_RX_EQU_SHIFT) +#define ICE_AQC_RX_EQU_DFE_11 (0x3B << ICE_AQC_RX_EQU_SHIFT) +#define ICE_AQC_RX_EQU_DFE_12 (0x3C << ICE_AQC_RX_EQU_SHIFT) #define ICE_AQC_TX_EQU_PRE1 0x0 #define ICE_AQC_TX_EQU_PRE3 0x3 #define ICE_AQC_TX_EQU_ATTEN 0x4 diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c b/drivers/net/ethernet/intel/ice/ice_ethtool.c index 19e7a9d93928..3072634bf049 100644 --- a/drivers/net/ethernet/intel/ice/ice_ethtool.c +++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c @@ -711,6 +711,23 @@ static int ice_get_tx_rx_equa(struct ice_hw *hw, u8 serdes_num, { ICE_AQC_RX_EQU_BFLF, rx, &ptr->rx_equ_bflf }, { ICE_AQC_RX_EQU_BFHF, rx, &ptr->rx_equ_bfhf }, { ICE_AQC_RX_EQU_DRATE, rx, &ptr->rx_equ_drate }, + { ICE_AQC_RX_EQU_CTLE_GAINHF, rx, &ptr->rx_equ_ctle_gainhf }, + { ICE_AQC_RX_EQU_CTLE_GAINLF, rx, &ptr->rx_equ_ctle_gainlf }, + { ICE_AQC_RX_EQU_CTLE_GAINDC, rx, &ptr->rx_equ_ctle_gaindc }, + { ICE_AQC_RX_EQU_CTLE_BW, rx, &ptr->rx_equ_ctle_bw }, + { ICE_AQC_RX_EQU_DFE_GAIN, rx, &ptr->rx_equ_dfe_gain }, + { ICE_AQC_RX_EQU_DFE_GAIN2, rx, &ptr->rx_equ_dfe_gain_2 }, + { ICE_AQC_RX_EQU_DFE_2, rx, &ptr->rx_equ_dfe_2 }, + { ICE_AQC_RX_EQU_DFE_3, rx, &ptr->rx_equ_dfe_3 }, + { ICE_AQC_RX_EQU_DFE_4, rx, &ptr->rx_equ_dfe_4 }, + { ICE_AQC_RX_EQU_DFE_5, rx, &ptr->rx_equ_dfe_5 }, + { ICE_AQC_RX_EQU_DFE_6, rx, &ptr->rx_equ_dfe_6 }, + { ICE_AQC_RX_EQU_DFE_7, rx, &ptr->rx_equ_dfe_7 }, + { ICE_AQC_RX_EQU_DFE_8, rx, &ptr->rx_equ_dfe_8 }, + { ICE_AQC_RX_EQU_DFE_9, rx, &ptr->rx_equ_dfe_9 }, + { ICE_AQC_RX_EQU_DFE_10, rx, &ptr->rx_equ_dfe_10 }, + { ICE_AQC_RX_EQU_DFE_11, rx, &ptr->rx_equ_dfe_11 }, + { ICE_AQC_RX_EQU_DFE_12, rx, &ptr->rx_equ_dfe_12 }, }; int err; diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.h b/drivers/net/ethernet/intel/ice/ice_ethtool.h index 98eb9c51d687..8f2ad1c172c0 100644 --- a/drivers/net/ethernet/intel/ice/ice_ethtool.h +++ b/drivers/net/ethernet/intel/ice/ice_ethtool.h @@ -16,6 +16,23 @@ struct ice_serdes_equalization_to_ethtool { int rx_equ_bflf; int rx_equ_bfhf; int rx_equ_drate; + int rx_equ_ctle_gainhf; + int rx_equ_ctle_gainlf; + int rx_equ_ctle_gaindc; + int rx_equ_ctle_bw; + int rx_equ_dfe_gain; + int rx_equ_dfe_gain_2; + int rx_equ_dfe_2; + int rx_equ_dfe_3; + int rx_equ_dfe_4; + int rx_equ_dfe_5; + int rx_equ_dfe_6; + int rx_equ_dfe_7; + int rx_equ_dfe_8; + int rx_equ_dfe_9; + int rx_equ_dfe_10; + int rx_equ_dfe_11; + int rx_equ_dfe_12; int tx_equ_pre1;
[Intel-wired-lan] [PATCH iwl-next 0/2] Extend the dump serdes equalizer values feature
Extend the work done in commit 70838938e89c ("ice: Implement driver functionality to dump serdes equalizer values") by refactor the ice_get_tx_rx_equa() function, shorten struct fields names and add new Rx registers that can be read using command: $ ethtool -d interface_name. Mateusz Polchlopek (2): ice: rework of dump serdes equalizer values feature ice: extend dump serdes equalizer values feature .../net/ethernet/intel/ice/ice_adminq_cmd.h | 17 +++ drivers/net/ethernet/intel/ice/ice_ethtool.c | 110 +++--- drivers/net/ethernet/intel/ice/ice_ethtool.h | 39 +-- 3 files changed, 89 insertions(+), 77 deletions(-) base-commit: 3b26fcd8529dc446af79c6023a1850260ceeaba7 -- 2.38.1
Re: [Intel-wired-lan] [RFC bpf-next 0/4] Add XDP rx hw hints support performing XDP_REDIRECT
"Arthur Fabre" writes: > On Mon Sep 30, 2024 at 12:52 PM CEST, Toke Høiland-Jørgensen wrote: >> > Thinking about it more, my only relectance for a registration API is how >> > to communicate the ID back to other consumers (our discussion below). >> > >> >> >> >> > Dynamically registering fields means you have to share the returned ID >> >> > with whoever is interested, which sounds tricky. >> >> > If an XDP program sets a field like packet_id, every tracing >> >> > program that looks at it, and userspace service, would need to know what >> >> > the ID of that field is. >> >> > Is there a way to easily share that ID with all of them? >> >> >> >> Right, so I'll admit this was one of the handwavy bits of my original >> >> proposal, but I don't think it's unsolvable. You could do something like >> >> (once, on application initialisation): >> >> >> >> __u64 my_key = bpf_register_metadata_field(my_size); /* maybe add a name >> >> for introspection? */ >> >> bpf_map_update(&shared_application_config, &my_key_index, &my_key); >> >> >> >> and then just get the key out of that map from all programs that want to >> >> use it? >> > >> > Passing it out of band works (whether it's via a pinned map like you >> > described, or through other means like a unix socket or some other >> > API), it's just more complicated. >> > >> > Every consumer also needs to know about that API. That won't work with >> > standard tools. For example if we set a PACKET_ID KV, maybe we could >> > give it to pwru so it could track packets using it? >> > Without registering keys, we could pass it as a cli flag. With >> > registration, we'd have to have some helper to get the KV ID. >> > >> > It also introduces ordering dependencies between the services on >> > startup, eg packet tracing hooks could only be attached once our XDP >> > service has registered a PACKET_ID KV, and they could query it's API. >> >> Yeah, we definitely need a way to make that accessible and not too >> cumbersome. >> >> I suppose what we really need is a way to map an application-specific >> identifier to an ID. And, well, those identifiers could just be (string) >> names? That's what we do for CO-RE, after all. So you'd get something >> like: >> >> id = bpf_register_metadata_field("packet_id", 8, BPF_CREATE); /* register */ >> >> id = bpf_register_metadata_field("packet_id", 8, BPF_NO_CREATE); /* resolve >> */ >> >> and we make that idempotent, so that two callers using the same name and >> size will just get the same id back; and if called with BPF_NO_CREATE, >> you'll get -ENOENT if it hasn't already been registered by someone else? >> >> We could even make this a sub-command of the bpf() syscall if we want it >> to be UAPI, but that's not strictly necessary, applications can also >> just call the registration from a syscall program at startup... > > That's a nice API, it makes sharing the IDs much easier. > > We still have to worry about collisions (what if two different things > want to add their own "packet_id" field?). But at least: > > * "Any string" has many more possibilities than 0-64 keys. Yes, and it's easy to namespace (by prefixing, like APPNAME_my_metaname). But sure, if everyone uses very generic names that will be a problem. > * bpf_register_metadata() could return an error if a field is already > registered, instead of silently letting an application overwrite > metadata I don't think we can fundamentally solve the collision problem if we also need to allow sharing keys (on purpose). I.e., we can't distinguish between "these two applications deliberately want to share the packet_id field" and "these two applications accidentally both picked packet_id as their internal key". I suppose we could pre-define some extra reserved keys if there turns out to be widely used identifiers that many applications want. But I'm not sure if that should be there from the start, it quickly becomes very speculative(packet_id comes to mind as one that might be generally useful, but I'm really not sure, TBH). > (although arguably we could have add a BPF_NOEXIST style flag > to the KV set() to kind of do the same). A global registry will need locking, so not sure it's a good idea to support inserting into it in the fast path? > At least internally, it still feels like we'd maintain a registry of > these string fields and make them configurable for each service to avoid > collisions. Yeah, see above. Some kind of coordination (like a registry) is impossible to avoid if you *want* to share data, but not sure how common that is? -Toke
Re: [Intel-wired-lan] [net-next v3 2/2] e1000: Link NAPI instances to queues and IRQs
On Mon, Sep 30, 2024 at 05:12:32PM +, Joe Damato wrote: > Add support for netdev-genl, allowing users to query IRQ, NAPI, and queue > information. > > After this patch is applied, note the IRQ assigned to my NIC: > > $ cat /proc/interrupts | grep enp0s8 | cut -f1 --delimiter=':' > 18 > > Note the output from the cli: > > $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ > --dump napi-get --json='{"ifindex": 2}' > [{'id': 513, 'ifindex': 2, 'irq': 18}] > > This device supports only 1 rx and 1 tx queue, so querying that: > > $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ > --dump queue-get --json='{"ifindex": 2}' > [{'id': 0, 'ifindex': 2, 'napi-id': 513, 'type': 'rx'}, > {'id': 0, 'ifindex': 2, 'napi-id': 513, 'type': 'tx'}] > > Signed-off-by: Joe Damato Reviewed-by: Simon Horman
Re: [Intel-wired-lan] [RFC bpf-next 0/4] Add XDP rx hw hints support performing XDP_REDIRECT
> On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote: > > > Lorenzo Bianconi writes: > > > > > > >> > We could combine such a registration API with your header format, so > > > >> > that the registration just becomes a way of allocating one of the > > > >> > keys > > > >> > from 0-63 (and the registry just becomes a global copy of the > > > >> > header). > > > >> > This would basically amount to moving the "service config file" into > > > >> > the > > > >> > kernel, since that seems to be the only common denominator we can > > > >> > rely > > > >> > on between BPF applications (as all attempts to write a common daemon > > > >> > for BPF management have shown). > > > >> > > > >> That sounds reasonable. And I guess we'd have set() check the global > > > >> registry to enforce that the key has been registered beforehand? > > > >> > > > >> > > > > >> > -Toke > > > >> > > > >> Thanks for all the feedback! > > > > > > > > I like this 'fast' KV approach but I guess we should really evaluate its > > > > impact on performances (especially for xdp) since, based on the kfunc > > > > calls > > > > order in the ebpf program, we can have one or multiple memmove/memcpy > > > > for > > > > each packet, right? > > > > > > Yes, with Arthur's scheme, performance will be ordering dependent. Using > > > a global registry for offsets would sidestep this, but have the > > > synchronisation issues we discussed up-thread. So on balance, I think > > > the memmove() suggestion will probably lead to the least pain. > > > > > > For the HW metadata we could sidestep this by always having a fixed > > > struct for it (but using the same set/get() API with reserved keys). The > > > only drawback of doing that is that we statically reserve a bit of > > > space, but I'm not sure that is such a big issue in practice (at least > > > not until this becomes to popular that the space starts to be contended; > > > but surely 256 bytes ought to be enough for everybody, right? :)). > > > > I am fine with the proposed approach, but I think we need to verify what is > > the > > impact on performances (in the worst case??) > > If drivers are responsible for populating the hardware metadata before > XDP, we could make sure drivers set the fields in order to avoid any > memove() (and maybe even provide a helper to ensure this?). nope, since the current APIs introduced by Stanislav are consuming NIC metadata in kfuncs (mainly for af_xdp) and, according to my understanding, we want to add a kfunc to store the info for each NIC metadata (e.g rx-hash, timestamping, ..) into the packet (this is what Toke is proposing, right?). In this case kfunc calling order makes a difference. We can think even to add single kfunc to store all the info for all the NIC metadata (maybe via a helping struct) but it seems not scalable to me and we are losing kfunc versatility. Regards, Lorenzo > > > > > Moreover, I still think the metadata area in the xdp_frame/xdp_buff is > > > > not > > > > so suitable for nic hw metadata since: > > > > - it grows backward > > > > - it is probably in a different cacheline with respect to xdp_frame > > > > - nic hw metadata will not start at fixed and immutable address, but it > > > > depends > > > > on the running ebpf program > > > > > > > > What about having something like: > > > > - fixed hw nic metadata: just after xdp_frame struct (or if you want at > > > > the end > > > > of the metadata area :)). Here he can reuse the same KV approach if > > > > it is fast > > > > - user defined metadata: in the metadata area of the xdp_frame/xdp_buff > > > > > > AFAIU, none of this will live in the (current) XDP metadata area. It > > > will all live just after the xdp_frame struct (so sharing the space with > > > the metadata area in the sense that adding more metadata kv fields will > > > decrease the amount of space that is usable by the current XDP metadata > > > APIs). > > > > > > -Toke > > > > > > > ah, ok. I was thinking the proposed approach was to put them in the current > > metadata field. > > I've also been thinking of putting this new KV stuff at the start of the > headroom (I think that's what you're saying Toke?). It has a few nice > advantanges: > > * It coexists nicely with the current XDP / TC metadata support. > Those users won't be able to overwrite / corrupt the KV metadata. > KV users won't need to call xdp_adjust_meta() (which would be awkward - > how would they know how much space the KV implementation needs). > > * We don't have to move all the metadata everytime we call > xdp_adjust_head() (or the kernel equivalent). > > Are there any performance implications of that, e.g. for caching? > > This would also grow "upwards" which is more natural, but I think > either way the KV API would hide whether it's downwards or upwards from > users. > > > > > Regards, > > Lorenzo > signature.asc Description: PGP signature
Re: [Intel-wired-lan] [RFC bpf-next 0/4] Add XDP rx hw hints support performing XDP_REDIRECT
On Mon Sep 30, 2024 at 12:52 PM CEST, Toke Høiland-Jørgensen wrote: > > Thinking about it more, my only relectance for a registration API is how > > to communicate the ID back to other consumers (our discussion below). > > > >> > >> > Dynamically registering fields means you have to share the returned ID > >> > with whoever is interested, which sounds tricky. > >> > If an XDP program sets a field like packet_id, every tracing > >> > program that looks at it, and userspace service, would need to know what > >> > the ID of that field is. > >> > Is there a way to easily share that ID with all of them? > >> > >> Right, so I'll admit this was one of the handwavy bits of my original > >> proposal, but I don't think it's unsolvable. You could do something like > >> (once, on application initialisation): > >> > >> __u64 my_key = bpf_register_metadata_field(my_size); /* maybe add a name > >> for introspection? */ > >> bpf_map_update(&shared_application_config, &my_key_index, &my_key); > >> > >> and then just get the key out of that map from all programs that want to > >> use it? > > > > Passing it out of band works (whether it's via a pinned map like you > > described, or through other means like a unix socket or some other > > API), it's just more complicated. > > > > Every consumer also needs to know about that API. That won't work with > > standard tools. For example if we set a PACKET_ID KV, maybe we could > > give it to pwru so it could track packets using it? > > Without registering keys, we could pass it as a cli flag. With > > registration, we'd have to have some helper to get the KV ID. > > > > It also introduces ordering dependencies between the services on > > startup, eg packet tracing hooks could only be attached once our XDP > > service has registered a PACKET_ID KV, and they could query it's API. > > Yeah, we definitely need a way to make that accessible and not too > cumbersome. > > I suppose what we really need is a way to map an application-specific > identifier to an ID. And, well, those identifiers could just be (string) > names? That's what we do for CO-RE, after all. So you'd get something > like: > > id = bpf_register_metadata_field("packet_id", 8, BPF_CREATE); /* register */ > > id = bpf_register_metadata_field("packet_id", 8, BPF_NO_CREATE); /* resolve */ > > and we make that idempotent, so that two callers using the same name and > size will just get the same id back; and if called with BPF_NO_CREATE, > you'll get -ENOENT if it hasn't already been registered by someone else? > > We could even make this a sub-command of the bpf() syscall if we want it > to be UAPI, but that's not strictly necessary, applications can also > just call the registration from a syscall program at startup... That's a nice API, it makes sharing the IDs much easier. We still have to worry about collisions (what if two different things want to add their own "packet_id" field?). But at least: * "Any string" has many more possibilities than 0-64 keys. * bpf_register_metadata() could return an error if a field is already registered, instead of silently letting an application overwrite metadata (although arguably we could have add a BPF_NOEXIST style flag to the KV set() to kind of do the same). At least internally, it still feels like we'd maintain a registry of these string fields and make them configurable for each service to avoid collisions. > >> We could combine such a registration API with your header format, so > >> that the registration just becomes a way of allocating one of the keys > >> from 0-63 (and the registry just becomes a global copy of the header). > >> This would basically amount to moving the "service config file" into the > >> kernel, since that seems to be the only common denominator we can rely > >> on between BPF applications (as all attempts to write a common daemon > >> for BPF management have shown). > > > > That sounds reasonable. And I guess we'd have set() check the global > > registry to enforce that the key has been registered beforehand? > > Yes, exactly. And maybe check that the size matches as well just to > remove the obvious footgun of accidentally stepping on each other's > toes? > > > Thanks for all the feedback! > > You're welcome! Thanks for working on this :) > > -Toke
Re: [Intel-wired-lan] [RFC bpf-next 0/4] Add XDP rx hw hints support performing XDP_REDIRECT
On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote: > > Lorenzo Bianconi writes: > > > > >> > We could combine such a registration API with your header format, so > > >> > that the registration just becomes a way of allocating one of the keys > > >> > from 0-63 (and the registry just becomes a global copy of the header). > > >> > This would basically amount to moving the "service config file" into > > >> > the > > >> > kernel, since that seems to be the only common denominator we can rely > > >> > on between BPF applications (as all attempts to write a common daemon > > >> > for BPF management have shown). > > >> > > >> That sounds reasonable. And I guess we'd have set() check the global > > >> registry to enforce that the key has been registered beforehand? > > >> > > >> > > > >> > -Toke > > >> > > >> Thanks for all the feedback! > > > > > > I like this 'fast' KV approach but I guess we should really evaluate its > > > impact on performances (especially for xdp) since, based on the kfunc > > > calls > > > order in the ebpf program, we can have one or multiple memmove/memcpy for > > > each packet, right? > > > > Yes, with Arthur's scheme, performance will be ordering dependent. Using > > a global registry for offsets would sidestep this, but have the > > synchronisation issues we discussed up-thread. So on balance, I think > > the memmove() suggestion will probably lead to the least pain. > > > > For the HW metadata we could sidestep this by always having a fixed > > struct for it (but using the same set/get() API with reserved keys). The > > only drawback of doing that is that we statically reserve a bit of > > space, but I'm not sure that is such a big issue in practice (at least > > not until this becomes to popular that the space starts to be contended; > > but surely 256 bytes ought to be enough for everybody, right? :)). > > I am fine with the proposed approach, but I think we need to verify what is > the > impact on performances (in the worst case??) If drivers are responsible for populating the hardware metadata before XDP, we could make sure drivers set the fields in order to avoid any memove() (and maybe even provide a helper to ensure this?). > > > Moreover, I still think the metadata area in the xdp_frame/xdp_buff is not > > > so suitable for nic hw metadata since: > > > - it grows backward > > > - it is probably in a different cacheline with respect to xdp_frame > > > - nic hw metadata will not start at fixed and immutable address, but it > > > depends > > > on the running ebpf program > > > > > > What about having something like: > > > - fixed hw nic metadata: just after xdp_frame struct (or if you want at > > > the end > > > of the metadata area :)). Here he can reuse the same KV approach if it > > > is fast > > > - user defined metadata: in the metadata area of the xdp_frame/xdp_buff > > > > AFAIU, none of this will live in the (current) XDP metadata area. It > > will all live just after the xdp_frame struct (so sharing the space with > > the metadata area in the sense that adding more metadata kv fields will > > decrease the amount of space that is usable by the current XDP metadata > > APIs). > > > > -Toke > > > > ah, ok. I was thinking the proposed approach was to put them in the current > metadata field. I've also been thinking of putting this new KV stuff at the start of the headroom (I think that's what you're saying Toke?). It has a few nice advantanges: * It coexists nicely with the current XDP / TC metadata support. Those users won't be able to overwrite / corrupt the KV metadata. KV users won't need to call xdp_adjust_meta() (which would be awkward - how would they know how much space the KV implementation needs). * We don't have to move all the metadata everytime we call xdp_adjust_head() (or the kernel equivalent). Are there any performance implications of that, e.g. for caching? This would also grow "upwards" which is more natural, but I think either way the KV API would hide whether it's downwards or upwards from users. > > Regards, > Lorenzo
Re: [Intel-wired-lan] [PATCH iwl-next v10 07/14] iavf: add support for indirect access to PHC time
From: Mateusz Polchlopek Date: Tue, 1 Oct 2024 09:20:14 +0200 > > > On 8/21/2024 4:31 PM, Alexander Lobakin wrote: >> From: Wojciech Drewek >> Date: Wed, 21 Aug 2024 14:15:32 +0200 >> >>> From: Jacob Keller >>> >>> Implement support for reading the PHC time indirectly via the >>> VIRTCHNL_OP_1588_PTP_GET_TIME operation. >> >> [...] >> >>> +/** >>> + * iavf_queue_ptp_cmd - Queue PTP command for sending over virtchnl >>> + * @adapter: private adapter structure >>> + * @cmd: the command structure to send >>> + * >>> + * Queue the given command structure into the PTP virtchnl command >>> queue tos >>> + * end to the PF. >>> + */ >>> +static void iavf_queue_ptp_cmd(struct iavf_adapter *adapter, >>> + struct iavf_ptp_aq_cmd *cmd) >>> +{ >>> + mutex_lock(&adapter->ptp.aq_cmd_lock); >>> + list_add_tail(&cmd->list, &adapter->ptp.aq_cmds); >>> + mutex_unlock(&adapter->ptp.aq_cmd_lock); >>> + >>> + adapter->aq_required |= IAVF_FLAG_AQ_SEND_PTP_CMD; >>> + mod_delayed_work(adapter->wq, &adapter->watchdog_task, 0); >> >> Are you sure you need delayed_work here? delayed_work is used only when >> you need to run it after a delay. If the delay is always 0, then you >> only need work_struct and queue_work(). >> > > Sorry for late response but I was on quite long vacation. > > Regarding your question - I think it is needed to have > mod_delayed_work() here, at least as for now. I agree with your > suggestion to use queue_work() but this function takes as parameter > work_struct and in our case the adapter->watchdog_task field is of > struct delayed_work type. It uses the function iavf_watchdog_task() > which does plenty of things (including sending ptp cmd). Changing > adapter->watchdog_task to be just struct work_struct is not applicable > here as in iavf_main.c file in few places we add it to queue with > different time. Aaaah I'm sorry I didn't notice that watchdog_task is used in other placed, not only here. Ack, leave it as it is. > > Make long story short - I agree with your argument but as far as this > 0 delay is not causing performance regression then I will stick with > this solution implemented by Jake. Thanks, Olek
Re: [Intel-wired-lan] [PATCH net v2 2/2] page_pool: fix IOMMU crash when driver has already unbound
On 9/25/24 09:57, Yunsheng Lin wrote: Networking driver with page_pool support may hand over page still with dma mapping to network stack and try to reuse that page after network stack is done with it and passes it back to page_pool to avoid the penalty of dma mapping/unmapping. With all the caching in the network stack, some pages may be held in the network stack without returning to the page_pool soon enough, and with VF disable causing the driver unbound, the page_pool does not stop the driver from doing it's unbounding work, instead page_pool uses workqueue to check if there is some pages coming back from the network stack periodically, if there is any, it will do the dma unmmapping related cleanup work. As mentioned in [1], attempting DMA unmaps after the driver has already unbound may leak resources or at worst corrupt memory. Fundamentally, the page pool code cannot allow DMA mappings to outlive the driver they belong to. Currently it seems there are at least two cases that the page is not released fast enough causing dma unmmapping done after driver has already unbound: 1. ipv4 packet defragmentation timeout: this seems to cause delay up to 30 secs. 2. skb_defer_free_flush(): this may cause infinite delay if there is no triggering for net_rx_action(). In order not to do the dma unmmapping after driver has already unbound and stall the unloading of the networking driver, add the pool->items array to record all the pages including the ones which are handed over to network stack, so the page_pool can do the dma unmmapping for those pages when page_pool_destroy() is called. As the pool->items need to be large enough to avoid performance degradation, add a 'item_full' stat to indicate the allocation failure due to unavailability of pool->items. This looks really invasive, with room for potentially large performance regressions or worse. At very least it does not look suitable for net. Is the problem only tied to VFs drivers? It's a pity all the page_pool users will have to pay a bill for it... /P
Re: [Intel-wired-lan] [RFC bpf-next 0/4] Add XDP rx hw hints support performing XDP_REDIRECT
Lorenzo Bianconi writes: >> On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote: >> > > Lorenzo Bianconi writes: >> > > >> > > >> > We could combine such a registration API with your header format, so >> > > >> > that the registration just becomes a way of allocating one of the >> > > >> > keys >> > > >> > from 0-63 (and the registry just becomes a global copy of the >> > > >> > header). >> > > >> > This would basically amount to moving the "service config file" >> > > >> > into the >> > > >> > kernel, since that seems to be the only common denominator we can >> > > >> > rely >> > > >> > on between BPF applications (as all attempts to write a common >> > > >> > daemon >> > > >> > for BPF management have shown). >> > > >> >> > > >> That sounds reasonable. And I guess we'd have set() check the global >> > > >> registry to enforce that the key has been registered beforehand? >> > > >> >> > > >> > >> > > >> > -Toke >> > > >> >> > > >> Thanks for all the feedback! >> > > > >> > > > I like this 'fast' KV approach but I guess we should really evaluate >> > > > its >> > > > impact on performances (especially for xdp) since, based on the kfunc >> > > > calls >> > > > order in the ebpf program, we can have one or multiple memmove/memcpy >> > > > for >> > > > each packet, right? >> > > >> > > Yes, with Arthur's scheme, performance will be ordering dependent. Using >> > > a global registry for offsets would sidestep this, but have the >> > > synchronisation issues we discussed up-thread. So on balance, I think >> > > the memmove() suggestion will probably lead to the least pain. >> > > >> > > For the HW metadata we could sidestep this by always having a fixed >> > > struct for it (but using the same set/get() API with reserved keys). The >> > > only drawback of doing that is that we statically reserve a bit of >> > > space, but I'm not sure that is such a big issue in practice (at least >> > > not until this becomes to popular that the space starts to be contended; >> > > but surely 256 bytes ought to be enough for everybody, right? :)). >> > >> > I am fine with the proposed approach, but I think we need to verify what >> > is the >> > impact on performances (in the worst case??) >> >> If drivers are responsible for populating the hardware metadata before >> XDP, we could make sure drivers set the fields in order to avoid any >> memove() (and maybe even provide a helper to ensure this?). > > nope, since the current APIs introduced by Stanislav are consuming NIC > metadata in kfuncs (mainly for af_xdp) and, according to my understanding, > we want to add a kfunc to store the info for each NIC metadata (e.g rx-hash, > timestamping, ..) into the packet (this is what Toke is proposing, right?). > In this case kfunc calling order makes a difference. > We can think even to add single kfunc to store all the info for all the NIC > metadata (maybe via a helping struct) but it seems not scalable to me and we > are losing kfunc versatility. Yes, I agree we should have separate kfuncs for each metadata field. Which means it makes a lot of sense to just use the same setter API that we use for the user-registered metadata fields, but using reserved keys. So something like: #define BPF_METADATA_HW_HASH BIT(60) #define BPF_METADATA_HW_TIMESTAMP BIT(61) #define BPF_METADATA_HW_VLAN BIT(62) #define BPF_METADATA_RESERVED (0x << 48) bpf_packet_metadata_set(pkt, BPF_METADATA_HW_HASH, hash_value); As for the internal representation, we can just have the kfunc do something like: int bpf_packet_metadata_set(field_id, value) { switch(field_id) { case BPF_METADATA_HW_HASH: pkt->xdp_hw_meta.hash = value; break; [...] default: /* do the key packing thing */ } } that way the order of setting the HW fields doesn't matter, only the user-defined metadata. >> > > > Moreover, I still think the metadata area in the xdp_frame/xdp_buff is >> > > > not >> > > > so suitable for nic hw metadata since: >> > > > - it grows backward >> > > > - it is probably in a different cacheline with respect to xdp_frame >> > > > - nic hw metadata will not start at fixed and immutable address, but >> > > > it depends >> > > > on the running ebpf program >> > > > >> > > > What about having something like: >> > > > - fixed hw nic metadata: just after xdp_frame struct (or if you want >> > > > at the end >> > > > of the metadata area :)). Here he can reuse the same KV approach if >> > > > it is fast >> > > > - user defined metadata: in the metadata area of the xdp_frame/xdp_buff >> > > >> > > AFAIU, none of this will live in the (current) XDP metadata area. It >> > > will all live just after the xdp_frame struct (so sharing the space with >> > > the metadata area in the sense that adding more metadata kv fields will >> > > decrease the amount of space that is usable by the current XDP metadata >> > > APIs). >> > > >> > > -Toke >> > > >> > >> > ah, ok. I was thinking the proposed app
Re: [Intel-wired-lan] [PATCH net v2 2/2] page_pool: fix IOMMU crash when driver has already unbound
Hi Paolo, Thanks for taking the time. On Tue, 1 Oct 2024 at 16:32, Paolo Abeni wrote: > > On 9/25/24 09:57, Yunsheng Lin wrote: > > Networking driver with page_pool support may hand over page > > still with dma mapping to network stack and try to reuse that > > page after network stack is done with it and passes it back > > to page_pool to avoid the penalty of dma mapping/unmapping. > > With all the caching in the network stack, some pages may be > > held in the network stack without returning to the page_pool > > soon enough, and with VF disable causing the driver unbound, > > the page_pool does not stop the driver from doing it's > > unbounding work, instead page_pool uses workqueue to check > > if there is some pages coming back from the network stack > > periodically, if there is any, it will do the dma unmmapping > > related cleanup work. > > > > As mentioned in [1], attempting DMA unmaps after the driver > > has already unbound may leak resources or at worst corrupt > > memory. Fundamentally, the page pool code cannot allow DMA > > mappings to outlive the driver they belong to. > > > > Currently it seems there are at least two cases that the page > > is not released fast enough causing dma unmmapping done after > > driver has already unbound: > > 1. ipv4 packet defragmentation timeout: this seems to cause > > delay up to 30 secs. > > 2. skb_defer_free_flush(): this may cause infinite delay if > > there is no triggering for net_rx_action(). > > > > In order not to do the dma unmmapping after driver has already > > unbound and stall the unloading of the networking driver, add > > the pool->items array to record all the pages including the ones > > which are handed over to network stack, so the page_pool can > > do the dma unmmapping for those pages when page_pool_destroy() > > is called. As the pool->items need to be large enough to avoid > > performance degradation, add a 'item_full' stat to indicate the > > allocation failure due to unavailability of pool->items. > > This looks really invasive, with room for potentially large performance > regressions or worse. At very least it does not look suitable for net. Perhaps, and you are right we need to measure performance before pulling it but... > > Is the problem only tied to VFs drivers? It's a pity all the page_pool > users will have to pay a bill for it... It's not. The problem happens when an SKB has been scheduled for recycling and has already been mapped via page_pool. If the driver disappears in the meantime, page_pool will free all the packets it holds in its private rings (both slow and fast), but is not in control of the SKB anymore. So any packets coming back for recycling *after* that point cannot unmap memory properly. As discussed this can either lead to memory corruption and resource leaking, or worse as seen in the bug report panics. I am fine with this going into -next, but it really is a bugfix, although I am not 100% sure that the Fixes: tag in the current patch is correct. Thanks /Ilias > > /P >
Re: [Intel-wired-lan] [PATCH net v2 2/2] page_pool: fix IOMMU crash when driver has already unbound
On Wed, 2 Oct 2024 at 09:46, Ilias Apalodimas wrote: > > Hi Paolo, > > Thanks for taking the time. > > On Tue, 1 Oct 2024 at 16:32, Paolo Abeni wrote: > > > > On 9/25/24 09:57, Yunsheng Lin wrote: > > > Networking driver with page_pool support may hand over page > > > still with dma mapping to network stack and try to reuse that > > > page after network stack is done with it and passes it back > > > to page_pool to avoid the penalty of dma mapping/unmapping. > > > With all the caching in the network stack, some pages may be > > > held in the network stack without returning to the page_pool > > > soon enough, and with VF disable causing the driver unbound, > > > the page_pool does not stop the driver from doing it's > > > unbounding work, instead page_pool uses workqueue to check > > > if there is some pages coming back from the network stack > > > periodically, if there is any, it will do the dma unmmapping > > > related cleanup work. > > > > > > As mentioned in [1], attempting DMA unmaps after the driver > > > has already unbound may leak resources or at worst corrupt > > > memory. Fundamentally, the page pool code cannot allow DMA > > > mappings to outlive the driver they belong to. > > > > > > Currently it seems there are at least two cases that the page > > > is not released fast enough causing dma unmmapping done after > > > driver has already unbound: > > > 1. ipv4 packet defragmentation timeout: this seems to cause > > > delay up to 30 secs. > > > 2. skb_defer_free_flush(): this may cause infinite delay if > > > there is no triggering for net_rx_action(). > > > > > > In order not to do the dma unmmapping after driver has already > > > unbound and stall the unloading of the networking driver, add > > > the pool->items array to record all the pages including the ones > > > which are handed over to network stack, so the page_pool can > > > do the dma unmmapping for those pages when page_pool_destroy() > > > is called. As the pool->items need to be large enough to avoid > > > performance degradation, add a 'item_full' stat to indicate the > > > allocation failure due to unavailability of pool->items. > > > > This looks really invasive, with room for potentially large performance > > regressions or worse. At very least it does not look suitable for net. > > Perhaps, and you are right we need to measure performance before > pulling it but... > > > > > Is the problem only tied to VFs drivers? It's a pity all the page_pool > > users will have to pay a bill for it... > > It's not. The problem happens when an SKB has been scheduled for > recycling and has already been mapped via page_pool. If the driver > disappears in the meantime, Apologies, this wasn't correct. It's the device that has to disappear not the driver > page_pool will free all the packets it > holds in its private rings (both slow and fast), but is not in control > of the SKB anymore. So any packets coming back for recycling *after* > that point cannot unmap memory properly. > > As discussed this can either lead to memory corruption and resource > leaking, or worse as seen in the bug report panics. I am fine with > this going into -next, but it really is a bugfix, although I am not > 100% sure that the Fixes: tag in the current patch is correct. > > Thanks > /Ilias > > > > /P > >
[Intel-wired-lan] [RFC net-next v4 0/9] Add support for per-NAPI config via netlink
Greetings: Welcome to RFC v4. Very important and significant changes have been made since RFC v3 [1], please see the changelog below for details. A couple important call outs for this revision for reviewers: 1. idpf embeds a napi_struct in an internal data structure and includes an assertion on the size of napi_struct. The maintainers have stated that they think anyone touching napi_struct should update the assertion [2], so I've done this in patch 3. Even though the assertion has been updated, I've given the cacheline placement of napi_struct within idpf's internals no thought or consideration. Would appreciate other opinions on this; I think idpf should be fixed. It seems unreasonable to me that anyone changing the size of a struct in the core should need to think about cachelines in idpf. 2. This revision seems to work (see below for a full walk through). Is this the behavior we want? Am I missing some use case or some behavioral thing other folks need? 3. Re a previous point made by Stanislav regarding "taking over a NAPI ID" when the channel count changes: mlx5 seems to call napi_disable followed by netif_napi_del for the old queues and then calls napi_enable for the new ones. In this RFC, the NAPI ID generation is deferred to napi_enable. This means we won't end up with two of the same NAPI IDs added to the hash at the same time (I am pretty sure). Can we assume all drivers will napi_disable the old queues before napi_enable the new ones? If yes, we might not need to worry about a NAPI ID takeover function. 4. I made the decision to remove the WARN_ON_ONCE that (I think?) Jakub previously suggested in alloc_netdev_mqs (WARN_ON_ONCE(txqs != rxqs);) because this was triggering on every kernel boot with my mlx5 NIC. 5. I left the "maxqs = max(txqs, rxqs);" in alloc_netdev_mqs despite thinking this is a bit strange. I think it's strange that we might be short some number of NAPI configs, but it seems like most people are in favor of this approach, so I've left it. I'd appreciate thoughts from reviewers on the above items, if at all possible. Once those are addressed, modulo any feedback on the code/white space wrapping I still need to do, I think this is close to an official submission. Now, on to the implementation: This implementation allocates an array of "struct napi_config" in net_device and each NAPI instance is assigned an index into the config array. Per-NAPI settings like: - NAPI ID - gro_flush_timeout - defer_hard_irqs are persisted in napi_config and restored on napi_disable/napi_enable respectively. To help illustrate how this would end up working, I've added patches for 3 drivers, of which I have access to only 1: - mlx5 which is the basis of the examples below - mlx4 which has TX only NAPIs, just to highlight that case. I have only compile tested this patch; I don't have this hardware. - bnxt which I have only compiled tested. I don't have this hardware. NOTE: I only tested this on mlx5; I have no access to the other hardware for which I provided patches. Hopefully other folks can help test :) This iteration seems to persist NAPI IDs and settings even when resizing queues, see below, so I think maybe this is getting close to where we want to land in terms of functionality. Here's how it works when I test it on my system: # start with 2 queues $ ethtool -l eth4 | grep Combined | tail -1 Combined: 2 First, output the current NAPI settings: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 7}' [{'defer-hard-irqs': 0, 'gro-flush-timeout': 0, 'id': 345, 'ifindex': 7, 'irq': 527}, {'defer-hard-irqs': 0, 'gro-flush-timeout': 0, 'id': 344, 'ifindex': 7, 'irq': 327}] Now, set the global sysfs parameters: $ sudo bash -c 'echo 2 >/sys/class/net/eth4/gro_flush_timeout' $ sudo bash -c 'echo 100 >/sys/class/net/eth4/napi_defer_hard_irqs' Output current NAPI settings again: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 7}' [{'defer-hard-irqs': 100, 'gro-flush-timeout': 2, 'id': 345, 'ifindex': 7, 'irq': 527}, {'defer-hard-irqs': 100, 'gro-flush-timeout': 2, 'id': 344, 'ifindex': 7, 'irq': 327}] Now set NAPI ID 345, via its NAPI ID to specific values: $ sudo ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/netdev.yaml \ --do napi-set \ --json='{"id": 345, "defer-hard-irqs": 111, "gro-flush-timeout": 1}' None Now output current NAPI settings again to ensure only NAPI ID 345 changed: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex"
[Intel-wired-lan] [RFC net-next v4 3/9] net: napi: Make gro_flush_timeout per-NAPI
Allow per-NAPI gro_flush_timeout setting. The existing sysfs parameter is respected; writes to sysfs will write to all NAPI structs for the device and the net_device gro_flush_timeout field. Reads from sysfs will read from the net_device field. The ability to set gro_flush_timeout on specific NAPI instances will be added in a later commit, via netdev-genl. Note that idpf has embedded napi_struct in its internals and has established some series of asserts that involve the size of napi structure. Since this change increases the napi_struct size from 400 to 416 (according to pahole on my system), I've increased the assertion in idpf by 16 bytes. No attention whatsoever was paid to the cacheline placement of idpf internals as a result of this change. Signed-off-by: Joe Damato --- .../networking/net_cachelines/net_device.rst | 2 +- drivers/net/ethernet/intel/idpf/idpf_txrx.h | 2 +- include/linux/netdevice.h | 3 +- net/core/dev.c| 12 +++--- net/core/dev.h| 40 +++ net/core/net-sysfs.c | 2 +- 6 files changed, 51 insertions(+), 10 deletions(-) diff --git a/Documentation/networking/net_cachelines/net_device.rst b/Documentation/networking/net_cachelines/net_device.rst index eeeb7c925ec5..3d02ae79c850 100644 --- a/Documentation/networking/net_cachelines/net_device.rst +++ b/Documentation/networking/net_cachelines/net_device.rst @@ -98,7 +98,6 @@ struct_netdev_queue*_rx read_mostly unsigned_intnum_rx_queues unsigned_intreal_num_rx_queues - read_mostly get_rps_cpu struct_bpf_prog*xdp_prog- read_mostly netif_elide_gro() -unsigned_long gro_flush_timeout - read_mostly napi_complete_done unsigned_intgro_max_size- read_mostly skb_gro_receive unsigned_intgro_ipv4_max_size - read_mostly skb_gro_receive rx_handler_func_t* rx_handler read_mostly - __netif_receive_skb_core @@ -182,4 +181,5 @@ struct_devlink_port*devlink_port struct_dpll_pin*dpll_pin struct hlist_head page_pools struct dim_irq_moder* irq_moder +unsigned_long gro_flush_timeout u32 napi_defer_hard_irqs diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.h b/drivers/net/ethernet/intel/idpf/idpf_txrx.h index f0537826f840..fcdf73486d46 100644 --- a/drivers/net/ethernet/intel/idpf/idpf_txrx.h +++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.h @@ -438,7 +438,7 @@ struct idpf_q_vector { __cacheline_group_end_aligned(cold); }; libeth_cacheline_set_assert(struct idpf_q_vector, 112, - 424 + 2 * sizeof(struct dim), + 440 + 2 * sizeof(struct dim), 8 + sizeof(cpumask_var_t)); struct idpf_rx_queue_stats { diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 55764efc5c93..33897edd16c8 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -377,6 +377,7 @@ struct napi_struct { struct list_headdev_list; struct hlist_node napi_hash_node; int irq; + unsigned long gro_flush_timeout; u32 defer_hard_irqs; }; @@ -2075,7 +2076,6 @@ struct net_device { int ifindex; unsigned intreal_num_rx_queues; struct netdev_rx_queue *_rx; - unsigned long gro_flush_timeout; unsigned intgro_max_size; unsigned intgro_ipv4_max_size; rx_handler_func_t __rcu *rx_handler; @@ -2398,6 +2398,7 @@ struct net_device { /** @irq_moder: dim parameters used if IS_ENABLED(CONFIG_DIMLIB). */ struct dim_irq_moder*irq_moder; + unsigned long gro_flush_timeout; u32 napi_defer_hard_irqs; u8 priv[] cacheline_aligned diff --git a/net/core/dev.c b/net/core/dev.c index 748739958d2a..056ed44f766f 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6226,12 +6226,12 @@ bool napi_complete_done(struct napi_struct *n, int work_done) if (work_done) { if (n->gro_bitmask) - timeout = READ_ONCE(n->dev->gro_flush_timeout); + timeout = napi_get_gro_flush_timeout(n); n->defer_hard_irqs_count = napi_get_defer_hard_irqs(n