Re: [Intel-wired-lan] [RFC net-next 1/1] idpf: Don't hard code napi_struct size

2024-10-01 Thread Alexander Lobakin
From: Joe Damato 
Date: Mon, 30 Sep 2024 15:17:46 -0700

> On Mon, Sep 30, 2024 at 03:10:41PM +0200, Przemek Kitszel wrote:
>> On 9/30/24 14:38, Alexander Lobakin wrote:
>>> From: Alexander Lobakin 
>>> Date: Mon, 30 Sep 2024 14:33:45 +0200
>>>
 From: Joe Damato 
 Date: Wed, 25 Sep 2024 18:00:17 +
>>
>>> struct napi_struct doesn't have any such fields and doesn't depend on
>>> the kernel configuration, that's why it's hardcoded.
>>> Please don't change that, just adjust the hardcoded values when needed.
>>
>> This is the crucial point, and I agree with Olek.
>>
>> If you will find it more readable/future proof, feel free to add
>> comments like /* napi_struct */ near their "400" part in the hardcode.
>>
>> Side note: you could just run this as a part of your netdev series,
>> given you will properly CC.
> 
> I've already sent the official patch because I didn't hear back on
> this RFC.
> 
> Sorry, but I respectfully disagree with you both on this; I don't
> think it makes sense to have code that will break if fields are
> added to napi_struct thereby requiring anyone who works on the core
> to update this code over and over again.
> 
> I understand that the sizeofs are "meaningless" because of your
> desire to optimize cachelines, but IMHO and, again, respectfully, it
> seems strange that any adjustments to core should require a change
> to this code.

But if you change any core API, let's say rename a field used in several
drivers, you anyway need to adjust the affected drivers.
It's a common practice that some core changes require touching drivers.
Moreover, sizeof(struct napi_struct) doesn't get changed often, so I
don't see any issue in adjusting one line in idpf by just increasing one
value there by sizeof(your_new_field).

If you do that, then:
+ you get notified that you may affect the performance of different
  drivers (napi_struct is usually embedded into perf-critical
  structures in drivers);
+ I get notified (Cced) that someone's change will affect idpf, so I'll
  be aware of it and review the change;
- you need to adjust one line in idpf.

Is it just me or these '+'s easily outweight that sole '-'?

> 
> I really do not want to include a patch to change the size of
> napi_struct in idpf as part of my RFC which is totally unrelated to
> idpf and will detract from the review of my core changes.

One line won't distract anyone. The kernel tree contains let's say one
patch from me where I needed to modify around 20 drivers within one
commit due to core code structure change -- the number of locs I changed
in the drivers was way bigger than the number of locs I changed in the
core. And there's a ton of such commits in there. Again, it's a common
practice.

> 
> Perhaps my change is unacceptable, but there should be a way to deal
> with this that doesn't require everyone working on core networking
> code to update idpf, right?

We can't isolate the core code from the drivers up to the point that you
wouldn't require to touch the drivers at all when working on the core
changes, we're not Windows. The drivers and the core are inside one
tree, so that such changes can be made easily and no code inside the
whole tree is ABI (excl uAPI).

Thanks,
Olek


[Intel-wired-lan] [PATCH iwl-next v1 1/1] igc: remove autoneg parameter from igc_mac_info

2024-10-01 Thread Vitaly Lifshits
Since the igc driver doesn't support forced speed configuration and
its current related hardware doesn't support it either, there is no
use of the mac.autoneg parameter. Moreover, in one case this usage
might result in a NULL pointer dereference due to an uninitialized
function pointer, phy.ops.force_speed_duplex.

Therefore, remove this parameter from the igc code.

Signed-off-by: Vitaly Lifshits 
---
 drivers/net/ethernet/intel/igc/igc_diag.c|   3 +-
 drivers/net/ethernet/intel/igc/igc_ethtool.c |  13 +-
 drivers/net/ethernet/intel/igc/igc_hw.h  |   1 -
 drivers/net/ethernet/intel/igc/igc_mac.c | 316 +--
 drivers/net/ethernet/intel/igc/igc_main.c|   1 -
 drivers/net/ethernet/intel/igc/igc_phy.c |  24 +-
 6 files changed, 163 insertions(+), 195 deletions(-)

diff --git a/drivers/net/ethernet/intel/igc/igc_diag.c 
b/drivers/net/ethernet/intel/igc/igc_diag.c
index cc621970c0cd..a43d7244ee70 100644
--- a/drivers/net/ethernet/intel/igc/igc_diag.c
+++ b/drivers/net/ethernet/intel/igc/igc_diag.c
@@ -173,8 +173,7 @@ bool igc_link_test(struct igc_adapter *adapter, u64 *data)
*data = 0;
 
/* add delay to give enough time for autonegotioation to finish */
-   if (adapter->hw.mac.autoneg)
-   ssleep(5);
+   ssleep(5);
 
link_up = igc_has_link(adapter);
if (!link_up) {
diff --git a/drivers/net/ethernet/intel/igc/igc_ethtool.c 
b/drivers/net/ethernet/intel/igc/igc_ethtool.c
index 5b0c6f433767..817838677817 100644
--- a/drivers/net/ethernet/intel/igc/igc_ethtool.c
+++ b/drivers/net/ethernet/intel/igc/igc_ethtool.c
@@ -1821,11 +1821,8 @@ static int igc_ethtool_get_link_ksettings(struct 
net_device *netdev,
ethtool_link_ksettings_add_link_mode(cmd, advertising, 
2500baseT_Full);
 
/* set autoneg settings */
-   if (hw->mac.autoneg == 1) {
-   ethtool_link_ksettings_add_link_mode(cmd, supported, Autoneg);
-   ethtool_link_ksettings_add_link_mode(cmd, advertising,
-Autoneg);
-   }
+   ethtool_link_ksettings_add_link_mode(cmd, supported, Autoneg);
+   ethtool_link_ksettings_add_link_mode(cmd, advertising, Autoneg);
 
/* Set pause flow control settings */
ethtool_link_ksettings_add_link_mode(cmd, supported, Pause);
@@ -1878,10 +1875,7 @@ static int igc_ethtool_get_link_ksettings(struct 
net_device *netdev,
cmd->base.duplex = DUPLEX_UNKNOWN;
}
cmd->base.speed = speed;
-   if (hw->mac.autoneg)
-   cmd->base.autoneg = AUTONEG_ENABLE;
-   else
-   cmd->base.autoneg = AUTONEG_DISABLE;
+   cmd->base.autoneg = AUTONEG_ENABLE;
 
/* MDI-X => 2; MDI =>1; Invalid =>0 */
if (hw->phy.media_type == igc_media_type_copper)
@@ -1955,7 +1949,6 @@ igc_ethtool_set_link_ksettings(struct net_device *netdev,
advertised |= ADVERTISE_10_HALF;
 
if (cmd->base.autoneg == AUTONEG_ENABLE) {
-   hw->mac.autoneg = 1;
hw->phy.autoneg_advertised = advertised;
if (adapter->fc_autoneg)
hw->fc.requested_mode = igc_fc_default;
diff --git a/drivers/net/ethernet/intel/igc/igc_hw.h 
b/drivers/net/ethernet/intel/igc/igc_hw.h
index e1c572e0d4ef..d9d1a1a11daf 100644
--- a/drivers/net/ethernet/intel/igc/igc_hw.h
+++ b/drivers/net/ethernet/intel/igc/igc_hw.h
@@ -92,7 +92,6 @@ struct igc_mac_info {
bool asf_firmware_present;
bool arc_subsystem_valid;
 
-   bool autoneg;
bool autoneg_failed;
bool get_link_status;
 };
diff --git a/drivers/net/ethernet/intel/igc/igc_mac.c 
b/drivers/net/ethernet/intel/igc/igc_mac.c
index a5c4b19d71a2..d344e0a1cd5e 100644
--- a/drivers/net/ethernet/intel/igc/igc_mac.c
+++ b/drivers/net/ethernet/intel/igc/igc_mac.c
@@ -386,14 +386,6 @@ s32 igc_check_for_copper_link(struct igc_hw *hw)
 */
igc_check_downshift(hw);
 
-   /* If we are forcing speed/duplex, then we simply return since
-* we have already determined whether we have link or not.
-*/
-   if (!mac->autoneg) {
-   ret_val = -IGC_ERR_CONFIG;
-   goto out;
-   }
-
/* Auto-Neg is enabled.  Auto Speed Detection takes care
 * of MAC speed/duplex configuration.  So we only need to
 * configure Collision Distance in the MAC.
@@ -468,173 +460,171 @@ s32 igc_config_fc_after_link_up(struct igc_hw *hw)
goto out;
}
 
-   /* Check for the case where we have copper media and auto-neg is
-* enabled.  In this case, we need to check and see if Auto-Neg
-* has completed, and if so, how the PHY and link partner has
-* flow control configured.
+   /* In auto-neg, we need to check and see if Auto-Neg has completed,
+* and if so, how the PHY and link partner has flow control
+* configured.
 */
-   if (mac->auto

[Intel-wired-lan] [PATCH iwl-net v1 1/1] e1000e: Remove Meteor Lake SMBUS workarounds

2024-10-01 Thread Vitaly Lifshits
This is a partial revert to commit 76a0a3f9cc2f ("e1000e: fix force smbus
during suspend flow"). That commit fixed a sporadic PHY access issue but
introduced a regression in runtime suspend flows.
The original issue on Meteor Lake systems was rare in terms of the
reproduction rate and the number of the systems affected.

After the integration of commit 0a6ad4d9e169 ("e1000e: avoid failing the
system during pm_suspend"), PHY access loss can no longer cause a
system-level suspend failure. As it only occurs when the LAN cable is
disconnected, and is recovered during system resume flow. Therefore, its
functional impact is low, and the priority is given to stabilizing
runtime suspend.

Fixes: 76a0a3f9cc2f ("e1000e: fix force smbus during suspend flow")
Signed-off-by: Vitaly Lifshits 
---
 drivers/net/ethernet/intel/e1000e/ich8lan.c | 17 -
 1 file changed, 4 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/ich8lan.c 
b/drivers/net/ethernet/intel/e1000e/ich8lan.c
index ce227b56cf72..2f9655cf5dd9 100644
--- a/drivers/net/ethernet/intel/e1000e/ich8lan.c
+++ b/drivers/net/ethernet/intel/e1000e/ich8lan.c
@@ -1205,12 +1205,10 @@ s32 e1000_enable_ulp_lpt_lp(struct e1000_hw *hw, bool 
to_sx)
if (ret_val)
goto out;
 
-   if (hw->mac.type != e1000_pch_mtp) {
-   ret_val = e1000e_force_smbus(hw);
-   if (ret_val) {
-   e_dbg("Failed to force SMBUS: %d\n", ret_val);
-   goto release;
-   }
+   ret_val = e1000e_force_smbus(hw);
+   if (ret_val) {
+   e_dbg("Failed to force SMBUS: %d\n", ret_val);
+   goto release;
}
 
/* Si workaround for ULP entry flow on i127/rev6 h/w.  Enable
@@ -1273,13 +1271,6 @@ s32 e1000_enable_ulp_lpt_lp(struct e1000_hw *hw, bool 
to_sx)
}
 
 release:
-   if (hw->mac.type == e1000_pch_mtp) {
-   ret_val = e1000e_force_smbus(hw);
-   if (ret_val)
-   e_dbg("Failed to force SMBUS over MTL system: %d\n",
- ret_val);
-   }
-
hw->phy.ops.release(hw);
 out:
if (ret_val)
-- 
2.34.1



Re: [Intel-wired-lan] [net-next v3 1/2] e1000e: Link NAPI instances to queues and IRQs

2024-10-01 Thread Lifshits, Vitaly




On 10/1/2024 1:50 PM, Simon Horman wrote:

On Mon, Sep 30, 2024 at 05:12:31PM +, Joe Damato wrote:

Add support for netdev-genl, allowing users to query IRQ, NAPI, and queue
information.

After this patch is applied, note the IRQs assigned to my NIC:

$ cat /proc/interrupts | grep ens | cut -f1 --delimiter=':'
  50
  51
  52

While e1000e allocates 3 IRQs (RX, TX, and other), it looks like e1000e
only has a single NAPI, so I've associated the NAPI with the RX IRQ (50
on my system, seen above).

Note the output from the cli:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump napi-get --json='{"ifindex": 2}'
[{'id': 145, 'ifindex': 2, 'irq': 50}]

This device supports only 1 rx and 1 tx queue. so querying that:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
--dump queue-get --json='{"ifindex": 2}'
[{'id': 0, 'ifindex': 2, 'napi-id': 145, 'type': 'rx'},
  {'id': 0, 'ifindex': 2, 'napi-id': 145, 'type': 'tx'}]

Signed-off-by: Joe Damato 


Reviewed-by: Simon Horman 



Acked-by: Vitaly Lifshits 


Re: [Intel-wired-lan] [RFC net-next 1/1] idpf: Don't hard code napi_struct size

2024-10-01 Thread Joe Damato
On Tue, Oct 01, 2024 at 03:14:07PM +0200, Alexander Lobakin wrote:
> From: Joe Damato 
> Date: Mon, 30 Sep 2024 15:17:46 -0700
> 
> > On Mon, Sep 30, 2024 at 03:10:41PM +0200, Przemek Kitszel wrote:
> >> On 9/30/24 14:38, Alexander Lobakin wrote:
> >>> From: Alexander Lobakin 
> >>> Date: Mon, 30 Sep 2024 14:33:45 +0200
> >>>
>  From: Joe Damato 
>  Date: Wed, 25 Sep 2024 18:00:17 +
> >>
> >>> struct napi_struct doesn't have any such fields and doesn't depend on
> >>> the kernel configuration, that's why it's hardcoded.
> >>> Please don't change that, just adjust the hardcoded values when needed.
> >>
> >> This is the crucial point, and I agree with Olek.
> >>
> >> If you will find it more readable/future proof, feel free to add
> >> comments like /* napi_struct */ near their "400" part in the hardcode.
> >>
> >> Side note: you could just run this as a part of your netdev series,
> >> given you will properly CC.
> > 
> > I've already sent the official patch because I didn't hear back on
> > this RFC.
> > 
> > Sorry, but I respectfully disagree with you both on this; I don't
> > think it makes sense to have code that will break if fields are
> > added to napi_struct thereby requiring anyone who works on the core
> > to update this code over and over again.
> > 
> > I understand that the sizeofs are "meaningless" because of your
> > desire to optimize cachelines, but IMHO and, again, respectfully, it
> > seems strange that any adjustments to core should require a change
> > to this code.
> 
> But if you change any core API, let's say rename a field used in several
> drivers, you anyway need to adjust the affected drivers.

Sorry, but that's a totally different argument.

There are obvious cases where touching certain parts of core would
require changes to drivers, yes. I agree on that if I change an API
or a struct field name, or remove an enum, then this affects drivers
which must be updated.

idpf does not fall in this category as it relies on the *size* of
the structure, not the field names. Adding a new field wouldn't
break any of your existing code accessing fields in the struct since
I haven't removed a field.

Adding a new field may adjust the size. According to your previous
email: idpf cares about the size because it wants the cachelines to
look a certain way in pahole. OK, but why is that the concern of
someone working on core code?

It doesn't make sense to me that if I add a new field, I now need to
look at pahole output for idpf to make sure you will approve of the
cacheline placement.

> It's a common practice that some core changes require touching drivers.
> Moreover, sizeof(struct napi_struct) doesn't get changed often, so I
> don't see any issue in adjusting one line in idpf by just increasing one
> value there by sizeof(your_new_field).

The problem is: what if everyone starts doing this? Trying to rely
on the specific size of some core data structure in their driver
code for cacheline placement?

Do I then need to shift through each driver with pahole and adjust
the cacheline placement of each affected structure because I added a
field to napi_struct ?

The burden of optimizing cache line placement should fall on the
driver maintainers, not the person adding the field to napi_struct.

It would be different if I deleted a field or renamed a field. In
those cases: sure that's my issue to deal with changing the affected
drivers. Just like it would be if I removed an enum value.

> If you do that, then:
> + you get notified that you may affect the performance of different
>   drivers (napi_struct is usually embedded into perf-critical
>   structures in drivers);

But I don't want to think about idpf; it's not my responsibility at
all to ensure that the cacheline placement in idpf is optimal.

> + I get notified (Cced) that someone's change will affect idpf, so I'll
>   be aware of it and review the change;
> - you need to adjust one line in idpf.

Adjust one line in idpf and then go through another revision if the
maintainers don't like what the cachelines look like in pahole?

And I need to do this for something totally unrelated that idpf
won't even support (because I'm not adding support for it in the
RFC) ?

> Is it just me or these '+'s easily outweight that sole '-'?

I disagree even more than I disagreed yesterday.


Re: [Intel-wired-lan] [net-next v3 1/2] e1000e: Link NAPI instances to queues and IRQs

2024-10-01 Thread Simon Horman
On Mon, Sep 30, 2024 at 05:12:31PM +, Joe Damato wrote:
> Add support for netdev-genl, allowing users to query IRQ, NAPI, and queue
> information.
> 
> After this patch is applied, note the IRQs assigned to my NIC:
> 
> $ cat /proc/interrupts | grep ens | cut -f1 --delimiter=':'
>  50
>  51
>  52
> 
> While e1000e allocates 3 IRQs (RX, TX, and other), it looks like e1000e
> only has a single NAPI, so I've associated the NAPI with the RX IRQ (50
> on my system, seen above).
> 
> Note the output from the cli:
> 
> $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
>--dump napi-get --json='{"ifindex": 2}'
> [{'id': 145, 'ifindex': 2, 'irq': 50}]
> 
> This device supports only 1 rx and 1 tx queue. so querying that:
> 
> $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
>--dump queue-get --json='{"ifindex": 2}'
> [{'id': 0, 'ifindex': 2, 'napi-id': 145, 'type': 'rx'},
>  {'id': 0, 'ifindex': 2, 'napi-id': 145, 'type': 'tx'}]
> 
> Signed-off-by: Joe Damato 

Reviewed-by: Simon Horman 



[Intel-wired-lan] [tnguy-net-queue:100GbE] BUILD SUCCESS 09d0fb5cb30ebcaed4a33028ae383f5a1463e2b2

2024-10-01 Thread kernel test robot
tree/branch: 
https://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue.git 100GbE
branch HEAD: 09d0fb5cb30ebcaed4a33028ae383f5a1463e2b2  idpf: deinit virtchnl 
transaction manager after vport and vectors

elapsed time: 829m

configs tested: 94
configs skipped: 3

The following configs have been built successfully.
More configs may be tested in the coming days.

tested configs:
alpha   allnoconfiggcc-14.1.0
alpha  allyesconfigclang-20
alpha defconfiggcc-14.1.0
arcallmodconfigclang-20
arc allnoconfiggcc-14.1.0
arcallyesconfigclang-20
arcaxs103_defconfiggcc-14.1.0
arc   defconfiggcc-14.1.0
armallmodconfigclang-20
arm allnoconfiggcc-14.1.0
armallyesconfigclang-20
armcollie_defconfiggcc-14.1.0
arm   davinci_all_defconfiggcc-14.1.0
arm   defconfiggcc-14.1.0
arm   lpc18xx_defconfiggcc-14.1.0
arm versatile_defconfiggcc-14.1.0
arm64  allmodconfigclang-20
arm64   allnoconfiggcc-14.1.0
arm64 defconfiggcc-14.1.0
cskyallnoconfiggcc-14.1.0
csky  defconfiggcc-14.1.0
hexagonallmodconfigclang-20
hexagon allnoconfiggcc-14.1.0
hexagonallyesconfigclang-20
hexagon   defconfiggcc-14.1.0
i386   allmodconfigclang-18
i386allnoconfigclang-18
i386   allyesconfigclang-18
i386  defconfigclang-18
loongarch  allmodconfiggcc-14.1.0
loongarch   allnoconfiggcc-14.1.0
loongarch defconfiggcc-14.1.0
m68k   allmodconfiggcc-14.1.0
m68kallnoconfiggcc-14.1.0
m68k   allyesconfiggcc-14.1.0
m68k  defconfiggcc-14.1.0
microblaze allmodconfiggcc-14.1.0
microblaze  allnoconfiggcc-14.1.0
microblaze allyesconfiggcc-14.1.0
microblazedefconfiggcc-14.1.0
mipsallnoconfiggcc-14.1.0
mips   db1xxx_defconfiggcc-14.1.0
mips rs90_defconfiggcc-14.1.0
nios2   allnoconfiggcc-14.1.0
nios2 defconfiggcc-14.1.0
openriscallnoconfigclang-20
openrisc   allyesconfiggcc-14.1.0
openrisc  defconfiggcc-12
openriscor1klitex_defconfiggcc-14.1.0
parisc allmodconfiggcc-14.1.0
parisc  allnoconfigclang-20
parisc allyesconfiggcc-14.1.0
pariscdefconfiggcc-12
parisc64  defconfiggcc-14.1.0
powerpcallmodconfiggcc-14.1.0
powerpc allnoconfigclang-20
powerpcallyesconfiggcc-14.1.0
powerpc   canyonlands_defconfiggcc-14.1.0
powerpc  cell_defconfiggcc-14.1.0
powerpc  fsp2_defconfiggcc-14.1.0
powerpc mpc834x_itxgp_defconfiggcc-14.1.0
powerpc   wii_defconfiggcc-14.1.0
riscv  allmodconfiggcc-14.1.0
riscv   allnoconfigclang-20
riscv  allyesconfiggcc-14.1.0
riscv defconfiggcc-12
s390   allmodconfigclang-20
s390   allmodconfiggcc-14.1.0
s390allnoconfigclang-20
s390   allyesconfiggcc-14.1.0
s390  defconfiggcc-12
sh alldefconfiggcc-14.1.0
sh allmodconfiggcc-14.1.0
sh  allnoconfiggcc-14.1.0
sh allyesconfiggcc-14.1.0
shdefconfiggcc-12
sh  ecovec24-romimage_defconfiggcc-14.1.0
sh  titan_defconfiggcc-14.1.0
shul2_defconfiggcc-14.1.0
sparc  allmodconfiggcc-14.1.0
sparc64   defconfiggcc-12
um allmodconfigclang-20
um  allnoconfigclang-20
um allyesconfigclang-20
umdefconfiggcc-12
um   i386_defconfiggcc-12
um x86_64_defconfig   

Re: [Intel-wired-lan] [PATCH iwl-next v10 07/14] iavf: add support for indirect access to PHC time

2024-10-01 Thread Mateusz Polchlopek




On 8/21/2024 4:31 PM, Alexander Lobakin wrote:

From: Wojciech Drewek 
Date: Wed, 21 Aug 2024 14:15:32 +0200


From: Jacob Keller 

Implement support for reading the PHC time indirectly via the
VIRTCHNL_OP_1588_PTP_GET_TIME operation.


[...]


+/**
+ * iavf_queue_ptp_cmd - Queue PTP command for sending over virtchnl
+ * @adapter: private adapter structure
+ * @cmd: the command structure to send
+ *
+ * Queue the given command structure into the PTP virtchnl command queue tos
+ * end to the PF.
+ */
+static void iavf_queue_ptp_cmd(struct iavf_adapter *adapter,
+  struct iavf_ptp_aq_cmd *cmd)
+{
+   mutex_lock(&adapter->ptp.aq_cmd_lock);
+   list_add_tail(&cmd->list, &adapter->ptp.aq_cmds);
+   mutex_unlock(&adapter->ptp.aq_cmd_lock);
+
+   adapter->aq_required |= IAVF_FLAG_AQ_SEND_PTP_CMD;
+   mod_delayed_work(adapter->wq, &adapter->watchdog_task, 0);


Are you sure you need delayed_work here? delayed_work is used only when
you need to run it after a delay. If the delay is always 0, then you
only need work_struct and queue_work().



Sorry for late response but I was on quite long vacation.

Regarding your question - I think it is needed to have
mod_delayed_work() here, at least as for now. I agree with your
suggestion to use queue_work() but this function takes as parameter
work_struct and in our case the adapter->watchdog_task field is of
struct delayed_work type. It uses the function iavf_watchdog_task()
which does plenty of things (including sending ptp cmd). Changing
adapter->watchdog_task to be just struct work_struct is not applicable
here as in iavf_main.c file in few places we add it to queue with
different time.

Make long story short - I agree with your argument but as far as this
0 delay is not causing performance regression then I will stick with
this solution implemented by Jake.


+}
+
+/**
+ * iavf_send_phc_read - Send request to read PHC time


[...]


+static int iavf_ptp_gettimex64(struct ptp_clock_info *info,
+  struct timespec64 *ts,
+  struct ptp_system_timestamp *sts)
+{
+   struct iavf_adapter *adapter = iavf_clock_to_adapter(info);
+
+   if (!adapter->ptp.initialized)
+   return -ENODEV;


Why is it -ENODEV here, but -EOPNOTSUPP several functions above, are you
sure these codes are the ones expected by the upper layers?


+
+   return iavf_read_phc_indirect(adapter, ts, sts);
+}


[...]


diff --git a/drivers/net/ethernet/intel/iavf/iavf_ptp.h 
b/drivers/net/ethernet/intel/iavf/iavf_ptp.h
index c2ed24cef926..0bb4bddc1495 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_ptp.h
+++ b/drivers/net/ethernet/intel/iavf/iavf_ptp.h
@@ -6,9 +6,13 @@
  
  #include "iavf_types.h"
  
+#define iavf_clock_to_adapter(info)\

+   container_of_const(info, struct iavf_adapter, ptp.info)


It's only used in one file, are you sure you need it here in the header?
Or it will be used in later patches?

[...]


+void iavf_virtchnl_send_ptp_cmd(struct iavf_adapter *adapter)
+{
+   struct device *dev = &adapter->pdev->dev;
+   struct iavf_ptp_aq_cmd *cmd;
+   int err;
+
+   if (!adapter->ptp.initialized) {


BTW does it make sense to introduce ptp.initialized since you can always
check ptp.clock for being %NULL and it will be the same?


+   /* This shouldn't be possible to hit, since no messages should
+* be queued if PTP is not initialized.
+*/
+   pci_err(adapter->pdev, "PTP is not initialized\n");
+   adapter->aq_required &= ~IAVF_FLAG_AQ_SEND_PTP_CMD;
+   return;
+   }
+
+   mutex_lock(&adapter->ptp.aq_cmd_lock);
+   cmd = list_first_entry_or_null(&adapter->ptp.aq_cmds,
+  struct iavf_ptp_aq_cmd, list);
+   if (!cmd) {
+   /* no further PTP messages to send */
+   adapter->aq_required &= ~IAVF_FLAG_AQ_SEND_PTP_CMD;
+   goto out_unlock;
+   }
+
+   if (adapter->current_op != VIRTCHNL_OP_UNKNOWN) {
+   /* bail because we already have a command pending */
+   dev_err(dev, "Cannot send PTP command %d, command %d pending\n",


pci_err()


+   cmd->v_opcode, adapter->current_op);
+   goto out_unlock;
+   }
+
+   err = iavf_send_pf_msg(adapter, cmd->v_opcode, cmd->msg, cmd->msglen);
+   if (!err) {
+   /* Command was sent without errors, so we can remove it from
+* the list and discard it.
+*/
+   list_del(&cmd->list);
+   kfree(cmd);
+   } else {
+   /* We failed to send the command, try again next cycle */
+   dev_warn(dev, "Failed to send PTP command %d\n", cmd->v_opcode);


pci_err() I'd say.


+   }
+
+   if (list_empty(&adapter->ptp.aq_cmds))
+   /* no further PTP messages to send */
+

[Intel-wired-lan] [PATCH iwl-next 1/2] ice: rework of dump serdes equalizer values feature

2024-10-01 Thread Mateusz Polchlopek
Refactor function ice_get_tx_rx_equa() to iterate over new table of
params instead of multiple calls to ice_aq_get_phy_equalization().

Subsequent commit will extend that function by add more serdes equalizer
values to dump.

Shorten the fields of struct ice_serdes_equalization_to_ethtool for
readability purposes.

Reviewed-by: Przemek Kitszel 
Signed-off-by: Mateusz Polchlopek 
---
 drivers/net/ethernet/intel/ice/ice_ethtool.c | 93 ++--
 drivers/net/ethernet/intel/ice/ice_ethtool.h | 22 ++---
 2 files changed, 38 insertions(+), 77 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c 
b/drivers/net/ethernet/intel/ice/ice_ethtool.c
index 2924ac61300d..19e7a9d93928 100644
--- a/drivers/net/ethernet/intel/ice/ice_ethtool.c
+++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c
@@ -693,75 +693,36 @@ static int ice_get_port_topology(struct ice_hw *hw, u8 
lport,
 static int ice_get_tx_rx_equa(struct ice_hw *hw, u8 serdes_num,
  struct ice_serdes_equalization_to_ethtool *ptr)
 {
+   static const int tx = ICE_AQC_OP_CODE_TX_EQU;
+   static const int rx = ICE_AQC_OP_CODE_RX_EQU;
+   struct {
+   int data_in;
+   int opcode;
+   int *out;
+   } aq_params[] = {
+   { ICE_AQC_TX_EQU_PRE1, tx, &ptr->tx_equ_pre1 },
+   { ICE_AQC_TX_EQU_PRE3, tx, &ptr->tx_equ_pre3 },
+   { ICE_AQC_TX_EQU_ATTEN, tx, &ptr->tx_equ_atten },
+   { ICE_AQC_TX_EQU_POST1, tx, &ptr->tx_equ_post1 },
+   { ICE_AQC_TX_EQU_PRE2, tx, &ptr->tx_equ_pre2 },
+   { ICE_AQC_RX_EQU_PRE2, rx, &ptr->rx_equ_pre2 },
+   { ICE_AQC_RX_EQU_PRE1, rx, &ptr->rx_equ_pre1 },
+   { ICE_AQC_RX_EQU_POST1, rx, &ptr->rx_equ_post1 },
+   { ICE_AQC_RX_EQU_BFLF, rx, &ptr->rx_equ_bflf },
+   { ICE_AQC_RX_EQU_BFHF, rx, &ptr->rx_equ_bfhf },
+   { ICE_AQC_RX_EQU_DRATE, rx, &ptr->rx_equ_drate },
+   };
int err;
 
-   err = ice_aq_get_phy_equalization(hw, ICE_AQC_TX_EQU_PRE1,
- ICE_AQC_OP_CODE_TX_EQU, serdes_num,
- &ptr->tx_equalization_pre1);
-   if (err)
-   return err;
-
-   err = ice_aq_get_phy_equalization(hw, ICE_AQC_TX_EQU_PRE3,
- ICE_AQC_OP_CODE_TX_EQU, serdes_num,
- &ptr->tx_equalization_pre3);
-   if (err)
-   return err;
-
-   err = ice_aq_get_phy_equalization(hw, ICE_AQC_TX_EQU_ATTEN,
- ICE_AQC_OP_CODE_TX_EQU, serdes_num,
- &ptr->tx_equalization_atten);
-   if (err)
-   return err;
-
-   err = ice_aq_get_phy_equalization(hw, ICE_AQC_TX_EQU_POST1,
- ICE_AQC_OP_CODE_TX_EQU, serdes_num,
- &ptr->tx_equalization_post1);
-   if (err)
-   return err;
-
-   err = ice_aq_get_phy_equalization(hw, ICE_AQC_TX_EQU_PRE2,
- ICE_AQC_OP_CODE_TX_EQU, serdes_num,
- &ptr->tx_equalization_pre2);
-   if (err)
-   return err;
-
-   err = ice_aq_get_phy_equalization(hw, ICE_AQC_RX_EQU_PRE2,
- ICE_AQC_OP_CODE_RX_EQU, serdes_num,
- &ptr->rx_equalization_pre2);
-   if (err)
-   return err;
-
-   err = ice_aq_get_phy_equalization(hw, ICE_AQC_RX_EQU_PRE1,
- ICE_AQC_OP_CODE_RX_EQU, serdes_num,
- &ptr->rx_equalization_pre1);
-   if (err)
-   return err;
-
-   err = ice_aq_get_phy_equalization(hw, ICE_AQC_RX_EQU_POST1,
- ICE_AQC_OP_CODE_RX_EQU, serdes_num,
- &ptr->rx_equalization_post1);
-   if (err)
-   return err;
-
-   err = ice_aq_get_phy_equalization(hw, ICE_AQC_RX_EQU_BFLF,
- ICE_AQC_OP_CODE_RX_EQU, serdes_num,
- &ptr->rx_equalization_bflf);
-   if (err)
-   return err;
-
-   err = ice_aq_get_phy_equalization(hw, ICE_AQC_RX_EQU_BFHF,
- ICE_AQC_OP_CODE_RX_EQU, serdes_num,
- &ptr->rx_equalization_bfhf);
-   if (err)
-   return err;
-
-   err = ice_aq_get_phy_equalization(hw, ICE_AQC_RX_EQU_DRATE,
- ICE_AQC_OP_CODE_RX_EQU, serdes_num,
- &ptr->rx_equalization_drate);
-   if (err)
-   return err;
+   for (int i = 0; i < ARRAY_SIZE(aq_params); i++) {
+   err

[Intel-wired-lan] [PATCH iwl-next 2/2] ice: extend dump serdes equalizer values feature

2024-10-01 Thread Mateusz Polchlopek
Extend the work done in commit 70838938e89c ("ice: Implement driver
functionality to dump serdes equalizer values") by adding the new set of
Rx registers that can be read using command:
  $ ethtool -d interface_name

Rx equalization parameters are E810 PHY registers used by end user to
gather information about configuration and status to debug link and
connection issues in the field.

Reviewed-by: Przemek Kitszel 
Signed-off-by: Mateusz Polchlopek 
---
 drivers/net/ethernet/intel/ice/ice_adminq_cmd.h | 17 +
 drivers/net/ethernet/intel/ice/ice_ethtool.c| 17 +
 drivers/net/ethernet/intel/ice/ice_ethtool.h| 17 +
 3 files changed, 51 insertions(+)

diff --git a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h 
b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
index 1f01f3501d6b..1489a8ceec51 100644
--- a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
@@ -1492,6 +1492,23 @@ struct ice_aqc_dnl_equa_param {
 #define ICE_AQC_RX_EQU_BFLF (0x13 << ICE_AQC_RX_EQU_SHIFT)
 #define ICE_AQC_RX_EQU_BFHF (0x14 << ICE_AQC_RX_EQU_SHIFT)
 #define ICE_AQC_RX_EQU_DRATE (0x15 << ICE_AQC_RX_EQU_SHIFT)
+#define ICE_AQC_RX_EQU_CTLE_GAINHF (0x20 << ICE_AQC_RX_EQU_SHIFT)
+#define ICE_AQC_RX_EQU_CTLE_GAINLF (0x21 << ICE_AQC_RX_EQU_SHIFT)
+#define ICE_AQC_RX_EQU_CTLE_GAINDC (0x22 << ICE_AQC_RX_EQU_SHIFT)
+#define ICE_AQC_RX_EQU_CTLE_BW (0x23 << ICE_AQC_RX_EQU_SHIFT)
+#define ICE_AQC_RX_EQU_DFE_GAIN (0x30 << ICE_AQC_RX_EQU_SHIFT)
+#define ICE_AQC_RX_EQU_DFE_GAIN2 (0x31 << ICE_AQC_RX_EQU_SHIFT)
+#define ICE_AQC_RX_EQU_DFE_2 (0x32 << ICE_AQC_RX_EQU_SHIFT)
+#define ICE_AQC_RX_EQU_DFE_3 (0x33 << ICE_AQC_RX_EQU_SHIFT)
+#define ICE_AQC_RX_EQU_DFE_4 (0x34 << ICE_AQC_RX_EQU_SHIFT)
+#define ICE_AQC_RX_EQU_DFE_5 (0x35 << ICE_AQC_RX_EQU_SHIFT)
+#define ICE_AQC_RX_EQU_DFE_6 (0x36 << ICE_AQC_RX_EQU_SHIFT)
+#define ICE_AQC_RX_EQU_DFE_7 (0x37 << ICE_AQC_RX_EQU_SHIFT)
+#define ICE_AQC_RX_EQU_DFE_8 (0x38 << ICE_AQC_RX_EQU_SHIFT)
+#define ICE_AQC_RX_EQU_DFE_9 (0x39 << ICE_AQC_RX_EQU_SHIFT)
+#define ICE_AQC_RX_EQU_DFE_10 (0x3A << ICE_AQC_RX_EQU_SHIFT)
+#define ICE_AQC_RX_EQU_DFE_11 (0x3B << ICE_AQC_RX_EQU_SHIFT)
+#define ICE_AQC_RX_EQU_DFE_12 (0x3C << ICE_AQC_RX_EQU_SHIFT)
 #define ICE_AQC_TX_EQU_PRE1 0x0
 #define ICE_AQC_TX_EQU_PRE3 0x3
 #define ICE_AQC_TX_EQU_ATTEN 0x4
diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.c 
b/drivers/net/ethernet/intel/ice/ice_ethtool.c
index 19e7a9d93928..3072634bf049 100644
--- a/drivers/net/ethernet/intel/ice/ice_ethtool.c
+++ b/drivers/net/ethernet/intel/ice/ice_ethtool.c
@@ -711,6 +711,23 @@ static int ice_get_tx_rx_equa(struct ice_hw *hw, u8 
serdes_num,
{ ICE_AQC_RX_EQU_BFLF, rx, &ptr->rx_equ_bflf },
{ ICE_AQC_RX_EQU_BFHF, rx, &ptr->rx_equ_bfhf },
{ ICE_AQC_RX_EQU_DRATE, rx, &ptr->rx_equ_drate },
+   { ICE_AQC_RX_EQU_CTLE_GAINHF, rx, &ptr->rx_equ_ctle_gainhf },
+   { ICE_AQC_RX_EQU_CTLE_GAINLF, rx, &ptr->rx_equ_ctle_gainlf },
+   { ICE_AQC_RX_EQU_CTLE_GAINDC, rx, &ptr->rx_equ_ctle_gaindc },
+   { ICE_AQC_RX_EQU_CTLE_BW, rx, &ptr->rx_equ_ctle_bw },
+   { ICE_AQC_RX_EQU_DFE_GAIN, rx, &ptr->rx_equ_dfe_gain },
+   { ICE_AQC_RX_EQU_DFE_GAIN2, rx, &ptr->rx_equ_dfe_gain_2 },
+   { ICE_AQC_RX_EQU_DFE_2, rx, &ptr->rx_equ_dfe_2 },
+   { ICE_AQC_RX_EQU_DFE_3, rx, &ptr->rx_equ_dfe_3 },
+   { ICE_AQC_RX_EQU_DFE_4, rx, &ptr->rx_equ_dfe_4 },
+   { ICE_AQC_RX_EQU_DFE_5, rx, &ptr->rx_equ_dfe_5 },
+   { ICE_AQC_RX_EQU_DFE_6, rx, &ptr->rx_equ_dfe_6 },
+   { ICE_AQC_RX_EQU_DFE_7, rx, &ptr->rx_equ_dfe_7 },
+   { ICE_AQC_RX_EQU_DFE_8, rx, &ptr->rx_equ_dfe_8 },
+   { ICE_AQC_RX_EQU_DFE_9, rx, &ptr->rx_equ_dfe_9 },
+   { ICE_AQC_RX_EQU_DFE_10, rx, &ptr->rx_equ_dfe_10 },
+   { ICE_AQC_RX_EQU_DFE_11, rx, &ptr->rx_equ_dfe_11 },
+   { ICE_AQC_RX_EQU_DFE_12, rx, &ptr->rx_equ_dfe_12 },
};
int err;
 
diff --git a/drivers/net/ethernet/intel/ice/ice_ethtool.h 
b/drivers/net/ethernet/intel/ice/ice_ethtool.h
index 98eb9c51d687..8f2ad1c172c0 100644
--- a/drivers/net/ethernet/intel/ice/ice_ethtool.h
+++ b/drivers/net/ethernet/intel/ice/ice_ethtool.h
@@ -16,6 +16,23 @@ struct ice_serdes_equalization_to_ethtool {
int rx_equ_bflf;
int rx_equ_bfhf;
int rx_equ_drate;
+   int rx_equ_ctle_gainhf;
+   int rx_equ_ctle_gainlf;
+   int rx_equ_ctle_gaindc;
+   int rx_equ_ctle_bw;
+   int rx_equ_dfe_gain;
+   int rx_equ_dfe_gain_2;
+   int rx_equ_dfe_2;
+   int rx_equ_dfe_3;
+   int rx_equ_dfe_4;
+   int rx_equ_dfe_5;
+   int rx_equ_dfe_6;
+   int rx_equ_dfe_7;
+   int rx_equ_dfe_8;
+   int rx_equ_dfe_9;
+   int rx_equ_dfe_10;
+   int rx_equ_dfe_11;
+   int rx_equ_dfe_12;
int tx_equ_pre1;
 

[Intel-wired-lan] [PATCH iwl-next 0/2] Extend the dump serdes equalizer values feature

2024-10-01 Thread Mateusz Polchlopek
Extend the work done in commit 70838938e89c ("ice: Implement driver
functionality to dump serdes equalizer values") by refactor the
ice_get_tx_rx_equa() function, shorten struct fields names and add
new Rx registers that can be read using command:
 $ ethtool -d interface_name.

Mateusz Polchlopek (2):
  ice: rework of dump serdes equalizer values feature
  ice: extend dump serdes equalizer values feature

 .../net/ethernet/intel/ice/ice_adminq_cmd.h   |  17 +++
 drivers/net/ethernet/intel/ice/ice_ethtool.c  | 110 +++---
 drivers/net/ethernet/intel/ice/ice_ethtool.h  |  39 +--
 3 files changed, 89 insertions(+), 77 deletions(-)


base-commit: 3b26fcd8529dc446af79c6023a1850260ceeaba7
-- 
2.38.1



Re: [Intel-wired-lan] [RFC bpf-next 0/4] Add XDP rx hw hints support performing XDP_REDIRECT

2024-10-01 Thread Toke Høiland-Jørgensen
"Arthur Fabre"  writes:

> On Mon Sep 30, 2024 at 12:52 PM CEST, Toke Høiland-Jørgensen wrote:
>> > Thinking about it more, my only relectance for a registration API is how
>> > to communicate the ID back to other consumers (our discussion below).
>> >
>> >>
>> >> > Dynamically registering fields means you have to share the returned ID
>> >> > with whoever is interested, which sounds tricky.
>> >> > If an XDP program sets a field like packet_id, every tracing
>> >> > program that looks at it, and userspace service, would need to know what
>> >> > the ID of that field is.
>> >> > Is there a way to easily share that ID with all of them?
>> >>
>> >> Right, so I'll admit this was one of the handwavy bits of my original
>> >> proposal, but I don't think it's unsolvable. You could do something like
>> >> (once, on application initialisation):
>> >>
>> >> __u64 my_key = bpf_register_metadata_field(my_size); /* maybe add a name 
>> >> for introspection? */
>> >> bpf_map_update(&shared_application_config, &my_key_index, &my_key);
>> >>
>> >> and then just get the key out of that map from all programs that want to
>> >> use it?
>> >
>> > Passing it out of band works (whether it's via a pinned map like you
>> > described, or through other means like a unix socket or some other
>> > API), it's just more complicated.
>> >
>> > Every consumer also needs to know about that API. That won't work with
>> > standard tools. For example if we set a PACKET_ID KV, maybe we could
>> > give it to pwru so it could track packets using it?
>> > Without registering keys, we could pass it as a cli flag. With
>> > registration, we'd have to have some helper to get the KV ID.
>> >
>> > It also introduces ordering dependencies between the services on
>> > startup, eg packet tracing hooks could only be attached once our XDP
>> > service has registered a PACKET_ID KV, and they could query it's API.
>>
>> Yeah, we definitely need a way to make that accessible and not too
>> cumbersome.
>>
>> I suppose what we really need is a way to map an application-specific
>> identifier to an ID. And, well, those identifiers could just be (string)
>> names? That's what we do for CO-RE, after all. So you'd get something
>> like:
>>
>> id = bpf_register_metadata_field("packet_id", 8, BPF_CREATE); /* register */
>>
>> id = bpf_register_metadata_field("packet_id", 8, BPF_NO_CREATE); /* resolve 
>> */
>>
>> and we make that idempotent, so that two callers using the same name and
>> size will just get the same id back; and if called with BPF_NO_CREATE,
>> you'll get -ENOENT if it hasn't already been registered by someone else?
>>
>> We could even make this a sub-command of the bpf() syscall if we want it
>> to be UAPI, but that's not strictly necessary, applications can also
>> just call the registration from a syscall program at startup...
>
> That's a nice API, it makes sharing the IDs much easier.
>
> We still have to worry about collisions (what if two different things
> want to add their own "packet_id" field?). But at least:
>
> * "Any string" has many more possibilities than 0-64 keys.

Yes, and it's easy to namespace (by prefixing, like
APPNAME_my_metaname). But sure, if everyone uses very generic names that
will be a problem.

> * bpf_register_metadata() could return an error if a field is already
> registered, instead of silently letting an application overwrite
> metadata

I don't think we can fundamentally solve the collision problem if we
also need to allow sharing keys (on purpose). I.e., we can't distinguish
between "these two applications deliberately want to share the packet_id
field" and "these two applications accidentally both picked packet_id as
their internal key".

I suppose we could pre-define some extra reserved keys if there turns
out to be widely used identifiers that many applications want. But I'm
not sure if that should be there from the start, it quickly becomes very
speculative(packet_id comes to mind as one that might be generally
useful, but I'm really not sure, TBH).

> (although arguably we could have add a BPF_NOEXIST style flag
> to the KV set() to kind of do the same).

A global registry will need locking, so not sure it's a good idea to
support inserting into it in the fast path?

> At least internally, it still feels like we'd maintain a registry of
> these string fields and make them configurable for each service to avoid
> collisions.

Yeah, see above. Some kind of coordination (like a registry) is
impossible to avoid if you *want* to share data, but not sure how
common that is?

-Toke



Re: [Intel-wired-lan] [net-next v3 2/2] e1000: Link NAPI instances to queues and IRQs

2024-10-01 Thread Simon Horman
On Mon, Sep 30, 2024 at 05:12:32PM +, Joe Damato wrote:
> Add support for netdev-genl, allowing users to query IRQ, NAPI, and queue
> information.
> 
> After this patch is applied, note the IRQ assigned to my NIC:
> 
> $ cat /proc/interrupts | grep enp0s8 | cut -f1 --delimiter=':'
>  18
> 
> Note the output from the cli:
> 
> $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
>  --dump napi-get --json='{"ifindex": 2}'
> [{'id': 513, 'ifindex': 2, 'irq': 18}]
> 
> This device supports only 1 rx and 1 tx queue, so querying that:
> 
> $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
>  --dump queue-get --json='{"ifindex": 2}'
> [{'id': 0, 'ifindex': 2, 'napi-id': 513, 'type': 'rx'},
>  {'id': 0, 'ifindex': 2, 'napi-id': 513, 'type': 'tx'}]
> 
> Signed-off-by: Joe Damato 

Reviewed-by: Simon Horman 



Re: [Intel-wired-lan] [RFC bpf-next 0/4] Add XDP rx hw hints support performing XDP_REDIRECT

2024-10-01 Thread Lorenzo Bianconi
> On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote:
> > > Lorenzo Bianconi  writes:
> > > 
> > > >> > We could combine such a registration API with your header format, so
> > > >> > that the registration just becomes a way of allocating one of the 
> > > >> > keys
> > > >> > from 0-63 (and the registry just becomes a global copy of the 
> > > >> > header).
> > > >> > This would basically amount to moving the "service config file" into 
> > > >> > the
> > > >> > kernel, since that seems to be the only common denominator we can 
> > > >> > rely
> > > >> > on between BPF applications (as all attempts to write a common daemon
> > > >> > for BPF management have shown).
> > > >> 
> > > >> That sounds reasonable. And I guess we'd have set() check the global
> > > >> registry to enforce that the key has been registered beforehand?
> > > >> 
> > > >> >
> > > >> > -Toke
> > > >> 
> > > >> Thanks for all the feedback!
> > > >
> > > > I like this 'fast' KV approach but I guess we should really evaluate its
> > > > impact on performances (especially for xdp) since, based on the kfunc 
> > > > calls
> > > > order in the ebpf program, we can have one or multiple memmove/memcpy 
> > > > for
> > > > each packet, right?
> > > 
> > > Yes, with Arthur's scheme, performance will be ordering dependent. Using
> > > a global registry for offsets would sidestep this, but have the
> > > synchronisation issues we discussed up-thread. So on balance, I think
> > > the memmove() suggestion will probably lead to the least pain.
> > > 
> > > For the HW metadata we could sidestep this by always having a fixed
> > > struct for it (but using the same set/get() API with reserved keys). The
> > > only drawback of doing that is that we statically reserve a bit of
> > > space, but I'm not sure that is such a big issue in practice (at least
> > > not until this becomes to popular that the space starts to be contended;
> > > but surely 256 bytes ought to be enough for everybody, right? :)).
> >
> > I am fine with the proposed approach, but I think we need to verify what is 
> > the
> > impact on performances (in the worst case??)
> 
> If drivers are responsible for populating the hardware metadata before
> XDP, we could make sure drivers set the fields in order to avoid any
> memove() (and maybe even provide a helper to ensure this?).

nope, since the current APIs introduced by Stanislav are consuming NIC
metadata in kfuncs (mainly for af_xdp) and, according to my understanding,
we want to add a kfunc to store the info for each NIC metadata (e.g rx-hash,
timestamping, ..) into the packet (this is what Toke is proposing, right?).
In this case kfunc calling order makes a difference.
We can think even to add single kfunc to store all the info for all the NIC
metadata (maybe via a helping struct) but it seems not scalable to me and we
are losing kfunc versatility.

Regards,
Lorenzo

> 
> > > > Moreover, I still think the metadata area in the xdp_frame/xdp_buff is 
> > > > not
> > > > so suitable for nic hw metadata since:
> > > > - it grows backward 
> > > > - it is probably in a different cacheline with respect to xdp_frame
> > > > - nic hw metadata will not start at fixed and immutable address, but it 
> > > > depends
> > > >   on the running ebpf program
> > > >
> > > > What about having something like:
> > > > - fixed hw nic metadata: just after xdp_frame struct (or if you want at 
> > > > the end
> > > >   of the metadata area :)). Here he can reuse the same KV approach if 
> > > > it is fast
> > > > - user defined metadata: in the metadata area of the xdp_frame/xdp_buff
> > > 
> > > AFAIU, none of this will live in the (current) XDP metadata area. It
> > > will all live just after the xdp_frame struct (so sharing the space with
> > > the metadata area in the sense that adding more metadata kv fields will
> > > decrease the amount of space that is usable by the current XDP metadata
> > > APIs).
> > > 
> > > -Toke
> > > 
> >
> > ah, ok. I was thinking the proposed approach was to put them in the current
> > metadata field.
> 
> I've also been thinking of putting this new KV stuff at the start of the
> headroom (I think that's what you're saying Toke?). It has a few nice
> advantanges:
> 
> * It coexists nicely with the current XDP / TC metadata support.
> Those users won't be able to overwrite / corrupt the KV metadata.
> KV users won't need to call xdp_adjust_meta() (which would be awkward -
> how would they know how much space the KV implementation needs).
> 
> * We don't have to move all the metadata everytime we call
> xdp_adjust_head() (or the kernel equivalent).
> 
> Are there any performance implications of that, e.g. for caching?
> 
> This would also grow "upwards" which is more natural, but I think 
> either way the KV API would hide whether it's downwards or upwards from
> users.
> 
> >
> > Regards,
> > Lorenzo
> 


signature.asc
Description: PGP signature


Re: [Intel-wired-lan] [RFC bpf-next 0/4] Add XDP rx hw hints support performing XDP_REDIRECT

2024-10-01 Thread Arthur Fabre
On Mon Sep 30, 2024 at 12:52 PM CEST, Toke Høiland-Jørgensen wrote:
> > Thinking about it more, my only relectance for a registration API is how
> > to communicate the ID back to other consumers (our discussion below).
> >
> >>
> >> > Dynamically registering fields means you have to share the returned ID
> >> > with whoever is interested, which sounds tricky.
> >> > If an XDP program sets a field like packet_id, every tracing
> >> > program that looks at it, and userspace service, would need to know what
> >> > the ID of that field is.
> >> > Is there a way to easily share that ID with all of them?
> >>
> >> Right, so I'll admit this was one of the handwavy bits of my original
> >> proposal, but I don't think it's unsolvable. You could do something like
> >> (once, on application initialisation):
> >>
> >> __u64 my_key = bpf_register_metadata_field(my_size); /* maybe add a name 
> >> for introspection? */
> >> bpf_map_update(&shared_application_config, &my_key_index, &my_key);
> >>
> >> and then just get the key out of that map from all programs that want to
> >> use it?
> >
> > Passing it out of band works (whether it's via a pinned map like you
> > described, or through other means like a unix socket or some other
> > API), it's just more complicated.
> >
> > Every consumer also needs to know about that API. That won't work with
> > standard tools. For example if we set a PACKET_ID KV, maybe we could
> > give it to pwru so it could track packets using it?
> > Without registering keys, we could pass it as a cli flag. With
> > registration, we'd have to have some helper to get the KV ID.
> >
> > It also introduces ordering dependencies between the services on
> > startup, eg packet tracing hooks could only be attached once our XDP
> > service has registered a PACKET_ID KV, and they could query it's API.
>
> Yeah, we definitely need a way to make that accessible and not too
> cumbersome.
>
> I suppose what we really need is a way to map an application-specific
> identifier to an ID. And, well, those identifiers could just be (string)
> names? That's what we do for CO-RE, after all. So you'd get something
> like:
>
> id = bpf_register_metadata_field("packet_id", 8, BPF_CREATE); /* register */
>
> id = bpf_register_metadata_field("packet_id", 8, BPF_NO_CREATE); /* resolve */
>
> and we make that idempotent, so that two callers using the same name and
> size will just get the same id back; and if called with BPF_NO_CREATE,
> you'll get -ENOENT if it hasn't already been registered by someone else?
>
> We could even make this a sub-command of the bpf() syscall if we want it
> to be UAPI, but that's not strictly necessary, applications can also
> just call the registration from a syscall program at startup...

That's a nice API, it makes sharing the IDs much easier.

We still have to worry about collisions (what if two different things
want to add their own "packet_id" field?). But at least:

* "Any string" has many more possibilities than 0-64 keys.

* bpf_register_metadata() could return an error if a field is already
registered, instead of silently letting an application overwrite
metadata (although arguably we could have add a BPF_NOEXIST style flag
to the KV set() to kind of do the same).

At least internally, it still feels like we'd maintain a registry of
these string fields and make them configurable for each service to avoid
collisions.

> >> We could combine such a registration API with your header format, so
> >> that the registration just becomes a way of allocating one of the keys
> >> from 0-63 (and the registry just becomes a global copy of the header).
> >> This would basically amount to moving the "service config file" into the
> >> kernel, since that seems to be the only common denominator we can rely
> >> on between BPF applications (as all attempts to write a common daemon
> >> for BPF management have shown).
> >
> > That sounds reasonable. And I guess we'd have set() check the global
> > registry to enforce that the key has been registered beforehand?
>
> Yes, exactly. And maybe check that the size matches as well just to
> remove the obvious footgun of accidentally stepping on each other's
> toes?
>
> > Thanks for all the feedback!
>
> You're welcome! Thanks for working on this :)
>
> -Toke



Re: [Intel-wired-lan] [RFC bpf-next 0/4] Add XDP rx hw hints support performing XDP_REDIRECT

2024-10-01 Thread Arthur Fabre
On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote:
> > Lorenzo Bianconi  writes:
> > 
> > >> > We could combine such a registration API with your header format, so
> > >> > that the registration just becomes a way of allocating one of the keys
> > >> > from 0-63 (and the registry just becomes a global copy of the header).
> > >> > This would basically amount to moving the "service config file" into 
> > >> > the
> > >> > kernel, since that seems to be the only common denominator we can rely
> > >> > on between BPF applications (as all attempts to write a common daemon
> > >> > for BPF management have shown).
> > >> 
> > >> That sounds reasonable. And I guess we'd have set() check the global
> > >> registry to enforce that the key has been registered beforehand?
> > >> 
> > >> >
> > >> > -Toke
> > >> 
> > >> Thanks for all the feedback!
> > >
> > > I like this 'fast' KV approach but I guess we should really evaluate its
> > > impact on performances (especially for xdp) since, based on the kfunc 
> > > calls
> > > order in the ebpf program, we can have one or multiple memmove/memcpy for
> > > each packet, right?
> > 
> > Yes, with Arthur's scheme, performance will be ordering dependent. Using
> > a global registry for offsets would sidestep this, but have the
> > synchronisation issues we discussed up-thread. So on balance, I think
> > the memmove() suggestion will probably lead to the least pain.
> > 
> > For the HW metadata we could sidestep this by always having a fixed
> > struct for it (but using the same set/get() API with reserved keys). The
> > only drawback of doing that is that we statically reserve a bit of
> > space, but I'm not sure that is such a big issue in practice (at least
> > not until this becomes to popular that the space starts to be contended;
> > but surely 256 bytes ought to be enough for everybody, right? :)).
>
> I am fine with the proposed approach, but I think we need to verify what is 
> the
> impact on performances (in the worst case??)

If drivers are responsible for populating the hardware metadata before
XDP, we could make sure drivers set the fields in order to avoid any
memove() (and maybe even provide a helper to ensure this?).

> > > Moreover, I still think the metadata area in the xdp_frame/xdp_buff is not
> > > so suitable for nic hw metadata since:
> > > - it grows backward 
> > > - it is probably in a different cacheline with respect to xdp_frame
> > > - nic hw metadata will not start at fixed and immutable address, but it 
> > > depends
> > >   on the running ebpf program
> > >
> > > What about having something like:
> > > - fixed hw nic metadata: just after xdp_frame struct (or if you want at 
> > > the end
> > >   of the metadata area :)). Here he can reuse the same KV approach if it 
> > > is fast
> > > - user defined metadata: in the metadata area of the xdp_frame/xdp_buff
> > 
> > AFAIU, none of this will live in the (current) XDP metadata area. It
> > will all live just after the xdp_frame struct (so sharing the space with
> > the metadata area in the sense that adding more metadata kv fields will
> > decrease the amount of space that is usable by the current XDP metadata
> > APIs).
> > 
> > -Toke
> > 
>
> ah, ok. I was thinking the proposed approach was to put them in the current
> metadata field.

I've also been thinking of putting this new KV stuff at the start of the
headroom (I think that's what you're saying Toke?). It has a few nice
advantanges:

* It coexists nicely with the current XDP / TC metadata support.
Those users won't be able to overwrite / corrupt the KV metadata.
KV users won't need to call xdp_adjust_meta() (which would be awkward -
how would they know how much space the KV implementation needs).

* We don't have to move all the metadata everytime we call
xdp_adjust_head() (or the kernel equivalent).

Are there any performance implications of that, e.g. for caching?

This would also grow "upwards" which is more natural, but I think 
either way the KV API would hide whether it's downwards or upwards from
users.

>
> Regards,
> Lorenzo



Re: [Intel-wired-lan] [PATCH iwl-next v10 07/14] iavf: add support for indirect access to PHC time

2024-10-01 Thread Alexander Lobakin
From: Mateusz Polchlopek 
Date: Tue, 1 Oct 2024 09:20:14 +0200

> 
> 
> On 8/21/2024 4:31 PM, Alexander Lobakin wrote:
>> From: Wojciech Drewek 
>> Date: Wed, 21 Aug 2024 14:15:32 +0200
>>
>>> From: Jacob Keller 
>>>
>>> Implement support for reading the PHC time indirectly via the
>>> VIRTCHNL_OP_1588_PTP_GET_TIME operation.
>>
>> [...]
>>
>>> +/**
>>> + * iavf_queue_ptp_cmd - Queue PTP command for sending over virtchnl
>>> + * @adapter: private adapter structure
>>> + * @cmd: the command structure to send
>>> + *
>>> + * Queue the given command structure into the PTP virtchnl command
>>> queue tos
>>> + * end to the PF.
>>> + */
>>> +static void iavf_queue_ptp_cmd(struct iavf_adapter *adapter,
>>> +   struct iavf_ptp_aq_cmd *cmd)
>>> +{
>>> +    mutex_lock(&adapter->ptp.aq_cmd_lock);
>>> +    list_add_tail(&cmd->list, &adapter->ptp.aq_cmds);
>>> +    mutex_unlock(&adapter->ptp.aq_cmd_lock);
>>> +
>>> +    adapter->aq_required |= IAVF_FLAG_AQ_SEND_PTP_CMD;
>>> +    mod_delayed_work(adapter->wq, &adapter->watchdog_task, 0);
>>
>> Are you sure you need delayed_work here? delayed_work is used only when
>> you need to run it after a delay. If the delay is always 0, then you
>> only need work_struct and queue_work().
>>
> 
> Sorry for late response but I was on quite long vacation.
> 
> Regarding your question - I think it is needed to have
> mod_delayed_work() here, at least as for now. I agree with your
> suggestion to use queue_work() but this function takes as parameter
> work_struct and in our case the adapter->watchdog_task field is of
> struct delayed_work type. It uses the function iavf_watchdog_task()
> which does plenty of things (including sending ptp cmd). Changing
> adapter->watchdog_task to be just struct work_struct is not applicable
> here as in iavf_main.c file in few places we add it to queue with
> different time.

Aaaah I'm sorry I didn't notice that watchdog_task is used in other
placed, not only here.
Ack, leave it as it is.

> 
> Make long story short - I agree with your argument but as far as this
> 0 delay is not causing performance regression then I will stick with
> this solution implemented by Jake.

Thanks,
Olek


Re: [Intel-wired-lan] [PATCH net v2 2/2] page_pool: fix IOMMU crash when driver has already unbound

2024-10-01 Thread Paolo Abeni

On 9/25/24 09:57, Yunsheng Lin wrote:

Networking driver with page_pool support may hand over page
still with dma mapping to network stack and try to reuse that
page after network stack is done with it and passes it back
to page_pool to avoid the penalty of dma mapping/unmapping.
With all the caching in the network stack, some pages may be
held in the network stack without returning to the page_pool
soon enough, and with VF disable causing the driver unbound,
the page_pool does not stop the driver from doing it's
unbounding work, instead page_pool uses workqueue to check
if there is some pages coming back from the network stack
periodically, if there is any, it will do the dma unmmapping
related cleanup work.

As mentioned in [1], attempting DMA unmaps after the driver
has already unbound may leak resources or at worst corrupt
memory. Fundamentally, the page pool code cannot allow DMA
mappings to outlive the driver they belong to.

Currently it seems there are at least two cases that the page
is not released fast enough causing dma unmmapping done after
driver has already unbound:
1. ipv4 packet defragmentation timeout: this seems to cause
delay up to 30 secs.
2. skb_defer_free_flush(): this may cause infinite delay if
there is no triggering for net_rx_action().

In order not to do the dma unmmapping after driver has already
unbound and stall the unloading of the networking driver, add
the pool->items array to record all the pages including the ones
which are handed over to network stack, so the page_pool can
do the dma unmmapping for those pages when page_pool_destroy()
is called. As the pool->items need to be large enough to avoid
performance degradation, add a 'item_full' stat to indicate the
allocation failure due to unavailability of pool->items.


This looks really invasive, with room for potentially large performance 
regressions or worse. At very least it does not look suitable for net.


Is the problem only tied to VFs drivers? It's a pity all the page_pool 
users will have to pay a bill for it...


/P



Re: [Intel-wired-lan] [RFC bpf-next 0/4] Add XDP rx hw hints support performing XDP_REDIRECT

2024-10-01 Thread Toke Høiland-Jørgensen
Lorenzo Bianconi  writes:

>> On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote:
>> > > Lorenzo Bianconi  writes:
>> > > 
>> > > >> > We could combine such a registration API with your header format, so
>> > > >> > that the registration just becomes a way of allocating one of the 
>> > > >> > keys
>> > > >> > from 0-63 (and the registry just becomes a global copy of the 
>> > > >> > header).
>> > > >> > This would basically amount to moving the "service config file" 
>> > > >> > into the
>> > > >> > kernel, since that seems to be the only common denominator we can 
>> > > >> > rely
>> > > >> > on between BPF applications (as all attempts to write a common 
>> > > >> > daemon
>> > > >> > for BPF management have shown).
>> > > >> 
>> > > >> That sounds reasonable. And I guess we'd have set() check the global
>> > > >> registry to enforce that the key has been registered beforehand?
>> > > >> 
>> > > >> >
>> > > >> > -Toke
>> > > >> 
>> > > >> Thanks for all the feedback!
>> > > >
>> > > > I like this 'fast' KV approach but I guess we should really evaluate 
>> > > > its
>> > > > impact on performances (especially for xdp) since, based on the kfunc 
>> > > > calls
>> > > > order in the ebpf program, we can have one or multiple memmove/memcpy 
>> > > > for
>> > > > each packet, right?
>> > > 
>> > > Yes, with Arthur's scheme, performance will be ordering dependent. Using
>> > > a global registry for offsets would sidestep this, but have the
>> > > synchronisation issues we discussed up-thread. So on balance, I think
>> > > the memmove() suggestion will probably lead to the least pain.
>> > > 
>> > > For the HW metadata we could sidestep this by always having a fixed
>> > > struct for it (but using the same set/get() API with reserved keys). The
>> > > only drawback of doing that is that we statically reserve a bit of
>> > > space, but I'm not sure that is such a big issue in practice (at least
>> > > not until this becomes to popular that the space starts to be contended;
>> > > but surely 256 bytes ought to be enough for everybody, right? :)).
>> >
>> > I am fine with the proposed approach, but I think we need to verify what 
>> > is the
>> > impact on performances (in the worst case??)
>> 
>> If drivers are responsible for populating the hardware metadata before
>> XDP, we could make sure drivers set the fields in order to avoid any
>> memove() (and maybe even provide a helper to ensure this?).
>
> nope, since the current APIs introduced by Stanislav are consuming NIC
> metadata in kfuncs (mainly for af_xdp) and, according to my understanding,
> we want to add a kfunc to store the info for each NIC metadata (e.g rx-hash,
> timestamping, ..) into the packet (this is what Toke is proposing, right?).
> In this case kfunc calling order makes a difference.
> We can think even to add single kfunc to store all the info for all the NIC
> metadata (maybe via a helping struct) but it seems not scalable to me and we
> are losing kfunc versatility.

Yes, I agree we should have separate kfuncs for each metadata field.
Which means it makes a lot of sense to just use the same setter API that
we use for the user-registered metadata fields, but using reserved keys.
So something like:

#define BPF_METADATA_HW_HASH  BIT(60)
#define BPF_METADATA_HW_TIMESTAMP BIT(61)
#define BPF_METADATA_HW_VLAN  BIT(62)
#define BPF_METADATA_RESERVED (0x << 48)

bpf_packet_metadata_set(pkt, BPF_METADATA_HW_HASH, hash_value);


As for the internal representation, we can just have the kfunc do
something like:

int bpf_packet_metadata_set(field_id, value) {
  switch(field_id) {
case BPF_METADATA_HW_HASH:
  pkt->xdp_hw_meta.hash = value;
  break;
[...]
default:
  /* do the key packing thing */
  }
}


that way the order of setting the HW fields doesn't matter, only the
user-defined metadata.

>> > > > Moreover, I still think the metadata area in the xdp_frame/xdp_buff is 
>> > > > not
>> > > > so suitable for nic hw metadata since:
>> > > > - it grows backward 
>> > > > - it is probably in a different cacheline with respect to xdp_frame
>> > > > - nic hw metadata will not start at fixed and immutable address, but 
>> > > > it depends
>> > > >   on the running ebpf program
>> > > >
>> > > > What about having something like:
>> > > > - fixed hw nic metadata: just after xdp_frame struct (or if you want 
>> > > > at the end
>> > > >   of the metadata area :)). Here he can reuse the same KV approach if 
>> > > > it is fast
>> > > > - user defined metadata: in the metadata area of the xdp_frame/xdp_buff
>> > > 
>> > > AFAIU, none of this will live in the (current) XDP metadata area. It
>> > > will all live just after the xdp_frame struct (so sharing the space with
>> > > the metadata area in the sense that adding more metadata kv fields will
>> > > decrease the amount of space that is usable by the current XDP metadata
>> > > APIs).
>> > > 
>> > > -Toke
>> > > 
>> >
>> > ah, ok. I was thinking the proposed app

Re: [Intel-wired-lan] [PATCH net v2 2/2] page_pool: fix IOMMU crash when driver has already unbound

2024-10-01 Thread Ilias Apalodimas
Hi Paolo,

Thanks for taking the time.

On Tue, 1 Oct 2024 at 16:32, Paolo Abeni  wrote:
>
> On 9/25/24 09:57, Yunsheng Lin wrote:
> > Networking driver with page_pool support may hand over page
> > still with dma mapping to network stack and try to reuse that
> > page after network stack is done with it and passes it back
> > to page_pool to avoid the penalty of dma mapping/unmapping.
> > With all the caching in the network stack, some pages may be
> > held in the network stack without returning to the page_pool
> > soon enough, and with VF disable causing the driver unbound,
> > the page_pool does not stop the driver from doing it's
> > unbounding work, instead page_pool uses workqueue to check
> > if there is some pages coming back from the network stack
> > periodically, if there is any, it will do the dma unmmapping
> > related cleanup work.
> >
> > As mentioned in [1], attempting DMA unmaps after the driver
> > has already unbound may leak resources or at worst corrupt
> > memory. Fundamentally, the page pool code cannot allow DMA
> > mappings to outlive the driver they belong to.
> >
> > Currently it seems there are at least two cases that the page
> > is not released fast enough causing dma unmmapping done after
> > driver has already unbound:
> > 1. ipv4 packet defragmentation timeout: this seems to cause
> > delay up to 30 secs.
> > 2. skb_defer_free_flush(): this may cause infinite delay if
> > there is no triggering for net_rx_action().
> >
> > In order not to do the dma unmmapping after driver has already
> > unbound and stall the unloading of the networking driver, add
> > the pool->items array to record all the pages including the ones
> > which are handed over to network stack, so the page_pool can
> > do the dma unmmapping for those pages when page_pool_destroy()
> > is called. As the pool->items need to be large enough to avoid
> > performance degradation, add a 'item_full' stat to indicate the
> > allocation failure due to unavailability of pool->items.
>
> This looks really invasive, with room for potentially large performance
> regressions or worse. At very least it does not look suitable for net.

Perhaps, and you are right we need to measure performance before
pulling it but...

>
> Is the problem only tied to VFs drivers? It's a pity all the page_pool
> users will have to pay a bill for it...

It's not. The problem happens when an SKB has been scheduled for
recycling and has already been mapped via page_pool. If the driver
disappears in the meantime, page_pool will free all the packets it
holds in its private rings (both slow and fast), but is not in control
of the SKB anymore. So any packets coming back for recycling *after*
that point cannot unmap memory properly.

As discussed this can either lead to memory corruption and resource
leaking, or worse as seen in the bug report panics. I am fine with
this going into -next, but it really is a bugfix, although I am not
100% sure that the Fixes: tag in the current patch is correct.

Thanks
/Ilias
>
> /P
>


Re: [Intel-wired-lan] [PATCH net v2 2/2] page_pool: fix IOMMU crash when driver has already unbound

2024-10-01 Thread Ilias Apalodimas
On Wed, 2 Oct 2024 at 09:46, Ilias Apalodimas
 wrote:
>
> Hi Paolo,
>
> Thanks for taking the time.
>
> On Tue, 1 Oct 2024 at 16:32, Paolo Abeni  wrote:
> >
> > On 9/25/24 09:57, Yunsheng Lin wrote:
> > > Networking driver with page_pool support may hand over page
> > > still with dma mapping to network stack and try to reuse that
> > > page after network stack is done with it and passes it back
> > > to page_pool to avoid the penalty of dma mapping/unmapping.
> > > With all the caching in the network stack, some pages may be
> > > held in the network stack without returning to the page_pool
> > > soon enough, and with VF disable causing the driver unbound,
> > > the page_pool does not stop the driver from doing it's
> > > unbounding work, instead page_pool uses workqueue to check
> > > if there is some pages coming back from the network stack
> > > periodically, if there is any, it will do the dma unmmapping
> > > related cleanup work.
> > >
> > > As mentioned in [1], attempting DMA unmaps after the driver
> > > has already unbound may leak resources or at worst corrupt
> > > memory. Fundamentally, the page pool code cannot allow DMA
> > > mappings to outlive the driver they belong to.
> > >
> > > Currently it seems there are at least two cases that the page
> > > is not released fast enough causing dma unmmapping done after
> > > driver has already unbound:
> > > 1. ipv4 packet defragmentation timeout: this seems to cause
> > > delay up to 30 secs.
> > > 2. skb_defer_free_flush(): this may cause infinite delay if
> > > there is no triggering for net_rx_action().
> > >
> > > In order not to do the dma unmmapping after driver has already
> > > unbound and stall the unloading of the networking driver, add
> > > the pool->items array to record all the pages including the ones
> > > which are handed over to network stack, so the page_pool can
> > > do the dma unmmapping for those pages when page_pool_destroy()
> > > is called. As the pool->items need to be large enough to avoid
> > > performance degradation, add a 'item_full' stat to indicate the
> > > allocation failure due to unavailability of pool->items.
> >
> > This looks really invasive, with room for potentially large performance
> > regressions or worse. At very least it does not look suitable for net.
>
> Perhaps, and you are right we need to measure performance before
> pulling it but...
>
> >
> > Is the problem only tied to VFs drivers? It's a pity all the page_pool
> > users will have to pay a bill for it...
>
> It's not. The problem happens when an SKB has been scheduled for
> recycling and has already been mapped via page_pool. If the driver
> disappears in the meantime,

Apologies, this wasn't correct. It's the device that has to disappear
not the driver

> page_pool will free all the packets it
> holds in its private rings (both slow and fast), but is not in control
> of the SKB anymore. So any packets coming back for recycling *after*
> that point cannot unmap memory properly.
>
> As discussed this can either lead to memory corruption and resource
> leaking, or worse as seen in the bug report panics. I am fine with
> this going into -next, but it really is a bugfix, although I am not
> 100% sure that the Fixes: tag in the current patch is correct.
>
> Thanks
> /Ilias
> >
> > /P
> >


[Intel-wired-lan] [RFC net-next v4 0/9] Add support for per-NAPI config via netlink

2024-10-01 Thread Joe Damato
Greetings:

Welcome to RFC v4.

Very important and significant changes have been made since RFC v3 [1],
please see the changelog below for details.

A couple important call outs for this revision for reviewers:

  1. idpf embeds a napi_struct in an internal data structure and
 includes an assertion on the size of napi_struct. The maintainers
 have stated that they think anyone touching napi_struct should update
 the assertion [2], so I've done this in patch 3. 

 Even though the assertion has been updated, I've given the
 cacheline placement of napi_struct within idpf's internals no
 thought or consideration.

 Would appreciate other opinions on this; I think idpf should be
 fixed. It seems unreasonable to me that anyone changing the size of
 a struct in the core should need to think about cachelines in idpf.

  2. This revision seems to work (see below for a full walk through). Is
 this the behavior we want? Am I missing some use case or some
 behavioral thing other folks need?

  3. Re a previous point made by Stanislav regarding "taking over a NAPI
 ID" when the channel count changes: mlx5 seems to call napi_disable
 followed by netif_napi_del for the old queues and then calls
 napi_enable for the new ones. In this RFC, the NAPI ID generation
 is deferred to napi_enable. This means we won't end up with two of
 the same NAPI IDs added to the hash at the same time (I am pretty
 sure).

 Can we assume all drivers will napi_disable the old queues before
 napi_enable the new ones? If yes, we might not need to worry about
 a NAPI ID takeover function.
 
  4. I made the decision to remove the WARN_ON_ONCE that (I think?)
 Jakub previously suggested in alloc_netdev_mqs (WARN_ON_ONCE(txqs
 != rxqs);) because this was triggering on every kernel boot with my
 mlx5 NIC.

  5. I left the "maxqs = max(txqs, rxqs);" in alloc_netdev_mqs despite
 thinking this is a bit strange. I think it's strange that we might
 be short some number of NAPI configs, but it seems like most people
 are in favor of this approach, so I've left it.

I'd appreciate thoughts from reviewers on the above items, if at all
possible. Once those are addressed, modulo any feedback on the
code/white space wrapping I still need to do, I think this is close to
an official submission.

Now, on to the implementation:

This implementation allocates an array of "struct napi_config" in
net_device and each NAPI instance is assigned an index into the config
array.

Per-NAPI settings like:
  - NAPI ID
  - gro_flush_timeout
  - defer_hard_irqs

are persisted in napi_config and restored on napi_disable/napi_enable
respectively.

To help illustrate how this would end up working, I've added patches for
3 drivers, of which I have access to only 1:
  - mlx5 which is the basis of the examples below
  - mlx4 which has TX only NAPIs, just to highlight that case. I have
only compile tested this patch; I don't have this hardware.
  - bnxt which I have only compiled tested. I don't have this
hardware.

NOTE: I only tested this on mlx5; I have no access to the other hardware
for which I provided patches. Hopefully other folks can help test :)

This iteration seems to persist NAPI IDs and settings even when resizing
queues, see below, so I think maybe this is getting close to where we
want to land in terms of functionality.

Here's how it works when I test it on my system:

# start with 2 queues

$ ethtool -l eth4 | grep Combined | tail -1
Combined:   2

First, output the current NAPI settings:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
 --dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 0,
  'gro-flush-timeout': 0,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 0,
  'gro-flush-timeout': 0,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Now, set the global sysfs parameters:

$ sudo bash -c 'echo 2 >/sys/class/net/eth4/gro_flush_timeout'
$ sudo bash -c 'echo 100 >/sys/class/net/eth4/napi_defer_hard_irqs'

Output current NAPI settings again:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
 --dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 100,
  'gro-flush-timeout': 2,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 2,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Now set NAPI ID 345, via its NAPI ID to specific values:

$ sudo ./tools/net/ynl/cli.py \
  --spec Documentation/netlink/specs/netdev.yaml \
  --do napi-set \
  --json='{"id": 345,
   "defer-hard-irqs": 111,
   "gro-flush-timeout": 1}'
None

Now output current NAPI settings again to ensure only NAPI ID 345
changed:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
 --dump napi-get --json='{"ifindex"

[Intel-wired-lan] [RFC net-next v4 3/9] net: napi: Make gro_flush_timeout per-NAPI

2024-10-01 Thread Joe Damato
Allow per-NAPI gro_flush_timeout setting.

The existing sysfs parameter is respected; writes to sysfs will write to
all NAPI structs for the device and the net_device gro_flush_timeout
field. Reads from sysfs will read from the net_device field.

The ability to set gro_flush_timeout on specific NAPI instances will be
added in a later commit, via netdev-genl.

Note that idpf has embedded napi_struct in its internals and has
established some series of asserts that involve the size of napi
structure. Since this change increases the napi_struct size from 400 to
416 (according to pahole on my system), I've increased the assertion in
idpf by 16 bytes. No attention whatsoever was paid to the cacheline
placement of idpf internals as a result of this change.

Signed-off-by: Joe Damato 
---
 .../networking/net_cachelines/net_device.rst  |  2 +-
 drivers/net/ethernet/intel/idpf/idpf_txrx.h   |  2 +-
 include/linux/netdevice.h |  3 +-
 net/core/dev.c| 12 +++---
 net/core/dev.h| 40 +++
 net/core/net-sysfs.c  |  2 +-
 6 files changed, 51 insertions(+), 10 deletions(-)

diff --git a/Documentation/networking/net_cachelines/net_device.rst 
b/Documentation/networking/net_cachelines/net_device.rst
index eeeb7c925ec5..3d02ae79c850 100644
--- a/Documentation/networking/net_cachelines/net_device.rst
+++ b/Documentation/networking/net_cachelines/net_device.rst
@@ -98,7 +98,6 @@ struct_netdev_queue*_rx 
read_mostly
 unsigned_intnum_rx_queues  
 
 unsigned_intreal_num_rx_queues  -  
 read_mostly get_rps_cpu
 struct_bpf_prog*xdp_prog-  
 read_mostly netif_elide_gro()
-unsigned_long   gro_flush_timeout   -  
 read_mostly napi_complete_done
 unsigned_intgro_max_size-  
 read_mostly skb_gro_receive
 unsigned_intgro_ipv4_max_size   -  
 read_mostly skb_gro_receive
 rx_handler_func_t*  rx_handler  read_mostly
 -   __netif_receive_skb_core
@@ -182,4 +181,5 @@ struct_devlink_port*devlink_port
 struct_dpll_pin*dpll_pin   
 
 struct hlist_head   page_pools
 struct dim_irq_moder*   irq_moder
+unsigned_long   gro_flush_timeout
 u32 napi_defer_hard_irqs
diff --git a/drivers/net/ethernet/intel/idpf/idpf_txrx.h 
b/drivers/net/ethernet/intel/idpf/idpf_txrx.h
index f0537826f840..fcdf73486d46 100644
--- a/drivers/net/ethernet/intel/idpf/idpf_txrx.h
+++ b/drivers/net/ethernet/intel/idpf/idpf_txrx.h
@@ -438,7 +438,7 @@ struct idpf_q_vector {
__cacheline_group_end_aligned(cold);
 };
 libeth_cacheline_set_assert(struct idpf_q_vector, 112,
-   424 + 2 * sizeof(struct dim),
+   440 + 2 * sizeof(struct dim),
8 + sizeof(cpumask_var_t));
 
 struct idpf_rx_queue_stats {
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 55764efc5c93..33897edd16c8 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -377,6 +377,7 @@ struct napi_struct {
struct list_headdev_list;
struct hlist_node   napi_hash_node;
int irq;
+   unsigned long   gro_flush_timeout;
u32 defer_hard_irqs;
 };
 
@@ -2075,7 +2076,6 @@ struct net_device {
int ifindex;
unsigned intreal_num_rx_queues;
struct netdev_rx_queue  *_rx;
-   unsigned long   gro_flush_timeout;
unsigned intgro_max_size;
unsigned intgro_ipv4_max_size;
rx_handler_func_t __rcu *rx_handler;
@@ -2398,6 +2398,7 @@ struct net_device {
 
/** @irq_moder: dim parameters used if IS_ENABLED(CONFIG_DIMLIB). */
struct dim_irq_moder*irq_moder;
+   unsigned long   gro_flush_timeout;
u32 napi_defer_hard_irqs;
 
u8  priv[] cacheline_aligned
diff --git a/net/core/dev.c b/net/core/dev.c
index 748739958d2a..056ed44f766f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6226,12 +6226,12 @@ bool napi_complete_done(struct napi_struct *n, int 
work_done)
 
if (work_done) {
if (n->gro_bitmask)
-   timeout = READ_ONCE(n->dev->gro_flush_timeout);
+   timeout = napi_get_gro_flush_timeout(n);
n->defer_hard_irqs_count = napi_get_defer_hard_irqs(n