from:"Stanislav Fomichev"

Re: [Intel-wired-lan] [PATCH iwl-next, v2 1/2] selftests/bpf: xdp_hw_metadata reduce sleep interval

2024-03-01 Thread Stanislav Fomichev

On 03/02, Song Yoong Siang wrote:
> In current ping-pong design, xdp_hw_metadata will wait until the packet
> transmition completely done, then only start to receive the next packet.
> 
> The current sleep interval is 10ms, which is unnecessary large. Typically,
> a NIC does not need such a long time to transmit a packet. Furthermore,
> during this 10ms sleep time, the app is unable to receive incoming packets.
> 
> Therefore, this commit reduce sleep interval to 10us, so that
> xdp_hw_metadata able to support periodic packets with shorter interval.
> 10us * 500 = 5ms should be enough for packet transmission and status
> retrival.
> 
> Signed-off-by: Song Yoong Siang 

Acked-by: Stanislav Fomichev

Re: [Intel-wired-lan] [PATCH v7 net-next 11/15] testing: net-drv: add basic shaper test

2024-09-11 Thread Stanislav Fomichev

On 09/10, Paolo Abeni wrote:
> Leverage a basic/dummy netdevsim implementation to do functional
> coverage for NL interface.
> 
> Signed-off-by: Paolo Abeni 
> ---
> v5 -> v6:
>   - additional test-cases for delegation and queue reconf
> 
> v4 -> v5:
>   - updated to new driver API
>   - more consistent indentation
> 
> rfc v1 -> v2:
>   - added more test-cases WRT nesting and grouping
> ---
>  drivers/net/Kconfig   |   1 +
>  drivers/net/netdevsim/ethtool.c   |   2 +
>  drivers/net/netdevsim/netdev.c|  39 ++
>  tools/testing/selftests/drivers/net/Makefile  |   1 +
>  tools/testing/selftests/drivers/net/shaper.py | 457 ++
>  .../testing/selftests/net/lib/py/__init__.py  |   1 +
>  tools/testing/selftests/net/lib/py/ynl.py |   5 +
>  7 files changed, 506 insertions(+)
>  create mode 100755 tools/testing/selftests/drivers/net/shaper.py
> 
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index 9920b3a68ed1..1fd5acdc73c6 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -641,6 +641,7 @@ config NETDEVSIM
>   depends on PTP_1588_CLOCK_MOCK || PTP_1588_CLOCK_MOCK=n
>   select NET_DEVLINK
>   select PAGE_POOL
> + select NET_SHAPER
>   help
> This driver is a developer testing tool and software model that can
> be used to test various control path networking APIs, especially
> diff --git a/drivers/net/netdevsim/ethtool.c b/drivers/net/netdevsim/ethtool.c
> index 1436905bc106..5fe1eaef99b5 100644
> --- a/drivers/net/netdevsim/ethtool.c
> +++ b/drivers/net/netdevsim/ethtool.c
> @@ -103,8 +103,10 @@ nsim_set_channels(struct net_device *dev, struct 
> ethtool_channels *ch)
>   struct netdevsim *ns = netdev_priv(dev);
>   int err;
>  
> + mutex_lock(&dev->lock);
>   err = netif_set_real_num_queues(dev, ch->combined_count,
>   ch->combined_count);
> + mutex_unlock(&dev->lock);
>   if (err)
>   return err;
>  
> diff --git a/drivers/net/netdevsim/netdev.c b/drivers/net/netdevsim/netdev.c
> index 017a6102be0a..cad85bb0cf54 100644
> --- a/drivers/net/netdevsim/netdev.c
> +++ b/drivers/net/netdevsim/netdev.c
> @@ -22,6 +22,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -475,6 +476,43 @@ static int nsim_stop(struct net_device *dev)
>   return 0;
>  }
>  
> +static int nsim_shaper_set(struct net_shaper_binding *binding,
> +const struct net_shaper *shaper,
> +struct netlink_ext_ack *extack)
> +{
> + return 0;
> +}
> +
> +static int nsim_shaper_del(struct net_shaper_binding *binding,
> +const struct net_shaper_handle *handle,
> +struct netlink_ext_ack *extack)
> +{
> + return 0;
> +}
> +
> +static int nsim_shaper_group(struct net_shaper_binding *binding,
> +  int leaves_count,
> +  const struct net_shaper *leaves,
> +  const struct net_shaper *root,
> +  struct netlink_ext_ack *extack)
> +{
> + return 0;
> +}
> +
> +static void nsim_shaper_cap(struct net_shaper_binding *binding,
> + enum net_shaper_scope scope,
> + unsigned long *flags)
> +{
> + *flags = ULONG_MAX;
> +}
> +
> +static const struct net_shaper_ops nsim_shaper_ops = {
> + .set= nsim_shaper_set,
> + .delete = nsim_shaper_del,
> + .group  = nsim_shaper_group,
> + .capabilities   = nsim_shaper_cap,
> +};
> +
>  static const struct net_device_ops nsim_netdev_ops = {
>   .ndo_start_xmit = nsim_start_xmit,
>   .ndo_set_rx_mode= nsim_set_rx_mode,
> @@ -496,6 +534,7 @@ static const struct net_device_ops nsim_netdev_ops = {
>   .ndo_bpf= nsim_bpf,
>   .ndo_open   = nsim_open,
>   .ndo_stop   = nsim_stop,
> + .net_shaper_ops = &nsim_shaper_ops,
>  };
>  
>  static const struct net_device_ops nsim_vf_netdev_ops = {
> diff --git a/tools/testing/selftests/drivers/net/Makefile 
> b/tools/testing/selftests/drivers/net/Makefile
> index 39fb97a8c1df..25aec5c081df 100644
> --- a/tools/testing/selftests/drivers/net/Makefile
> +++ b/tools/testing/selftests/drivers/net/Makefile
> @@ -9,6 +9,7 @@ TEST_PROGS := \
>   ping.py \
>   queues.py \
>   stats.py \
> + shaper.py
>  # end of TEST_PROGS
>  
>  include ../../lib.mk
> diff --git a/tools/testing/selftests/drivers/net/shaper.py 
> b/tools/testing/selftests/drivers/net/shaper.py
> new file mode 100755
> index ..3504d51985bc
> --- /dev/null
> +++ b/tools/testing/selftests/drivers/net/shaper.py
> @@ -0,0 +1,457 @@
> +#!/usr/bin/env python3
> +# SPDX-License-Identifier: GPL-2.0
> +
> +from lib.py import ksft_run, ksft_exit, ksft_eq, ksft_t

Re: [Intel-wired-lan] [RFC net-next v4 0/9] Add support for per-NAPI config via netlink

2024-10-04 Thread Stanislav Fomichev

On 10/03, Joe Damato wrote:
> On Thu, Oct 03, 2024 at 04:53:13PM -0700, Joe Damato wrote:
> > On Thu, Oct 03, 2024 at 04:29:37PM -0700, Stanislav Fomichev wrote:
> > > On 10/01, Joe Damato wrote:
> > 
> > [...]
> >  
> > > >   2. This revision seems to work (see below for a full walk through). Is
> > > >  this the behavior we want? Am I missing some use case or some
> > > >  behavioral thing other folks need?
> > > 
> > > The walk through looks good!
> > 
> > Thanks for taking a look.
> > 
> > > >   3. Re a previous point made by Stanislav regarding "taking over a NAPI
> > > >  ID" when the channel count changes: mlx5 seems to call napi_disable
> > > >  followed by netif_napi_del for the old queues and then calls
> > > >  napi_enable for the new ones. In this RFC, the NAPI ID generation
> > > >  is deferred to napi_enable. This means we won't end up with two of
> > > >  the same NAPI IDs added to the hash at the same time (I am pretty
> > > >  sure).
> > > 
> > > 
> > > [..]
> > > 
> > > >  Can we assume all drivers will napi_disable the old queues before
> > > >  napi_enable the new ones? If yes, we might not need to worry about
> > > >  a NAPI ID takeover function.
> > > 
> > > With the explicit driver opt-in via netif_napi_add_config, this
> > > shouldn't matter? When somebody gets to converting the drivers that
> > > don't follow this common pattern they'll have to solve the takeover
> > > part :-)
> > 
> > That is true; that's a good point.
> 
> Actually, sorry, that isn't strictly true. NAPI ID generation is
> moved for everything to napi_enable; they just are (or are not)
> persisted depending on whether the driver opted in to add_config or
> not.
> 
> So, the change does affect all drivers. NAPI IDs won't be generated
> and added to the hash until napi_enable and they will be removed
> from the hash in napi_disable... even if you didn't opt-in to having
> storage.
> 
> Opt-ing in to storage via netif_napi_add_config just means that your
> NAPI IDs (and other settings) will be persistent.
> 
> Sorry about my confusion when replying earlier.

AFAIA, all control operations (ethtool or similar ones via netlink),
should grab rtnl lock. So as long as both enable/disable happen
under rtnl (and in my mind they should), I don't think there is gonna
be any user-visible side-effects of your change. But I might be wrong,
let's see if others can come up with some corner cases..

Re: [Intel-wired-lan] [RFC bpf-next 0/4] Add XDP rx hw hints support performing XDP_REDIRECT

2024-10-04 Thread Stanislav Fomichev

On 10/03, Daniel Xu wrote:
> Hi Stan,
> 
> On Thu, Oct 03, 2024 at 01:26:08PM GMT, Stanislav Fomichev wrote:
> > On 10/03, Arthur Fabre wrote:
> > > On Thu Oct 3, 2024 at 12:49 AM CEST, Stanislav Fomichev wrote:
> > > > On 10/02, Toke Høiland-Jørgensen wrote:
> > > > > Stanislav Fomichev  writes:
> > > > > 
> > > > > > On 10/01, Toke Høiland-Jørgensen wrote:
> > > > > >> Lorenzo Bianconi  writes:
> > > > > >> 
> > > > > >> >> On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote:
> > > > > >> >> > > Lorenzo Bianconi  writes:
> > > > > >> >> > > 
> > > > > >> >> > > >> > We could combine such a registration API with your 
> > > > > >> >> > > >> > header format, so
> > > > > >> >> > > >> > that the registration just becomes a way of allocating 
> > > > > >> >> > > >> > one of the keys
> > > > > >> >> > > >> > from 0-63 (and the registry just becomes a global copy 
> > > > > >> >> > > >> > of the header).
> > > > > >> >> > > >> > This would basically amount to moving the "service 
> > > > > >> >> > > >> > config file" into the
> > > > > >> >> > > >> > kernel, since that seems to be the only common 
> > > > > >> >> > > >> > denominator we can rely
> > > > > >> >> > > >> > on between BPF applications (as all attempts to write 
> > > > > >> >> > > >> > a common daemon
> > > > > >> >> > > >> > for BPF management have shown).
> > > > > >> >> > > >> 
> > > > > >> >> > > >> That sounds reasonable. And I guess we'd have set() 
> > > > > >> >> > > >> check the global
> > > > > >> >> > > >> registry to enforce that the key has been registered 
> > > > > >> >> > > >> beforehand?
> > > > > >> >> > > >> 
> > > > > >> >> > > >> >
> > > > > >> >> > > >> > -Toke
> > > > > >> >> > > >> 
> > > > > >> >> > > >> Thanks for all the feedback!
> > > > > >> >> > > >
> > > > > >> >> > > > I like this 'fast' KV approach but I guess we should 
> > > > > >> >> > > > really evaluate its
> > > > > >> >> > > > impact on performances (especially for xdp) since, based 
> > > > > >> >> > > > on the kfunc calls
> > > > > >> >> > > > order in the ebpf program, we can have one or multiple 
> > > > > >> >> > > > memmove/memcpy for
> > > > > >> >> > > > each packet, right?
> > > > > >> >> > > 
> > > > > >> >> > > Yes, with Arthur's scheme, performance will be ordering 
> > > > > >> >> > > dependent. Using
> > > > > >> >> > > a global registry for offsets would sidestep this, but have 
> > > > > >> >> > > the
> > > > > >> >> > > synchronisation issues we discussed up-thread. So on 
> > > > > >> >> > > balance, I think
> > > > > >> >> > > the memmove() suggestion will probably lead to the least 
> > > > > >> >> > > pain.
> > > > > >> >> > > 
> > > > > >> >> > > For the HW metadata we could sidestep this by always having 
> > > > > >> >> > > a fixed
> > > > > >> >> > > struct for it (but using the same set/get() API with 
> > > > > >> >> > > reserved keys). The
> > > > > >> >> > > only drawback of doing that is that we statically reserve a 
> > > > > >> >> > > bit of

Re: [Intel-wired-lan] [RFC bpf-next 0/4] Add XDP rx hw hints support performing XDP_REDIRECT

2024-10-04 Thread Stanislav Fomichev

On 10/04, Jesper Dangaard Brouer wrote:
> 
> 
> On 04/10/2024 04.13, Daniel Xu wrote:
> > On Thu, Oct 03, 2024 at 01:26:08PM GMT, Stanislav Fomichev wrote:
> > > On 10/03, Arthur Fabre wrote:
> > > > On Thu Oct 3, 2024 at 12:49 AM CEST, Stanislav Fomichev wrote:
> > > > > On 10/02, Toke Høiland-Jørgensen wrote:
> > > > > > Stanislav Fomichev  writes:
> > > > > > 
> > > > > > > On 10/01, Toke Høiland-Jørgensen wrote:
> > > > > > > > Lorenzo Bianconi  writes:
> > > > > > > > 
> > > > > > > > > > On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote:
> > > > > > > > > > > > Lorenzo Bianconi  writes:
> > > > > > > > > > > > 
> [...]
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I like this 'fast' KV approach but I guess we should 
> > > > > > > > > > > > > really evaluate its
> > > > > > > > > > > > > impact on performances (especially for xdp) since, 
> > > > > > > > > > > > > based on the kfunc calls
> > > > > > > > > > > > > order in the ebpf program, we can have one or 
> > > > > > > > > > > > > multiple memmove/memcpy for
> > > > > > > > > > > > > each packet, right?
> > > > > > > > > > > > 
> > > > > > > > > > > > Yes, with Arthur's scheme, performance will be ordering 
> > > > > > > > > > > > dependent. Using
> 
> I really like the *compact* Key-Value (KV) store idea from Arthur.
>  - The question is it is fast enough?
> 
> I've promised Arthur to XDP micro-benchmark this, if he codes this up to
> be usable in the XDP code path.  Listening to the LPC recording I heard
> that Alexei also saw potential and other use-case for this kind of
> fast-and-compact KV approach.
> 
> I have high hopes for the performance, as Arthur uses POPCNT instruction
> which is *very* fast[1]. I checked[2] AMD Zen 3 and 4 have Ops/Latency=1
> and Reciprocal throughput 0.25.
> 
>  [1] https://www.agner.org/optimize/blog/read.php?i=853#848
>  [2] https://www.agner.org/optimize/instruction_tables.pdf
> 
> [...]
> > > > There are two different use-cases for the metadata:
> > > > 
> > > > * "Hardware" metadata (like the hash, rx_timestamp...). There are only a
> > > >few well known fields, and only XDP can access them to set them as
> > > >metadata, so storing them in a struct somewhere could make sense.
> > > > 
> > > > * Arbitrary metadata used by services. Eg a TC filter could set a field
> > > >describing which service a packet is for, and that could be reused 
> > > > for
> > > >iptables, routing, socket dispatch...
> > > >Similarly we could set a "packet_id" field that uniquely identifies a
> > > >packet so we can trace it throughout the network stack (through
> > > >clones, encap, decap, userspace services...).
> > > >The skb->mark, but with more room, and better support for sharing it.
> > > > 
> > > > We can only know the layout ahead of time for the first one. And they're
> > > > similar enough in their requirements (need to be stored somewhere in the
> > > > SKB, have a way of retrieving each one individually, that it seems to
> > > > make sense to use a common API).
> > > 
> > > Why not have the following layout then?
> > > 
> > > +---+---++--+
> > > | more headroom | user-defined meta | hw-meta (potentially fixed skb 
> > > format) | data |
> > > +---+---++--+
> > >  ^
> > > ^
> > >  data_meta
> > >   data
> > > 
> > > You obviously still have a problem of communicating the layout if you
> > > have some redirects in between, but you, in theory still have this
> > > problem with user-defined metadata anyway (unless I'm missing
> > > something).
> > >

Re: [Intel-wired-lan] [RFC bpf-next 0/4] Add XDP rx hw hints support performing XDP_REDIRECT

2024-10-07 Thread Stanislav Fomichev

On 10/06, Toke Høiland-Jørgensen wrote:
> Stanislav Fomichev  writes:
> 
> > On 10/04, Jesper Dangaard Brouer wrote:
> >> 
> >> 
> >> On 04/10/2024 04.13, Daniel Xu wrote:
> >> > On Thu, Oct 03, 2024 at 01:26:08PM GMT, Stanislav Fomichev wrote:
> >> > > On 10/03, Arthur Fabre wrote:
> >> > > > On Thu Oct 3, 2024 at 12:49 AM CEST, Stanislav Fomichev wrote:
> >> > > > > On 10/02, Toke Høiland-Jørgensen wrote:
> >> > > > > > Stanislav Fomichev  writes:
> >> > > > > > 
> >> > > > > > > On 10/01, Toke Høiland-Jørgensen wrote:
> >> > > > > > > > Lorenzo Bianconi  writes:
> >> > > > > > > > 
> >> > > > > > > > > > On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi 
> >> > > > > > > > > > wrote:
> >> > > > > > > > > > > > Lorenzo Bianconi  writes:
> >> > > > > > > > > > > > 
> >> [...]
> >> > > > > > > > > > > > > 
> >> > > > > > > > > > > > > I like this 'fast' KV approach but I guess we 
> >> > > > > > > > > > > > > should really evaluate its
> >> > > > > > > > > > > > > impact on performances (especially for xdp) since, 
> >> > > > > > > > > > > > > based on the kfunc calls
> >> > > > > > > > > > > > > order in the ebpf program, we can have one or 
> >> > > > > > > > > > > > > multiple memmove/memcpy for
> >> > > > > > > > > > > > > each packet, right?
> >> > > > > > > > > > > > 
> >> > > > > > > > > > > > Yes, with Arthur's scheme, performance will be 
> >> > > > > > > > > > > > ordering dependent. Using
> >> 
> >> I really like the *compact* Key-Value (KV) store idea from Arthur.
> >>  - The question is it is fast enough?
> >> 
> >> I've promised Arthur to XDP micro-benchmark this, if he codes this up to
> >> be usable in the XDP code path.  Listening to the LPC recording I heard
> >> that Alexei also saw potential and other use-case for this kind of
> >> fast-and-compact KV approach.
> >> 
> >> I have high hopes for the performance, as Arthur uses POPCNT instruction
> >> which is *very* fast[1]. I checked[2] AMD Zen 3 and 4 have Ops/Latency=1
> >> and Reciprocal throughput 0.25.
> >> 
> >>  [1] https://www.agner.org/optimize/blog/read.php?i=853#848
> >>  [2] https://www.agner.org/optimize/instruction_tables.pdf
> >> 
> >> [...]
> >> > > > There are two different use-cases for the metadata:
> >> > > > 
> >> > > > * "Hardware" metadata (like the hash, rx_timestamp...). There are 
> >> > > > only a
> >> > > >few well known fields, and only XDP can access them to set them as
> >> > > >metadata, so storing them in a struct somewhere could make sense.
> >> > > > 
> >> > > > * Arbitrary metadata used by services. Eg a TC filter could set a 
> >> > > > field
> >> > > >describing which service a packet is for, and that could be 
> >> > > > reused for
> >> > > >iptables, routing, socket dispatch...
> >> > > >Similarly we could set a "packet_id" field that uniquely 
> >> > > > identifies a
> >> > > >packet so we can trace it throughout the network stack (through
> >> > > >clones, encap, decap, userspace services...).
> >> > > >The skb->mark, but with more room, and better support for sharing 
> >> > > > it.
> >> > > > 
> >> > > > We can only know the layout ahead of time for the first one. And 
> >> > > > they're
> >> > > > similar enough in their requirements (need to be stored somewhere in 
> >> > > > the
> >> > > > SKB, have a way of retrieving each one individually, that it seems to
> >> > > > make sense to use a common API).
> >> > > 
> >>

Re: [Intel-wired-lan] [RFC net-next v4 0/9] Add support for per-NAPI config via netlink

2024-10-03 Thread Stanislav Fomichev

On 10/01, Joe Damato wrote:
> Greetings:
> 
> Welcome to RFC v4.
> 
> Very important and significant changes have been made since RFC v3 [1],
> please see the changelog below for details.
> 
> A couple important call outs for this revision for reviewers:
> 
>   1. idpf embeds a napi_struct in an internal data structure and
>  includes an assertion on the size of napi_struct. The maintainers
>  have stated that they think anyone touching napi_struct should update
>  the assertion [2], so I've done this in patch 3. 
> 
>  Even though the assertion has been updated, I've given the
>  cacheline placement of napi_struct within idpf's internals no
>  thought or consideration.
> 
>  Would appreciate other opinions on this; I think idpf should be
>  fixed. It seems unreasonable to me that anyone changing the size of
>  a struct in the core should need to think about cachelines in idpf.

[..]

>   2. This revision seems to work (see below for a full walk through). Is
>  this the behavior we want? Am I missing some use case or some
>  behavioral thing other folks need?

The walk through looks good!


>   3. Re a previous point made by Stanislav regarding "taking over a NAPI
>  ID" when the channel count changes: mlx5 seems to call napi_disable
>  followed by netif_napi_del for the old queues and then calls
>  napi_enable for the new ones. In this RFC, the NAPI ID generation
>  is deferred to napi_enable. This means we won't end up with two of
>  the same NAPI IDs added to the hash at the same time (I am pretty
>  sure).


[..]

>  Can we assume all drivers will napi_disable the old queues before
>  napi_enable the new ones? If yes, we might not need to worry about
>  a NAPI ID takeover function.

With the explicit driver opt-in via netif_napi_add_config, this
shouldn't matter? When somebody gets to converting the drivers that
don't follow this common pattern they'll have to solve the takeover
part :-)

Re: [Intel-wired-lan] [RFC bpf-next 0/4] Add XDP rx hw hints support performing XDP_REDIRECT

2024-10-03 Thread Stanislav Fomichev

On 10/03, Arthur Fabre wrote:
> On Thu Oct 3, 2024 at 12:49 AM CEST, Stanislav Fomichev wrote:
> > On 10/02, Toke Høiland-Jørgensen wrote:
> > > Stanislav Fomichev  writes:
> > > 
> > > > On 10/01, Toke Høiland-Jørgensen wrote:
> > > >> Lorenzo Bianconi  writes:
> > > >> 
> > > >> >> On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote:
> > > >> >> > > Lorenzo Bianconi  writes:
> > > >> >> > > 
> > > >> >> > > >> > We could combine such a registration API with your header 
> > > >> >> > > >> > format, so
> > > >> >> > > >> > that the registration just becomes a way of allocating one 
> > > >> >> > > >> > of the keys
> > > >> >> > > >> > from 0-63 (and the registry just becomes a global copy of 
> > > >> >> > > >> > the header).
> > > >> >> > > >> > This would basically amount to moving the "service config 
> > > >> >> > > >> > file" into the
> > > >> >> > > >> > kernel, since that seems to be the only common denominator 
> > > >> >> > > >> > we can rely
> > > >> >> > > >> > on between BPF applications (as all attempts to write a 
> > > >> >> > > >> > common daemon
> > > >> >> > > >> > for BPF management have shown).
> > > >> >> > > >> 
> > > >> >> > > >> That sounds reasonable. And I guess we'd have set() check 
> > > >> >> > > >> the global
> > > >> >> > > >> registry to enforce that the key has been registered 
> > > >> >> > > >> beforehand?
> > > >> >> > > >> 
> > > >> >> > > >> >
> > > >> >> > > >> > -Toke
> > > >> >> > > >> 
> > > >> >> > > >> Thanks for all the feedback!
> > > >> >> > > >
> > > >> >> > > > I like this 'fast' KV approach but I guess we should really 
> > > >> >> > > > evaluate its
> > > >> >> > > > impact on performances (especially for xdp) since, based on 
> > > >> >> > > > the kfunc calls
> > > >> >> > > > order in the ebpf program, we can have one or multiple 
> > > >> >> > > > memmove/memcpy for
> > > >> >> > > > each packet, right?
> > > >> >> > > 
> > > >> >> > > Yes, with Arthur's scheme, performance will be ordering 
> > > >> >> > > dependent. Using
> > > >> >> > > a global registry for offsets would sidestep this, but have the
> > > >> >> > > synchronisation issues we discussed up-thread. So on balance, I 
> > > >> >> > > think
> > > >> >> > > the memmove() suggestion will probably lead to the least pain.
> > > >> >> > > 
> > > >> >> > > For the HW metadata we could sidestep this by always having a 
> > > >> >> > > fixed
> > > >> >> > > struct for it (but using the same set/get() API with reserved 
> > > >> >> > > keys). The
> > > >> >> > > only drawback of doing that is that we statically reserve a bit 
> > > >> >> > > of
> > > >> >> > > space, but I'm not sure that is such a big issue in practice 
> > > >> >> > > (at least
> > > >> >> > > not until this becomes to popular that the space starts to be 
> > > >> >> > > contended;
> > > >> >> > > but surely 256 bytes ought to be enough for everybody, right? 
> > > >> >> > > :)).
> > > >> >> >
> > > >> >> > I am fine with the proposed approach, but I think we need to 
> > > >> >> > verify what is the
> > > >> >> > impact on performances (in the worst case??)
> > > >> >>

Re: [Intel-wired-lan] [RFC bpf-next 0/4] Add XDP rx hw hints support performing XDP_REDIRECT

2024-09-26 Thread Stanislav Fomichev

On 09/26, Lorenzo Bianconi wrote:
> > Lorenzo Bianconi  writes:
> > 
> > >> I'm hinting at some complications here (with the EFAULT return) that
> > >> needs to be resolved: there is no guarantee that a given packet will be
> > >> in sync with the current status of the registered metadata, so we need
> > >> explicit checks for this. If metadata entries are de-registered again
> > >> this also means dealing with holes and/or reshuffling the metadata
> > >> layout to reuse the released space (incidentally, this is the one place
> > >> where a TLV format would have advantages).
> > >
> > > I like this approach but it seems to me more suitable for 'sw' metadata
> > > (this is main Arthur and Jakub use case iiuc) where the userspace would
> > > enable/disable these functionalities system-wide.
> > > Regarding device hw metadata (e.g. checksum offload) I can see some issues
> > > since on a system we can have multiple NICs with different capabilities.
> > > If we consider current codebase, stmmac driver supports only rx timestamp,
> > > while mlx5 supports all of them. In a theoretical system with these two
> > > NICs, since pkt_metadata_registry is global system-wide, we will end-up
> > > with quite a lot of holes for the stmmac, right? (I am not sure if this
> > > case is relevant or not). In other words, we will end-up with a fixed
> > > struct for device rx hw metadata (like xdp_rx_meta). So I am wondering
> > > if we really need all this complexity for xdp rx hw metadata?
> > 
> > Well, the "holes" will be there anyway (in your static struct approach).
> > They would just correspond to parts of the "struct xdp_rx_meta" being
> > unset.
> 
> yes, what I wanted to say is I have the feeling we will end up 90% of the
> times in the same fields architecture and the cases where we can save some
> space seem very limited. Anyway, I am fine to discuss about a common
> architecture.
> 
> > 
> > What the "userspace can turn off the fields system wide" would
> > accomplish is to *avoid* the holes if you know that you will never need
> > them. E.g., say a system administrator know that they have no networks
> > that use (offloaded) VLANs. They could then disable the vlan offload
> > field system-wide, and thus reclaim the four bytes taken up by the
> > "vlan" member of struct xdp_rx_meta, freeing that up for use by
> > application metadata.
> 
> Even if I like the idea of having a common approach for this kernel feature,
> hw metadata seems to me quite a corner case with respect of 'user-defined
> metadata', since:
> - I do not think it is a common scenario to disable hw offload capabilities
>   (e.g checksum offload in production)
> - I guess it is not just enough to disable them via bpf, but the user/sysadmin
>   will need even to configure the NIC via ethtool (so a 2-steps process).
> 
> I think we should pay attention to not overcomplicate something that is 99%
> enabled and just need to be fast. E.g I can see an issue of putting the hw rx
> metadata in metadata field since metadata grows backward and we will probably
> end up putting them in a different cacheline with respect to xdp_frame
> (xdp_headroom is usually 256B).
> 
> > 
> > However, it may well be that the complexity of allowing fields to be
> > turned off is not worth the gains. At least as long as there are only
> > the couple of HW metadata fields that we have currently. Having the
> > flexibility to change our minds later would be good, though, which is
> > mostly a matter of making the API exposed to BPF and/or userspace
> > flexible enough to allow us to move things around in memory in the
> > future. Which was basically my thought with the API I sketched out in
> > the previous email. I.e., you could go:
> > 
> > ret = bpf_get_packet_metadata_field(pkt, METADATA_ID_HW_HASH,
> > &my_hash_vlaue, sizeof(u32))
> 
> yes, my plan is to add dedicated bpf kfuncs to store hw metadata in
> xdp_frame/xdp_buff
> 
> > 
> > 
> > ...and the METADATA_ID_HW_HASH would be a constant defined by the
> > kernel, for which the bpf_get_packet_metadata_field() kfunc just has a
> > hardcoded lookup into struct xdp_rx_meta. And then, if we decide to move
> > the field in the future, we just change the kfunc implementation, with
> > no impact to the BPF programs calling it.
> > 
> 
> maybe we can use what we Stanislav have already added (maybe removing xdp
> prefix):
> 
> enum xdp_rx_metadata {
>   XDP_METADATA_KFUNC_RX_TIMESTAMP,
>   XDP_METADATA_KFUNC_RX_HASH,
>   XDP_METADATA_KFUNC_RX_VLAN_TAG
> };

I think it was one of the ideas floating around back then for the
xdp->skb case (including redirection): have an extra kfunc that the bpf
program can call and make this kfunc store the metadata (in the metadata area)
in some fixed format. Then the kernel can consume it.

Re: [Intel-wired-lan] [RFC bpf-next 0/4] Add XDP rx hw hints support performing XDP_REDIRECT

2024-10-02 Thread Stanislav Fomichev

On 10/02, Toke Høiland-Jørgensen wrote:
> Stanislav Fomichev  writes:
> 
> > On 10/01, Toke Høiland-Jørgensen wrote:
> >> Lorenzo Bianconi  writes:
> >> 
> >> >> On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote:
> >> >> > > Lorenzo Bianconi  writes:
> >> >> > > 
> >> >> > > >> > We could combine such a registration API with your header 
> >> >> > > >> > format, so
> >> >> > > >> > that the registration just becomes a way of allocating one of 
> >> >> > > >> > the keys
> >> >> > > >> > from 0-63 (and the registry just becomes a global copy of the 
> >> >> > > >> > header).
> >> >> > > >> > This would basically amount to moving the "service config 
> >> >> > > >> > file" into the
> >> >> > > >> > kernel, since that seems to be the only common denominator we 
> >> >> > > >> > can rely
> >> >> > > >> > on between BPF applications (as all attempts to write a common 
> >> >> > > >> > daemon
> >> >> > > >> > for BPF management have shown).
> >> >> > > >> 
> >> >> > > >> That sounds reasonable. And I guess we'd have set() check the 
> >> >> > > >> global
> >> >> > > >> registry to enforce that the key has been registered beforehand?
> >> >> > > >> 
> >> >> > > >> >
> >> >> > > >> > -Toke
> >> >> > > >> 
> >> >> > > >> Thanks for all the feedback!
> >> >> > > >
> >> >> > > > I like this 'fast' KV approach but I guess we should really 
> >> >> > > > evaluate its
> >> >> > > > impact on performances (especially for xdp) since, based on the 
> >> >> > > > kfunc calls
> >> >> > > > order in the ebpf program, we can have one or multiple 
> >> >> > > > memmove/memcpy for
> >> >> > > > each packet, right?
> >> >> > > 
> >> >> > > Yes, with Arthur's scheme, performance will be ordering dependent. 
> >> >> > > Using
> >> >> > > a global registry for offsets would sidestep this, but have the
> >> >> > > synchronisation issues we discussed up-thread. So on balance, I 
> >> >> > > think
> >> >> > > the memmove() suggestion will probably lead to the least pain.
> >> >> > > 
> >> >> > > For the HW metadata we could sidestep this by always having a fixed
> >> >> > > struct for it (but using the same set/get() API with reserved 
> >> >> > > keys). The
> >> >> > > only drawback of doing that is that we statically reserve a bit of
> >> >> > > space, but I'm not sure that is such a big issue in practice (at 
> >> >> > > least
> >> >> > > not until this becomes to popular that the space starts to be 
> >> >> > > contended;
> >> >> > > but surely 256 bytes ought to be enough for everybody, right? :)).
> >> >> >
> >> >> > I am fine with the proposed approach, but I think we need to verify 
> >> >> > what is the
> >> >> > impact on performances (in the worst case??)
> >> >> 
> >> >> If drivers are responsible for populating the hardware metadata before
> >> >> XDP, we could make sure drivers set the fields in order to avoid any
> >> >> memove() (and maybe even provide a helper to ensure this?).
> >> >
> >> > nope, since the current APIs introduced by Stanislav are consuming NIC
> >> > metadata in kfuncs (mainly for af_xdp) and, according to my 
> >> > understanding,
> >> > we want to add a kfunc to store the info for each NIC metadata (e.g 
> >> > rx-hash,
> >> > timestamping, ..) into the packet (this is what Toke is proposing, 
> >> > right?).
> >> > In this case kfunc calling order makes a difference.
> >> > We can think even to add single kfunc to store all the info for all the 
> >

Re: [Intel-wired-lan] [RFC bpf-next 0/4] Add XDP rx hw hints support performing XDP_REDIRECT

2024-10-02 Thread Stanislav Fomichev

On 10/01, Toke Høiland-Jørgensen wrote:
> Lorenzo Bianconi  writes:
> 
> >> On Mon Sep 30, 2024 at 1:49 PM CEST, Lorenzo Bianconi wrote:
> >> > > Lorenzo Bianconi  writes:
> >> > > 
> >> > > >> > We could combine such a registration API with your header format, 
> >> > > >> > so
> >> > > >> > that the registration just becomes a way of allocating one of the 
> >> > > >> > keys
> >> > > >> > from 0-63 (and the registry just becomes a global copy of the 
> >> > > >> > header).
> >> > > >> > This would basically amount to moving the "service config file" 
> >> > > >> > into the
> >> > > >> > kernel, since that seems to be the only common denominator we can 
> >> > > >> > rely
> >> > > >> > on between BPF applications (as all attempts to write a common 
> >> > > >> > daemon
> >> > > >> > for BPF management have shown).
> >> > > >> 
> >> > > >> That sounds reasonable. And I guess we'd have set() check the global
> >> > > >> registry to enforce that the key has been registered beforehand?
> >> > > >> 
> >> > > >> >
> >> > > >> > -Toke
> >> > > >> 
> >> > > >> Thanks for all the feedback!
> >> > > >
> >> > > > I like this 'fast' KV approach but I guess we should really evaluate 
> >> > > > its
> >> > > > impact on performances (especially for xdp) since, based on the 
> >> > > > kfunc calls
> >> > > > order in the ebpf program, we can have one or multiple 
> >> > > > memmove/memcpy for
> >> > > > each packet, right?
> >> > > 
> >> > > Yes, with Arthur's scheme, performance will be ordering dependent. 
> >> > > Using
> >> > > a global registry for offsets would sidestep this, but have the
> >> > > synchronisation issues we discussed up-thread. So on balance, I think
> >> > > the memmove() suggestion will probably lead to the least pain.
> >> > > 
> >> > > For the HW metadata we could sidestep this by always having a fixed
> >> > > struct for it (but using the same set/get() API with reserved keys). 
> >> > > The
> >> > > only drawback of doing that is that we statically reserve a bit of
> >> > > space, but I'm not sure that is such a big issue in practice (at least
> >> > > not until this becomes to popular that the space starts to be 
> >> > > contended;
> >> > > but surely 256 bytes ought to be enough for everybody, right? :)).
> >> >
> >> > I am fine with the proposed approach, but I think we need to verify what 
> >> > is the
> >> > impact on performances (in the worst case??)
> >> 
> >> If drivers are responsible for populating the hardware metadata before
> >> XDP, we could make sure drivers set the fields in order to avoid any
> >> memove() (and maybe even provide a helper to ensure this?).
> >
> > nope, since the current APIs introduced by Stanislav are consuming NIC
> > metadata in kfuncs (mainly for af_xdp) and, according to my understanding,
> > we want to add a kfunc to store the info for each NIC metadata (e.g rx-hash,
> > timestamping, ..) into the packet (this is what Toke is proposing, right?).
> > In this case kfunc calling order makes a difference.
> > We can think even to add single kfunc to store all the info for all the NIC
> > metadata (maybe via a helping struct) but it seems not scalable to me and we
> > are losing kfunc versatility.
> 
> Yes, I agree we should have separate kfuncs for each metadata field.
> Which means it makes a lot of sense to just use the same setter API that
> we use for the user-registered metadata fields, but using reserved keys.
> So something like:
> 
> #define BPF_METADATA_HW_HASH  BIT(60)
> #define BPF_METADATA_HW_TIMESTAMP BIT(61)
> #define BPF_METADATA_HW_VLAN  BIT(62)
> #define BPF_METADATA_RESERVED (0x << 48)
> 
> bpf_packet_metadata_set(pkt, BPF_METADATA_HW_HASH, hash_value);
> 
> 
> As for the internal representation, we can just have the kfunc do
> something like:
> 
> int bpf_packet_metadata_set(field_id, value) {
>   switch(field_id) {
> case BPF_METADATA_HW_HASH:
>   pkt->xdp_hw_meta.hash = value;
>   break;
> [...]
> default:
>   /* do the key packing thing */
>   }
> }
> 
> 
> that way the order of setting the HW fields doesn't matter, only the
> user-defined metadata.

Can you expand on why we need the flexibility of picking the metadata fields
here? Presumably we are talking about the use-cases where the XDP program
is doing redirect/pass and it doesn't really know who's the final
consumer is (might be another xdp program or might be the xdp->skb
kernel case), so the only sensible option here seems to be store everything?

Re: [Intel-wired-lan] [PATCH bpf-next v4 1/4] xsk: Add launch time hardware offload support to XDP Tx metadata

2025-01-07 Thread Stanislav Fomichev

On 01/06, Song Yoong Siang wrote:
> Extend the XDP Tx metadata framework so that user can requests launch time
> hardware offload, where the Ethernet device will schedule the packet for
> transmission at a pre-determined time called launch time. The value of
> launch time is communicated from user space to Ethernet driver via
> launch_time field of struct xsk_tx_metadata.
> 
> Suggested-by: Stanislav Fomichev 
> Signed-off-by: Song Yoong Siang 
> ---
>  Documentation/netlink/specs/netdev.yaml  |  4 ++
>  Documentation/networking/xsk-tx-metadata.rst | 64 
>  include/net/xdp_sock.h   | 10 +++
>  include/net/xdp_sock_drv.h   |  1 +
>  include/uapi/linux/if_xdp.h  | 10 +++
>  include/uapi/linux/netdev.h  |  3 +
>  net/core/netdev-genl.c   |  2 +
>  net/xdp/xsk.c|  3 +
>  tools/include/uapi/linux/if_xdp.h| 10 +++
>  tools/include/uapi/linux/netdev.h|  3 +
>  10 files changed, 110 insertions(+)
> 
> diff --git a/Documentation/netlink/specs/netdev.yaml 
> b/Documentation/netlink/specs/netdev.yaml
> index cbb544bd6c84..e59c8a14f7d1 100644
> --- a/Documentation/netlink/specs/netdev.yaml
> +++ b/Documentation/netlink/specs/netdev.yaml
> @@ -70,6 +70,10 @@ definitions:
>  name: tx-checksum
>  doc:
>L3 checksum HW offload is supported by the driver.
> +  -
> +name: tx-launch-time
> +doc:
> +  Launch time HW offload is supported by the driver.
>-
>  name: queue-type
>  type: enum
> diff --git a/Documentation/networking/xsk-tx-metadata.rst 
> b/Documentation/networking/xsk-tx-metadata.rst
> index e76b0cfc32f7..3cec089747ce 100644
> --- a/Documentation/networking/xsk-tx-metadata.rst
> +++ b/Documentation/networking/xsk-tx-metadata.rst
> @@ -50,6 +50,10 @@ The flags field enables the particular offload:
>checksum. ``csum_start`` specifies byte offset of where the checksumming
>should start and ``csum_offset`` specifies byte offset where the
>device should store the computed checksum.
> +- ``XDP_TXMD_FLAGS_LAUNCH_TIME``: requests the device to schedule the
> +  packet for transmission at a pre-determined time called launch time. The
> +  value of launch time is indicated by ``launch_time`` field of
> +  ``union xsk_tx_metadata``.
>  
>  Besides the flags above, in order to trigger the offloads, the first
>  packet's ``struct xdp_desc`` descriptor should set ``XDP_TX_METADATA``
> @@ -65,6 +69,65 @@ In this case, when running in ``XDK_COPY`` mode, the TX 
> checksum
>  is calculated on the CPU. Do not enable this option in production because
>  it will negatively affect performance.
>  
> +Launch Time
> +===
> +
> +The value of the requested launch time should be based on the device's PTP
> +Hardware Clock (PHC) to ensure accuracy. AF_XDP takes a different data path
> +compared to the ETF queuing discipline, which organizes packets and delays
> +their transmission. Instead, AF_XDP immediately hands off the packets to
> +the device driver without rearranging their order or holding them prior to
> +transmission. In scenarios where the launch time offload feature is
> +disabled, the device driver is expected to disregard the launch time
> +request. For correct interpretation and meaningful operation, the launch
> +time should never be set to a value larger than the farthest programmable
> +time in the future (the horizon). Different devices have different hardware
> +limitations on the launch time offload feature.
> +
> +stmmac driver
> +-
> +
> +For stmmac, TSO and launch time (TBS) features are mutually exclusive for
> +each individual Tx Queue. By default, the driver configures Tx Queue 0 to
> +support TSO and the rest of the Tx Queues to support TBS. The launch time
> +hardware offload feature can be enabled or disabled by using the tc-etf
> +command to call the driver's ndo_setup_tc() callback.
> +
> +The value of the launch time that is programmed in the Enhanced Normal
> +Transmit Descriptors is a 32-bit value, where the most significant 8 bits
> +represent the time in seconds and the remaining 24 bits represent the time
> +in 256 ns increments. The programmed launch time is compared against the
> +PTP time (bits[39:8]) and rolls over after 256 seconds. Therefore, the
> +horizon of the launch time for dwmac4 and dwxlgmac2 is 128 seconds in the
> +future.
> +
> +The stmmac driver maintains FIFO behavior and does not perform packet
> +reordering. This means that a packet with a launch time request will block
> +other packets in the same Tx Queue until it is transmitted.

Re: [Intel-wired-lan] [PATCH bpf-next v4 2/4] selftests/bpf: Add Launch Time request to xdp_hw_metadata

2025-01-07 Thread Stanislav Fomichev

On 01/06, Song Yoong Siang wrote:
> Add Launch Time hw offload request to xdp_hw_metadata. User can configure
> the delta of launch time to HW RX-time by using "-l" argument. The default
> delta is 100,000,000 nanosecond.
> 
> Signed-off-by: Song Yoong Siang 
> ---
>  tools/testing/selftests/bpf/xdp_hw_metadata.c | 30 +--
>  1 file changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/tools/testing/selftests/bpf/xdp_hw_metadata.c 
> b/tools/testing/selftests/bpf/xdp_hw_metadata.c
> index 6f7b15d6c6ed..795c1d14e02d 100644
> --- a/tools/testing/selftests/bpf/xdp_hw_metadata.c
> +++ b/tools/testing/selftests/bpf/xdp_hw_metadata.c
> @@ -13,6 +13,7 @@
>   * - UDP 9091 packets trigger TX reply
>   * - TX HW timestamp is requested and reported back upon completion
>   * - TX checksum is requested
> + * - TX launch time HW offload is requested for transmission
>   */
>  
>  #include 
> @@ -64,6 +65,8 @@ int rxq;
>  bool skip_tx;
>  __u64 last_hw_rx_timestamp;
>  __u64 last_xdp_rx_timestamp;
> +__u64 last_launch_time;
> +__u64 launch_time_delta_to_hw_rx_timestamp = 1; /* 0.1 second */
>  
>  void test__fail(void) { /* for network_helpers.c */ }
>  
> @@ -298,6 +301,8 @@ static bool complete_tx(struct xsk *xsk, clockid_t 
> clock_id)
>   if (meta->completion.tx_timestamp) {
>   __u64 ref_tstamp = gettime(clock_id);
>  
> + print_tstamp_delta("HW Launch-time", "HW TX-complete-time",
> +last_launch_time, 
> meta->completion.tx_timestamp);
>   print_tstamp_delta("HW TX-complete-time", "User 
> TX-complete-time",
>  meta->completion.tx_timestamp, ref_tstamp);
>   print_tstamp_delta("XDP RX-time", "User TX-complete-time",
> @@ -395,6 +400,14 @@ static void ping_pong(struct xsk *xsk, void *rx_packet, 
> clockid_t clock_id)
>  xsk, ntohs(udph->check), ntohs(want_csum),
>  meta->request.csum_start, meta->request.csum_offset);
>  
> + /* Set the value of launch time */
> + meta->flags |= XDP_TXMD_FLAGS_LAUNCH_TIME;
> + meta->request.launch_time = last_hw_rx_timestamp +
> + launch_time_delta_to_hw_rx_timestamp;
> + last_launch_time = meta->request.launch_time;
> + print_tstamp_delta("HW RX-time", "HW Launch-time", last_hw_rx_timestamp,
> +meta->request.launch_time);
> +
>   memcpy(data, rx_packet, len); /* don't share umem chunk for simplicity 
> */
>   tx_desc->options |= XDP_TX_METADATA;
>   tx_desc->len = len;
> @@ -402,10 +415,14 @@ static void ping_pong(struct xsk *xsk, void *rx_packet, 
> clockid_t clock_id)
>   xsk_ring_prod__submit(&xsk->tx, 1);
>  }
>  
> +#define SLEEP_PER_ITERATION_IN_US 10
> +#define SLEEP_PER_ITERATION_IN_NS (SLEEP_PER_ITERATION_IN_US * 1000)
> +#define MAX_ITERATION(x) (((x) / SLEEP_PER_ITERATION_IN_NS) + 500)
>  static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, 
> clockid_t clock_id)
>  {
>   const struct xdp_desc *rx_desc;
>   struct pollfd fds[rxq + 1];
> + int max_iterations;
>   __u64 comp_addr;
>   __u64 addr;
>   __u32 idx = 0;
> @@ -418,6 +435,9 @@ static int verify_metadata(struct xsk *rx_xsk, int rxq, 
> int server_fd, clockid_t
>   fds[i].revents = 0;
>   }
>  
> + /* Calculate max iterations to wait for transmit completion */
> + max_iterations = MAX_ITERATION(launch_time_delta_to_hw_rx_timestamp);
> +
>   fds[rxq].fd = server_fd;
>   fds[rxq].events = POLLIN;
>   fds[rxq].revents = 0;
> @@ -477,10 +497,10 @@ static int verify_metadata(struct xsk *rx_xsk, int rxq, 
> int server_fd, clockid_t
>   if (ret)
>   printf("kick_tx ret=%d\n", ret);
>  

[..]

> - for (int j = 0; j < 500; j++) {
> + for (int j = 0; j < max_iterations; 
> j++) {
>   if (complete_tx(xsk, clock_id))
>   break;
> - usleep(10);
> + 
> usleep(SLEEP_PER_ITERATION_IN_US);

nit: instead of doing MAX_ITERATION/max_iterations, can we simplify this
to the following?

static u64 now(void)
{
clock_gettime(...);
return ts.tv_sec * NSEC_PER_SEC + ts.tv_nsec;
}

/* wait 5 seconds + cover launch time */
deadline = now() + 5 * NSEC_PER_SEC + launch_time_delta_to_hw_rx_timestamp;
while (true) {
if (complete_tx())
break;
if (now() >= deadline)
break;
usleep(10);
}

It is a bit more readable than converting time to wait to the
iterations..

Re: [Intel-wired-lan] [PATCH bpf-next v4 3/4] net: stmmac: Add launch time support to XDP ZC

2025-01-07 Thread Stanislav Fomichev

On 01/06, Song Yoong Siang wrote:
> Enable launch time (Time-Based Scheduling) support to XDP zero copy via XDP
> Tx metadata framework.
> 
> This patch is tested with tools/testing/selftests/bpf/xdp_hw_metadata on
> Intel Tiger Lake platform. Below are the test steps and result.
> 
> Test Steps:
> 1. Add mqprio qdisc:
>$ sudo tc qdisc add dev enp0s30f4 handle 8001: parent root mqprio num_tc
>  4 map 0 1 2 3 3 3 3 3 3 3 3 3 3 3 3 3 queues 1@0 1@1 1@2 1@3 hw 0
> 
> 2. Enable launch time hardware offload on hardware queue 1:
>$ sudo tc qdisc replace dev enp0s30f4 parent 8001:2 etf offload clockid
>  CLOCK_TAI delta 50
> 
> 3. Add an ingress qdisc:
>$ sudo tc qdisc add dev enp0s30f4 ingress
> 
> 4. Add a flower filter to route incoming packet with VLAN priority 1 into
>hardware queue 1:
>$ sudo tc filter add dev enp0s30f4 parent : protocol 802.1Q flower
>  vlan_prio 1 hw_tc 1
> 
> 5. Enable VLAN tag stripping:
>$ sudo ethtool -K enp0s30f4 rxvlan on
> 
> 6. Start xdp_hw_metadata selftest application:
>$ sudo ./xdp_hw_metadata enp0s30f4 -l 10
> 
> 7. Send an UDP packet with VLAN priority 1 to port 9091 of DUT.

Tangential: I wonder whether we can add the setup steps to the
xdp_hw_metadata tool? It is useful to have one command to run that
takes care of all the details. Same way it already enables HW
tstamping.. 

Or, if not the full setup, some kind of detection we can signal to the
user that some things might be missing?

Re: [Intel-wired-lan] [PATCH bpf-next v6 2/4] selftests/bpf: Add launch time request to xdp_hw_metadata

2025-01-23 Thread Stanislav Fomichev

On 01/16, Song Yoong Siang wrote:
> Add launch time hardware offload request to xdp_hw_metadata. Users can
> configure the delta of launch time relative to HW RX-time using the "-l"
> argument. By default, the delta is set to 0 ns, which means the launch time
> is disabled. By setting the delta to a non-zero value, the launch time
> hardware offload feature will be enabled and requested. Additionally, users
> can configure the Tx Queue to be enabled with the launch time hardware
> offload using the "-L" argument. By default, Tx Queue 0 will be used.
> 
> Signed-off-by: Song Yoong Siang 

Forgot to add:

Acked-by: Stanislav Fomichev

Re: [Intel-wired-lan] [PATCH bpf-next v6 4/4] igc: Add launch time support to XDP ZC

2025-01-23 Thread Stanislav Fomichev

On 01/23, Florian Bezdeka wrote:
> Hi all,
> 
> On Thu, 2025-01-23 at 16:41 +, Song, Yoong Siang wrote:
> > On Thursday, January 23, 2025 11:40 PM, Bouska, Zdenek 
> >  wrote:
> > > 
> > > Hi Siang,
> > > 
> > > I tested this patch series on 6.13 with Intel I226-LM (rev 04).
> > > 
> > > I also applied patch "selftests/bpf: Actuate tx_metadata_len in 
> > > xdp_hw_metadata" [1]
> > > and "selftests/bpf: Enable Tx hwtstamp in xdp_hw_metadata" [2] so that TX 
> > > timestamps
> > > work.
> > > 
> > > HW RX-timestamp was small (0.5956 instead of 1737373125.5956):
> > > 
> > > HW RX-time:   595572448 (sec:0.5956) delta to User RX-time 
> > > sec:1737373124.9873 (1737373124987318.750 usec)
> > > XDP RX-time:   1737373125582798388 (sec:1737373125.5828) delta to User 
> > > RX-time sec:0.0001 (92.733 usec)
> > > 
> > > Igc's raw HW RX-timestamp in front of frame data was overwritten by BPF 
> > > program on
> > > line 90 in tools/testing/selftests/bpf: meta->hint_valid = 0;
> > > 
> > > "HW timestamp has been copied into local variable" comment is outdated on
> > > line 2813 in drivers/net/ethernet/intel/igc/igc_main.c after
> > > commit 069b142f5819 igc: Add support for PTP .getcyclesx64() [3].
> > > 
> > > Workaround is to add unused data to xdp_meta struct:
> > > 
> > > --- a/tools/testing/selftests/bpf/xdp_metadata.h
> > > +++ b/tools/testing/selftests/bpf/xdp_metadata.h
> > > @@ -49,4 +49,5 @@ struct xdp_meta {
> > >__s32 rx_vlan_tag_err;
> > >};
> > >enum xdp_meta_field hint_valid;
> > > +   __u8 avoid_IGC_TS_HDR_LEN[16];
> > > };
> > > 
> > 
> > Hi Zdenek Bouska, 
> > 
> > Thanks for your help on testing this patch set.
> > You are right, there is some issue with the Rx hw timestamp,
> > I will submit the bug fix patch when the solution is finalized,
> > but the fix will not be part of this launch time patch set.
> > Until then, you can continue to use your WA.
> 
> I think there is no simple fix for that. That needs some discussion
> around the "expectations" to the headroom / meta data area in front of
> the actual packet data.

By 'simple' you mean without some new UAPI to signal the size of that
'reserved area' by the driver? I don't see any other easy way out as well :-/

Re: [Intel-wired-lan] [PATCH bpf-next v5 1/4] xsk: Add launch time hardware offload support to XDP Tx metadata

2025-01-15 Thread Stanislav Fomichev

On 01/14, Song Yoong Siang wrote:
> Extend the XDP Tx metadata framework so that user can requests launch time
> hardware offload, where the Ethernet device will schedule the packet for
> transmission at a pre-determined time called launch time. The value of
> launch time is communicated from user space to Ethernet driver via
> launch_time field of struct xsk_tx_metadata.
> 
> Suggested-by: Stanislav Fomichev 
> Signed-off-by: Song Yoong Siang 

Acked-by: Stanislav Fomichev

Re: [Intel-wired-lan] [PATCH bpf-next v5 2/4] selftests/bpf: Add launch time request to xdp_hw_metadata

2025-01-15 Thread Stanislav Fomichev

On 01/15, Song, Yoong Siang wrote:
> On Wednesday, January 15, 2025 10:57 PM, Daniel Borkmann 
>  wrote:
> >On 1/14/25 4:27 PM, Song Yoong Siang wrote:
> >[...]
> >> +  /* Add mqprio qdisc with TC and hardware queue one-to-one mapping */
> >> +  char map[256] = {0};
> >> +  char queues[256] = {0};
> >> +
> >> +  for (i = 0; i < rxq; i++) {
> >> +  char buf[8];
> >> +
> >> +  snprintf(buf, sizeof(buf), "%d ", i);
> >> +  strcat(map, buf);
> >> +
> >> +  snprintf(buf, sizeof(buf), "1@%d ", i);
> >> +  strcat(queues, buf);
> >> +  }
> >> +  run_command("sudo tc qdisc add dev %s handle 8001: parent root mqprio
> >num_tc %d map %s queues %s hw 0",
> >> +  ifname, rxq, map, queues);
> >
> >Fyi, above triggers selftest build errors:
> >
> >   xdp_hw_metadata.c: In function ‘main’:
> >   xdp_hw_metadata.c:763:45: error: ‘%d’ directive output may be truncated
> >writing between 1 and 10 bytes into a region of size 8 [-Werror=format-
> >truncation=]
> > 763 | snprintf(buf, sizeof(buf), "%d ", i);
> > | ^~
> > TEST-OBJ [test_progs] arg_parsing.test.o
> >   xdp_hw_metadata.c:763:44: note: directive argument in the range [0,
> >2147483646]
> > 763 | snprintf(buf, sizeof(buf), "%d ", i);
> > |^
> >   xdp_hw_metadata.c:763:17: note: ‘snprintf’ output between 3 and 12 bytes 
> > into
> >a destination of size 8
> > 763 | snprintf(buf, sizeof(buf), "%d ", i);
> > | ^~~~
> >   xdp_hw_metadata.c:766:47: error: ‘%d’ directive output may be truncated
> >writing between 1 and 10 bytes into a region of size 6 [-Werror=format-
> >truncation=]
> > 766 | snprintf(buf, sizeof(buf), "1@%d ", i);
> > |   ^~
> >   xdp_hw_metadata.c:766:44: note: directive argument in the range [0,
> >2147483646]
> > 766 | snprintf(buf, sizeof(buf), "1@%d ", i);
> > |^~~
> >   xdp_hw_metadata.c:766:17: note: ‘snprintf’ output between 5 and 14 bytes 
> > into
> >a destination of size 8
> > 766 | snprintf(buf, sizeof(buf), "1@%d ", i);
> > | ^~
> 
> Thanks for pointing this out.
> Btw, do you know which build command will trigger these errors?
> I am using "make -C tools/testing/selftests/bpf" but cannot
> reproduce the build error.
> 
> Thanks & Regards
> Siang

Last time I used the following to reproduce similar issues on my side:

make -C tools/testing/selftests TARGETS="bpf" LLVM=1 
USERCFLAGS="-Wformat-truncation"

You can also try to use something like asprintf instead of managing the buffer
sizes manually.

Re: [Intel-wired-lan] [PATCH bpf-next v4 3/4] net: stmmac: Add launch time support to XDP ZC

2025-01-09 Thread Stanislav Fomichev

On 01/09, Song, Yoong Siang wrote:
> On Wednesday, January 8, 2025 1:08 AM, Stanislav Fomichev 
>  wrote:
> >On 01/06, Song Yoong Siang wrote:
> >> Enable launch time (Time-Based Scheduling) support to XDP zero copy via XDP
> >> Tx metadata framework.
> >>
> >> This patch is tested with tools/testing/selftests/bpf/xdp_hw_metadata on
> >> Intel Tiger Lake platform. Below are the test steps and result.
> >>
> >> Test Steps:
> >> 1. Add mqprio qdisc:
> >>$ sudo tc qdisc add dev enp0s30f4 handle 8001: parent root mqprio num_tc
> >>  4 map 0 1 2 3 3 3 3 3 3 3 3 3 3 3 3 3 queues 1@0 1@1 1@2 1@3 hw 0
> >>
> >> 2. Enable launch time hardware offload on hardware queue 1:
> >>$ sudo tc qdisc replace dev enp0s30f4 parent 8001:2 etf offload clockid
> >>  CLOCK_TAI delta 50
> >>
> >> 3. Add an ingress qdisc:
> >>$ sudo tc qdisc add dev enp0s30f4 ingress
> >>
> >> 4. Add a flower filter to route incoming packet with VLAN priority 1 into
> >>hardware queue 1:
> >>$ sudo tc filter add dev enp0s30f4 parent : protocol 802.1Q flower
> >>  vlan_prio 1 hw_tc 1
> >>
> >> 5. Enable VLAN tag stripping:
> >>$ sudo ethtool -K enp0s30f4 rxvlan on
> >>
> >> 6. Start xdp_hw_metadata selftest application:
> >>$ sudo ./xdp_hw_metadata enp0s30f4 -l 10
> >>
> >> 7. Send an UDP packet with VLAN priority 1 to port 9091 of DUT.
> >
> >Tangential: I wonder whether we can add the setup steps to the
> >xdp_hw_metadata tool? It is useful to have one command to run that
> >takes care of all the details. Same way it already enables HW
> >tstamping..
> >
> >Or, if not the full setup, some kind of detection we can signal to the
> >user that some things might be missing?
> 
> Sure. I can try to add the setup steps into xdp_hw_metadata
> by using ioctl() function. However, there are some device specific
> command, like the number of queue, their priority,
> how they route the incoming packet, etc. I will try to find out
> a common way that can work for most of the devices,
> at least work for both igc and stmmac.

We can query the number of queues (and everything else you need) in the
tool, similar to what we do in
testing/selftests/drivers/net/hw/ncdevmem.c, take a look. But if it's
too complicated, maybe at least print these commands on startup and tell
the user to run them.

The reason I'm asking is that I hope that we eventually can run this
tool from (automatic) testing/selftests/drivers/net/hw testsuite to
make sure the metadata stuff keeps working.

Re: [Intel-wired-lan] [PATCH bpf-next v4 1/4] xsk: Add launch time hardware offload support to XDP Tx metadata

2025-01-09 Thread Stanislav Fomichev

On 01/09, Song, Yoong Siang wrote:
> On Wednesday, January 8, 2025 12:50 AM, Stanislav Fomichev 
>  wrote:
> >On 01/06, Song Yoong Siang wrote:
> >> Extend the XDP Tx metadata framework so that user can requests launch time
> >> hardware offload, where the Ethernet device will schedule the packet for
> >> transmission at a pre-determined time called launch time. The value of
> >> launch time is communicated from user space to Ethernet driver via
> >> launch_time field of struct xsk_tx_metadata.
> >>
> >> Suggested-by: Stanislav Fomichev 
> 
> Hi Stanislav Fomichev,
> 
> Thanks for your review comments.
> I notice that you have two emails:
> s...@google.com & stfomic...@gmail.com
> 
> Which one I should use in the suggested-by tag?

google.com should be bouncing now. s...@fomichev.me is preferred.

[Intel-wired-lan] [PATCH net-next 0/3] udp_tunnel: remove rtnl_lock dependency

2025-05-20 Thread Stanislav Fomichev

Recently bnxt had to grow back a bunch of rtnl dependencies because
of udp_tunnel's infra. Add separate (global) mutext to protect
udp_tunnel state.

Cc: Michael Chan 

Stanislav Fomichev (3):
  net: ASSERT_RTNL remove netif_set_real_num_{rx,tx}_queues
  udp_tunnel: remove rtnl_lock dependency
  Revert "bnxt_en: bring back rtnl_lock() in the bnxt_open() path"

 .../net/ethernet/broadcom/bnx2x/bnx2x_main.c  |  3 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 42 ---
 drivers/net/ethernet/emulex/benet/be_main.c   |  3 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c   |  1 -
 drivers/net/ethernet/intel/ice/ice_main.c |  1 -
 .../net/ethernet/mellanox/mlx4/en_netdev.c|  3 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  3 +-
 .../ethernet/netronome/nfp/nfp_net_common.c   |  3 +-
 .../net/ethernet/qlogic/qede/qede_filter.c|  3 --
 .../net/ethernet/qlogic/qlcnic/qlcnic_main.c  |  1 -
 drivers/net/ethernet/sfc/ef10.c   |  1 -
 drivers/net/netdevsim/netdevsim.h |  1 -
 drivers/net/netdevsim/udp_tunnels.c   | 12 --
 include/net/udp_tunnel.h  |  9 ++--
 net/core/dev.c|  2 -
 net/ipv4/udp_tunnel_nic.c | 33 +++
 net/ipv4/udp_tunnel_stub.c|  2 +
 17 files changed, 34 insertions(+), 89 deletions(-)

-- 
2.49.0

[Intel-wired-lan] [PATCH net-next 2/3] udp_tunnel: remove rtnl_lock dependency

2025-05-20 Thread Stanislav Fomichev

Drivers that are using ops lock and don't depend on RTNL lock
still need to manage it because udp_tunnel's RTNL dependency.
Introduce new udp_tunnel_nic_lock and use it instead of
rtnl_lock. Drop non-UDP_TUNNEL_NIC_INFO_MAY_SLEEP mode from
udp_tunnel infra (udp_tunnel_nic_device_sync_work needs to
grab udp_tunnel_nic_lock mutex and might sleep).

Cc: Michael Chan 
Suggested-by: Jakub Kicinski 
Signed-off-by: Stanislav Fomichev 
---
 .../net/ethernet/broadcom/bnx2x/bnx2x_main.c  |  3 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c |  6 ++--
 drivers/net/ethernet/emulex/benet/be_main.c   |  3 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c   |  1 -
 drivers/net/ethernet/intel/ice/ice_main.c |  1 -
 .../net/ethernet/mellanox/mlx4/en_netdev.c|  3 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  3 +-
 .../ethernet/netronome/nfp/nfp_net_common.c   |  3 +-
 .../net/ethernet/qlogic/qede/qede_filter.c|  3 --
 .../net/ethernet/qlogic/qlcnic/qlcnic_main.c  |  1 -
 drivers/net/ethernet/sfc/ef10.c   |  1 -
 drivers/net/netdevsim/netdevsim.h |  1 -
 drivers/net/netdevsim/udp_tunnels.c   | 12 ---
 include/net/udp_tunnel.h  |  9 +++--
 net/ipv4/udp_tunnel_nic.c | 33 ---
 net/ipv4/udp_tunnel_stub.c|  2 ++
 16 files changed, 27 insertions(+), 58 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index f522ca8ff66b..fa7e5adf9804 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -10219,8 +10219,7 @@ static int bnx2x_udp_tunnel_sync(struct net_device 
*netdev, unsigned int table)
 
 static const struct udp_tunnel_nic_info bnx2x_udp_tunnels = {
.sync_table = bnx2x_udp_tunnel_sync,
-   .flags  = UDP_TUNNEL_NIC_INFO_MAY_SLEEP |
- UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
+   .flags  = UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
.tables = {
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_VXLAN,  },
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_GENEVE, },
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index d5495762c945..a3dadde65b8d 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -15573,8 +15573,7 @@ static int bnxt_udp_tunnel_unset_port(struct net_device 
*netdev, unsigned int ta
 static const struct udp_tunnel_nic_info bnxt_udp_tunnels = {
.set_port   = bnxt_udp_tunnel_set_port,
.unset_port = bnxt_udp_tunnel_unset_port,
-   .flags  = UDP_TUNNEL_NIC_INFO_MAY_SLEEP |
- UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
+   .flags  = UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
.tables = {
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_VXLAN,  },
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_GENEVE, },
@@ -15582,8 +15581,7 @@ static const struct udp_tunnel_nic_info 
bnxt_udp_tunnels = {
 }, bnxt_udp_tunnels_p7 = {
.set_port   = bnxt_udp_tunnel_set_port,
.unset_port = bnxt_udp_tunnel_unset_port,
-   .flags  = UDP_TUNNEL_NIC_INFO_MAY_SLEEP |
- UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
+   .flags  = UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
.tables = {
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_VXLAN,  },
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_GENEVE, },
diff --git a/drivers/net/ethernet/emulex/benet/be_main.c 
b/drivers/net/ethernet/emulex/benet/be_main.c
index 3d2e21592119..f49400ba9729 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -4031,8 +4031,7 @@ static int be_vxlan_unset_port(struct net_device *netdev, 
unsigned int table,
 static const struct udp_tunnel_nic_info be_udp_tunnels = {
.set_port   = be_vxlan_set_port,
.unset_port = be_vxlan_unset_port,
-   .flags  = UDP_TUNNEL_NIC_INFO_MAY_SLEEP |
- UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
+   .flags  = UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
.tables = {
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_VXLAN, },
},
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 120d68654e3f..3d3da9d15348 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -15891,7 +15891,6 @@ static int i40e_probe(struct pci_dev *pdev, const 
struct pci_device_id *ent)
 
pf->udp_tunnel_nic.set_port = i40e_udp_tunnel_set_port;
pf->udp_tunnel_nic.unset_port = i40e_udp_tunnel_unset_port;
-   pf->udp_tunnel_nic.flags = UDP_TUNNEL

[Intel-wired-lan] [PATCH net-next 3/3] Revert "bnxt_en: bring back rtnl_lock() in the bnxt_open() path"

2025-05-20 Thread Stanislav Fomichev

This reverts commit 325eb217e41fa14f307c7cc702bd18d0bb38fe84.

udp_tunnel infra doesn't need RTNL, should be safe to get back
to only netdev instance lock.

Cc: Michael Chan 
Signed-off-by: Stanislav Fomichev 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 36 +--
 1 file changed, 7 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index a3dadde65b8d..1da208c36572 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -14055,28 +14055,13 @@ static void bnxt_unlock_sp(struct bnxt *bp)
netdev_unlock(bp->dev);
 }
 
-/* Same as bnxt_lock_sp() with additional rtnl_lock */
-static void bnxt_rtnl_lock_sp(struct bnxt *bp)
-{
-   clear_bit(BNXT_STATE_IN_SP_TASK, &bp->state);
-   rtnl_lock();
-   netdev_lock(bp->dev);
-}
-
-static void bnxt_rtnl_unlock_sp(struct bnxt *bp)
-{
-   set_bit(BNXT_STATE_IN_SP_TASK, &bp->state);
-   netdev_unlock(bp->dev);
-   rtnl_unlock();
-}
-
 /* Only called from bnxt_sp_task() */
 static void bnxt_reset(struct bnxt *bp, bool silent)
 {
-   bnxt_rtnl_lock_sp(bp);
+   bnxt_lock_sp(bp);
if (test_bit(BNXT_STATE_OPEN, &bp->state))
bnxt_reset_task(bp, silent);
-   bnxt_rtnl_unlock_sp(bp);
+   bnxt_unlock_sp(bp);
 }
 
 /* Only called from bnxt_sp_task() */
@@ -14084,9 +14069,9 @@ static void bnxt_rx_ring_reset(struct bnxt *bp)
 {
int i;
 
-   bnxt_rtnl_lock_sp(bp);
+   bnxt_lock_sp(bp);
if (!test_bit(BNXT_STATE_OPEN, &bp->state)) {
-   bnxt_rtnl_unlock_sp(bp);
+   bnxt_unlock_sp(bp);
return;
}
/* Disable and flush TPA before resetting the RX ring */
@@ -14125,7 +14110,7 @@ static void bnxt_rx_ring_reset(struct bnxt *bp)
}
if (bp->flags & BNXT_FLAG_TPA)
bnxt_set_tpa(bp, true);
-   bnxt_rtnl_unlock_sp(bp);
+   bnxt_unlock_sp(bp);
 }
 
 static void bnxt_fw_fatal_close(struct bnxt *bp)
@@ -15017,17 +15002,15 @@ static void bnxt_fw_reset_task(struct work_struct 
*work)
bp->fw_reset_state = BNXT_FW_RESET_STATE_OPENING;
fallthrough;
case BNXT_FW_RESET_STATE_OPENING:
-   while (!rtnl_trylock()) {
+   while (!netdev_trylock(bp->dev)) {
bnxt_queue_fw_reset_work(bp, HZ / 10);
return;
}
-   netdev_lock(bp->dev);
rc = bnxt_open(bp->dev);
if (rc) {
netdev_err(bp->dev, "bnxt_open() failed during FW 
reset\n");
bnxt_fw_reset_abort(bp, rc);
netdev_unlock(bp->dev);
-   rtnl_unlock();
goto ulp_start;
}
 
@@ -15047,7 +15030,6 @@ static void bnxt_fw_reset_task(struct work_struct *work)
bnxt_dl_health_fw_status_update(bp, true);
}
netdev_unlock(bp->dev);
-   rtnl_unlock();
bnxt_ulp_start(bp, 0);
bnxt_reenable_sriov(bp);
netdev_lock(bp->dev);
@@ -15996,7 +15978,7 @@ static int bnxt_queue_start(struct net_device *dev, 
void *qmem, int idx)
   rc);
napi_enable_locked(&bnapi->napi);
bnxt_db_nq_arm(bp, &cpr->cp_db, cpr->cp_raw_cons);
-   netif_close(dev);
+   bnxt_reset_task(bp, true);
return rc;
 }
 
@@ -16812,7 +16794,6 @@ static int bnxt_resume(struct device *device)
struct bnxt *bp = netdev_priv(dev);
int rc = 0;
 
-   rtnl_lock();
netdev_lock(dev);
rc = pci_enable_device(bp->pdev);
if (rc) {
@@ -16857,7 +16838,6 @@ static int bnxt_resume(struct device *device)
 
 resume_exit:
netdev_unlock(bp->dev);
-   rtnl_unlock();
bnxt_ulp_start(bp, rc);
if (!rc)
bnxt_reenable_sriov(bp);
@@ -17023,7 +17003,6 @@ static void bnxt_io_resume(struct pci_dev *pdev)
int err;
 
netdev_info(bp->dev, "PCI Slot Resume\n");
-   rtnl_lock();
netdev_lock(netdev);
 
err = bnxt_hwrm_func_qcaps(bp);
@@ -17041,7 +17020,6 @@ static void bnxt_io_resume(struct pci_dev *pdev)
netif_device_attach(netdev);
 
netdev_unlock(netdev);
-   rtnl_unlock();
bnxt_ulp_start(bp, err);
if (!err)
bnxt_reenable_sriov(bp);
-- 
2.49.0

[Intel-wired-lan] [PATCH net-next 1/3] net: ASSERT_RTNL remove netif_set_real_num_{rx, tx}_queues

2025-05-20 Thread Stanislav Fomichev

Existing netdev_ops_assert_locked takes care of asserting either
netdev lock or RTNL.

Cc: Michael Chan 
Signed-off-by: Stanislav Fomichev 
---
 net/core/dev.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index fccf2167b235..5ea718036921 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3179,7 +3179,6 @@ int netif_set_real_num_tx_queues(struct net_device *dev, 
unsigned int txq)
 
if (dev->reg_state == NETREG_REGISTERED ||
dev->reg_state == NETREG_UNREGISTERING) {
-   ASSERT_RTNL();
netdev_ops_assert_locked(dev);
 
rc = netdev_queue_update_kobjects(dev, dev->real_num_tx_queues,
@@ -3229,7 +3228,6 @@ int netif_set_real_num_rx_queues(struct net_device *dev, 
unsigned int rxq)
return -EINVAL;
 
if (dev->reg_state == NETREG_REGISTERED) {
-   ASSERT_RTNL();
netdev_ops_assert_locked(dev);
 
rc = net_rx_queue_update_kobjects(dev, dev->real_num_rx_queues,
-- 
2.49.0

Re: [Intel-wired-lan] [PATCH net-next 2/3] udp_tunnel: remove rtnl_lock dependency

2025-05-21 Thread Stanislav Fomichev

On 05/21, Jakub Kicinski wrote:
> On Tue, 20 May 2025 13:36:13 -0700 Stanislav Fomichev wrote:
> > Drivers that are using ops lock and don't depend on RTNL lock
> > still need to manage it because udp_tunnel's RTNL dependency.
> > Introduce new udp_tunnel_nic_lock and use it instead of
> > rtnl_lock. Drop non-UDP_TUNNEL_NIC_INFO_MAY_SLEEP mode from
> > udp_tunnel infra (udp_tunnel_nic_device_sync_work needs to
> > grab udp_tunnel_nic_lock mutex and might sleep).
> 
> There is a netdevsim-based test for this that needs to be fixed up.

Oh, I did not see that one, let me try to find and run it.

> > diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h
> > index 2df3b8344eb5..7f5537fdf2c9 100644
> > --- a/include/net/udp_tunnel.h
> > +++ b/include/net/udp_tunnel.h
> > @@ -221,19 +221,17 @@ static inline void udp_tunnel_encap_enable(struct 
> > sock *sk)
> >  #define UDP_TUNNEL_NIC_MAX_TABLES  4
> >  
> >  enum udp_tunnel_nic_info_flags {
> > -   /* Device callbacks may sleep */
> > -   UDP_TUNNEL_NIC_INFO_MAY_SLEEP   = BIT(0),
> 
> Could we use a different lock for sleeping and non-sleeping drivers?

We can probably do it if we reorder the locks (as you ask/suggest
below). Overall, I'm not sure I understand why we want to have two
paths here. If we can do everything via work queue, why have a separate
path for the non-sleepable callback? (more code -> more bugs)

> > @@ -554,11 +543,11 @@ static void __udp_tunnel_nic_reset_ntf(struct 
> > net_device *dev)
> > struct udp_tunnel_nic *utn;
> > unsigned int i, j;
> >  
> > -   ASSERT_RTNL();
> > +   mutex_lock(&udp_tunnel_nic_lock);
> >  
> > utn = dev->udp_tunnel_nic;
> 
> utn and info's lifetimes are tied to the lifetime of the device
> I think their existence can remain protected by the external locks

SG, will move the lock down a bit.

> > if (!utn)
> > -   return;
> > +   goto unlock;
> >  
> > utn->need_sync = false;
> > for (i = 0; i < utn->n_tables; i++)
> 
> > -   rtnl_lock();
> > +   mutex_lock(&udp_tunnel_nic_lock);
> > utn->work_pending = 0;
> > __udp_tunnel_nic_device_sync(utn->dev, utn);
> >  
> > -   if (utn->need_replay)
> > +   if (utn->need_replay) {
> > +   rtnl_lock();
> > udp_tunnel_nic_replay(utn->dev, utn);
> > -   rtnl_unlock();
> > +   rtnl_unlock();
> > +   }
> > +   mutex_unlock(&udp_tunnel_nic_lock);
> >  }
> 
> What's the lock ordering between the new lock and rtnl lock?

>From ops-locked, we'll get: ops->tunnel_lock (__udp_tunnel_nic_reset_ntf)
>From non-ops locked, we'll get: rtnl->tunnel_lock

I see your point, we need to do rtnl->tunnel_lock here as well.

> BTW the lock could live in utn, right? We can't use the instance
> lock because of sharing, but we could put the lock in utn?

I was thinking that there is some global state besides udp_tunnel_nic,
but I don't see any, will move the lock, thanks!

Re: [Intel-wired-lan] [PATCH net-next 1/3] net: ASSERT_RTNL remove netif_set_real_num_{rx, tx}_queues

2025-05-21 Thread Stanislav Fomichev

On 05/21, Loktionov, Aleksandr wrote:
> 
> 
> > -Original Message-
> > From: Intel-wired-lan  On Behalf
> > Of Stanislav Fomichev
> > Sent: Tuesday, May 20, 2025 10:36 PM
> > To: net...@vger.kernel.org
> > Cc: da...@davemloft.net; eduma...@google.com; k...@kernel.org;
> > pab...@redhat.com; skall...@marvell.com; mani...@marvell.com;
> > andrew+net...@lunn.ch; michael.c...@broadcom.com;
> > pavan.che...@broadcom.com; ajit.khapa...@broadcom.com;
> > sriharsha.basavapa...@broadcom.com; somnath.ko...@broadcom.com;
> > Nguyen, Anthony L ; Kitszel, Przemyslaw
> > ; tar...@nvidia.com; sae...@nvidia.com;
> > louis.pe...@corigine.com; shsha...@marvell.com; GR-Linux-NIC-
> > d...@marvell.com; ecree.xil...@gmail.com; ho...@kernel.org;
> > dsah...@kernel.org; ruanjin...@huawei.com; mh...@redhat.com;
> > stfomic...@gmail.com; linux-ker...@vger.kernel.org; intel-wired-
> > l...@lists.osuosl.org; linux-r...@vger.kernel.org; oss-
> > driv...@corigine.com; linux-net-driv...@amd.com; l...@kernel.org
> > Subject: [Intel-wired-lan] [PATCH net-next 1/3] net: ASSERT_RTNL
> > remove netif_set_real_num_{rx, tx}_queues
> > 
> Can you consider more explicit title like:
> net: remove redundant ASSERT_RTNL() in queue setup functions
> ?
> 
> > Existing netdev_ops_assert_locked takes care of asserting either
> > netdev lock or RTNL.
> > 
> I'd recommend rephrasing like:
> The existing netdev_ops_assert_locked() already asserts that either
> the RTNL lock or the per-device lock is held, making the explicit
> ASSERT_RTNL() redundant.

Sure, will do, thanks!

[Intel-wired-lan] [PATCH net-next v5 3/6] udp_tunnel: remove rtnl_lock dependency

2025-06-16 Thread Stanislav Fomichev

Drivers that are using ops lock and don't depend on RTNL lock
still need to manage it because udp_tunnel's RTNL dependency.
Introduce new udp_tunnel_nic_lock and use it instead of
rtnl_lock. Drop non-UDP_TUNNEL_NIC_INFO_MAY_SLEEP mode from
udp_tunnel infra (udp_tunnel_nic_device_sync_work needs to
grab udp_tunnel_nic_lock mutex and might sleep).

Cover more places in v4:

- netlink
  - udp_tunnel_notify_add_rx_port (ndo_open)
- triggers udp_tunnel_nic_device_sync_work
  - udp_tunnel_notify_del_rx_port (ndo_stop)
- triggers udp_tunnel_nic_device_sync_work
  - udp_tunnel_get_rx_info (__netdev_update_features)
- triggers NETDEV_UDP_TUNNEL_PUSH_INFO
  - udp_tunnel_drop_rx_info (__netdev_update_features)
- triggers NETDEV_UDP_TUNNEL_DROP_INFO
  - udp_tunnel_nic_reset_ntf (ndo_open)

- notifiers
  - udp_tunnel_nic_netdevice_event, depending on the event:
- triggers NETDEV_UDP_TUNNEL_PUSH_INFO
- triggers NETDEV_UDP_TUNNEL_DROP_INFO

- ethnl_tunnel_info_reply_size
- udp_tunnel_nic_set_port_priv (two intel drivers)

Cc: Michael Chan 
Suggested-by: Jakub Kicinski 
Signed-off-by: Stanislav Fomichev 
---
 .../net/ethernet/broadcom/bnx2x/bnx2x_main.c  |  3 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c |  6 +-
 drivers/net/ethernet/emulex/benet/be_main.c   |  3 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c   |  1 -
 drivers/net/ethernet/intel/ice/ice_main.c |  1 -
 .../net/ethernet/mellanox/mlx4/en_netdev.c|  3 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  3 +-
 .../ethernet/netronome/nfp/nfp_net_common.c   |  3 +-
 .../net/ethernet/qlogic/qede/qede_filter.c|  3 -
 .../net/ethernet/qlogic/qlcnic/qlcnic_main.c  |  1 -
 drivers/net/ethernet/sfc/ef10.c   |  1 -
 drivers/net/netdevsim/udp_tunnels.c   |  4 -
 include/net/udp_tunnel.h  | 87 ++-
 net/core/dev.c|  2 +
 net/ipv4/udp_tunnel_core.c| 16 ++--
 net/ipv4/udp_tunnel_nic.c | 78 +
 16 files changed, 142 insertions(+), 73 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index c9a1a1d504c0..3ee4b848ef53 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -10219,8 +10219,7 @@ static int bnx2x_udp_tunnel_sync(struct net_device 
*netdev, unsigned int table)
 
 static const struct udp_tunnel_nic_info bnx2x_udp_tunnels = {
.sync_table = bnx2x_udp_tunnel_sync,
-   .flags  = UDP_TUNNEL_NIC_INFO_MAY_SLEEP |
- UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
+   .flags  = UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
.tables = {
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_VXLAN,  },
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_GENEVE, },
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 869580b6f70d..7946586802af 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -15573,8 +15573,7 @@ static int bnxt_udp_tunnel_unset_port(struct net_device 
*netdev, unsigned int ta
 static const struct udp_tunnel_nic_info bnxt_udp_tunnels = {
.set_port   = bnxt_udp_tunnel_set_port,
.unset_port = bnxt_udp_tunnel_unset_port,
-   .flags  = UDP_TUNNEL_NIC_INFO_MAY_SLEEP |
- UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
+   .flags  = UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
.tables = {
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_VXLAN,  },
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_GENEVE, },
@@ -15582,8 +15581,7 @@ static const struct udp_tunnel_nic_info 
bnxt_udp_tunnels = {
 }, bnxt_udp_tunnels_p7 = {
.set_port   = bnxt_udp_tunnel_set_port,
.unset_port = bnxt_udp_tunnel_unset_port,
-   .flags  = UDP_TUNNEL_NIC_INFO_MAY_SLEEP |
- UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
+   .flags  = UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
.tables = {
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_VXLAN,  },
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_GENEVE, },
diff --git a/drivers/net/ethernet/emulex/benet/be_main.c 
b/drivers/net/ethernet/emulex/benet/be_main.c
index 3d2e21592119..f49400ba9729 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -4031,8 +4031,7 @@ static int be_vxlan_unset_port(struct net_device *netdev, 
unsigned int table,
 static const struct udp_tunnel_nic_info be_udp_tunnels = {
.set_port   = be_vxlan_set_port,
.unset_port = be_vxlan_unset_port,
-   .flags  = UDP_TUNNEL_NIC_INFO_MAY_SLEEP |
- UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
+

[Intel-wired-lan] [PATCH net-next v5 4/6] net: remove redundant ASSERT_RTNL() in queue setup functions

2025-06-16 Thread Stanislav Fomichev

The existing netdev_ops_assert_locked() already asserts that either
the RTNL lock or the per-device lock is held, making the explicit
ASSERT_RTNL() redundant.

Cc: Michael Chan 
Signed-off-by: Stanislav Fomichev 
---
 net/core/dev.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 43f56b44f351..df1678b1fe24 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3179,7 +3179,6 @@ int netif_set_real_num_tx_queues(struct net_device *dev, 
unsigned int txq)
 
if (dev->reg_state == NETREG_REGISTERED ||
dev->reg_state == NETREG_UNREGISTERING) {
-   ASSERT_RTNL();
netdev_ops_assert_locked(dev);
 
rc = netdev_queue_update_kobjects(dev, dev->real_num_tx_queues,
@@ -3229,7 +3228,6 @@ int netif_set_real_num_rx_queues(struct net_device *dev, 
unsigned int rxq)
return -EINVAL;
 
if (dev->reg_state == NETREG_REGISTERED) {
-   ASSERT_RTNL();
netdev_ops_assert_locked(dev);
 
rc = net_rx_queue_update_kobjects(dev, dev->real_num_rx_queues,
-- 
2.49.0

[Intel-wired-lan] [PATCH net-next v5 0/6] udp_tunnel: remove rtnl_lock dependency

2025-06-16 Thread Stanislav Fomichev

Recently bnxt had to grow back a bunch of rtnl dependencies because
of udp_tunnel's infra. Add separate (global) mutext to protect
udp_tunnel state.

v5:
- remove unused variable (l...@intel.com)

v4:
- grab lock in more places, specifically netlink and notifiers (Jakub)
- convert geneve and vxlan notifiers to (sleepable) rtnl lock

v3:
- fix 2 test failures (Jakub NIPA)

v2:
- move the lock into udp_tunnel_nic (Jakub)
- reorder the lock ordering (Jakub)
- move udp_ports_sleep removal into separate patch and update the test
(Jakub)

Cc: Michael Chan 

Stanislav Fomichev (6):
  geneve: rely on rtnl lock in geneve_offload_rx_ports
  vxlan: drop sock_lock
  udp_tunnel: remove rtnl_lock dependency
  net: remove redundant ASSERT_RTNL() in queue setup functions
  netdevsim: remove udp_ports_sleep
  Revert "bnxt_en: bring back rtnl_lock() in the bnxt_open() path"

 .../net/ethernet/broadcom/bnx2x/bnx2x_main.c  |  3 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 42 ++---
 drivers/net/ethernet/emulex/benet/be_main.c   |  3 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c   |  1 -
 drivers/net/ethernet/intel/ice/ice_main.c |  1 -
 .../net/ethernet/mellanox/mlx4/en_netdev.c|  3 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  3 +-
 .../ethernet/netronome/nfp/nfp_net_common.c   |  3 +-
 .../net/ethernet/qlogic/qede/qede_filter.c|  3 -
 .../net/ethernet/qlogic/qlcnic/qlcnic_main.c  |  1 -
 drivers/net/ethernet/sfc/ef10.c   |  1 -
 drivers/net/geneve.c  |  7 +-
 drivers/net/netdevsim/netdevsim.h |  2 -
 drivers/net/netdevsim/udp_tunnels.c   | 12 ---
 drivers/net/vxlan/vxlan_core.c| 35 +++-
 drivers/net/vxlan/vxlan_private.h |  2 +-
 drivers/net/vxlan/vxlan_vnifilter.c   | 18 ++--
 include/net/udp_tunnel.h  | 87 ++-
 net/core/dev.c|  4 +-
 net/ipv4/udp_tunnel_core.c| 16 ++--
 net/ipv4/udp_tunnel_nic.c | 78 +
 .../drivers/net/netdevsim/udp_tunnel_nic.sh   | 23 +
 22 files changed, 176 insertions(+), 172 deletions(-)

-- 
2.49.0

[Intel-wired-lan] [PATCH net-next v5 1/6] geneve: rely on rtnl lock in geneve_offload_rx_ports

2025-06-16 Thread Stanislav Fomichev

udp_tunnel_push_rx_port will grab mutex in the next patch so
we can't use rcu. geneve_offload_rx_ports is called
from geneve_netdevice_event for NETDEV_UDP_TUNNEL_PUSH_INFO and
NETDEV_UDP_TUNNEL_DROP_INFO which both have ASSERT_RTNL.
Entries are added to and removed from the sock_list under rtnl
lock as well (when adding or removing a tunneling device).

Signed-off-by: Stanislav Fomichev 
---
 drivers/net/geneve.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index ffc15a432689..9efedc6758bf 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -41,6 +41,7 @@ MODULE_PARM_DESC(log_ecn_error, "Log packets received with 
corrupted ECN");
 /* per-network namespace private data for this module */
 struct geneve_net {
struct list_headgeneve_list;
+   /* sock_list is protected by rtnl lock */
struct list_headsock_list;
 };
 
@@ -1179,8 +1180,9 @@ static void geneve_offload_rx_ports(struct net_device 
*dev, bool push)
struct geneve_net *gn = net_generic(net, geneve_net_id);
struct geneve_sock *gs;
 
-   rcu_read_lock();
-   list_for_each_entry_rcu(gs, &gn->sock_list, list) {
+   ASSERT_RTNL();
+
+   list_for_each_entry(gs, &gn->sock_list, list) {
if (push) {
udp_tunnel_push_rx_port(dev, gs->sock,
UDP_TUNNEL_TYPE_GENEVE);
@@ -1189,7 +1191,6 @@ static void geneve_offload_rx_ports(struct net_device 
*dev, bool push)
UDP_TUNNEL_TYPE_GENEVE);
}
}
-   rcu_read_unlock();
 }
 
 /* Initialize the device structure. */
-- 
2.49.0

[Intel-wired-lan] [PATCH net-next v5 6/6] Revert "bnxt_en: bring back rtnl_lock() in the bnxt_open() path"

2025-06-16 Thread Stanislav Fomichev

This reverts commit 325eb217e41fa14f307c7cc702bd18d0bb38fe84.

udp_tunnel infra doesn't need RTNL, should be safe to get back
to only netdev instance lock.

Cc: Michael Chan 
Reviewed-by: Aleksandr Loktionov 
Signed-off-by: Stanislav Fomichev 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 36 +--
 1 file changed, 7 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 7946586802af..b359ef4b78a9 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -14055,28 +14055,13 @@ static void bnxt_unlock_sp(struct bnxt *bp)
netdev_unlock(bp->dev);
 }
 
-/* Same as bnxt_lock_sp() with additional rtnl_lock */
-static void bnxt_rtnl_lock_sp(struct bnxt *bp)
-{
-   clear_bit(BNXT_STATE_IN_SP_TASK, &bp->state);
-   rtnl_lock();
-   netdev_lock(bp->dev);
-}
-
-static void bnxt_rtnl_unlock_sp(struct bnxt *bp)
-{
-   set_bit(BNXT_STATE_IN_SP_TASK, &bp->state);
-   netdev_unlock(bp->dev);
-   rtnl_unlock();
-}
-
 /* Only called from bnxt_sp_task() */
 static void bnxt_reset(struct bnxt *bp, bool silent)
 {
-   bnxt_rtnl_lock_sp(bp);
+   bnxt_lock_sp(bp);
if (test_bit(BNXT_STATE_OPEN, &bp->state))
bnxt_reset_task(bp, silent);
-   bnxt_rtnl_unlock_sp(bp);
+   bnxt_unlock_sp(bp);
 }
 
 /* Only called from bnxt_sp_task() */
@@ -14084,9 +14069,9 @@ static void bnxt_rx_ring_reset(struct bnxt *bp)
 {
int i;
 
-   bnxt_rtnl_lock_sp(bp);
+   bnxt_lock_sp(bp);
if (!test_bit(BNXT_STATE_OPEN, &bp->state)) {
-   bnxt_rtnl_unlock_sp(bp);
+   bnxt_unlock_sp(bp);
return;
}
/* Disable and flush TPA before resetting the RX ring */
@@ -14125,7 +14110,7 @@ static void bnxt_rx_ring_reset(struct bnxt *bp)
}
if (bp->flags & BNXT_FLAG_TPA)
bnxt_set_tpa(bp, true);
-   bnxt_rtnl_unlock_sp(bp);
+   bnxt_unlock_sp(bp);
 }
 
 static void bnxt_fw_fatal_close(struct bnxt *bp)
@@ -15017,17 +15002,15 @@ static void bnxt_fw_reset_task(struct work_struct 
*work)
bp->fw_reset_state = BNXT_FW_RESET_STATE_OPENING;
fallthrough;
case BNXT_FW_RESET_STATE_OPENING:
-   while (!rtnl_trylock()) {
+   while (!netdev_trylock(bp->dev)) {
bnxt_queue_fw_reset_work(bp, HZ / 10);
return;
}
-   netdev_lock(bp->dev);
rc = bnxt_open(bp->dev);
if (rc) {
netdev_err(bp->dev, "bnxt_open() failed during FW 
reset\n");
bnxt_fw_reset_abort(bp, rc);
netdev_unlock(bp->dev);
-   rtnl_unlock();
goto ulp_start;
}
 
@@ -15047,7 +15030,6 @@ static void bnxt_fw_reset_task(struct work_struct *work)
bnxt_dl_health_fw_status_update(bp, true);
}
netdev_unlock(bp->dev);
-   rtnl_unlock();
bnxt_ulp_start(bp, 0);
bnxt_reenable_sriov(bp);
netdev_lock(bp->dev);
@@ -15996,7 +15978,7 @@ static int bnxt_queue_start(struct net_device *dev, 
void *qmem, int idx)
   rc);
napi_enable_locked(&bnapi->napi);
bnxt_db_nq_arm(bp, &cpr->cp_db, cpr->cp_raw_cons);
-   netif_close(dev);
+   bnxt_reset_task(bp, true);
return rc;
 }
 
@@ -16812,7 +16794,6 @@ static int bnxt_resume(struct device *device)
struct bnxt *bp = netdev_priv(dev);
int rc = 0;
 
-   rtnl_lock();
netdev_lock(dev);
rc = pci_enable_device(bp->pdev);
if (rc) {
@@ -16857,7 +16838,6 @@ static int bnxt_resume(struct device *device)
 
 resume_exit:
netdev_unlock(bp->dev);
-   rtnl_unlock();
bnxt_ulp_start(bp, rc);
if (!rc)
bnxt_reenable_sriov(bp);
@@ -17023,7 +17003,6 @@ static void bnxt_io_resume(struct pci_dev *pdev)
int err;
 
netdev_info(bp->dev, "PCI Slot Resume\n");
-   rtnl_lock();
netdev_lock(netdev);
 
err = bnxt_hwrm_func_qcaps(bp);
@@ -17041,7 +17020,6 @@ static void bnxt_io_resume(struct pci_dev *pdev)
netif_device_attach(netdev);
 
netdev_unlock(netdev);
-   rtnl_unlock();
bnxt_ulp_start(bp, err);
if (!err)
bnxt_reenable_sriov(bp);
-- 
2.49.0

[Intel-wired-lan] [PATCH net-next v5 2/6] vxlan: drop sock_lock

2025-06-16 Thread Stanislav Fomichev

We won't be able to sleep soon in vxlan_offload_rx_ports and won't be
able to grab sock_lock. Instead of having separate spinlock to
manage sockets, rely on rtnl lock. This is similar to how geneve
manages its sockets.

Signed-off-by: Stanislav Fomichev 
---
 drivers/net/vxlan/vxlan_core.c  | 35 -
 drivers/net/vxlan/vxlan_private.h   |  2 +-
 drivers/net/vxlan/vxlan_vnifilter.c | 18 ++-
 3 files changed, 22 insertions(+), 33 deletions(-)

diff --git a/drivers/net/vxlan/vxlan_core.c b/drivers/net/vxlan/vxlan_core.c
index 97792de896b7..9d7249caf137 100644
--- a/drivers/net/vxlan/vxlan_core.c
+++ b/drivers/net/vxlan/vxlan_core.c
@@ -1485,21 +1485,18 @@ static enum skb_drop_reason vxlan_snoop(struct 
net_device *dev,
 
 static bool __vxlan_sock_release_prep(struct vxlan_sock *vs)
 {
-   struct vxlan_net *vn;
+   ASSERT_RTNL();
 
if (!vs)
return false;
if (!refcount_dec_and_test(&vs->refcnt))
return false;
 
-   vn = net_generic(sock_net(vs->sock->sk), vxlan_net_id);
-   spin_lock(&vn->sock_lock);
hlist_del_rcu(&vs->hlist);
udp_tunnel_notify_del_rx_port(vs->sock,
  (vs->flags & VXLAN_F_GPE) ?
  UDP_TUNNEL_TYPE_VXLAN_GPE :
  UDP_TUNNEL_TYPE_VXLAN);
-   spin_unlock(&vn->sock_lock);
 
return true;
 }
@@ -2847,26 +2844,23 @@ static void vxlan_cleanup(struct timer_list *t)
 
 static void vxlan_vs_del_dev(struct vxlan_dev *vxlan)
 {
-   struct vxlan_net *vn = net_generic(vxlan->net, vxlan_net_id);
+   ASSERT_RTNL();
 
-   spin_lock(&vn->sock_lock);
hlist_del_init_rcu(&vxlan->hlist4.hlist);
 #if IS_ENABLED(CONFIG_IPV6)
hlist_del_init_rcu(&vxlan->hlist6.hlist);
 #endif
-   spin_unlock(&vn->sock_lock);
 }
 
 static void vxlan_vs_add_dev(struct vxlan_sock *vs, struct vxlan_dev *vxlan,
 struct vxlan_dev_node *node)
 {
-   struct vxlan_net *vn = net_generic(vxlan->net, vxlan_net_id);
__be32 vni = vxlan->default_dst.remote_vni;
 
+   ASSERT_RTNL();
+
node->vxlan = vxlan;
-   spin_lock(&vn->sock_lock);
hlist_add_head_rcu(&node->hlist, vni_head(vs, vni));
-   spin_unlock(&vn->sock_lock);
 }
 
 /* Setup stats when device is created */
@@ -3291,9 +3285,10 @@ static void vxlan_offload_rx_ports(struct net_device 
*dev, bool push)
struct vxlan_net *vn = net_generic(net, vxlan_net_id);
unsigned int i;
 
-   spin_lock(&vn->sock_lock);
+   ASSERT_RTNL();
+
for (i = 0; i < PORT_HASH_SIZE; ++i) {
-   hlist_for_each_entry_rcu(vs, &vn->sock_list[i], hlist) {
+   hlist_for_each_entry(vs, &vn->sock_list[i], hlist) {
unsigned short type;
 
if (vs->flags & VXLAN_F_GPE)
@@ -3307,7 +3302,6 @@ static void vxlan_offload_rx_ports(struct net_device 
*dev, bool push)
udp_tunnel_drop_rx_port(dev, vs->sock, type);
}
}
-   spin_unlock(&vn->sock_lock);
 }
 
 /* Initialize the device structure. */
@@ -3537,12 +3531,13 @@ static struct vxlan_sock *vxlan_socket_create(struct 
net *net, bool ipv6,
  __be16 port, u32 flags,
  int ifindex)
 {
-   struct vxlan_net *vn = net_generic(net, vxlan_net_id);
struct vxlan_sock *vs;
struct socket *sock;
unsigned int h;
struct udp_tunnel_sock_cfg tunnel_cfg;
 
+   ASSERT_RTNL();
+
vs = kzalloc(sizeof(*vs), GFP_KERNEL);
if (!vs)
return ERR_PTR(-ENOMEM);
@@ -3560,13 +3555,11 @@ static struct vxlan_sock *vxlan_socket_create(struct 
net *net, bool ipv6,
refcount_set(&vs->refcnt, 1);
vs->flags = (flags & VXLAN_F_RCV_FLAGS);
 
-   spin_lock(&vn->sock_lock);
hlist_add_head_rcu(&vs->hlist, vs_head(net, port));
udp_tunnel_notify_add_rx_port(sock,
  (vs->flags & VXLAN_F_GPE) ?
  UDP_TUNNEL_TYPE_VXLAN_GPE :
  UDP_TUNNEL_TYPE_VXLAN);
-   spin_unlock(&vn->sock_lock);
 
/* Mark socket as an encapsulation socket. */
memset(&tunnel_cfg, 0, sizeof(tunnel_cfg));
@@ -3590,26 +3583,27 @@ static struct vxlan_sock *vxlan_socket_create(struct 
net *net, bool ipv6,
 
 static int __vxlan_sock_add(struct vxlan_dev *vxlan, bool ipv6)
 {
-   struct vxlan_net *vn = net_generic(vxlan->net, vxlan_net_id);
bool metadata = vxlan->cfg.flags & VXLAN_F_COLLECT_METADATA;
struct vxlan_sock

[Intel-wired-lan] [PATCH net-next v5 5/6] netdevsim: remove udp_ports_sleep

2025-06-16 Thread Stanislav Fomichev

Now that there is only one path in udp_tunnel, there is no need to
have udp_ports_sleep knob. Remove it and adjust the test.

Cc: Michael Chan 
Reviewed-by: Aleksandr Loktionov 
Signed-off-by: Stanislav Fomichev 
---
 drivers/net/netdevsim/netdevsim.h |  2 --
 drivers/net/netdevsim/udp_tunnels.c   |  8 ---
 .../drivers/net/netdevsim/udp_tunnel_nic.sh   | 23 +--
 3 files changed, 1 insertion(+), 32 deletions(-)

diff --git a/drivers/net/netdevsim/netdevsim.h 
b/drivers/net/netdevsim/netdevsim.h
index d04401f0bdf7..511ed72a93ce 100644
--- a/drivers/net/netdevsim/netdevsim.h
+++ b/drivers/net/netdevsim/netdevsim.h
@@ -131,7 +131,6 @@ struct netdevsim {
struct nsim_macsec macsec;
struct {
u32 inject_error;
-   u32 sleep;
u32 __ports[2][NSIM_UDP_TUNNEL_N_PORTS];
u32 (*ports)[NSIM_UDP_TUNNEL_N_PORTS];
struct dentry *ddir;
@@ -342,7 +341,6 @@ struct nsim_dev {
bool ipv4_only;
bool shared;
bool static_iana_vxlan;
-   u32 sleep;
} udp_ports;
struct nsim_dev_psample *psample;
u16 esw_mode;
diff --git a/drivers/net/netdevsim/udp_tunnels.c 
b/drivers/net/netdevsim/udp_tunnels.c
index 10cbbf1c584b..89fff76e51cf 100644
--- a/drivers/net/netdevsim/udp_tunnels.c
+++ b/drivers/net/netdevsim/udp_tunnels.c
@@ -18,9 +18,6 @@ nsim_udp_tunnel_set_port(struct net_device *dev, unsigned int 
table,
ret = -ns->udp_ports.inject_error;
ns->udp_ports.inject_error = 0;
 
-   if (ns->udp_ports.sleep)
-   msleep(ns->udp_ports.sleep);
-
if (!ret) {
if (ns->udp_ports.ports[table][entry]) {
WARN(1, "entry already in use\n");
@@ -47,8 +44,6 @@ nsim_udp_tunnel_unset_port(struct net_device *dev, unsigned 
int table,
ret = -ns->udp_ports.inject_error;
ns->udp_ports.inject_error = 0;
 
-   if (ns->udp_ports.sleep)
-   msleep(ns->udp_ports.sleep);
if (!ret) {
u32 val = be16_to_cpu(ti->port) << 16 | ti->type;
 
@@ -170,7 +165,6 @@ int nsim_udp_tunnels_info_create(struct nsim_dev *nsim_dev,
   GFP_KERNEL);
if (!info)
return -ENOMEM;
-   ns->udp_ports.sleep = nsim_dev->udp_ports.sleep;
 
if (nsim_dev->udp_ports.sync_all) {
info->set_port = NULL;
@@ -213,6 +207,4 @@ void nsim_udp_tunnels_debugfs_create(struct nsim_dev 
*nsim_dev)
&nsim_dev->udp_ports.shared);
debugfs_create_bool("udp_ports_static_iana_vxlan", 0600, nsim_dev->ddir,
&nsim_dev->udp_ports.static_iana_vxlan);
-   debugfs_create_u32("udp_ports_sleep", 0600, nsim_dev->ddir,
-  &nsim_dev->udp_ports.sleep);
 }
diff --git a/tools/testing/selftests/drivers/net/netdevsim/udp_tunnel_nic.sh 
b/tools/testing/selftests/drivers/net/netdevsim/udp_tunnel_nic.sh
index 92c2f0376c08..4c859ecdad94 100755
--- a/tools/testing/selftests/drivers/net/netdevsim/udp_tunnel_nic.sh
+++ b/tools/testing/selftests/drivers/net/netdevsim/udp_tunnel_nic.sh
@@ -266,7 +266,6 @@ for port in 0 1; do
echo $NSIM_ID > /sys/bus/netdevsim/new_device
 else
echo 1 > $NSIM_DEV_DFS/udp_ports_open_only
-   echo 1 > $NSIM_DEV_DFS/udp_ports_sleep
echo 1 > $NSIM_DEV_SYS/new_port
 fi
 NSIM_NETDEV=`get_netdev_name old_netdevs`
@@ -350,23 +349,11 @@ old_netdevs=$(ls /sys/class/net)
 port=0
 echo $NSIM_ID > /sys/bus/netdevsim/new_device
 echo 0 > $NSIM_DEV_SYS/del_port
-echo 1000 > $NSIM_DEV_DFS/udp_ports_sleep
 echo 0 > $NSIM_DEV_SYS/new_port
 NSIM_NETDEV=`get_netdev_name old_netdevs`
 
 msg="create VxLANs"
-exp0=( 0 0 0 0 ) # sleep is longer than out wait
-new_vxlan vxlan0 1 $NSIM_NETDEV
-
-modprobe -r vxlan
-modprobe -r udp_tunnel
-
-msg="remove tunnels"
-exp0=( 0 0 0 0 )
-check_tables
-
-msg="create VxLANs"
-exp0=( 0 0 0 0 ) # sleep is longer than out wait
+exp0=( `mke 1 1` 0 0 0 )
 new_vxlan vxlan0 1 $NSIM_NETDEV
 
 exp0=( 0 0 0 0 )
@@ -428,7 +415,6 @@ echo 0 > $NSIM_DEV_SYS/del_port
 for port in 0 1; do
 if [ $port -ne 0 ]; then
echo 1 > $NSIM_DEV_DFS/udp_ports_open_only
-   echo 1 > $NSIM_DEV_DFS/udp_ports_sleep
 fi
 
 echo $port > $NSIM_DEV_SYS/new_port
@@ -486,7 +472,6 @@ echo 1 > $NSIM_DEV_DFS/udp_ports_sync_all
 for port in 0 1; do
 if [ $port -ne 0 ]; then
echo 1 > $NSIM_DEV_DFS/udp_ports_open_only
-   echo 1 > $NSIM_DEV_DFS/udp_ports_sleep
 fi
 
 echo $port > $NSIM_DEV_SYS/new_port
@@ -543,7 +528,6 @@ echo 0 > $NSIM_DEV_SYS/del_port
 for port in 0 1; do
 if [ $port -ne 0 ]; then
echo 1 > $NSIM_DEV_DFS/udp_ports_open_on

Re: [Intel-wired-lan] [PATCH net-next v2 0/4] udp_tunnel: remove rtnl_lock dependency

2025-06-09 Thread Stanislav Fomichev

On 06/09, Jakub Kicinski wrote:
> On Mon,  9 Jun 2025 09:25:37 -0700 Stanislav Fomichev wrote:
> > Recently bnxt had to grow back a bunch of rtnl dependencies because
> > of udp_tunnel's infra. Add separate (global) mutext to protect
> > udp_tunnel state.
> 
> Appears to break the selftest, unfortunately:
> https://netdev.bots.linux.dev/contest.html?test=udp-tunnel-nic-sh&branch=net-next-2025-06-09--21-00

Argh, should have run it locally first :-(
Looks like there is a test that sets up pretty high sleep time (1 sec)
and expects entry to not appear during next 'ethtool --show-tunnels' run.

Gonna double check and remove the case if my understanding is correct.
Don't think there is much value in keeping the debugfs knob just for the
sake of this test? LMK if you disagree; otherwise gonna repost tomorrow.

[Intel-wired-lan] [PATCH net-next v2 0/4] udp_tunnel: remove rtnl_lock dependency

2025-06-09 Thread Stanislav Fomichev

Recently bnxt had to grow back a bunch of rtnl dependencies because
of udp_tunnel's infra. Add separate (global) mutext to protect
udp_tunnel state.

v2:
- move the lock into udp_tunnel_nic (Jakub)
- reorder the lock ordering (Jakub)
- move udp_ports_sleep removal into separate patch and update the test
  (Jakub)

Cc: Michael Chan 

Stanislav Fomichev (4):
  udp_tunnel: remove rtnl_lock dependency
  net: remove redundant ASSERT_RTNL() in queue setup functions
  netdevsim: remove udp_ports_sleep
  Revert "bnxt_en: bring back rtnl_lock() in the bnxt_open() path"

 .../net/ethernet/broadcom/bnx2x/bnx2x_main.c  |  3 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 42 ---
 drivers/net/ethernet/emulex/benet/be_main.c   |  3 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c   |  1 -
 drivers/net/ethernet/intel/ice/ice_main.c |  1 -
 .../net/ethernet/mellanox/mlx4/en_netdev.c|  3 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  3 +-
 .../ethernet/netronome/nfp/nfp_net_common.c   |  3 +-
 .../net/ethernet/qlogic/qede/qede_filter.c|  3 --
 .../net/ethernet/qlogic/qlcnic/qlcnic_main.c  |  1 -
 drivers/net/ethernet/sfc/ef10.c   |  1 -
 drivers/net/netdevsim/netdevsim.h |  2 -
 drivers/net/netdevsim/udp_tunnels.c   | 12 --
 include/net/udp_tunnel.h  |  8 ++--
 net/core/dev.c|  2 -
 net/ipv4/udp_tunnel_nic.c | 30 +++--
 .../drivers/net/netdevsim/udp_tunnel_nic.sh   | 10 -
 17 files changed, 31 insertions(+), 97 deletions(-)

-- 
2.49.0

[Intel-wired-lan] [PATCH net-next v2 1/4] udp_tunnel: remove rtnl_lock dependency

2025-06-09 Thread Stanislav Fomichev

Drivers that are using ops lock and don't depend on RTNL lock
still need to manage it because udp_tunnel's RTNL dependency.
Introduce new udp_tunnel_nic_lock and use it instead of
rtnl_lock. Drop non-UDP_TUNNEL_NIC_INFO_MAY_SLEEP mode from
udp_tunnel infra (udp_tunnel_nic_device_sync_work needs to
grab udp_tunnel_nic_lock mutex and might sleep).

Cc: Michael Chan 
Suggested-by: Jakub Kicinski 
Signed-off-by: Stanislav Fomichev 
---
 .../net/ethernet/broadcom/bnx2x/bnx2x_main.c  |  3 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c |  6 ++--
 drivers/net/ethernet/emulex/benet/be_main.c   |  3 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c   |  1 -
 drivers/net/ethernet/intel/ice/ice_main.c |  1 -
 .../net/ethernet/mellanox/mlx4/en_netdev.c|  3 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  3 +-
 .../ethernet/netronome/nfp/nfp_net_common.c   |  3 +-
 .../net/ethernet/qlogic/qede/qede_filter.c|  3 --
 .../net/ethernet/qlogic/qlcnic/qlcnic_main.c  |  1 -
 drivers/net/ethernet/sfc/ef10.c   |  1 -
 drivers/net/netdevsim/udp_tunnels.c   |  4 ---
 include/net/udp_tunnel.h  |  8 ++---
 net/ipv4/udp_tunnel_nic.c | 30 +--
 14 files changed, 24 insertions(+), 46 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index f522ca8ff66b..fa7e5adf9804 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -10219,8 +10219,7 @@ static int bnx2x_udp_tunnel_sync(struct net_device 
*netdev, unsigned int table)
 
 static const struct udp_tunnel_nic_info bnx2x_udp_tunnels = {
.sync_table = bnx2x_udp_tunnel_sync,
-   .flags  = UDP_TUNNEL_NIC_INFO_MAY_SLEEP |
- UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
+   .flags  = UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
.tables = {
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_VXLAN,  },
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_GENEVE, },
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index d5495762c945..a3dadde65b8d 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -15573,8 +15573,7 @@ static int bnxt_udp_tunnel_unset_port(struct net_device 
*netdev, unsigned int ta
 static const struct udp_tunnel_nic_info bnxt_udp_tunnels = {
.set_port   = bnxt_udp_tunnel_set_port,
.unset_port = bnxt_udp_tunnel_unset_port,
-   .flags  = UDP_TUNNEL_NIC_INFO_MAY_SLEEP |
- UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
+   .flags  = UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
.tables = {
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_VXLAN,  },
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_GENEVE, },
@@ -15582,8 +15581,7 @@ static const struct udp_tunnel_nic_info 
bnxt_udp_tunnels = {
 }, bnxt_udp_tunnels_p7 = {
.set_port   = bnxt_udp_tunnel_set_port,
.unset_port = bnxt_udp_tunnel_unset_port,
-   .flags  = UDP_TUNNEL_NIC_INFO_MAY_SLEEP |
- UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
+   .flags  = UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
.tables = {
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_VXLAN,  },
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_GENEVE, },
diff --git a/drivers/net/ethernet/emulex/benet/be_main.c 
b/drivers/net/ethernet/emulex/benet/be_main.c
index 3d2e21592119..f49400ba9729 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -4031,8 +4031,7 @@ static int be_vxlan_unset_port(struct net_device *netdev, 
unsigned int table,
 static const struct udp_tunnel_nic_info be_udp_tunnels = {
.set_port   = be_vxlan_set_port,
.unset_port = be_vxlan_unset_port,
-   .flags  = UDP_TUNNEL_NIC_INFO_MAY_SLEEP |
- UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
+   .flags  = UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
.tables = {
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_VXLAN, },
},
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 120d68654e3f..3d3da9d15348 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -15891,7 +15891,6 @@ static int i40e_probe(struct pci_dev *pdev, const 
struct pci_device_id *ent)
 
pf->udp_tunnel_nic.set_port = i40e_udp_tunnel_set_port;
pf->udp_tunnel_nic.unset_port = i40e_udp_tunnel_unset_port;
-   pf->udp_tunnel_nic.flags = UDP_TUNNEL_NIC_INFO_MAY_SLEEP;
pf->udp_tunnel_nic.shared = &pf->udp_tunnel_shared;
pf->ud

[Intel-wired-lan] [PATCH net-next v2 2/4] net: remove redundant ASSERT_RTNL() in queue setup functions

2025-06-09 Thread Stanislav Fomichev

The existing netdev_ops_assert_locked() already asserts that either
the RTNL lock or the per-device lock is held, making the explicit
ASSERT_RTNL() redundant.

Cc: Michael Chan 
Signed-off-by: Stanislav Fomichev 
---
 net/core/dev.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index be97c440ecd5..72997636b8ec 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3179,7 +3179,6 @@ int netif_set_real_num_tx_queues(struct net_device *dev, 
unsigned int txq)
 
if (dev->reg_state == NETREG_REGISTERED ||
dev->reg_state == NETREG_UNREGISTERING) {
-   ASSERT_RTNL();
netdev_ops_assert_locked(dev);
 
rc = netdev_queue_update_kobjects(dev, dev->real_num_tx_queues,
@@ -3229,7 +3228,6 @@ int netif_set_real_num_rx_queues(struct net_device *dev, 
unsigned int rxq)
return -EINVAL;
 
if (dev->reg_state == NETREG_REGISTERED) {
-   ASSERT_RTNL();
netdev_ops_assert_locked(dev);
 
rc = net_rx_queue_update_kobjects(dev, dev->real_num_rx_queues,
-- 
2.49.0

[Intel-wired-lan] [PATCH net-next v2 3/4] netdevsim: remove udp_ports_sleep

2025-06-09 Thread Stanislav Fomichev

Now that there is only one path in udp_tunnel, there is no need to
have udp_ports_sleep knob. Remove it and adjust the test.

Cc: Michael Chan 
Signed-off-by: Stanislav Fomichev 
---
 drivers/net/netdevsim/netdevsim.h  |  2 --
 drivers/net/netdevsim/udp_tunnels.c|  8 
 .../selftests/drivers/net/netdevsim/udp_tunnel_nic.sh  | 10 --
 3 files changed, 20 deletions(-)

diff --git a/drivers/net/netdevsim/netdevsim.h 
b/drivers/net/netdevsim/netdevsim.h
index d04401f0bdf7..511ed72a93ce 100644
--- a/drivers/net/netdevsim/netdevsim.h
+++ b/drivers/net/netdevsim/netdevsim.h
@@ -131,7 +131,6 @@ struct netdevsim {
struct nsim_macsec macsec;
struct {
u32 inject_error;
-   u32 sleep;
u32 __ports[2][NSIM_UDP_TUNNEL_N_PORTS];
u32 (*ports)[NSIM_UDP_TUNNEL_N_PORTS];
struct dentry *ddir;
@@ -342,7 +341,6 @@ struct nsim_dev {
bool ipv4_only;
bool shared;
bool static_iana_vxlan;
-   u32 sleep;
} udp_ports;
struct nsim_dev_psample *psample;
u16 esw_mode;
diff --git a/drivers/net/netdevsim/udp_tunnels.c 
b/drivers/net/netdevsim/udp_tunnels.c
index 10cbbf1c584b..89fff76e51cf 100644
--- a/drivers/net/netdevsim/udp_tunnels.c
+++ b/drivers/net/netdevsim/udp_tunnels.c
@@ -18,9 +18,6 @@ nsim_udp_tunnel_set_port(struct net_device *dev, unsigned int 
table,
ret = -ns->udp_ports.inject_error;
ns->udp_ports.inject_error = 0;
 
-   if (ns->udp_ports.sleep)
-   msleep(ns->udp_ports.sleep);
-
if (!ret) {
if (ns->udp_ports.ports[table][entry]) {
WARN(1, "entry already in use\n");
@@ -47,8 +44,6 @@ nsim_udp_tunnel_unset_port(struct net_device *dev, unsigned 
int table,
ret = -ns->udp_ports.inject_error;
ns->udp_ports.inject_error = 0;
 
-   if (ns->udp_ports.sleep)
-   msleep(ns->udp_ports.sleep);
if (!ret) {
u32 val = be16_to_cpu(ti->port) << 16 | ti->type;
 
@@ -170,7 +165,6 @@ int nsim_udp_tunnels_info_create(struct nsim_dev *nsim_dev,
   GFP_KERNEL);
if (!info)
return -ENOMEM;
-   ns->udp_ports.sleep = nsim_dev->udp_ports.sleep;
 
if (nsim_dev->udp_ports.sync_all) {
info->set_port = NULL;
@@ -213,6 +207,4 @@ void nsim_udp_tunnels_debugfs_create(struct nsim_dev 
*nsim_dev)
&nsim_dev->udp_ports.shared);
debugfs_create_bool("udp_ports_static_iana_vxlan", 0600, nsim_dev->ddir,
&nsim_dev->udp_ports.static_iana_vxlan);
-   debugfs_create_u32("udp_ports_sleep", 0600, nsim_dev->ddir,
-  &nsim_dev->udp_ports.sleep);
 }
diff --git a/tools/testing/selftests/drivers/net/netdevsim/udp_tunnel_nic.sh 
b/tools/testing/selftests/drivers/net/netdevsim/udp_tunnel_nic.sh
index 92c2f0376c08..8c5fe7bdf1ce 100755
--- a/tools/testing/selftests/drivers/net/netdevsim/udp_tunnel_nic.sh
+++ b/tools/testing/selftests/drivers/net/netdevsim/udp_tunnel_nic.sh
@@ -266,7 +266,6 @@ for port in 0 1; do
echo $NSIM_ID > /sys/bus/netdevsim/new_device
 else
echo 1 > $NSIM_DEV_DFS/udp_ports_open_only
-   echo 1 > $NSIM_DEV_DFS/udp_ports_sleep
echo 1 > $NSIM_DEV_SYS/new_port
 fi
 NSIM_NETDEV=`get_netdev_name old_netdevs`
@@ -350,7 +349,6 @@ old_netdevs=$(ls /sys/class/net)
 port=0
 echo $NSIM_ID > /sys/bus/netdevsim/new_device
 echo 0 > $NSIM_DEV_SYS/del_port
-echo 1000 > $NSIM_DEV_DFS/udp_ports_sleep
 echo 0 > $NSIM_DEV_SYS/new_port
 NSIM_NETDEV=`get_netdev_name old_netdevs`
 
@@ -428,7 +426,6 @@ echo 0 > $NSIM_DEV_SYS/del_port
 for port in 0 1; do
 if [ $port -ne 0 ]; then
echo 1 > $NSIM_DEV_DFS/udp_ports_open_only
-   echo 1 > $NSIM_DEV_DFS/udp_ports_sleep
 fi
 
 echo $port > $NSIM_DEV_SYS/new_port
@@ -486,7 +483,6 @@ echo 1 > $NSIM_DEV_DFS/udp_ports_sync_all
 for port in 0 1; do
 if [ $port -ne 0 ]; then
echo 1 > $NSIM_DEV_DFS/udp_ports_open_only
-   echo 1 > $NSIM_DEV_DFS/udp_ports_sleep
 fi
 
 echo $port > $NSIM_DEV_SYS/new_port
@@ -543,7 +539,6 @@ echo 0 > $NSIM_DEV_SYS/del_port
 for port in 0 1; do
 if [ $port -ne 0 ]; then
echo 1 > $NSIM_DEV_DFS/udp_ports_open_only
-   echo 1 > $NSIM_DEV_DFS/udp_ports_sleep
 fi
 
 echo $port > $NSIM_DEV_SYS/new_port
@@ -573,7 +568,6 @@ echo 1 > $NSIM_DEV_DFS/udp_ports_ipv4_only
 for port in 0 1; do
 if [ $port -ne 0 ]; then
echo 1 > $NSIM_DEV_DFS/udp_ports_open_only
-   echo 1 > $NSIM_DEV_DFS/udp_ports_sleep
 fi
 
 echo $port > $NSIM_DEV_SYS/new_port
@@ -634,7 +628,6 @@ echo 0 > $NSIM_DEV_SYS/

[Intel-wired-lan] [PATCH net-next v2 4/4] Revert "bnxt_en: bring back rtnl_lock() in the bnxt_open() path"

2025-06-09 Thread Stanislav Fomichev

This reverts commit 325eb217e41fa14f307c7cc702bd18d0bb38fe84.

udp_tunnel infra doesn't need RTNL, should be safe to get back
to only netdev instance lock.

Cc: Michael Chan 
Reviewed-by: Aleksandr Loktionov 
Signed-off-by: Stanislav Fomichev 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 36 +--
 1 file changed, 7 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index a3dadde65b8d..1da208c36572 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -14055,28 +14055,13 @@ static void bnxt_unlock_sp(struct bnxt *bp)
netdev_unlock(bp->dev);
 }
 
-/* Same as bnxt_lock_sp() with additional rtnl_lock */
-static void bnxt_rtnl_lock_sp(struct bnxt *bp)
-{
-   clear_bit(BNXT_STATE_IN_SP_TASK, &bp->state);
-   rtnl_lock();
-   netdev_lock(bp->dev);
-}
-
-static void bnxt_rtnl_unlock_sp(struct bnxt *bp)
-{
-   set_bit(BNXT_STATE_IN_SP_TASK, &bp->state);
-   netdev_unlock(bp->dev);
-   rtnl_unlock();
-}
-
 /* Only called from bnxt_sp_task() */
 static void bnxt_reset(struct bnxt *bp, bool silent)
 {
-   bnxt_rtnl_lock_sp(bp);
+   bnxt_lock_sp(bp);
if (test_bit(BNXT_STATE_OPEN, &bp->state))
bnxt_reset_task(bp, silent);
-   bnxt_rtnl_unlock_sp(bp);
+   bnxt_unlock_sp(bp);
 }
 
 /* Only called from bnxt_sp_task() */
@@ -14084,9 +14069,9 @@ static void bnxt_rx_ring_reset(struct bnxt *bp)
 {
int i;
 
-   bnxt_rtnl_lock_sp(bp);
+   bnxt_lock_sp(bp);
if (!test_bit(BNXT_STATE_OPEN, &bp->state)) {
-   bnxt_rtnl_unlock_sp(bp);
+   bnxt_unlock_sp(bp);
return;
}
/* Disable and flush TPA before resetting the RX ring */
@@ -14125,7 +14110,7 @@ static void bnxt_rx_ring_reset(struct bnxt *bp)
}
if (bp->flags & BNXT_FLAG_TPA)
bnxt_set_tpa(bp, true);
-   bnxt_rtnl_unlock_sp(bp);
+   bnxt_unlock_sp(bp);
 }
 
 static void bnxt_fw_fatal_close(struct bnxt *bp)
@@ -15017,17 +15002,15 @@ static void bnxt_fw_reset_task(struct work_struct 
*work)
bp->fw_reset_state = BNXT_FW_RESET_STATE_OPENING;
fallthrough;
case BNXT_FW_RESET_STATE_OPENING:
-   while (!rtnl_trylock()) {
+   while (!netdev_trylock(bp->dev)) {
bnxt_queue_fw_reset_work(bp, HZ / 10);
return;
}
-   netdev_lock(bp->dev);
rc = bnxt_open(bp->dev);
if (rc) {
netdev_err(bp->dev, "bnxt_open() failed during FW 
reset\n");
bnxt_fw_reset_abort(bp, rc);
netdev_unlock(bp->dev);
-   rtnl_unlock();
goto ulp_start;
}
 
@@ -15047,7 +15030,6 @@ static void bnxt_fw_reset_task(struct work_struct *work)
bnxt_dl_health_fw_status_update(bp, true);
}
netdev_unlock(bp->dev);
-   rtnl_unlock();
bnxt_ulp_start(bp, 0);
bnxt_reenable_sriov(bp);
netdev_lock(bp->dev);
@@ -15996,7 +15978,7 @@ static int bnxt_queue_start(struct net_device *dev, 
void *qmem, int idx)
   rc);
napi_enable_locked(&bnapi->napi);
bnxt_db_nq_arm(bp, &cpr->cp_db, cpr->cp_raw_cons);
-   netif_close(dev);
+   bnxt_reset_task(bp, true);
return rc;
 }
 
@@ -16812,7 +16794,6 @@ static int bnxt_resume(struct device *device)
struct bnxt *bp = netdev_priv(dev);
int rc = 0;
 
-   rtnl_lock();
netdev_lock(dev);
rc = pci_enable_device(bp->pdev);
if (rc) {
@@ -16857,7 +16838,6 @@ static int bnxt_resume(struct device *device)
 
 resume_exit:
netdev_unlock(bp->dev);
-   rtnl_unlock();
bnxt_ulp_start(bp, rc);
if (!rc)
bnxt_reenable_sriov(bp);
@@ -17023,7 +17003,6 @@ static void bnxt_io_resume(struct pci_dev *pdev)
int err;
 
netdev_info(bp->dev, "PCI Slot Resume\n");
-   rtnl_lock();
netdev_lock(netdev);
 
err = bnxt_hwrm_func_qcaps(bp);
@@ -17041,7 +17020,6 @@ static void bnxt_io_resume(struct pci_dev *pdev)
netif_device_attach(netdev);
 
netdev_unlock(netdev);
-   rtnl_unlock();
bnxt_ulp_start(bp, err);
if (!err)
bnxt_reenable_sriov(bp);
-- 
2.49.0

[Intel-wired-lan] [PATCH net-next v4 2/6] vxlan: drop sock_lock

2025-06-13 Thread Stanislav Fomichev

We won't be able to sleep soon in vxlan_offload_rx_ports and won't be
able to grab sock_lock. Instead of having separate spinlock to
manage sockets, rely on rtnl lock. This is similar to how geneve
manages its sockets.

Signed-off-by: Stanislav Fomichev 
---
 drivers/net/vxlan/vxlan_core.c  | 34 +
 drivers/net/vxlan/vxlan_private.h   |  2 +-
 drivers/net/vxlan/vxlan_vnifilter.c | 18 ++-
 3 files changed, 23 insertions(+), 31 deletions(-)

diff --git a/drivers/net/vxlan/vxlan_core.c b/drivers/net/vxlan/vxlan_core.c
index 97792de896b7..01362e98325c 100644
--- a/drivers/net/vxlan/vxlan_core.c
+++ b/drivers/net/vxlan/vxlan_core.c
@@ -1487,19 +1487,19 @@ static bool __vxlan_sock_release_prep(struct vxlan_sock 
*vs)
 {
struct vxlan_net *vn;
 
+   ASSERT_RTNL();
+
if (!vs)
return false;
if (!refcount_dec_and_test(&vs->refcnt))
return false;
 
vn = net_generic(sock_net(vs->sock->sk), vxlan_net_id);
-   spin_lock(&vn->sock_lock);
hlist_del_rcu(&vs->hlist);
udp_tunnel_notify_del_rx_port(vs->sock,
  (vs->flags & VXLAN_F_GPE) ?
  UDP_TUNNEL_TYPE_VXLAN_GPE :
  UDP_TUNNEL_TYPE_VXLAN);
-   spin_unlock(&vn->sock_lock);
 
return true;
 }
@@ -2847,26 +2847,23 @@ static void vxlan_cleanup(struct timer_list *t)
 
 static void vxlan_vs_del_dev(struct vxlan_dev *vxlan)
 {
-   struct vxlan_net *vn = net_generic(vxlan->net, vxlan_net_id);
+   ASSERT_RTNL();
 
-   spin_lock(&vn->sock_lock);
hlist_del_init_rcu(&vxlan->hlist4.hlist);
 #if IS_ENABLED(CONFIG_IPV6)
hlist_del_init_rcu(&vxlan->hlist6.hlist);
 #endif
-   spin_unlock(&vn->sock_lock);
 }
 
 static void vxlan_vs_add_dev(struct vxlan_sock *vs, struct vxlan_dev *vxlan,
 struct vxlan_dev_node *node)
 {
-   struct vxlan_net *vn = net_generic(vxlan->net, vxlan_net_id);
__be32 vni = vxlan->default_dst.remote_vni;
 
+   ASSERT_RTNL();
+
node->vxlan = vxlan;
-   spin_lock(&vn->sock_lock);
hlist_add_head_rcu(&node->hlist, vni_head(vs, vni));
-   spin_unlock(&vn->sock_lock);
 }
 
 /* Setup stats when device is created */
@@ -3291,9 +3288,10 @@ static void vxlan_offload_rx_ports(struct net_device 
*dev, bool push)
struct vxlan_net *vn = net_generic(net, vxlan_net_id);
unsigned int i;
 
-   spin_lock(&vn->sock_lock);
+   ASSERT_RTNL();
+
for (i = 0; i < PORT_HASH_SIZE; ++i) {
-   hlist_for_each_entry_rcu(vs, &vn->sock_list[i], hlist) {
+   hlist_for_each_entry(vs, &vn->sock_list[i], hlist) {
unsigned short type;
 
if (vs->flags & VXLAN_F_GPE)
@@ -3307,7 +3305,6 @@ static void vxlan_offload_rx_ports(struct net_device 
*dev, bool push)
udp_tunnel_drop_rx_port(dev, vs->sock, type);
}
}
-   spin_unlock(&vn->sock_lock);
 }
 
 /* Initialize the device structure. */
@@ -3537,12 +3534,13 @@ static struct vxlan_sock *vxlan_socket_create(struct 
net *net, bool ipv6,
  __be16 port, u32 flags,
  int ifindex)
 {
-   struct vxlan_net *vn = net_generic(net, vxlan_net_id);
struct vxlan_sock *vs;
struct socket *sock;
unsigned int h;
struct udp_tunnel_sock_cfg tunnel_cfg;
 
+   ASSERT_RTNL();
+
vs = kzalloc(sizeof(*vs), GFP_KERNEL);
if (!vs)
return ERR_PTR(-ENOMEM);
@@ -3560,13 +3558,11 @@ static struct vxlan_sock *vxlan_socket_create(struct 
net *net, bool ipv6,
refcount_set(&vs->refcnt, 1);
vs->flags = (flags & VXLAN_F_RCV_FLAGS);
 
-   spin_lock(&vn->sock_lock);
hlist_add_head_rcu(&vs->hlist, vs_head(net, port));
udp_tunnel_notify_add_rx_port(sock,
  (vs->flags & VXLAN_F_GPE) ?
  UDP_TUNNEL_TYPE_VXLAN_GPE :
  UDP_TUNNEL_TYPE_VXLAN);
-   spin_unlock(&vn->sock_lock);
 
/* Mark socket as an encapsulation socket. */
memset(&tunnel_cfg, 0, sizeof(tunnel_cfg));
@@ -3590,26 +3586,27 @@ static struct vxlan_sock *vxlan_socket_create(struct 
net *net, bool ipv6,
 
 static int __vxlan_sock_add(struct vxlan_dev *vxlan, bool ipv6)
 {
-   struct vxlan_net *vn = net_generic(vxlan->net, vxlan_net_id);
bool metadata = vxlan->cfg.flags & VXLAN_F_COLLECT_METADATA;
struct vxlan_sock *vs = NULL;
struct vxlan_dev_node *node;
int l3mdev_index = 0;
 
+

[Intel-wired-lan] [PATCH net-next v4 1/6] geneve: rely on rtnl lock in geneve_offload_rx_ports

2025-06-13 Thread Stanislav Fomichev

udp_tunnel_push_rx_port will grab mutex in the next patch so
we can't use rcu. geneve_offload_rx_ports is called
from geneve_netdevice_event for NETDEV_UDP_TUNNEL_PUSH_INFO and
NETDEV_UDP_TUNNEL_DROP_INFO which both have ASSERT_RTNL.
Entries are added to and removed from the sock_list under rtnl
lock as well (when adding or removing a tunneling device).

Signed-off-by: Stanislav Fomichev 
---
 drivers/net/geneve.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index ffc15a432689..9efedc6758bf 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -41,6 +41,7 @@ MODULE_PARM_DESC(log_ecn_error, "Log packets received with 
corrupted ECN");
 /* per-network namespace private data for this module */
 struct geneve_net {
struct list_headgeneve_list;
+   /* sock_list is protected by rtnl lock */
struct list_headsock_list;
 };
 
@@ -1179,8 +1180,9 @@ static void geneve_offload_rx_ports(struct net_device 
*dev, bool push)
struct geneve_net *gn = net_generic(net, geneve_net_id);
struct geneve_sock *gs;
 
-   rcu_read_lock();
-   list_for_each_entry_rcu(gs, &gn->sock_list, list) {
+   ASSERT_RTNL();
+
+   list_for_each_entry(gs, &gn->sock_list, list) {
if (push) {
udp_tunnel_push_rx_port(dev, gs->sock,
UDP_TUNNEL_TYPE_GENEVE);
@@ -1189,7 +1191,6 @@ static void geneve_offload_rx_ports(struct net_device 
*dev, bool push)
UDP_TUNNEL_TYPE_GENEVE);
}
}
-   rcu_read_unlock();
 }
 
 /* Initialize the device structure. */
-- 
2.49.0

[Intel-wired-lan] [PATCH net-next v4 0/6] udp_tunnel: remove rtnl_lock dependency

2025-06-13 Thread Stanislav Fomichev

Recently bnxt had to grow back a bunch of rtnl dependencies because
of udp_tunnel's infra. Add separate (global) mutext to protect
udp_tunnel state.

v4:
- grab lock in more places, specifically netlink and notifiers (Jakub)
- convert geneve and vxlan notifiers to (sleepable) rtnl lock

v3:
- fix 2 test failures (Jakub NIPA)

v2:
- move the lock into udp_tunnel_nic (Jakub)
- reorder the lock ordering (Jakub)
- move udp_ports_sleep removal into separate patch and update the test
(Jakub)

Cc: Michael Chan 

Stanislav Fomichev (6):
  geneve: rely on rtnl lock in geneve_offload_rx_ports
  vxlan: drop sock_lock
  udp_tunnel: remove rtnl_lock dependency
  net: remove redundant ASSERT_RTNL() in queue setup functions
  netdevsim: remove udp_ports_sleep
  Revert "bnxt_en: bring back rtnl_lock() in the bnxt_open() path"

 .../net/ethernet/broadcom/bnx2x/bnx2x_main.c  |  3 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 42 ++---
 drivers/net/ethernet/emulex/benet/be_main.c   |  3 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c   |  1 -
 drivers/net/ethernet/intel/ice/ice_main.c |  1 -
 .../net/ethernet/mellanox/mlx4/en_netdev.c|  3 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  3 +-
 .../ethernet/netronome/nfp/nfp_net_common.c   |  3 +-
 .../net/ethernet/qlogic/qede/qede_filter.c|  3 -
 .../net/ethernet/qlogic/qlcnic/qlcnic_main.c  |  1 -
 drivers/net/ethernet/sfc/ef10.c   |  1 -
 drivers/net/geneve.c  |  7 +-
 drivers/net/netdevsim/netdevsim.h |  2 -
 drivers/net/netdevsim/udp_tunnels.c   | 12 ---
 drivers/net/vxlan/vxlan_core.c| 34 
 drivers/net/vxlan/vxlan_private.h |  2 +-
 drivers/net/vxlan/vxlan_vnifilter.c   | 18 ++--
 include/net/udp_tunnel.h  | 87 ++-
 net/core/dev.c|  4 +-
 net/ipv4/udp_tunnel_core.c| 16 ++--
 net/ipv4/udp_tunnel_nic.c | 78 +
 .../drivers/net/netdevsim/udp_tunnel_nic.sh   | 23 +
 22 files changed, 177 insertions(+), 170 deletions(-)

-- 
2.49.0

[Intel-wired-lan] [PATCH net-next v4 3/6] udp_tunnel: remove rtnl_lock dependency

2025-06-13 Thread Stanislav Fomichev

Drivers that are using ops lock and don't depend on RTNL lock
still need to manage it because udp_tunnel's RTNL dependency.
Introduce new udp_tunnel_nic_lock and use it instead of
rtnl_lock. Drop non-UDP_TUNNEL_NIC_INFO_MAY_SLEEP mode from
udp_tunnel infra (udp_tunnel_nic_device_sync_work needs to
grab udp_tunnel_nic_lock mutex and might sleep).

Cover more places in v4:

- netlink
  - udp_tunnel_notify_add_rx_port (ndo_open)
- triggers udp_tunnel_nic_device_sync_work
  - udp_tunnel_notify_del_rx_port (ndo_stop)
- triggers udp_tunnel_nic_device_sync_work
  - udp_tunnel_get_rx_info (__netdev_update_features)
- triggers NETDEV_UDP_TUNNEL_PUSH_INFO
  - udp_tunnel_drop_rx_info (__netdev_update_features)
- triggers NETDEV_UDP_TUNNEL_DROP_INFO
  - udp_tunnel_nic_reset_ntf (ndo_open)

- notifiers
  - udp_tunnel_nic_netdevice_event, depending on the event:
- triggers NETDEV_UDP_TUNNEL_PUSH_INFO
- triggers NETDEV_UDP_TUNNEL_DROP_INFO

- ethnl_tunnel_info_reply_size
- udp_tunnel_nic_set_port_priv (two intel drivers)

Cc: Michael Chan 
Suggested-by: Jakub Kicinski 
Signed-off-by: Stanislav Fomichev 
---
 .../net/ethernet/broadcom/bnx2x/bnx2x_main.c  |  3 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c |  6 +-
 drivers/net/ethernet/emulex/benet/be_main.c   |  3 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c   |  1 -
 drivers/net/ethernet/intel/ice/ice_main.c |  1 -
 .../net/ethernet/mellanox/mlx4/en_netdev.c|  3 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  3 +-
 .../ethernet/netronome/nfp/nfp_net_common.c   |  3 +-
 .../net/ethernet/qlogic/qede/qede_filter.c|  3 -
 .../net/ethernet/qlogic/qlcnic/qlcnic_main.c  |  1 -
 drivers/net/ethernet/sfc/ef10.c   |  1 -
 drivers/net/netdevsim/udp_tunnels.c   |  4 -
 include/net/udp_tunnel.h  | 87 ++-
 net/core/dev.c|  2 +
 net/ipv4/udp_tunnel_core.c| 16 ++--
 net/ipv4/udp_tunnel_nic.c | 78 +
 16 files changed, 142 insertions(+), 73 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index c9a1a1d504c0..3ee4b848ef53 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -10219,8 +10219,7 @@ static int bnx2x_udp_tunnel_sync(struct net_device 
*netdev, unsigned int table)
 
 static const struct udp_tunnel_nic_info bnx2x_udp_tunnels = {
.sync_table = bnx2x_udp_tunnel_sync,
-   .flags  = UDP_TUNNEL_NIC_INFO_MAY_SLEEP |
- UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
+   .flags  = UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
.tables = {
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_VXLAN,  },
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_GENEVE, },
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 869580b6f70d..7946586802af 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -15573,8 +15573,7 @@ static int bnxt_udp_tunnel_unset_port(struct net_device 
*netdev, unsigned int ta
 static const struct udp_tunnel_nic_info bnxt_udp_tunnels = {
.set_port   = bnxt_udp_tunnel_set_port,
.unset_port = bnxt_udp_tunnel_unset_port,
-   .flags  = UDP_TUNNEL_NIC_INFO_MAY_SLEEP |
- UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
+   .flags  = UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
.tables = {
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_VXLAN,  },
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_GENEVE, },
@@ -15582,8 +15581,7 @@ static const struct udp_tunnel_nic_info 
bnxt_udp_tunnels = {
 }, bnxt_udp_tunnels_p7 = {
.set_port   = bnxt_udp_tunnel_set_port,
.unset_port = bnxt_udp_tunnel_unset_port,
-   .flags  = UDP_TUNNEL_NIC_INFO_MAY_SLEEP |
- UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
+   .flags  = UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
.tables = {
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_VXLAN,  },
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_GENEVE, },
diff --git a/drivers/net/ethernet/emulex/benet/be_main.c 
b/drivers/net/ethernet/emulex/benet/be_main.c
index 3d2e21592119..f49400ba9729 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -4031,8 +4031,7 @@ static int be_vxlan_unset_port(struct net_device *netdev, 
unsigned int table,
 static const struct udp_tunnel_nic_info be_udp_tunnels = {
.set_port   = be_vxlan_set_port,
.unset_port = be_vxlan_unset_port,
-   .flags  = UDP_TUNNEL_NIC_INFO_MAY_SLEEP |
- UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
+

[Intel-wired-lan] [PATCH net-next v4 6/6] Revert "bnxt_en: bring back rtnl_lock() in the bnxt_open() path"

2025-06-13 Thread Stanislav Fomichev

This reverts commit 325eb217e41fa14f307c7cc702bd18d0bb38fe84.

udp_tunnel infra doesn't need RTNL, should be safe to get back
to only netdev instance lock.

Cc: Michael Chan 
Reviewed-by: Aleksandr Loktionov 
Signed-off-by: Stanislav Fomichev 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 36 +--
 1 file changed, 7 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 7946586802af..b359ef4b78a9 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -14055,28 +14055,13 @@ static void bnxt_unlock_sp(struct bnxt *bp)
netdev_unlock(bp->dev);
 }
 
-/* Same as bnxt_lock_sp() with additional rtnl_lock */
-static void bnxt_rtnl_lock_sp(struct bnxt *bp)
-{
-   clear_bit(BNXT_STATE_IN_SP_TASK, &bp->state);
-   rtnl_lock();
-   netdev_lock(bp->dev);
-}
-
-static void bnxt_rtnl_unlock_sp(struct bnxt *bp)
-{
-   set_bit(BNXT_STATE_IN_SP_TASK, &bp->state);
-   netdev_unlock(bp->dev);
-   rtnl_unlock();
-}
-
 /* Only called from bnxt_sp_task() */
 static void bnxt_reset(struct bnxt *bp, bool silent)
 {
-   bnxt_rtnl_lock_sp(bp);
+   bnxt_lock_sp(bp);
if (test_bit(BNXT_STATE_OPEN, &bp->state))
bnxt_reset_task(bp, silent);
-   bnxt_rtnl_unlock_sp(bp);
+   bnxt_unlock_sp(bp);
 }
 
 /* Only called from bnxt_sp_task() */
@@ -14084,9 +14069,9 @@ static void bnxt_rx_ring_reset(struct bnxt *bp)
 {
int i;
 
-   bnxt_rtnl_lock_sp(bp);
+   bnxt_lock_sp(bp);
if (!test_bit(BNXT_STATE_OPEN, &bp->state)) {
-   bnxt_rtnl_unlock_sp(bp);
+   bnxt_unlock_sp(bp);
return;
}
/* Disable and flush TPA before resetting the RX ring */
@@ -14125,7 +14110,7 @@ static void bnxt_rx_ring_reset(struct bnxt *bp)
}
if (bp->flags & BNXT_FLAG_TPA)
bnxt_set_tpa(bp, true);
-   bnxt_rtnl_unlock_sp(bp);
+   bnxt_unlock_sp(bp);
 }
 
 static void bnxt_fw_fatal_close(struct bnxt *bp)
@@ -15017,17 +15002,15 @@ static void bnxt_fw_reset_task(struct work_struct 
*work)
bp->fw_reset_state = BNXT_FW_RESET_STATE_OPENING;
fallthrough;
case BNXT_FW_RESET_STATE_OPENING:
-   while (!rtnl_trylock()) {
+   while (!netdev_trylock(bp->dev)) {
bnxt_queue_fw_reset_work(bp, HZ / 10);
return;
}
-   netdev_lock(bp->dev);
rc = bnxt_open(bp->dev);
if (rc) {
netdev_err(bp->dev, "bnxt_open() failed during FW 
reset\n");
bnxt_fw_reset_abort(bp, rc);
netdev_unlock(bp->dev);
-   rtnl_unlock();
goto ulp_start;
}
 
@@ -15047,7 +15030,6 @@ static void bnxt_fw_reset_task(struct work_struct *work)
bnxt_dl_health_fw_status_update(bp, true);
}
netdev_unlock(bp->dev);
-   rtnl_unlock();
bnxt_ulp_start(bp, 0);
bnxt_reenable_sriov(bp);
netdev_lock(bp->dev);
@@ -15996,7 +15978,7 @@ static int bnxt_queue_start(struct net_device *dev, 
void *qmem, int idx)
   rc);
napi_enable_locked(&bnapi->napi);
bnxt_db_nq_arm(bp, &cpr->cp_db, cpr->cp_raw_cons);
-   netif_close(dev);
+   bnxt_reset_task(bp, true);
return rc;
 }
 
@@ -16812,7 +16794,6 @@ static int bnxt_resume(struct device *device)
struct bnxt *bp = netdev_priv(dev);
int rc = 0;
 
-   rtnl_lock();
netdev_lock(dev);
rc = pci_enable_device(bp->pdev);
if (rc) {
@@ -16857,7 +16838,6 @@ static int bnxt_resume(struct device *device)
 
 resume_exit:
netdev_unlock(bp->dev);
-   rtnl_unlock();
bnxt_ulp_start(bp, rc);
if (!rc)
bnxt_reenable_sriov(bp);
@@ -17023,7 +17003,6 @@ static void bnxt_io_resume(struct pci_dev *pdev)
int err;
 
netdev_info(bp->dev, "PCI Slot Resume\n");
-   rtnl_lock();
netdev_lock(netdev);
 
err = bnxt_hwrm_func_qcaps(bp);
@@ -17041,7 +17020,6 @@ static void bnxt_io_resume(struct pci_dev *pdev)
netif_device_attach(netdev);
 
netdev_unlock(netdev);
-   rtnl_unlock();
bnxt_ulp_start(bp, err);
if (!err)
bnxt_reenable_sriov(bp);
-- 
2.49.0

[Intel-wired-lan] [PATCH net-next v4 5/6] netdevsim: remove udp_ports_sleep

2025-06-13 Thread Stanislav Fomichev

Now that there is only one path in udp_tunnel, there is no need to
have udp_ports_sleep knob. Remove it and adjust the test.

Cc: Michael Chan 
Reviewed-by: Aleksandr Loktionov 
Signed-off-by: Stanislav Fomichev 
---
 drivers/net/netdevsim/netdevsim.h |  2 --
 drivers/net/netdevsim/udp_tunnels.c   |  8 ---
 .../drivers/net/netdevsim/udp_tunnel_nic.sh   | 23 +--
 3 files changed, 1 insertion(+), 32 deletions(-)

diff --git a/drivers/net/netdevsim/netdevsim.h 
b/drivers/net/netdevsim/netdevsim.h
index d04401f0bdf7..511ed72a93ce 100644
--- a/drivers/net/netdevsim/netdevsim.h
+++ b/drivers/net/netdevsim/netdevsim.h
@@ -131,7 +131,6 @@ struct netdevsim {
struct nsim_macsec macsec;
struct {
u32 inject_error;
-   u32 sleep;
u32 __ports[2][NSIM_UDP_TUNNEL_N_PORTS];
u32 (*ports)[NSIM_UDP_TUNNEL_N_PORTS];
struct dentry *ddir;
@@ -342,7 +341,6 @@ struct nsim_dev {
bool ipv4_only;
bool shared;
bool static_iana_vxlan;
-   u32 sleep;
} udp_ports;
struct nsim_dev_psample *psample;
u16 esw_mode;
diff --git a/drivers/net/netdevsim/udp_tunnels.c 
b/drivers/net/netdevsim/udp_tunnels.c
index 10cbbf1c584b..89fff76e51cf 100644
--- a/drivers/net/netdevsim/udp_tunnels.c
+++ b/drivers/net/netdevsim/udp_tunnels.c
@@ -18,9 +18,6 @@ nsim_udp_tunnel_set_port(struct net_device *dev, unsigned int 
table,
ret = -ns->udp_ports.inject_error;
ns->udp_ports.inject_error = 0;
 
-   if (ns->udp_ports.sleep)
-   msleep(ns->udp_ports.sleep);
-
if (!ret) {
if (ns->udp_ports.ports[table][entry]) {
WARN(1, "entry already in use\n");
@@ -47,8 +44,6 @@ nsim_udp_tunnel_unset_port(struct net_device *dev, unsigned 
int table,
ret = -ns->udp_ports.inject_error;
ns->udp_ports.inject_error = 0;
 
-   if (ns->udp_ports.sleep)
-   msleep(ns->udp_ports.sleep);
if (!ret) {
u32 val = be16_to_cpu(ti->port) << 16 | ti->type;
 
@@ -170,7 +165,6 @@ int nsim_udp_tunnels_info_create(struct nsim_dev *nsim_dev,
   GFP_KERNEL);
if (!info)
return -ENOMEM;
-   ns->udp_ports.sleep = nsim_dev->udp_ports.sleep;
 
if (nsim_dev->udp_ports.sync_all) {
info->set_port = NULL;
@@ -213,6 +207,4 @@ void nsim_udp_tunnels_debugfs_create(struct nsim_dev 
*nsim_dev)
&nsim_dev->udp_ports.shared);
debugfs_create_bool("udp_ports_static_iana_vxlan", 0600, nsim_dev->ddir,
&nsim_dev->udp_ports.static_iana_vxlan);
-   debugfs_create_u32("udp_ports_sleep", 0600, nsim_dev->ddir,
-  &nsim_dev->udp_ports.sleep);
 }
diff --git a/tools/testing/selftests/drivers/net/netdevsim/udp_tunnel_nic.sh 
b/tools/testing/selftests/drivers/net/netdevsim/udp_tunnel_nic.sh
index 92c2f0376c08..4c859ecdad94 100755
--- a/tools/testing/selftests/drivers/net/netdevsim/udp_tunnel_nic.sh
+++ b/tools/testing/selftests/drivers/net/netdevsim/udp_tunnel_nic.sh
@@ -266,7 +266,6 @@ for port in 0 1; do
echo $NSIM_ID > /sys/bus/netdevsim/new_device
 else
echo 1 > $NSIM_DEV_DFS/udp_ports_open_only
-   echo 1 > $NSIM_DEV_DFS/udp_ports_sleep
echo 1 > $NSIM_DEV_SYS/new_port
 fi
 NSIM_NETDEV=`get_netdev_name old_netdevs`
@@ -350,23 +349,11 @@ old_netdevs=$(ls /sys/class/net)
 port=0
 echo $NSIM_ID > /sys/bus/netdevsim/new_device
 echo 0 > $NSIM_DEV_SYS/del_port
-echo 1000 > $NSIM_DEV_DFS/udp_ports_sleep
 echo 0 > $NSIM_DEV_SYS/new_port
 NSIM_NETDEV=`get_netdev_name old_netdevs`
 
 msg="create VxLANs"
-exp0=( 0 0 0 0 ) # sleep is longer than out wait
-new_vxlan vxlan0 1 $NSIM_NETDEV
-
-modprobe -r vxlan
-modprobe -r udp_tunnel
-
-msg="remove tunnels"
-exp0=( 0 0 0 0 )
-check_tables
-
-msg="create VxLANs"
-exp0=( 0 0 0 0 ) # sleep is longer than out wait
+exp0=( `mke 1 1` 0 0 0 )
 new_vxlan vxlan0 1 $NSIM_NETDEV
 
 exp0=( 0 0 0 0 )
@@ -428,7 +415,6 @@ echo 0 > $NSIM_DEV_SYS/del_port
 for port in 0 1; do
 if [ $port -ne 0 ]; then
echo 1 > $NSIM_DEV_DFS/udp_ports_open_only
-   echo 1 > $NSIM_DEV_DFS/udp_ports_sleep
 fi
 
 echo $port > $NSIM_DEV_SYS/new_port
@@ -486,7 +472,6 @@ echo 1 > $NSIM_DEV_DFS/udp_ports_sync_all
 for port in 0 1; do
 if [ $port -ne 0 ]; then
echo 1 > $NSIM_DEV_DFS/udp_ports_open_only
-   echo 1 > $NSIM_DEV_DFS/udp_ports_sleep
 fi
 
 echo $port > $NSIM_DEV_SYS/new_port
@@ -543,7 +528,6 @@ echo 0 > $NSIM_DEV_SYS/del_port
 for port in 0 1; do
 if [ $port -ne 0 ]; then
echo 1 > $NSIM_DEV_DFS/udp_ports_open_on

[Intel-wired-lan] [PATCH net-next v4 4/6] net: remove redundant ASSERT_RTNL() in queue setup functions

2025-06-13 Thread Stanislav Fomichev

The existing netdev_ops_assert_locked() already asserts that either
the RTNL lock or the per-device lock is held, making the explicit
ASSERT_RTNL() redundant.

Cc: Michael Chan 
Signed-off-by: Stanislav Fomichev 
---
 net/core/dev.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 43f56b44f351..df1678b1fe24 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3179,7 +3179,6 @@ int netif_set_real_num_tx_queues(struct net_device *dev, 
unsigned int txq)
 
if (dev->reg_state == NETREG_REGISTERED ||
dev->reg_state == NETREG_UNREGISTERING) {
-   ASSERT_RTNL();
netdev_ops_assert_locked(dev);
 
rc = netdev_queue_update_kobjects(dev, dev->real_num_tx_queues,
@@ -3229,7 +3228,6 @@ int netif_set_real_num_rx_queues(struct net_device *dev, 
unsigned int rxq)
return -EINVAL;
 
if (dev->reg_state == NETREG_REGISTERED) {
-   ASSERT_RTNL();
netdev_ops_assert_locked(dev);
 
rc = net_rx_queue_update_kobjects(dev, dev->real_num_rx_queues,
-- 
2.49.0

Re: [Intel-wired-lan] [PATCH iwl-net] e1000: Move cancel_work_sync to avoid deadlock

2025-05-30 Thread Stanislav Fomichev

On 05/30, Joe Damato wrote:
> Previously, e1000_down called cancel_work_sync for the e1000 reset task
> (via e1000_down_and_stop), which takes RTNL.
> 
> As reported by users and syzbot, a deadlock is possible due to lock
> inversion in the following scenario:
> 
> CPU 0:
>   - RTNL is held
>   - e1000_close
>   - e1000_down
>   - cancel_work_sync (takes the work queue mutex)
>   - e1000_reset_task
> 
> CPU 1:
>   - process_one_work (takes the work queue mutex)
>   - e1000_reset_task (takes RTNL)

nit: as Jakub mentioned in another thread, it seems more about the
flush_work waiting for the reset_task to complete rather than
wq mutexes (which are fake)?

CPU 0:
  - RTNL is held
  - e1000_close
  - e1000_down
  - cancel_work_sync
  - __flush_work
  - 

CPU 1:
  - process_one_work
  - e1000_reset_task (takes RTNL)
  - 

The fix looks good!

Acked-by: Stanislav Fomichev

Re: [Intel-wired-lan] [PATCH net-next v3 1/4] udp_tunnel: remove rtnl_lock dependency

2025-06-11 Thread Stanislav Fomichev

On 06/11, Jakub Kicinski wrote:
> On Tue, 10 Jun 2025 10:15:19 -0700 Stanislav Fomichev wrote:
> > Drivers that are using ops lock and don't depend on RTNL lock
> > still need to manage it because udp_tunnel's RTNL dependency.
> > Introduce new udp_tunnel_nic_lock and use it instead of
> > rtnl_lock. Drop non-UDP_TUNNEL_NIC_INFO_MAY_SLEEP mode from
> > udp_tunnel infra (udp_tunnel_nic_device_sync_work needs to
> > grab udp_tunnel_nic_lock mutex and might sleep).
> 
> There are multiple entry points to this code, basically each member of
> struct udp_tunnel_nic_ops and the netdev notifiers. In this patch only
> reset and work are locked. I'm a bit confused as to what is the new
> lock protecting :S

I though that most of the callers are from do_setlink and there we have
rtnl and we grab rtnl+lock during the sync. But that doesn't
address the suspend/resume vs do_setlink race, that's true :-(

Did not look deep into the notifiers, assuming they are a way to push
the info down to the devices (under rtnl) plus trigger the sync work,
will take a closer look.

[Intel-wired-lan] [PATCH net-next v3 4/4] Revert "bnxt_en: bring back rtnl_lock() in the bnxt_open() path"

2025-06-10 Thread Stanislav Fomichev

This reverts commit 325eb217e41fa14f307c7cc702bd18d0bb38fe84.

udp_tunnel infra doesn't need RTNL, should be safe to get back
to only netdev instance lock.

Cc: Michael Chan 
Reviewed-by: Aleksandr Loktionov 
Signed-off-by: Stanislav Fomichev 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 36 +--
 1 file changed, 7 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index a3dadde65b8d..1da208c36572 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -14055,28 +14055,13 @@ static void bnxt_unlock_sp(struct bnxt *bp)
netdev_unlock(bp->dev);
 }
 
-/* Same as bnxt_lock_sp() with additional rtnl_lock */
-static void bnxt_rtnl_lock_sp(struct bnxt *bp)
-{
-   clear_bit(BNXT_STATE_IN_SP_TASK, &bp->state);
-   rtnl_lock();
-   netdev_lock(bp->dev);
-}
-
-static void bnxt_rtnl_unlock_sp(struct bnxt *bp)
-{
-   set_bit(BNXT_STATE_IN_SP_TASK, &bp->state);
-   netdev_unlock(bp->dev);
-   rtnl_unlock();
-}
-
 /* Only called from bnxt_sp_task() */
 static void bnxt_reset(struct bnxt *bp, bool silent)
 {
-   bnxt_rtnl_lock_sp(bp);
+   bnxt_lock_sp(bp);
if (test_bit(BNXT_STATE_OPEN, &bp->state))
bnxt_reset_task(bp, silent);
-   bnxt_rtnl_unlock_sp(bp);
+   bnxt_unlock_sp(bp);
 }
 
 /* Only called from bnxt_sp_task() */
@@ -14084,9 +14069,9 @@ static void bnxt_rx_ring_reset(struct bnxt *bp)
 {
int i;
 
-   bnxt_rtnl_lock_sp(bp);
+   bnxt_lock_sp(bp);
if (!test_bit(BNXT_STATE_OPEN, &bp->state)) {
-   bnxt_rtnl_unlock_sp(bp);
+   bnxt_unlock_sp(bp);
return;
}
/* Disable and flush TPA before resetting the RX ring */
@@ -14125,7 +14110,7 @@ static void bnxt_rx_ring_reset(struct bnxt *bp)
}
if (bp->flags & BNXT_FLAG_TPA)
bnxt_set_tpa(bp, true);
-   bnxt_rtnl_unlock_sp(bp);
+   bnxt_unlock_sp(bp);
 }
 
 static void bnxt_fw_fatal_close(struct bnxt *bp)
@@ -15017,17 +15002,15 @@ static void bnxt_fw_reset_task(struct work_struct 
*work)
bp->fw_reset_state = BNXT_FW_RESET_STATE_OPENING;
fallthrough;
case BNXT_FW_RESET_STATE_OPENING:
-   while (!rtnl_trylock()) {
+   while (!netdev_trylock(bp->dev)) {
bnxt_queue_fw_reset_work(bp, HZ / 10);
return;
}
-   netdev_lock(bp->dev);
rc = bnxt_open(bp->dev);
if (rc) {
netdev_err(bp->dev, "bnxt_open() failed during FW 
reset\n");
bnxt_fw_reset_abort(bp, rc);
netdev_unlock(bp->dev);
-   rtnl_unlock();
goto ulp_start;
}
 
@@ -15047,7 +15030,6 @@ static void bnxt_fw_reset_task(struct work_struct *work)
bnxt_dl_health_fw_status_update(bp, true);
}
netdev_unlock(bp->dev);
-   rtnl_unlock();
bnxt_ulp_start(bp, 0);
bnxt_reenable_sriov(bp);
netdev_lock(bp->dev);
@@ -15996,7 +15978,7 @@ static int bnxt_queue_start(struct net_device *dev, 
void *qmem, int idx)
   rc);
napi_enable_locked(&bnapi->napi);
bnxt_db_nq_arm(bp, &cpr->cp_db, cpr->cp_raw_cons);
-   netif_close(dev);
+   bnxt_reset_task(bp, true);
return rc;
 }
 
@@ -16812,7 +16794,6 @@ static int bnxt_resume(struct device *device)
struct bnxt *bp = netdev_priv(dev);
int rc = 0;
 
-   rtnl_lock();
netdev_lock(dev);
rc = pci_enable_device(bp->pdev);
if (rc) {
@@ -16857,7 +16838,6 @@ static int bnxt_resume(struct device *device)
 
 resume_exit:
netdev_unlock(bp->dev);
-   rtnl_unlock();
bnxt_ulp_start(bp, rc);
if (!rc)
bnxt_reenable_sriov(bp);
@@ -17023,7 +17003,6 @@ static void bnxt_io_resume(struct pci_dev *pdev)
int err;
 
netdev_info(bp->dev, "PCI Slot Resume\n");
-   rtnl_lock();
netdev_lock(netdev);
 
err = bnxt_hwrm_func_qcaps(bp);
@@ -17041,7 +17020,6 @@ static void bnxt_io_resume(struct pci_dev *pdev)
netif_device_attach(netdev);
 
netdev_unlock(netdev);
-   rtnl_unlock();
bnxt_ulp_start(bp, err);
if (!err)
bnxt_reenable_sriov(bp);
-- 
2.49.0

[Intel-wired-lan] [PATCH net-next v3 0/4] udp_tunnel: remove rtnl_lock dependency

2025-06-10 Thread Stanislav Fomichev

Recently bnxt had to grow back a bunch of rtnl dependencies because
of udp_tunnel's infra. Add separate (global) mutext to protect
udp_tunnel state.

v3:
- fix 2 test failures (Jakub NIPA)

v2:
- move the lock into udp_tunnel_nic (Jakub)
- reorder the lock ordering (Jakub)
- move udp_ports_sleep removal into separate patch and update the test
  (Jakub)

Cc: Michael Chan 

Stanislav Fomichev (4):
  udp_tunnel: remove rtnl_lock dependency
  net: remove redundant ASSERT_RTNL() in queue setup functions
  netdevsim: remove udp_ports_sleep
  Revert "bnxt_en: bring back rtnl_lock() in the bnxt_open() path"

 .../net/ethernet/broadcom/bnx2x/bnx2x_main.c  |  3 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 42 ---
 drivers/net/ethernet/emulex/benet/be_main.c   |  3 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c   |  1 -
 drivers/net/ethernet/intel/ice/ice_main.c |  1 -
 .../net/ethernet/mellanox/mlx4/en_netdev.c|  3 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  3 +-
 .../ethernet/netronome/nfp/nfp_net_common.c   |  3 +-
 .../net/ethernet/qlogic/qede/qede_filter.c|  3 --
 .../net/ethernet/qlogic/qlcnic/qlcnic_main.c  |  1 -
 drivers/net/ethernet/sfc/ef10.c   |  1 -
 drivers/net/netdevsim/netdevsim.h |  2 -
 drivers/net/netdevsim/udp_tunnels.c   | 12 --
 include/net/udp_tunnel.h  |  8 ++--
 net/core/dev.c|  2 -
 net/ipv4/udp_tunnel_nic.c | 30 +++--
 .../drivers/net/netdevsim/udp_tunnel_nic.sh   | 23 +-
 17 files changed, 32 insertions(+), 109 deletions(-)

-- 
2.49.0

[Intel-wired-lan] [PATCH net-next v3 2/4] net: remove redundant ASSERT_RTNL() in queue setup functions

2025-06-10 Thread Stanislav Fomichev

The existing netdev_ops_assert_locked() already asserts that either
the RTNL lock or the per-device lock is held, making the explicit
ASSERT_RTNL() redundant.

Cc: Michael Chan 
Signed-off-by: Stanislav Fomichev 
---
 net/core/dev.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index be97c440ecd5..72997636b8ec 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3179,7 +3179,6 @@ int netif_set_real_num_tx_queues(struct net_device *dev, 
unsigned int txq)
 
if (dev->reg_state == NETREG_REGISTERED ||
dev->reg_state == NETREG_UNREGISTERING) {
-   ASSERT_RTNL();
netdev_ops_assert_locked(dev);
 
rc = netdev_queue_update_kobjects(dev, dev->real_num_tx_queues,
@@ -3229,7 +3228,6 @@ int netif_set_real_num_rx_queues(struct net_device *dev, 
unsigned int rxq)
return -EINVAL;
 
if (dev->reg_state == NETREG_REGISTERED) {
-   ASSERT_RTNL();
netdev_ops_assert_locked(dev);
 
rc = net_rx_queue_update_kobjects(dev, dev->real_num_rx_queues,
-- 
2.49.0

[Intel-wired-lan] [PATCH net-next v3 1/4] udp_tunnel: remove rtnl_lock dependency

2025-06-10 Thread Stanislav Fomichev

Drivers that are using ops lock and don't depend on RTNL lock
still need to manage it because udp_tunnel's RTNL dependency.
Introduce new udp_tunnel_nic_lock and use it instead of
rtnl_lock. Drop non-UDP_TUNNEL_NIC_INFO_MAY_SLEEP mode from
udp_tunnel infra (udp_tunnel_nic_device_sync_work needs to
grab udp_tunnel_nic_lock mutex and might sleep).

Cc: Michael Chan 
Reviewed-by: Cosmin Ratiu 
Suggested-by: Jakub Kicinski 
Signed-off-by: Stanislav Fomichev 
---
 .../net/ethernet/broadcom/bnx2x/bnx2x_main.c  |  3 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c |  6 ++--
 drivers/net/ethernet/emulex/benet/be_main.c   |  3 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c   |  1 -
 drivers/net/ethernet/intel/ice/ice_main.c |  1 -
 .../net/ethernet/mellanox/mlx4/en_netdev.c|  3 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  3 +-
 .../ethernet/netronome/nfp/nfp_net_common.c   |  3 +-
 .../net/ethernet/qlogic/qede/qede_filter.c|  3 --
 .../net/ethernet/qlogic/qlcnic/qlcnic_main.c  |  1 -
 drivers/net/ethernet/sfc/ef10.c   |  1 -
 drivers/net/netdevsim/udp_tunnels.c   |  4 ---
 include/net/udp_tunnel.h  |  8 ++---
 net/ipv4/udp_tunnel_nic.c | 30 +--
 14 files changed, 24 insertions(+), 46 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index f522ca8ff66b..fa7e5adf9804 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -10219,8 +10219,7 @@ static int bnx2x_udp_tunnel_sync(struct net_device 
*netdev, unsigned int table)
 
 static const struct udp_tunnel_nic_info bnx2x_udp_tunnels = {
.sync_table = bnx2x_udp_tunnel_sync,
-   .flags  = UDP_TUNNEL_NIC_INFO_MAY_SLEEP |
- UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
+   .flags  = UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
.tables = {
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_VXLAN,  },
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_GENEVE, },
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index d5495762c945..a3dadde65b8d 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -15573,8 +15573,7 @@ static int bnxt_udp_tunnel_unset_port(struct net_device 
*netdev, unsigned int ta
 static const struct udp_tunnel_nic_info bnxt_udp_tunnels = {
.set_port   = bnxt_udp_tunnel_set_port,
.unset_port = bnxt_udp_tunnel_unset_port,
-   .flags  = UDP_TUNNEL_NIC_INFO_MAY_SLEEP |
- UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
+   .flags  = UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
.tables = {
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_VXLAN,  },
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_GENEVE, },
@@ -15582,8 +15581,7 @@ static const struct udp_tunnel_nic_info 
bnxt_udp_tunnels = {
 }, bnxt_udp_tunnels_p7 = {
.set_port   = bnxt_udp_tunnel_set_port,
.unset_port = bnxt_udp_tunnel_unset_port,
-   .flags  = UDP_TUNNEL_NIC_INFO_MAY_SLEEP |
- UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
+   .flags  = UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
.tables = {
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_VXLAN,  },
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_GENEVE, },
diff --git a/drivers/net/ethernet/emulex/benet/be_main.c 
b/drivers/net/ethernet/emulex/benet/be_main.c
index 3d2e21592119..f49400ba9729 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -4031,8 +4031,7 @@ static int be_vxlan_unset_port(struct net_device *netdev, 
unsigned int table,
 static const struct udp_tunnel_nic_info be_udp_tunnels = {
.set_port   = be_vxlan_set_port,
.unset_port = be_vxlan_unset_port,
-   .flags  = UDP_TUNNEL_NIC_INFO_MAY_SLEEP |
- UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
+   .flags  = UDP_TUNNEL_NIC_INFO_OPEN_ONLY,
.tables = {
{ .n_entries = 1, .tunnel_types = UDP_TUNNEL_TYPE_VXLAN, },
},
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 120d68654e3f..3d3da9d15348 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -15891,7 +15891,6 @@ static int i40e_probe(struct pci_dev *pdev, const 
struct pci_device_id *ent)
 
pf->udp_tunnel_nic.set_port = i40e_udp_tunnel_set_port;
pf->udp_tunnel_nic.unset_port = i40e_udp_tunnel_unset_port;
-   pf->udp_tunnel_nic.flags = UDP_TUNNEL_NIC_INFO_MAY_SLEEP;
pf->udp_tunnel_nic.shared = &pf->

[Intel-wired-lan] [PATCH net-next v3 3/4] netdevsim: remove udp_ports_sleep

2025-06-10 Thread Stanislav Fomichev

Now that there is only one path in udp_tunnel, there is no need to
have udp_ports_sleep knob. Remove it and adjust the test.

Cc: Michael Chan 
Reviewed-by: Aleksandr Loktionov 
Signed-off-by: Stanislav Fomichev 
---
 drivers/net/netdevsim/netdevsim.h |  2 --
 drivers/net/netdevsim/udp_tunnels.c   |  8 ---
 .../drivers/net/netdevsim/udp_tunnel_nic.sh   | 23 +--
 3 files changed, 1 insertion(+), 32 deletions(-)

diff --git a/drivers/net/netdevsim/netdevsim.h 
b/drivers/net/netdevsim/netdevsim.h
index d04401f0bdf7..511ed72a93ce 100644
--- a/drivers/net/netdevsim/netdevsim.h
+++ b/drivers/net/netdevsim/netdevsim.h
@@ -131,7 +131,6 @@ struct netdevsim {
struct nsim_macsec macsec;
struct {
u32 inject_error;
-   u32 sleep;
u32 __ports[2][NSIM_UDP_TUNNEL_N_PORTS];
u32 (*ports)[NSIM_UDP_TUNNEL_N_PORTS];
struct dentry *ddir;
@@ -342,7 +341,6 @@ struct nsim_dev {
bool ipv4_only;
bool shared;
bool static_iana_vxlan;
-   u32 sleep;
} udp_ports;
struct nsim_dev_psample *psample;
u16 esw_mode;
diff --git a/drivers/net/netdevsim/udp_tunnels.c 
b/drivers/net/netdevsim/udp_tunnels.c
index 10cbbf1c584b..89fff76e51cf 100644
--- a/drivers/net/netdevsim/udp_tunnels.c
+++ b/drivers/net/netdevsim/udp_tunnels.c
@@ -18,9 +18,6 @@ nsim_udp_tunnel_set_port(struct net_device *dev, unsigned int 
table,
ret = -ns->udp_ports.inject_error;
ns->udp_ports.inject_error = 0;
 
-   if (ns->udp_ports.sleep)
-   msleep(ns->udp_ports.sleep);
-
if (!ret) {
if (ns->udp_ports.ports[table][entry]) {
WARN(1, "entry already in use\n");
@@ -47,8 +44,6 @@ nsim_udp_tunnel_unset_port(struct net_device *dev, unsigned 
int table,
ret = -ns->udp_ports.inject_error;
ns->udp_ports.inject_error = 0;
 
-   if (ns->udp_ports.sleep)
-   msleep(ns->udp_ports.sleep);
if (!ret) {
u32 val = be16_to_cpu(ti->port) << 16 | ti->type;
 
@@ -170,7 +165,6 @@ int nsim_udp_tunnels_info_create(struct nsim_dev *nsim_dev,
   GFP_KERNEL);
if (!info)
return -ENOMEM;
-   ns->udp_ports.sleep = nsim_dev->udp_ports.sleep;
 
if (nsim_dev->udp_ports.sync_all) {
info->set_port = NULL;
@@ -213,6 +207,4 @@ void nsim_udp_tunnels_debugfs_create(struct nsim_dev 
*nsim_dev)
&nsim_dev->udp_ports.shared);
debugfs_create_bool("udp_ports_static_iana_vxlan", 0600, nsim_dev->ddir,
&nsim_dev->udp_ports.static_iana_vxlan);
-   debugfs_create_u32("udp_ports_sleep", 0600, nsim_dev->ddir,
-  &nsim_dev->udp_ports.sleep);
 }
diff --git a/tools/testing/selftests/drivers/net/netdevsim/udp_tunnel_nic.sh 
b/tools/testing/selftests/drivers/net/netdevsim/udp_tunnel_nic.sh
index 92c2f0376c08..4c859ecdad94 100755
--- a/tools/testing/selftests/drivers/net/netdevsim/udp_tunnel_nic.sh
+++ b/tools/testing/selftests/drivers/net/netdevsim/udp_tunnel_nic.sh
@@ -266,7 +266,6 @@ for port in 0 1; do
echo $NSIM_ID > /sys/bus/netdevsim/new_device
 else
echo 1 > $NSIM_DEV_DFS/udp_ports_open_only
-   echo 1 > $NSIM_DEV_DFS/udp_ports_sleep
echo 1 > $NSIM_DEV_SYS/new_port
 fi
 NSIM_NETDEV=`get_netdev_name old_netdevs`
@@ -350,23 +349,11 @@ old_netdevs=$(ls /sys/class/net)
 port=0
 echo $NSIM_ID > /sys/bus/netdevsim/new_device
 echo 0 > $NSIM_DEV_SYS/del_port
-echo 1000 > $NSIM_DEV_DFS/udp_ports_sleep
 echo 0 > $NSIM_DEV_SYS/new_port
 NSIM_NETDEV=`get_netdev_name old_netdevs`
 
 msg="create VxLANs"
-exp0=( 0 0 0 0 ) # sleep is longer than out wait
-new_vxlan vxlan0 1 $NSIM_NETDEV
-
-modprobe -r vxlan
-modprobe -r udp_tunnel
-
-msg="remove tunnels"
-exp0=( 0 0 0 0 )
-check_tables
-
-msg="create VxLANs"
-exp0=( 0 0 0 0 ) # sleep is longer than out wait
+exp0=( `mke 1 1` 0 0 0 )
 new_vxlan vxlan0 1 $NSIM_NETDEV
 
 exp0=( 0 0 0 0 )
@@ -428,7 +415,6 @@ echo 0 > $NSIM_DEV_SYS/del_port
 for port in 0 1; do
 if [ $port -ne 0 ]; then
echo 1 > $NSIM_DEV_DFS/udp_ports_open_only
-   echo 1 > $NSIM_DEV_DFS/udp_ports_sleep
 fi
 
 echo $port > $NSIM_DEV_SYS/new_port
@@ -486,7 +472,6 @@ echo 1 > $NSIM_DEV_DFS/udp_ports_sync_all
 for port in 0 1; do
 if [ $port -ne 0 ]; then
echo 1 > $NSIM_DEV_DFS/udp_ports_open_only
-   echo 1 > $NSIM_DEV_DFS/udp_ports_sleep
 fi
 
 echo $port > $NSIM_DEV_SYS/new_port
@@ -543,7 +528,6 @@ echo 0 > $NSIM_DEV_SYS/del_port
 for port in 0 1; do
 if [ $port -ne 0 ]; then
echo 1 > $NSIM_DEV_DFS/udp_ports_open_on

Re: [Intel-wired-lan] [PATCH net-next 1/2] stmmac: xsk: fix underflow of budget in zerocopy mode

2025-07-21 Thread Stanislav Fomichev

On 07/21, Jason Xing wrote:
> From: Jason Xing 
> 
> The issue can happen when the budget number of descs are consumed. As
> long as the budget is decreased to zero, it will again go into
> while (budget-- > 0) statement and get decreased by one, so the
> underflow issue can happen. It will lead to returning true whereas the
> expected value should be false.
> 
> In this case where all the budget are used up, it means zc function
> should return false to let the poll run again because normally we
> might have more data to process.
> 
> Fixes: 132c32ee5bc0 ("net: stmmac: Add TX via XDP zero-copy socket")
> Signed-off-by: Jason Xing 
> ---
>  drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 
> b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index f350a6662880..ea5541f9e9a6 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -2596,7 +2596,7 @@ static bool stmmac_xdp_xmit_zc(struct stmmac_priv 
> *priv, u32 queue, u32 budget)
>  
>   budget = min(budget, stmmac_tx_avail(priv, queue));
>  
> - while (budget-- > 0) {
> + while (budget > 0) {

There is a continue on line 2621. Should we do 'for (; budget > 0; budget--)'
instead? And maybe the same for ixgbe [0]?

0: 
https://lore.kernel.org/netdev/20250720091123.474-3-kerneljasonx...@gmail.com/

54 matches

Mail list logo