date:20210117

Re: [RFC PATCH 0/7] Support for virtio-net hash reporting

2021-01-17 Thread Yuri Benditovich

On Thu, Jan 14, 2021 at 5:39 AM Jason Wang  wrote:
>
>
> On 2021/1/13 下午10:33, Willem de Bruijn wrote:
> > On Tue, Jan 12, 2021 at 11:11 PM Jason Wang  wrote:
> >>
> >> On 2021/1/13 上午7:47, Willem de Bruijn wrote:
> >>> On Tue, Jan 12, 2021 at 3:29 PM Yuri Benditovich
> >>>  wrote:
>  On Tue, Jan 12, 2021 at 9:49 PM Yuri Benditovich
>   wrote:
> > On Tue, Jan 12, 2021 at 9:41 PM Yuri Benditovich
> >  wrote:
> >> Existing TUN module is able to use provided "steering eBPF" to
> >> calculate per-packet hash and derive the destination queue to
> >> place the packet to. The eBPF uses mapped configuration data
> >> containing a key for hash calculation and indirection table
> >> with array of queues' indices.
> >>
> >> This series of patches adds support for virtio-net hash reporting
> >> feature as defined in virtio specification. It extends the TUN module
> >> and the "steering eBPF" as follows:
> >>
> >> Extended steering eBPF calculates the hash value and hash type, keeps
> >> hash value in the skb->hash and returns index of destination virtqueue
> >> and the type of the hash. TUN module keeps returned hash type in
> >> (currently unused) field of the skb.
> >> skb->__unused renamed to 'hash_report_type'.
> >>
> >> When TUN module is called later to allocate and fill the virtio-net
> >> header and push it to destination virtqueue it populates the hash
> >> and the hash type into virtio-net header.
> >>
> >> VHOST driver is made aware of respective virtio-net feature that
> >> extends the virtio-net header to report the hash value and hash report
> >> type.
> > Comment from Willem de Bruijn:
> >
> > Skbuff fields are in short supply. I don't think we need to add one
> > just for this narrow path entirely internal to the tun device.
> >
>  We understand that and try to minimize the impact by using an already
>  existing unused field of skb.
> >>> Not anymore. It was repurposed as a flags field very recently.
> >>>
> >>> This use case is also very narrow in scope. And a very short path from
> >>> data producer to consumer. So I don't think it needs to claim scarce
> >>> bits in the skb.
> >>>
> >>> tun_ebpf_select_queue stores the field, tun_put_user reads it and
> >>> converts it to the virtio_net_hdr in the descriptor.
> >>>
> >>> tun_ebpf_select_queue is called from .ndo_select_queue.  Storing the
> >>> field in skb->cb is fragile, as in theory some code could overwrite
> >>> that between field between ndo_select_queue and
> >>> ndo_start_xmit/tun_net_xmit, from which point it is fully under tun
> >>> control again. But in practice, I don't believe anything does.
> >>>
> >>> Alternatively an existing skb field that is used only on disjoint
> >>> datapaths, such as ingress-only, could be viable.
> >>
> >> A question here. We had metadata support in XDP for cooperation between
> >> eBPF programs. Do we have something similar in the skb?
> >>
> >> E.g in the RSS, if we want to pass some metadata information between
> >> eBPF program and the logic that generates the vnet header (either hard
> >> logic in the kernel or another eBPF program). Is there any way that can
> >> avoid the possible conflicts of qdiscs?
> > Not that I am aware of. The closest thing is cb[].
> >
> > It'll have to aliase a field like that, that is known unused for the given 
> > path.
>
>
> Right, we need to make sure cb is not used by other ones. I'm not sure
> how hard to achieve that consider Qemu installs the eBPF program but it
> doesn't deal with networking configurations.
>
>
> >
> > One other approach that has been used within linear call stacks is out
> > of band. Like percpu variables softnet_data.xmit.more and
> > mirred_rec_level. But that is perhaps a bit overwrought for this use
> > case.
>
>
> Yes, and if we go that way then eBPF turns out to be a burden since we
> need to invent helpers to access those auxiliary data structure. It
> would be better then to hard-coded the RSS in the kernel.
>
>
> >
> > Instead, you could just run the flow_dissector in tun_put_user if the
> > feature is negotiated. Indeed, the flow dissector seems more apt to me
> > than BPF here. Note that the flow dissector internally can be
> > overridden by a BPF program if the admin so chooses.
> >
>  When this set of patches is related to hash delivery in the virtio-net
>  packet in general,
>  it was prepared in context of RSS feature implementation as defined in
>  virtio spec [1]
>  In case of RSS it is not enough to run the flow_dissector in 
>  tun_put_user:
>  in tun_ebpf_select_queue the TUN calls eBPF to calculate the hash,
>  hash type and queue index
>  according to the (mapped) parameters (key, hash types, indirection
>  table) received from the guest.
> >>> TUNSETSTEERINGEBPF was added to support more diverse queue selection
> >>> than the default in

[PATCH net-next 4/5] selftests: mlxsw: sch_red_core: Drop two unused variables

2021-01-17 Thread Ido Schimmel

From: Petr Machata 

These variables are cut'n'pasted from other functions in the file and not
actually used.

Signed-off-by: Petr Machata 
Signed-off-by: Ido Schimmel 
---
 tools/testing/selftests/drivers/net/mlxsw/sch_red_core.sh | 2 --
 1 file changed, 2 deletions(-)

diff --git a/tools/testing/selftests/drivers/net/mlxsw/sch_red_core.sh 
b/tools/testing/selftests/drivers/net/mlxsw/sch_red_core.sh
index b0cb1aaffdda..d439ee0eaa8a 100644
--- a/tools/testing/selftests/drivers/net/mlxsw/sch_red_core.sh
+++ b/tools/testing/selftests/drivers/net/mlxsw/sch_red_core.sh
@@ -551,10 +551,8 @@ do_drop_test()
local trigger=$1; shift
local subtest=$1; shift
local fetch_counter=$1; shift
-   local backlog
local base
local now
-   local pct
 
RET=0
 
-- 
2.29.2

[PATCH net-next 2/5] mlxsw: spectrum_trap: Add ecn_mark trap

2021-01-17 Thread Ido Schimmel

From: Petr Machata 

Register with devlink a new buffer drop trap, ecn_mark. Since Spectrum-1
does not support the ecn_mark trap, do it only for Spectrum-2 and above.

Reviewed-by: Jiri Pirko 
Signed-off-by: Petr Machata 
Signed-off-by: Ido Schimmel 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_trap.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_trap.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_trap.c
index 4ef12e3e021a..45cf6c027cac 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_trap.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_trap.c
@@ -51,6 +51,8 @@ enum {
 enum {
/* Packet was early dropped. */
MLXSW_SP_MIRROR_REASON_INGRESS_WRED = 9,
+   /* Packet was ECN marked. */
+   MLXSW_SP_MIRROR_REASON_EGRESS_ECN = 13,
 };
 
 static int mlxsw_sp_rx_listener(struct mlxsw_sp *mlxsw_sp, struct sk_buff *skb,
@@ -1760,6 +1762,13 @@ mlxsw_sp2_trap_items_arr[] = {
},
.is_source = true,
},
+   {
+   .trap = MLXSW_SP_TRAP_BUFFER_DROP(ECN_MARK),
+   .listeners_arr = {
+   MLXSW_SP_RXL_BUFFER_DISCARD(EGRESS_ECN),
+   },
+   .is_source = true,
+   },
 };
 
 static int
-- 
2.29.2

[PATCH net-next 1/5] devlink: Add ecn_mark trap

2021-01-17 Thread Ido Schimmel

From: Petr Machata 

Add the packet trap that can report packets that were ECN marked due to RED
AQM.

Signed-off-by: Amit Cohen 
Reviewed-by: Jiri Pirko 
Signed-off-by: Petr Machata 
Signed-off-by: Ido Schimmel 
---
 Documentation/networking/devlink/devlink-trap.rst | 4 
 include/net/devlink.h | 3 +++
 net/core/devlink.c| 1 +
 3 files changed, 8 insertions(+)

diff --git a/Documentation/networking/devlink/devlink-trap.rst 
b/Documentation/networking/devlink/devlink-trap.rst
index d875f3e1e9cf..c6026b44617e 100644
--- a/Documentation/networking/devlink/devlink-trap.rst
+++ b/Documentation/networking/devlink/devlink-trap.rst
@@ -480,6 +480,10 @@ be added to the following table:
  - ``drop``
  - Traps packets that the device decided to drop in case they hit a
blackhole nexthop
+   * - ``ecn_mark``
+ - ``drop``
+ - Traps ECN-capable packets that were marked with CE (Congestion
+   Encountered) code point by RED algorithm instead of being dropped
 
 Driver-specific Packet Traps
 
diff --git a/include/net/devlink.h b/include/net/devlink.h
index f466819cc477..dd0c0b8fba6e 100644
--- a/include/net/devlink.h
+++ b/include/net/devlink.h
@@ -836,6 +836,7 @@ enum devlink_trap_generic_id {
DEVLINK_TRAP_GENERIC_ID_GTP_PARSING,
DEVLINK_TRAP_GENERIC_ID_ESP_PARSING,
DEVLINK_TRAP_GENERIC_ID_BLACKHOLE_NEXTHOP,
+   DEVLINK_TRAP_GENERIC_ID_ECN_MARK,
 
/* Add new generic trap IDs above */
__DEVLINK_TRAP_GENERIC_ID_MAX,
@@ -1061,6 +1062,8 @@ enum devlink_trap_group_generic_id {
"esp_parsing"
 #define DEVLINK_TRAP_GENERIC_NAME_BLACKHOLE_NEXTHOP \
"blackhole_nexthop"
+#define DEVLINK_TRAP_GENERIC_NAME_ECN_MARK \
+   "ecn_mark"
 
 #define DEVLINK_TRAP_GROUP_GENERIC_NAME_L2_DROPS \
"l2_drops"
diff --git a/net/core/devlink.c b/net/core/devlink.c
index ee828e4b1007..f86688bfad46 100644
--- a/net/core/devlink.c
+++ b/net/core/devlink.c
@@ -9508,6 +9508,7 @@ static const struct devlink_trap devlink_trap_generic[] = 
{
DEVLINK_TRAP(GTP_PARSING, DROP),
DEVLINK_TRAP(ESP_PARSING, DROP),
DEVLINK_TRAP(BLACKHOLE_NEXTHOP, DROP),
+   DEVLINK_TRAP(ECN_MARK, DROP),
 };
 
 #define DEVLINK_TRAP_GROUP(_id)
  \
-- 
2.29.2

[PATCH net-next 3/5] mlxsw: spectrum_qdisc: Offload RED qevent mark

2021-01-17 Thread Ido Schimmel

From: Petr Machata 

The RED "mark" qevent can be offloaded under the same exact conditions as
the RED "tail_drop" qevent. Therefore recognize its binding type in the
TC_SETUP_BLOCK handler and translate to the right SPAN trigger.

The corresponding hardware trigger, MLXSW_SP_SPAN_TRIGGER_ECN, is
considered "egress", unlike the previously-offloaded _EARLY_DROP. Add a
helper to spectrum_span, mlxsw_sp_span_trigger_is_ingress(), to classify
triggers to ingress and egress. Pass result of this instead of hardcoding
true when calling mlxsw_sp_span_analyzed_port_get()/_put().

Signed-off-by: Petr Machata 
Reviewed-by: Jiri Pirko 
Signed-off-by: Ido Schimmel 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c   |  2 ++
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h   |  2 ++
 .../net/ethernet/mellanox/mlxsw/spectrum_qdisc.c | 14 +++---
 .../net/ethernet/mellanox/mlxsw/spectrum_span.c  | 16 
 .../net/ethernet/mellanox/mlxsw/spectrum_span.h  |  1 +
 5 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 1650d9852b5b..e93b96837a44 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -992,6 +992,8 @@ static int mlxsw_sp_setup_tc_block(struct mlxsw_sp_port 
*mlxsw_sp_port,
return mlxsw_sp_setup_tc_block_clsact(mlxsw_sp_port, f, false);
case FLOW_BLOCK_BINDER_TYPE_RED_EARLY_DROP:
return mlxsw_sp_setup_tc_block_qevent_early_drop(mlxsw_sp_port, 
f);
+   case FLOW_BLOCK_BINDER_TYPE_RED_MARK:
+   return mlxsw_sp_setup_tc_block_qevent_mark(mlxsw_sp_port, f);
default:
return -EOPNOTSUPP;
}
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
index a6956cfc9cb1..0e5fd6a73da1 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
@@ -1109,6 +1109,8 @@ int mlxsw_sp_setup_tc_fifo(struct mlxsw_sp_port 
*mlxsw_sp_port,
   struct tc_fifo_qopt_offload *p);
 int mlxsw_sp_setup_tc_block_qevent_early_drop(struct mlxsw_sp_port 
*mlxsw_sp_port,
  struct flow_block_offload *f);
+int mlxsw_sp_setup_tc_block_qevent_mark(struct mlxsw_sp_port *mlxsw_sp_port,
+   struct flow_block_offload *f);
 
 /* spectrum_fid.c */
 bool mlxsw_sp_fid_is_dummy(struct mlxsw_sp *mlxsw_sp, u16 fid_index);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
index fd672c6c9133..5cd8854e6070 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c
@@ -1327,6 +1327,7 @@ static int mlxsw_sp_qevent_span_configure(struct mlxsw_sp 
*mlxsw_sp,
  const struct 
mlxsw_sp_span_agent_parms *agent_parms,
  int *p_span_id)
 {
+   bool ingress = 
mlxsw_sp_span_trigger_is_ingress(qevent_binding->span_trigger);
struct mlxsw_sp_port *mlxsw_sp_port = qevent_binding->mlxsw_sp_port;
struct mlxsw_sp_span_trigger_parms trigger_parms = {};
int span_id;
@@ -1336,7 +1337,7 @@ static int mlxsw_sp_qevent_span_configure(struct mlxsw_sp 
*mlxsw_sp,
if (err)
return err;
 
-   err = mlxsw_sp_span_analyzed_port_get(mlxsw_sp_port, true);
+   err = mlxsw_sp_span_analyzed_port_get(mlxsw_sp_port, ingress);
if (err)
goto err_analyzed_port_get;
 
@@ -1358,7 +1359,7 @@ static int mlxsw_sp_qevent_span_configure(struct mlxsw_sp 
*mlxsw_sp,
mlxsw_sp_span_agent_unbind(mlxsw_sp, qevent_binding->span_trigger, 
mlxsw_sp_port,
   &trigger_parms);
 err_agent_bind:
-   mlxsw_sp_span_analyzed_port_put(mlxsw_sp_port, true);
+   mlxsw_sp_span_analyzed_port_put(mlxsw_sp_port, ingress);
 err_analyzed_port_get:
mlxsw_sp_span_agent_put(mlxsw_sp, span_id);
return err;
@@ -1368,6 +1369,7 @@ static void mlxsw_sp_qevent_span_deconfigure(struct 
mlxsw_sp *mlxsw_sp,
 struct mlxsw_sp_qevent_binding 
*qevent_binding,
 int span_id)
 {
+   bool ingress = 
mlxsw_sp_span_trigger_is_ingress(qevent_binding->span_trigger);
struct mlxsw_sp_port *mlxsw_sp_port = qevent_binding->mlxsw_sp_port;
struct mlxsw_sp_span_trigger_parms trigger_parms = {
.span_id = span_id,
@@ -1377,7 +1379,7 @@ static void mlxsw_sp_qevent_span_deconfigure(struct 
mlxsw_sp *mlxsw_sp,
  qevent_binding->tclass_num);
mlxsw_sp_span_agent_unbind(mlxsw_sp, qevent_binding->span_trigger, 
mlxsw_sp_port,
   &trigger_pa

[PATCH v4 net-next] net: bridge: check vlan with eth_type_vlan() method

2021-01-17 Thread menglong8 . dong

From: Menglong Dong 

Replace some checks for ETH_P_8021Q and ETH_P_8021AD with
eth_type_vlan().

Signed-off-by: Menglong Dong 
---
v4:
- remove unnecessary brackets.

v3:
- fix compile warning in br_vlan_set_proto() by casting 'val' to
  be16.

v2:
- use eth_type_vlan() in br_validate() and __br_vlan_set_proto()
  too.
---
 net/bridge/br_forward.c |  3 +--
 net/bridge/br_netlink.c | 12 +++-
 net/bridge/br_vlan.c|  2 +-
 3 files changed, 5 insertions(+), 12 deletions(-)

diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index e28ffadd1371..6e9b049ae521 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -39,8 +39,7 @@ int br_dev_queue_push_xmit(struct net *net, struct sock *sk, 
struct sk_buff *skb
br_drop_fake_rtable(skb);
 
if (skb->ip_summed == CHECKSUM_PARTIAL &&
-   (skb->protocol == htons(ETH_P_8021Q) ||
-skb->protocol == htons(ETH_P_8021AD))) {
+   eth_type_vlan(skb->protocol)) {
int depth;
 
if (!__vlan_get_protocol(skb, skb->protocol, &depth))
diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
index 49700ce0e919..762f273802cd 100644
--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
@@ -1096,15 +1096,9 @@ static int br_validate(struct nlattr *tb[], struct 
nlattr *data[],
return 0;
 
 #ifdef CONFIG_BRIDGE_VLAN_FILTERING
-   if (data[IFLA_BR_VLAN_PROTOCOL]) {
-   switch (nla_get_be16(data[IFLA_BR_VLAN_PROTOCOL])) {
-   case htons(ETH_P_8021Q):
-   case htons(ETH_P_8021AD):
-   break;
-   default:
-   return -EPROTONOSUPPORT;
-   }
-   }
+   if (data[IFLA_BR_VLAN_PROTOCOL] &&
+   !eth_type_vlan(nla_get_be16(data[IFLA_BR_VLAN_PROTOCOL])))
+   return -EPROTONOSUPPORT;
 
if (data[IFLA_BR_VLAN_DEFAULT_PVID]) {
__u16 defpvid = nla_get_u16(data[IFLA_BR_VLAN_DEFAULT_PVID]);
diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
index 701cad646b20..bb2909738518 100644
--- a/net/bridge/br_vlan.c
+++ b/net/bridge/br_vlan.c
@@ -917,7 +917,7 @@ int __br_vlan_set_proto(struct net_bridge *br, __be16 proto)
 
 int br_vlan_set_proto(struct net_bridge *br, unsigned long val)
 {
-   if (val != ETH_P_8021Q && val != ETH_P_8021AD)
+   if (!eth_type_vlan(htons(val)))
return -EPROTONOSUPPORT;
 
return __br_vlan_set_proto(br, htons(val));
-- 
2.30.0

[PATCH mlx5-next v3 5/5] net/mlx5: Allow to the users to configure number of MSI-X vectors

2021-01-17 Thread Leon Romanovsky

From: Leon Romanovsky 

Implement ability to configure MSI-X for the SR-IOV VFs.

Signed-off-by: Leon Romanovsky 
---
 .../net/ethernet/mellanox/mlx5/core/main.c|  1 +
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  1 +
 .../net/ethernet/mellanox/mlx5/core/sriov.c   | 38 +++
 3 files changed, 40 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 8269cfbfc69d..334b3b5077c5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -1647,6 +1647,7 @@ static struct pci_driver mlx5_core_driver = {
.shutdown   = shutdown,
.err_handler= &mlx5_err_handler,
.sriov_configure   = mlx5_core_sriov_configure,
+   .sriov_set_msix_vec_count = mlx5_core_sriov_set_msix_vec_count,
 };
 
 static void mlx5_core_verify_params(void)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h 
b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index 5babb4434a87..8a2523d2d43a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -138,6 +138,7 @@ void mlx5_sriov_cleanup(struct mlx5_core_dev *dev);
 int mlx5_sriov_attach(struct mlx5_core_dev *dev);
 void mlx5_sriov_detach(struct mlx5_core_dev *dev);
 int mlx5_core_sriov_configure(struct pci_dev *dev, int num_vfs);
+int mlx5_core_sriov_set_msix_vec_count(struct pci_dev *vf, int msix_vec_count);
 int mlx5_core_enable_hca(struct mlx5_core_dev *dev, u16 func_id);
 int mlx5_core_disable_hca(struct mlx5_core_dev *dev, u16 func_id);
 int mlx5_create_scheduling_element_cmd(struct mlx5_core_dev *dev, u8 hierarchy,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c 
b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
index f0ec86a1c8a6..febb7a5ec72d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
@@ -144,6 +144,7 @@ mlx5_device_disable_sriov(struct mlx5_core_dev *dev, int 
num_vfs, bool clear_vf)
 static int mlx5_sriov_enable(struct pci_dev *pdev, int num_vfs)
 {
struct mlx5_core_dev *dev  = pci_get_drvdata(pdev);
+   u32 num_vf_msix;
int err;
 
err = mlx5_device_enable_sriov(dev, num_vfs);
@@ -152,6 +153,8 @@ static int mlx5_sriov_enable(struct pci_dev *pdev, int 
num_vfs)
return err;
}
 
+   num_vf_msix = MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix);
+   pci_sriov_set_vf_total_msix(pdev, num_vf_msix);
err = pci_enable_sriov(pdev, num_vfs);
if (err) {
mlx5_core_warn(dev, "pci_enable_sriov failed : %d\n", err);
@@ -187,6 +190,41 @@ int mlx5_core_sriov_configure(struct pci_dev *pdev, int 
num_vfs)
return err ? err : num_vfs;
 }
 
+int mlx5_core_sriov_set_msix_vec_count(struct pci_dev *vf, int msix_vec_count)
+{
+   struct pci_dev *pf = pci_physfn(vf);
+   struct mlx5_core_sriov *sriov;
+   struct mlx5_core_dev *dev;
+   int num_vf_msix, id;
+
+   dev = pci_get_drvdata(pf);
+   num_vf_msix = MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix);
+   if (!num_vf_msix)
+   return -EOPNOTSUPP;
+
+   if (!msix_vec_count)
+   msix_vec_count =
+   mlx5_get_default_msix_vec_count(dev, pci_num_vf(pf));
+
+   sriov = &dev->priv.sriov;
+
+   /* Reversed translation of PCI VF function number to the internal
+* function_id, which exists in the name of virtfn symlink.
+*/
+   for (id = 0; id < pci_num_vf(pf); id++) {
+   if (!sriov->vfs_ctx[id].enabled)
+   continue;
+
+   if (vf->devfn == pci_iov_virtfn_devfn(pf, id))
+   break;
+   }
+
+   if (id == pci_num_vf(pf) || !sriov->vfs_ctx[id].enabled)
+   return -EINVAL;
+
+   return mlx5_set_msix_vec_count(dev, id + 1, msix_vec_count);
+}
+
 int mlx5_sriov_attach(struct mlx5_core_dev *dev)
 {
if (!mlx5_core_is_pf(dev) || !pci_num_vf(dev->pdev))
-- 
2.29.2

[PATCH mlx5-next v3 2/5] PCI: Add SR-IOV sysfs entry to read total number of dynamic MSI-X vectors

2021-01-17 Thread Leon Romanovsky

From: Leon Romanovsky 

Some SR-IOV capable devices provide an ability to configure specific
number of MSI-X vectors on their VF prior driver is probed on that VF.

In order to make management easy, provide new read-only sysfs file that
returns a total number of possible to configure MSI-X vectors.

cat /sys/bus/pci/devices/.../sriov_vf_total_msix
  = 0 - feature is not supported
  > 0 - total number of MSI-X vectors to consume by the VFs

Signed-off-by: Leon Romanovsky 
---
 Documentation/ABI/testing/sysfs-bus-pci | 14 ++
 drivers/pci/iov.c   | 12 
 drivers/pci/pci.h   |  3 +++
 include/linux/pci.h |  2 ++
 4 files changed, 31 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci 
b/Documentation/ABI/testing/sysfs-bus-pci
index 34a8c6bcde70..530c249cc3da 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -395,3 +395,17 @@ Description:
The file is writable if the PF is bound to a driver that
set sriov_vf_total_msix > 0 and there is no driver bound
to the VF.
+
+What:  /sys/bus/pci/devices/.../sriov_vf_total_msix
+Date:  January 2021
+Contact:   Leon Romanovsky 
+Description:
+   This file is associated with the SR-IOV PFs.
+   It returns a total number of possible to configure MSI-X
+   vectors on the enabled VFs.
+
+   The values returned are:
+* > 0 - this will be total number possible to consume by VFs,
+* = 0 - feature is not supported
+
+   If no SR-IOV VFs are enabled, this value will return 0.
diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index b8d7a3b26f87..6660632b7a14 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -394,12 +394,22 @@ static ssize_t sriov_drivers_autoprobe_store(struct 
device *dev,
return count;
 }

+static ssize_t sriov_vf_total_msix_show(struct device *dev,
+   struct device_attribute *attr,
+   char *buf)
+{
+   struct pci_dev *pdev = to_pci_dev(dev);
+
+   return sprintf(buf, "%u\n", pdev->sriov->vf_total_msix);
+}
+
 static DEVICE_ATTR_RO(sriov_totalvfs);
 static DEVICE_ATTR_RW(sriov_numvfs);
 static DEVICE_ATTR_RO(sriov_offset);
 static DEVICE_ATTR_RO(sriov_stride);
 static DEVICE_ATTR_RO(sriov_vf_device);
 static DEVICE_ATTR_RW(sriov_drivers_autoprobe);
+static DEVICE_ATTR_RO(sriov_vf_total_msix);

 static struct attribute *sriov_dev_attrs[] = {
&dev_attr_sriov_totalvfs.attr,
@@ -408,6 +418,7 @@ static struct attribute *sriov_dev_attrs[] = {
&dev_attr_sriov_stride.attr,
&dev_attr_sriov_vf_device.attr,
&dev_attr_sriov_drivers_autoprobe.attr,
+   &dev_attr_sriov_vf_total_msix.attr,
NULL,
 };

@@ -659,6 +670,7 @@ static void sriov_disable(struct pci_dev *dev)
sysfs_remove_link(&dev->dev.kobj, "dep_link");

iov->num_VFs = 0;
+   iov->vf_total_msix = 0;
pci_iov_set_numvfs(dev, 0);
 }

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index a275018beb2f..444e03ce9c97 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -327,6 +327,9 @@ struct pci_sriov {
u16 subsystem_device; /* VF subsystem device */
resource_size_t barsz[PCI_SRIOV_NUM_BARS];  /* VF BAR size */
booldrivers_autoprobe; /* Auto probing of VFs by driver */
+   u32 vf_total_msix;  /* Total number of MSI-X vectors the VFs
+* can consume
+*/
 };

 /**
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 6be96d468eda..c950513558b8 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -2075,6 +2075,7 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev);
 int pci_sriov_configure_simple(struct pci_dev *dev, int nr_virtfn);
 resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno);
 void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe);
+void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count);

 /* Arch may override these (weak) */
 int pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs);
@@ -2115,6 +2116,7 @@ static inline int pci_sriov_get_totalvfs(struct pci_dev 
*dev)
 static inline resource_size_t pci_iov_resource_size(struct pci_dev *dev, int 
resno)
 { return 0; }
 static inline void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe) { 
}
+static inline void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count) 
{}
 #endif

 #if defined(CONFIG_HOTPLUG_PCI) || defined(CONFIG_HOTPLUG_PCI_MODULE)
--
2.29.2

[PATCH] net/qla3xxx: switch from 'pci_' to 'dma_' API

2021-01-17 Thread Christophe JAILLET

The wrappers in include/linux/pci-dma-compat.h should go away.

The patch has been generated with the coccinelle script below and has been
hand modified to replace GFP_ with a correct flag.
It has been compile tested.

When memory is allocated in 'ql_alloc_net_req_rsp_queues()' GFP_KERNEL can
be used because it is only called from 'ql_alloc_mem_resources()' which
already calls 'ql_alloc_buffer_queues()' which uses GFP_KERNEL. (see below)

When memory is allocated in 'ql_alloc_buffer_queues()' GFP_KERNEL can be
used because this flag is already used just a few line above.

When memory is allocated in 'ql_alloc_small_buffers()' GFP_KERNEL can
be used because it is only called from 'ql_alloc_mem_resources()' which
already calls 'ql_alloc_buffer_queues()' which uses GFP_KERNEL. (see above)

When memory is allocated in 'ql_alloc_mem_resources()' GFP_KERNEL can be
used because this function already calls 'ql_alloc_buffer_queues()' which
uses GFP_KERNEL. (see above)


While at it, use 'dma_set_mask_and_coherent()' instead of 'dma_set_mask()/
dma_set_coherent_mask()' in order to slightly simplify code.


@@
@@
-PCI_DMA_BIDIRECTIONAL
+DMA_BIDIRECTIONAL

@@
@@
-PCI_DMA_TODEVICE
+DMA_TO_DEVICE

@@
@@
-PCI_DMA_FROMDEVICE
+DMA_FROM_DEVICE

@@
@@
-PCI_DMA_NONE
+DMA_NONE

@@
expression e1, e2, e3;
@@
-pci_alloc_consistent(e1, e2, e3)
+dma_alloc_coherent(&e1->dev, e2, e3, GFP_)

@@
expression e1, e2, e3;
@@
-pci_zalloc_consistent(e1, e2, e3)
+dma_alloc_coherent(&e1->dev, e2, e3, GFP_)

@@
expression e1, e2, e3, e4;
@@
-pci_free_consistent(e1, e2, e3, e4)
+dma_free_coherent(&e1->dev, e2, e3, e4)

@@
expression e1, e2, e3, e4;
@@
-pci_map_single(e1, e2, e3, e4)
+dma_map_single(&e1->dev, e2, e3, e4)

@@
expression e1, e2, e3, e4;
@@
-pci_unmap_single(e1, e2, e3, e4)
+dma_unmap_single(&e1->dev, e2, e3, e4)

@@
expression e1, e2, e3, e4, e5;
@@
-pci_map_page(e1, e2, e3, e4, e5)
+dma_map_page(&e1->dev, e2, e3, e4, e5)

@@
expression e1, e2, e3, e4;
@@
-pci_unmap_page(e1, e2, e3, e4)
+dma_unmap_page(&e1->dev, e2, e3, e4)

@@
expression e1, e2, e3, e4;
@@
-pci_map_sg(e1, e2, e3, e4)
+dma_map_sg(&e1->dev, e2, e3, e4)

@@
expression e1, e2, e3, e4;
@@
-pci_unmap_sg(e1, e2, e3, e4)
+dma_unmap_sg(&e1->dev, e2, e3, e4)

@@
expression e1, e2, e3, e4;
@@
-pci_dma_sync_single_for_cpu(e1, e2, e3, e4)
+dma_sync_single_for_cpu(&e1->dev, e2, e3, e4)

@@
expression e1, e2, e3, e4;
@@
-pci_dma_sync_single_for_device(e1, e2, e3, e4)
+dma_sync_single_for_device(&e1->dev, e2, e3, e4)

@@
expression e1, e2, e3, e4;
@@
-pci_dma_sync_sg_for_cpu(e1, e2, e3, e4)
+dma_sync_sg_for_cpu(&e1->dev, e2, e3, e4)

@@
expression e1, e2, e3, e4;
@@
-pci_dma_sync_sg_for_device(e1, e2, e3, e4)
+dma_sync_sg_for_device(&e1->dev, e2, e3, e4)

@@
expression e1, e2;
@@
-pci_dma_mapping_error(e1, e2)
+dma_mapping_error(&e1->dev, e2)

@@
expression e1, e2;
@@
-pci_set_dma_mask(e1, e2)
+dma_set_mask(&e1->dev, e2)

@@
expression e1, e2;
@@
-pci_set_consistent_dma_mask(e1, e2)
+dma_set_coherent_mask(&e1->dev, e2)

Signed-off-by: Christophe JAILLET 
---
If needed, see post from Christoph Hellwig on the kernel-janitors ML:
   https://marc.info/?l=kernel-janitors&m=158745678307186&w=4
---
 drivers/net/ethernet/qlogic/qla3xxx.c | 196 --
 1 file changed, 87 insertions(+), 109 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qla3xxx.c 
b/drivers/net/ethernet/qlogic/qla3xxx.c
index 27740c027681..214e347097a7 100644
--- a/drivers/net/ethernet/qlogic/qla3xxx.c
+++ b/drivers/net/ethernet/qlogic/qla3xxx.c
@@ -315,12 +315,11 @@ static void ql_release_to_lrg_buf_free_list(struct 
ql3_adapter *qdev,
 * buffer
 */
skb_reserve(lrg_buf_cb->skb, QL_HEADER_SPACE);
-   map = pci_map_single(qdev->pdev,
+   map = dma_map_single(&qdev->pdev->dev,
 lrg_buf_cb->skb->data,
-qdev->lrg_buffer_len -
-QL_HEADER_SPACE,
-PCI_DMA_FROMDEVICE);
-   err = pci_dma_mapping_error(qdev->pdev, map);
+qdev->lrg_buffer_len - 
QL_HEADER_SPACE,
+DMA_FROM_DEVICE);
+   err = dma_mapping_error(&qdev->pdev->dev, map);
if (err) {
netdev_err(qdev->ndev,
   "PCI mapping failed with error: 
%d\n",
@@ -1802,13 +1801,12 @@ static int ql_populate_free_queue(struct ql3_adapter 
*qdev)
 * first buffer
 */
skb_reserve(lrg_buf_cb->skb, QL_HEADER_SPAC

[PATCH mlx5-next v3 4/5] net/mlx5: Dynamically assign MSI-X vectors count

2021-01-17 Thread Leon Romanovsky

From: Leon Romanovsky 

The number of MSI-X vectors is PCI property visible through lspci, that
field is read-only and configured by the device. The static assignment
of an amount of MSI-X vectors doesn't allow utilize the newly created
VF because it is not known to the device the future load and configuration
where that VF will be used.

To overcome the inefficiency in the spread of such MSI-X vectors, we
allow the kernel to instruct the device with the needed number of such
vectors.

Such change immediately increases the amount of MSI-X vectors for the
system with @ VFs from 12 vectors per-VF, to be 32 vectors per-VF.

Before this patch:
[root@server ~]# lspci -vs :08:00.2
08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 
Virtual Function]

Capabilities: [9c] MSI-X: Enable- Count=12 Masked-

After this patch:
[root@server ~]# lspci -vs :08:00.2
08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 
Virtual Function]

Capabilities: [9c] MSI-X: Enable- Count=32 Masked-

Signed-off-by: Leon Romanovsky 
---
 .../net/ethernet/mellanox/mlx5/core/main.c|  4 ++
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  5 ++
 .../net/ethernet/mellanox/mlx5/core/pci_irq.c | 72 +++
 .../net/ethernet/mellanox/mlx5/core/sriov.c   | 13 +++-
 4 files changed, 92 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index c08315b51fd3..8269cfbfc69d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -567,6 +567,10 @@ static int handle_hca_cap(struct mlx5_core_dev *dev, void 
*set_ctx)
if (MLX5_CAP_GEN_MAX(dev, mkey_by_name))
MLX5_SET(cmd_hca_cap, set_hca_cap, mkey_by_name, 1);
 
+   if (MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix))
+   MLX5_SET(cmd_hca_cap, set_hca_cap, num_total_dynamic_vf_msix,
+MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix));
+
return set_caps(dev, set_ctx, MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE);
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h 
b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index 0a0302ce7144..5babb4434a87 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -172,6 +172,11 @@ int mlx5_irq_attach_nb(struct mlx5_irq_table *irq_table, 
int vecidx,
   struct notifier_block *nb);
 int mlx5_irq_detach_nb(struct mlx5_irq_table *irq_table, int vecidx,
   struct notifier_block *nb);
+
+int mlx5_set_msix_vec_count(struct mlx5_core_dev *dev, int devfn,
+   int msix_vec_count);
+int mlx5_get_default_msix_vec_count(struct mlx5_core_dev *dev, int num_vfs);
+
 struct cpumask *
 mlx5_irq_get_affinity_mask(struct mlx5_irq_table *irq_table, int vecidx);
 struct cpu_rmap *mlx5_irq_get_rmap(struct mlx5_irq_table *table);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
index 6fd974920394..2a35888fcff0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c
@@ -55,6 +55,78 @@ static struct mlx5_irq *mlx5_irq_get(struct mlx5_core_dev 
*dev, int vecidx)
return &irq_table->irq[vecidx];
 }
 
+/**
+ * mlx5_get_default_msix_vec_count() - Get defaults of number of MSI-X vectors
+ * to be set
+ * @dev: PF to work on
+ * @num_vfs: Number of VFs was asked when SR-IOV was enabled
+ **/
+int mlx5_get_default_msix_vec_count(struct mlx5_core_dev *dev, int num_vfs)
+{
+   int num_vf_msix, min_msix, max_msix;
+
+   num_vf_msix = MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix);
+   if (!num_vf_msix)
+   return 0;
+
+   min_msix = MLX5_CAP_GEN(dev, min_dynamic_vf_msix_table_size);
+   max_msix = MLX5_CAP_GEN(dev, max_dynamic_vf_msix_table_size);
+
+   /* Limit maximum number of MSI-X to leave some of them free in the
+* pool and ready to be assigned by the users without need to resize
+* other Vfs.
+*/
+   return max(min(num_vf_msix / num_vfs, max_msix / 2), min_msix);
+}
+
+/**
+ * mlx5_set_msix_vec_count() - Set dynamically allocated MSI-X to the VF
+ * @dev: PF to work on
+ * @function_id: Internal PCI VF function id
+ * @msix_vec_count: Number of MSI-X to set
+ **/
+int mlx5_set_msix_vec_count(struct mlx5_core_dev *dev, int function_id,
+   int msix_vec_count)
+{
+   int sz = MLX5_ST_SZ_BYTES(set_hca_cap_in);
+   int num_vf_msix, min_msix, max_msix;
+   void *hca_cap, *cap;
+   int ret;
+
+   num_vf_msix = MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix);
+   if (!num_vf_msix)
+   return 0;
+
+   if (!MLX5_CAP_GEN(dev, vport_group_manager) || !mlx5_core_is_pf(dev))
+

[PATCH mlx5-next v3 1/5] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs

2021-01-17 Thread Leon Romanovsky

From: Leon Romanovsky 

Extend PCI sysfs interface with a new callback that allows configure
the number of MSI-X vectors for specific SR-IO VF. This is needed
to optimize the performance of newly bound devices by allocating
the number of vectors based on the administrator knowledge of targeted VM.

This function is applicable for SR-IOV VF because such devices allocate
their MSI-X table before they will run on the VMs and HW can't guess the
right number of vectors, so the HW allocates them statically and equally.

The newly added /sys/bus/pci/devices/.../sriov_vf_msix_count file will be seen
for the VFs and it is writable as long as a driver is not bounded to the VF.

The values accepted are:
 * > 0 - this will be number reported by the VF's MSI-X capability
 * < 0 - not valid
 * = 0 - will reset to the device default value

Signed-off-by: Leon Romanovsky 
---
 Documentation/ABI/testing/sysfs-bus-pci | 20 ++
 drivers/pci/iov.c   | 82 +
 drivers/pci/msi.c   | 47 ++
 drivers/pci/pci-sysfs.c |  1 +
 drivers/pci/pci.h   |  2 +
 include/linux/pci.h |  3 +
 6 files changed, 155 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci 
b/Documentation/ABI/testing/sysfs-bus-pci
index 25c9c39770c6..34a8c6bcde70 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci
+++ b/Documentation/ABI/testing/sysfs-bus-pci
@@ -375,3 +375,23 @@ Description:
The value comes from the PCI kernel device state and can be one
of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold".
The file is read only.
+
+What:  /sys/bus/pci/devices/.../sriov_vf_msix_count
+Date:  December 2020
+Contact:   Leon Romanovsky 
+Description:
+   This file is associated with the SR-IOV VFs.
+   It allows configuration of the number of MSI-X vectors for
+   the VF. This is needed to optimize performance of newly bound
+   devices by allocating the number of vectors based on the
+   administrator knowledge of targeted VM.
+
+   The values accepted are:
+* > 0 - this will be number reported by the VF's MSI-X
+capability
+* < 0 - not valid
+* = 0 - will reset to the device default value
+
+   The file is writable if the PF is bound to a driver that
+   set sriov_vf_total_msix > 0 and there is no driver bound
+   to the VF.
diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 4afd4ee4f7f0..b8d7a3b26f87 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -31,6 +31,7 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
return (dev->devfn + dev->sriov->offset +
dev->sriov->stride * vf_id) & 0xff;
 }
+EXPORT_SYMBOL(pci_iov_virtfn_devfn);

 /*
  * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may
@@ -426,6 +427,68 @@ const struct attribute_group sriov_dev_attr_group = {
.is_visible = sriov_attrs_are_visible,
 };

+#ifdef CONFIG_PCI_MSI
+static ssize_t sriov_vf_msix_count_show(struct device *dev,
+   struct device_attribute *attr,
+   char *buf)
+{
+   struct pci_dev *pdev = to_pci_dev(dev);
+   int count = pci_msix_vec_count(pdev);
+
+   if (count < 0)
+   return count;
+
+   return sprintf(buf, "%d\n", count);
+}
+
+static ssize_t sriov_vf_msix_count_store(struct device *dev,
+struct device_attribute *attr,
+const char *buf, size_t count)
+{
+   struct pci_dev *vf_dev = to_pci_dev(dev);
+   int val, ret;
+
+   ret = kstrtoint(buf, 0, &val);
+   if (ret)
+   return ret;
+
+   ret = pci_vf_set_msix_vec_count(vf_dev, val);
+   if (ret)
+   return ret;
+
+   return count;
+}
+static DEVICE_ATTR_RW(sriov_vf_msix_count);
+#endif
+
+static struct attribute *sriov_vf_dev_attrs[] = {
+#ifdef CONFIG_PCI_MSI
+   &dev_attr_sriov_vf_msix_count.attr,
+#endif
+   NULL,
+};
+
+static umode_t sriov_vf_attrs_are_visible(struct kobject *kobj,
+ struct attribute *a, int n)
+{
+   struct device *dev = kobj_to_dev(kobj);
+   struct pci_dev *vf_dev;
+
+   if (dev_is_pf(dev))
+   return 0;
+
+   vf_dev = to_pci_dev(dev);
+   if (!vf_dev->msix_cap)
+   return 0;
+
+   return a->mode;
+}
+
+const struct attribute_group sriov_vf_dev_attr_group = {
+   .attrs = sriov_vf_dev_attrs,
+   .is_visible = sriov_vf_attrs_are_visible,
+};
+
 int __weak pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs)
 {
return 0;
@@ -1054,6 +1117,25 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev)
 }
 EXPORT_SYM

[PATCH mlx5-next v3 0/5] Dynamically assign MSI-X vectors count

2021-01-17 Thread Leon Romanovsky

From: Leon Romanovsky 

Changelog
v3:
 * Renamed pci_set_msix_vec_count to be pci_vf_set_msix_vec_count.
 * Added VF msix_cap check to hide sysfs entry if device doesn't support msix.
 * Changed "-" to be ":" in the mlx5 patch to silence CI warnings about missing
   kdoc description.
 * Split differently error print in mlx5 driver to avoid checkpatch warning.
v2: https://lore.kernel.org/linux-pci/20210114103140.866141-1-l...@kernel.org
 * Patch 1:
  * Renamed vf_msix_vec sysfs knob to be sriov_vf_msix_count
  * Added PF and VF device locks during set MSI-X call to protect from parallel
driver bind/unbind operations.
  * Removed extra checks when reading sriov_vf_msix, because users will
be able to distinguish between supported/not supported by looking on
sriov_vf_total_msix count.
  * Changed all occurrences of "numb" to be "count"
  * Changed returned error from EOPNOTSUPP to be EBUSY if user tries to set
MSI-X count after driver already bound to the VF.
  * Added extra comment in pci_set_msix_vec_count() to emphasize that driver
should not be bound.
 * Patch 2:
  * Changed vf_total_msix from int to be u32 and updated function signatures
accordingly.
  * Improved patch title
v1: https://lore.kernel.org/linux-pci/20210110150727.1965295-1-l...@kernel.org
 * Improved wording and commit messages of first PCI patch
 * Added extra PCI patch to provide total number of MSI-X vectors
 * Prohibited read of vf_msix_vec sysfs file if driver doesn't support write
 * Removed extra function definition in pci.h
v0: https://lore.kernel.org/linux-pci/20210103082440.34994-1-l...@kernel.org



Hi,

The number of MSI-X vectors is PCI property visible through lspci, that
field is read-only and configured by the device.

The static assignment of an amount of MSI-X vectors doesn't allow utilize
the newly created VF because it is not known to the device the future load
and configuration where that VF will be used.

The VFs are created on the hypervisor and forwarded to the VMs that have
different properties (for example number of CPUs).

To overcome the inefficiency in the spread of such MSI-X vectors, we
allow the kernel to instruct the device with the needed number of such
vectors, before VF is initialized and bounded to the driver.

Before this series:
[root@server ~]# lspci -vs :08:00.2
08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 
Virtual Function]

Capabilities: [9c] MSI-X: Enable- Count=12 Masked-

Configuration script:
1. Start fresh
echo 0 > /sys/bus/pci/devices/\:08\:00.0/sriov_numvfs
modprobe -q -r mlx5_ib mlx5_core
2. Ensure that driver doesn't run and it is safe to change MSI-X
echo 0 > /sys/bus/pci/devices/\:08\:00.0/sriov_drivers_autoprobe
3. Load driver for the PF
modprobe mlx5_core
4. Configure one of the VFs with new number
echo 2 > /sys/bus/pci/devices/\:08\:00.0/sriov_numvfs
echo 21 > /sys/bus/pci/devices/\:08\:00.2/sriov_vf_msix_count

After this series:
[root@server ~]# lspci -vs :08:00.2
08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 
Virtual Function]

Capabilities: [9c] MSI-X: Enable- Count=21 Masked-

Thanks

Leon Romanovsky (5):
  PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
  PCI: Add SR-IOV sysfs entry to read total number of dynamic MSI-X
vectors
  net/mlx5: Add dynamic MSI-X capabilities bits
  net/mlx5: Dynamically assign MSI-X vectors count
  net/mlx5: Allow to the users to configure number of MSI-X vectors

 Documentation/ABI/testing/sysfs-bus-pci   | 34 +++
 .../net/ethernet/mellanox/mlx5/core/main.c|  5 +
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  6 ++
 .../net/ethernet/mellanox/mlx5/core/pci_irq.c | 72 ++
 .../net/ethernet/mellanox/mlx5/core/sriov.c   | 51 +-
 drivers/pci/iov.c | 94 +++
 drivers/pci/msi.c | 47 ++
 drivers/pci/pci-sysfs.c   |  1 +
 drivers/pci/pci.h |  5 +
 include/linux/mlx5/mlx5_ifc.h | 11 ++-
 include/linux/pci.h   |  5 +
 11 files changed, 328 insertions(+), 3 deletions(-)

--
2.29.2

[PATCH net-next 5/5] selftests: mlxsw: RED: Add selftests for the mark qevent

2021-01-17 Thread Ido Schimmel

From: Petr Machata 

Add do_mark_test(), which is to do_ecn_test() like do_drop_test() is to
do_red_test(): meant to test that actions on the RED mark qevent block are
offloaded, and executed on ECN-marked packets.

Signed-off-by: Petr Machata 
Signed-off-by: Ido Schimmel 
---
 .../drivers/net/mlxsw/sch_red_core.sh | 82 +++
 .../drivers/net/mlxsw/sch_red_ets.sh  | 74 +++--
 2 files changed, 151 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/drivers/net/mlxsw/sch_red_core.sh 
b/tools/testing/selftests/drivers/net/mlxsw/sch_red_core.sh
index d439ee0eaa8a..6c2ddf76b4b8 100644
--- a/tools/testing/selftests/drivers/net/mlxsw/sch_red_core.sh
+++ b/tools/testing/selftests/drivers/net/mlxsw/sch_red_core.sh
@@ -544,6 +544,51 @@ do_mc_backlog_test()
log_test "TC $((vlan - 10)): Qdisc reports MC backlog"
 }
 
+do_mark_test()
+{
+   local vlan=$1; shift
+   local limit=$1; shift
+   local subtest=$1; shift
+   local fetch_counter=$1; shift
+   local should_fail=$1; shift
+   local base
+
+   RET=0
+
+   start_tcp_traffic $h1.$vlan $(ipaddr 1 $vlan) $(ipaddr 3 $vlan) $h3_mac 
tos=0x01
+
+   # Create a bit of a backlog and observe no mirroring due to marks.
+   qevent_rule_install_$subtest
+
+   build_backlog $vlan $((2 * limit / 3)) tcp tos=0x01 >/dev/null
+
+   base=$($fetch_counter)
+   count=$(busywait 1100 until_counter_is ">= $((base + 1))" 
$fetch_counter)
+   check_fail $? "Spurious packets ($base -> $count) observed without 
buffer pressure"
+
+   # Above limit, everything should be mirrored, we should see lots of
+   # packets.
+   build_backlog $vlan $((3 * limit / 2)) tcp tos=0x01 >/dev/null
+   busywait_for_counter 1100 +1 \
+$fetch_counter > /dev/null
+   check_err_fail "$should_fail" $? "ECN-marked packets $subtest'd"
+
+   # When the rule is uninstalled, there should be no mirroring.
+   qevent_rule_uninstall_$subtest
+   busywait_for_counter 1100 +10 \
+$fetch_counter > /dev/null
+   check_fail $? "Spurious packets observed after uninstall"
+
+   if ((should_fail)); then
+   log_test "TC $((vlan - 10)): marked packets not $subtest'd"
+   else
+   log_test "TC $((vlan - 10)): marked packets $subtest'd"
+   fi
+
+   stop_traffic
+   sleep 1
+}
+
 do_drop_test()
 {
local vlan=$1; shift
@@ -626,6 +671,22 @@ do_drop_mirror_test()
tc filter del dev $h2 ingress pref 1 handle 101 flower
 }
 
+do_mark_mirror_test()
+{
+   local vlan=$1; shift
+   local limit=$1; shift
+
+   tc filter add dev $h2 ingress pref 1 handle 101 prot ip \
+  flower skip_sw ip_proto tcp \
+  action drop
+
+   do_mark_test "$vlan" "$limit" mirror \
+qevent_counter_fetch_mirror \
+$(: should_fail=)0
+
+   tc filter del dev $h2 ingress pref 1 handle 101 flower
+}
+
 qevent_rule_install_trap()
 {
tc filter add block 10 pref 1234 handle 102 matchall skip_sw \
@@ -653,3 +714,24 @@ do_drop_trap_test()
do_drop_test "$vlan" "$limit" "$trap_name" trap \
 "qevent_counter_fetch_trap $trap_name"
 }
+
+do_mark_trap_test()
+{
+   local vlan=$1; shift
+   local limit=$1; shift
+   local should_fail=$1; shift
+
+   do_mark_test "$vlan" "$limit" trap \
+"qevent_counter_fetch_trap ecn_mark" \
+"$should_fail"
+}
+
+do_mark_trap_test_pass()
+{
+   do_mark_trap_test "$@" 0
+}
+
+do_mark_trap_test_fail()
+{
+   do_mark_trap_test "$@" 1
+}
diff --git a/tools/testing/selftests/drivers/net/mlxsw/sch_red_ets.sh 
b/tools/testing/selftests/drivers/net/mlxsw/sch_red_ets.sh
index 3f007c5f8361..566cad32f346 100755
--- a/tools/testing/selftests/drivers/net/mlxsw/sch_red_ets.sh
+++ b/tools/testing/selftests/drivers/net/mlxsw/sch_red_ets.sh
@@ -9,6 +9,8 @@ ALL_TESTS="
mc_backlog_test
red_mirror_test
red_trap_test
+   ecn_trap_test
+   ecn_mirror_test
 "
 : ${QDISC:=ets}
 source sch_red_core.sh
@@ -21,28 +23,60 @@ source sch_red_core.sh
 BACKLOG1=20
 BACKLOG2=50
 
-install_qdisc()
+install_root_qdisc()
 {
-   local -a args=("$@")
-
tc qdisc add dev $swp3 root handle 10: $QDISC \
   bands 8 priomap 7 6 5 4 3 2 1 0
+}
+
+install_qdisc_tc0()
+{
+   local -a args=("$@")
+
tc qdisc add dev $swp3 parent 10:8 handle 108: red \
   limit 100 min $BACKLOG1 max $((BACKLOG1 + 1)) \
   probability 1.0 avpkt 8000 burst 38 "${args[@]}"
+}
+
+install_qdisc_tc1()
+{
+   local -a args=("$@")
+
tc qdisc add dev $swp3 parent 10:7 handle 107: red \
   limit 100 min $BACKLOG2 max $((BACKLOG2 + 1)) \
   probability 1.0 avpkt 8000 burst 63 "${args[@]}"
+}
+
+install_qdisc()
+{
+   install_root_qdisc
+   install_qdisc_tc0 "$@"
+   inst

[PATCH mlx5-next v3 3/5] net/mlx5: Add dynamic MSI-X capabilities bits

2021-01-17 Thread Leon Romanovsky

From: Leon Romanovsky 

These new fields declare the number of MSI-X vectors that is
possible to allocate on the VF through PF configuration.

Value must be in range defined by min_dynamic_vf_msix_table_size
and max_dynamic_vf_msix_table_size.

The driver should continue to query its MSI-X table through PCI
configuration header.

Signed-off-by: Leon Romanovsky 
---
 include/linux/mlx5/mlx5_ifc.h | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 8c5d5fe58051..7ac614bc592a 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -1656,7 +1656,16 @@ struct mlx5_ifc_cmd_hca_cap_bits {
u8 reserved_at_6e0[0x10];
u8 sf_base_id[0x10];
 
-   u8 reserved_at_700[0x80];
+   u8 reserved_at_700[0x8];
+   u8 num_total_dynamic_vf_msix[0x18];
+   u8 reserved_at_720[0x14];
+   u8 dynamic_msix_table_size[0xc];
+   u8 reserved_at_740[0xc];
+   u8 min_dynamic_vf_msix_table_size[0x4];
+   u8 reserved_at_750[0x4];
+   u8 max_dynamic_vf_msix_table_size[0xc];
+
+   u8 reserved_at_760[0x20];
u8 vhca_tunnel_commands[0x40];
u8 reserved_at_7c0[0x40];
 };
-- 
2.29.2

[PATCH net] net: lapb: Add locking to the lapb module

2021-01-17 Thread Xie He

In the lapb module, the timers may run concurrently with other code in
this module, and there is currently no locking to prevent the code from
racing on "struct lapb_cb". This patch adds locking to prevent racing.

1. Add "spinlock_t lock" to "struct lapb_cb"; Add "spin_lock_bh" and
"spin_unlock_bh" to APIs, timer functions and notifier functions.

2. Add "bool t1timer_stop, t2timer_stop" to "struct lapb_cb" to make us
able to ask running timers to abort; Modify "lapb_stop_t1timer" and
"lapb_stop_t2timer" to make them able to abort running timers;
Modify "lapb_t2timer_expiry" and "lapb_t1timer_expiry" to make them
abort after they are stopped by "lapb_stop_t1timer", "lapb_stop_t2timer",
and "lapb_start_t1timer", "lapb_start_t2timer".

3. In lapb_unregister, change "lapb_stop_t1timer" and "lapb_stop_t2timer"
to "del_timer_sync" to make sure all running timers have exited.

4. In lapb_device_event, replace the "lapb_disconnect_request" call with
the content of "lapb_disconnect_request", to avoid trying to hold the
lock twice. When doing this, "lapb_start_t1timer" is removed because
I don't think it's necessary to start the timer when "NETDEV_GOING_DOWN".

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: Martin Schiller 
Signed-off-by: Xie He 
---
 include/net/lapb.h|  2 ++
 net/lapb/lapb_iface.c | 54 +++
 net/lapb/lapb_timer.c | 30 
 3 files changed, 78 insertions(+), 8 deletions(-)

diff --git a/include/net/lapb.h b/include/net/lapb.h
index ccc3d1f020b0..eee73442a1ba 100644
--- a/include/net/lapb.h
+++ b/include/net/lapb.h
@@ -92,6 +92,7 @@ struct lapb_cb {
unsigned short  n2, n2count;
unsigned short  t1, t2;
struct timer_list   t1timer, t2timer;
+   boolt1timer_stop, t2timer_stop;
 
/* Internal control information */
struct sk_buff_head write_queue;
@@ -103,6 +104,7 @@ struct lapb_cb {
struct lapb_frame   frmr_data;
unsigned char   frmr_type;
 
+   spinlock_t  lock;
refcount_t  refcnt;
 };
 
diff --git a/net/lapb/lapb_iface.c b/net/lapb/lapb_iface.c
index 40961889e9c0..ad6197381d05 100644
--- a/net/lapb/lapb_iface.c
+++ b/net/lapb/lapb_iface.c
@@ -122,6 +122,8 @@ static struct lapb_cb *lapb_create_cb(void)
 
timer_setup(&lapb->t1timer, NULL, 0);
timer_setup(&lapb->t2timer, NULL, 0);
+   lapb->t1timer_stop = true;
+   lapb->t2timer_stop = true;
 
lapb->t1  = LAPB_DEFAULT_T1;
lapb->t2  = LAPB_DEFAULT_T2;
@@ -129,6 +131,8 @@ static struct lapb_cb *lapb_create_cb(void)
lapb->mode= LAPB_DEFAULT_MODE;
lapb->window  = LAPB_DEFAULT_WINDOW;
lapb->state   = LAPB_STATE_0;
+
+   spin_lock_init(&lapb->lock);
refcount_set(&lapb->refcnt, 1);
 out:
return lapb;
@@ -178,8 +182,8 @@ int lapb_unregister(struct net_device *dev)
goto out;
lapb_put(lapb);
 
-   lapb_stop_t1timer(lapb);
-   lapb_stop_t2timer(lapb);
+   del_timer_sync(&lapb->t1timer);
+   del_timer_sync(&lapb->t2timer);
 
lapb_clear_queues(lapb);
 
@@ -201,6 +205,8 @@ int lapb_getparms(struct net_device *dev, struct 
lapb_parms_struct *parms)
if (!lapb)
goto out;
 
+   spin_lock_bh(&lapb->lock);
+
parms->t1  = lapb->t1 / HZ;
parms->t2  = lapb->t2 / HZ;
parms->n2  = lapb->n2;
@@ -219,6 +225,7 @@ int lapb_getparms(struct net_device *dev, struct 
lapb_parms_struct *parms)
else
parms->t2timer = (lapb->t2timer.expires - jiffies) / HZ;
 
+   spin_unlock_bh(&lapb->lock);
lapb_put(lapb);
rc = LAPB_OK;
 out:
@@ -234,6 +241,8 @@ int lapb_setparms(struct net_device *dev, struct 
lapb_parms_struct *parms)
if (!lapb)
goto out;
 
+   spin_lock_bh(&lapb->lock);
+
rc = LAPB_INVALUE;
if (parms->t1 < 1 || parms->t2 < 1 || parms->n2 < 1)
goto out_put;
@@ -256,6 +265,7 @@ int lapb_setparms(struct net_device *dev, struct 
lapb_parms_struct *parms)
 
rc = LAPB_OK;
 out_put:
+   spin_unlock_bh(&lapb->lock);
lapb_put(lapb);
 out:
return rc;
@@ -270,6 +280,8 @@ int lapb_connect_request(struct net_device *dev)
if (!lapb)
goto out;
 
+   spin_lock_bh(&lapb->lock);
+
rc = LAPB_OK;
if (lapb->state == LAPB_STATE_1)
goto out_put;
@@ -285,6 +297,7 @@ int lapb_connect_request(struct net_device *dev)
 
rc = LAPB_OK;
 out_put:
+   spin_unlock_bh(&lapb->lock);
lapb_put(lapb);
 out:
return rc;
@@ -299,6 +312,8 @@ int lapb_disconnect_request(struct net_device *dev)
if (!lapb)
goto out;
 
+   spin_lock_bh(&lapb->lock);
+
switch (lapb->state) {
case LAPB_STATE_0:
rc = LAPB_NOTCONNECTED;
@@ -330,6 +345,7 @@ int l

Re: [PATCH v2 net-next 19/21] net/mlx5e: NVMEoTCP, data-path for DDP offload

2021-01-17 Thread Boris Pismenny

On 16/01/2021 6:57, David Ahern wrote:
> I have not had time to review this version of the patches, but this
> patch seems very similar to 13 of 15 from v1 and you did not respond to
> my question on it ...
> 
> On 1/14/21 8:10 AM, Boris Pismenny wrote:
>> diff --git 
>> a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c 
>> b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c
>> new file mode 100644
>> index ..f446b5d56d64
>> --- /dev/null
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c
>> @@ -0,0 +1,243 @@
>> +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
>> +/* Copyright (c) 2021 Mellanox Technologies. */
>> +
>> +#include "en_accel/nvmeotcp_rxtx.h"
>> +#include "en_accel/nvmeotcp.h"
>> +#include 
>> +
>> +#define MLX5E_TC_FLOW_ID_MASK  0x00ff
>> +static void nvmeotcp_update_resync(struct mlx5e_nvmeotcp_queue *queue,
>> +   struct mlx5e_cqe128 *cqe128)
>> +{
>> +const struct tcp_ddp_ulp_ops *ulp_ops;
>> +u32 seq;
>> +
>> +seq = be32_to_cpu(cqe128->resync_tcp_sn);
>> +ulp_ops = inet_csk(queue->sk)->icsk_ulp_ddp_ops;
>> +if (ulp_ops && ulp_ops->resync_request)
>> +ulp_ops->resync_request(queue->sk, seq, TCP_DDP_RESYNC_REQ);
>> +}
>> +
>> +static void mlx5e_nvmeotcp_advance_sgl_iter(struct mlx5e_nvmeotcp_queue 
>> *queue)
>> +{
>> +struct nvmeotcp_queue_entry *nqe = &queue->ccid_table[queue->ccid];
>> +
>> +queue->ccoff += nqe->sgl[queue->ccsglidx].length;
>> +queue->ccoff_inner = 0;
>> +queue->ccsglidx++;
>> +}
>> +
>> +static inline void
>> +mlx5e_nvmeotcp_add_skb_frag(struct net_device *netdev, struct sk_buff *skb,
>> +struct mlx5e_nvmeotcp_queue *queue,
>> +struct nvmeotcp_queue_entry *nqe, u32 fragsz)
>> +{
>> +dma_sync_single_for_cpu(&netdev->dev,
>> +nqe->sgl[queue->ccsglidx].offset + 
>> queue->ccoff_inner,
>> +fragsz, DMA_FROM_DEVICE);
>> +page_ref_inc(compound_head(sg_page(&nqe->sgl[queue->ccsglidx])));
>> +// XXX: consider reducing the truesize, as no new memory is consumed
>> +skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
>> +sg_page(&nqe->sgl[queue->ccsglidx]),
>> +nqe->sgl[queue->ccsglidx].offset + queue->ccoff_inner,
>> +fragsz,
>> +fragsz);
>> +}
>> +
>> +static struct sk_buff*
>> +mlx5_nvmeotcp_add_tail_nonlinear(struct mlx5e_nvmeotcp_queue *queue,
>> + struct sk_buff *skb, skb_frag_t *org_frags,
>> + int org_nr_frags, int frag_index)
>> +{
>> +struct mlx5e_priv *priv = queue->priv;
>> +
>> +while (org_nr_frags != frag_index) {
>> +if (skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS) {
>> +dev_kfree_skb_any(skb);
>> +return NULL;
>> +}
>> +skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
>> +skb_frag_page(&org_frags[frag_index]),
>> +skb_frag_off(&org_frags[frag_index]),
>> +skb_frag_size(&org_frags[frag_index]),
>> +skb_frag_size(&org_frags[frag_index]));
>> +page_ref_inc(skb_frag_page(&org_frags[frag_index]));
>> +frag_index++;
>> +}
>> +return skb;
>> +}
>> +
>> +static struct sk_buff*
>> +mlx5_nvmeotcp_add_tail(struct mlx5e_nvmeotcp_queue *queue, struct sk_buff 
>> *skb,
>> +   int offset, int len)
>> +{
>> +struct mlx5e_priv *priv = queue->priv;
>> +
>> +if (skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS) {
>> +dev_kfree_skb_any(skb);
>> +return NULL;
>> +}
>> +skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
>> +virt_to_page(skb->data),
>> +offset,
>> +len,
>> +len);
>> +page_ref_inc(virt_to_page(skb->data));
>> +return skb;
>> +}
>> +
>> +static void mlx5_nvmeotcp_trim_nonlinear(struct sk_buff *skb,
>> + skb_frag_t *org_frags,
>> + int *frag_index,
>> + int remaining)
>> +{
>> +unsigned int frag_size;
>> +int nr_frags;
>> +
>> +/* skip @remaining bytes in frags */
>> +*frag_index = 0;
>> +while (remaining) {
>> +frag_size = skb_frag_size(&skb_shinfo(skb)->frags[*frag_index]);
>> +if (frag_size > remaining) {
>> +skb_frag_off_add(&skb_shinfo(skb)->frags[*frag_index],
>> + remaining);
>> +skb_frag_size_sub(&skb_shinfo(skb)->frags[*frag_index],
>> +  remaining);
>> +remaining = 0;
>> +} else {
>> +remaining -= fra

[PATCH net-next 0/5] mlxsw: Add support for RED qevent "mark"

2021-01-17 Thread Ido Schimmel

From: Ido Schimmel 

Petr says:

The RED qdisc currently supports two qevents: "early_drop" and "mark". The
filters added to the block bound to the "early_drop" qevent are executed on
packets for which the RED algorithm decides that they should be
early-dropped. The "mark" filters are similarly executed on ECT packets
that are marked as ECN-CE (Congestion Encountered).

A previous patchset has offloaded "early_drop" filters on Spectrum-2 and
later, provided that the classifier used is "matchall", that the action
used is either "trap" or "mirred", and a handful or further limitations.

This patchset similarly offloads "mark" filters.

Patch set overview:

Patches #1 and #2 add the trap, under which packets will be reported to the
CPU, if the qevent filter uses the action "trap".

Patch #3 then recognizes FLOW_BLOCK_BINDER_TYPE_RED_MARK as a binder type,
and offloads the attached filters similarly to _EARLY_DROP.

Patch #4 cleans up some unused variables in a selftest, and patch #5 adds a
new selftest for the RED "mark" qevent offload.

Petr Machata (5):
  devlink: Add ecn_mark trap
  mlxsw: spectrum_trap: Add ecn_mark trap
  mlxsw: spectrum_qdisc: Offload RED qevent mark
  selftests: mlxsw: sch_red_core: Drop two unused variables
  selftests: mlxsw: RED: Add selftests for the mark qevent

 .../networking/devlink/devlink-trap.rst   |  4 +
 .../net/ethernet/mellanox/mlxsw/spectrum.c|  2 +
 .../net/ethernet/mellanox/mlxsw/spectrum.h|  2 +
 .../ethernet/mellanox/mlxsw/spectrum_qdisc.c  | 14 +++-
 .../ethernet/mellanox/mlxsw/spectrum_span.c   | 16 
 .../ethernet/mellanox/mlxsw/spectrum_span.h   |  1 +
 .../ethernet/mellanox/mlxsw/spectrum_trap.c   |  9 ++
 include/net/devlink.h |  3 +
 net/core/devlink.c|  1 +
 .../drivers/net/mlxsw/sch_red_core.sh | 84 ++-
 .../drivers/net/mlxsw/sch_red_ets.sh  | 74 ++--
 11 files changed, 200 insertions(+), 10 deletions(-)

-- 
2.29.2

[PATCH rdma-rc] Revert "RDMA/mlx5: Fix devlink deadlock on net namespace deletion"

2021-01-17 Thread Leon Romanovsky

From: Parav Pandit 

This reverts commit fbdd0049d98d44914fc57d4b91f867f4996c787b.

Due to commit in fixes tag, netdevice events were received only
in one net namespace of mlx5_core_dev. Due to this when netdevice
events arrive in net namespace other than net namespace of mlx5_core_dev,
they are missed.

This results in empty GID table due to RDMA device being detached from
its net device.

Hence, revert back to receive netdevice events in all net namespaces to
restore back RDMA functionality in non init_net net namespace.

Fixes: fbdd0049d98d ("RDMA/mlx5: Fix devlink deadlock on net namespace 
deletion")
Signed-off-by: Parav Pandit 
Signed-off-by: Leon Romanovsky 
---
 drivers/infiniband/hw/mlx5/main.c  |  6 ++
 .../net/ethernet/mellanox/mlx5/core/lib/mlx5.h |  5 +
 include/linux/mlx5/driver.h| 18 --
 3 files changed, 7 insertions(+), 22 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index ed13c1bb031e..36f8ae4fe619 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -3335,8 +3335,7 @@ static int mlx5_add_netdev_notifier(struct mlx5_ib_dev 
*dev, u8 port_num)
int err;

dev->port[port_num].roce.nb.notifier_call = mlx5_netdev_event;
-   err = register_netdevice_notifier_net(mlx5_core_net(dev->mdev),
- &dev->port[port_num].roce.nb);
+   err = register_netdevice_notifier(&dev->port[port_num].roce.nb);
if (err) {
dev->port[port_num].roce.nb.notifier_call = NULL;
return err;
@@ -3348,8 +3347,7 @@ static int mlx5_add_netdev_notifier(struct mlx5_ib_dev 
*dev, u8 port_num)
 static void mlx5_remove_netdev_notifier(struct mlx5_ib_dev *dev, u8 port_num)
 {
if (dev->port[port_num].roce.nb.notifier_call) {
-   unregister_netdevice_notifier_net(mlx5_core_net(dev->mdev),
- &dev->port[port_num].roce.nb);
+   unregister_netdevice_notifier(&dev->port[port_num].roce.nb);
dev->port[port_num].roce.nb.notifier_call = NULL;
}
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/mlx5.h 
b/drivers/net/ethernet/mellanox/mlx5/core/lib/mlx5.h
index 3a9fa629503f..d046db7bb047 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/mlx5.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/mlx5.h
@@ -90,4 +90,9 @@ int mlx5_create_encryption_key(struct mlx5_core_dev *mdev,
   u32 key_type, u32 *p_key_id);
 void mlx5_destroy_encryption_key(struct mlx5_core_dev *mdev, u32 key_id);

+static inline struct net *mlx5_core_net(struct mlx5_core_dev *dev)
+{
+   return devlink_net(priv_to_devlink(dev));
+}
+
 #endif
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 4901b4fadabb..c4939b28ceed 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -1210,22 +1210,4 @@ static inline bool mlx5_is_roce_enabled(struct 
mlx5_core_dev *dev)
return val.vbool;
 }

-/**
- * mlx5_core_net - Provide net namespace of the mlx5_core_dev
- * @dev: mlx5 core device
- *
- * mlx5_core_net() returns the net namespace of mlx5 core device.
- * This can be called only in below described limited context.
- * (a) When a devlink instance for mlx5_core is registered and
- * when devlink reload operation is disabled.
- * or
- * (b) during devlink reload reload_down() and reload_up callbacks
- * where it is ensured that devlink instance's net namespace is
- * stable.
- */
-static inline struct net *mlx5_core_net(struct mlx5_core_dev *dev)
-{
-   return devlink_net(priv_to_devlink(dev));
-}
-
 #endif /* MLX5_DRIVER_H */
--
2.29.2

[PATCH net-next] net: core: init every ctl_table in netns_core_table

2021-01-17 Thread menglong8 . dong

From: Menglong Dong 

For now, there is only one element in netns_core_table, and it is inited
directly in sysctl_core_net_init. To make it more flexible, we can init
every element at once, just like what ipv4_sysctl_init_net() did.

Signed-off-by: Menglong Dong 
---
 net/core/sysctl_net_core.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index d86d8d11cfe4..966d976dee84 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -606,15 +606,19 @@ static __net_init int sysctl_core_net_init(struct net 
*net)
 
tbl = netns_core_table;
if (!net_eq(net, &init_net)) {
+   int i;
+
tbl = kmemdup(tbl, sizeof(netns_core_table), GFP_KERNEL);
if (tbl == NULL)
goto err_dup;
 
-   tbl[0].data = &net->core.sysctl_somaxconn;
+   /* Update the variables to point into the current struct net */
+   for (i = 0; i < ARRAY_SIZE(netns_core_table) - 1; i++) {
+   tbl[i].data += (void *)net - (void *)&init_net;
 
-   /* Don't export any sysctls to unprivileged users */
-   if (net->user_ns != &init_user_ns) {
-   tbl[0].procname = NULL;
+   /* Don't export any sysctls to unprivileged users */
+   if (net->user_ns != &init_user_ns)
+   tbl[i].procname = NULL;
}
}
 
-- 
2.30.0

ibmvnic: Race condition in remove callback

2021-01-17 Thread Uwe Kleine-König

Hello,

while working on some cleanup I stumbled over a problem in the ibmvnic's
remove callback. Since commit

7d7195a026ba ("ibmvnic: Do not process device remove during device 
reset")

there is the following code in the remove callback:

static int ibmvnic_remove(struct vio_dev *dev)
{
...
spin_lock_irqsave(&adapter->state_lock, flags);
if (test_bit(0, &adapter->resetting)) {
spin_unlock_irqrestore(&adapter->state_lock, flags);
return -EBUSY;
}

adapter->state = VNIC_REMOVING;
spin_unlock_irqrestore(&adapter->state_lock, flags);

flush_work(&adapter->ibmvnic_reset);
flush_delayed_work(&adapter->ibmvnic_delayed_reset);
...
}

Unfortunately returning -EBUSY doesn't work as intended. That's because
the return value of this function is ignored[1] and the device is
considered unbound by the device core (shortly) after ibmvnic_remove()
returns.

While looking into fixing that I noticed a worse problem:

If ibmvnic_reset() (e.g. called by the tx_timeout callback) calls
schedule_work(&adapter->ibmvnic_reset); just after the work queue is
flushed above the problem that 7d7195a026ba intends to fix will trigger
resulting in a use-after-free.

Also ibmvnic_reset() checks for adapter->state without holding the lock
which might be racy, too.

Best regards
Uwe

[1] vio_bus_remove (in arch/powerpc/platforms/pseries/vio.c) records the
return value and passes it on. But the driver core doesn't care for
the return value (see __device_release_driver() in drivers/base/dd.c
calling dev->bus->remove()).

-- 
Pengutronix e.K.   | Uwe Kleine-König|
Industrial Linux Solutions | https://www.pengutronix.de/ |


signature.asc
Description: PGP signature

Re: [PATCH net-next V2] net: ks8851: Fix mixed module/builtin build

2021-01-17 Thread Arnd Bergmann

On Sat, Jan 16, 2021 at 10:41 PM Heiner Kallweit  wrote:
> >>
> >> It seems unlikely that a system uses both, the parallel *and* the SPI
> >> variant of the ks8851.  So the additional memory necessary because of
> >> code duplication wouldn't matter in practice.
> >
> > I have a board with both options populated on my desk, sorry.
>
> Making the common part a separate module shouldn't be that hard.
> AFAICS it would just take:
> - export 4 functions from common
> - extend Kconfig
> - extend Makefile
> One similar configuration that comes to my mind and could be used as
> template is SPI_FSL_LIB.

There is no need to even change Kconfig, just simplify the Makefile to

obj-$(CONFIG_KS8851) += ks8851_common.o ks8851_spi.o
obj-$(CONFIG_KS8851_MLL) += ks8851_common.o ks8851_par.o

This will do the right thing and build ks8851_common.ko into
vmlinux if at least one of the two front-ends is built-in, and
otherwise build it at a loadable module if there is another
module using it.

 Arnd

[PATCH net-next] net: core: Namespace-ify sysctl_wmem_default and sysctl_rmem_default

2021-01-17 Thread menglong8 . dong

From: Menglong Dong 

For now, sysctl_wmem_default and sysctl_rmem_default are globally
unified. It's not convenient in some case. For example, when we
use docker and try to control the default udp socket receive buffer
for each container.

For that reason, make sysctl_wmem_default and sysctl_rmem_default
per-namespace.

Signed-off-by: Menglong Dong 
---
 include/net/netns/core.h   |  2 ++
 include/net/sock.h |  3 ---
 net/core/net_namespace.c   |  2 ++
 net/core/sock.c|  6 ++
 net/core/sysctl_net_core.c | 32 
 net/ipv4/ip_output.c   |  2 +-
 6 files changed, 23 insertions(+), 24 deletions(-)

diff --git a/include/net/netns/core.h b/include/net/netns/core.h
index 36c2d998a43c..317b47df6d08 100644
--- a/include/net/netns/core.h
+++ b/include/net/netns/core.h
@@ -9,6 +9,8 @@ struct netns_core {
/* core sysctls */
struct ctl_table_header *sysctl_hdr;
 
+   int sysctl_wmem_default;
+   int sysctl_rmem_default;
int sysctl_somaxconn;
 
 #ifdef CONFIG_PROC_FS
diff --git a/include/net/sock.h b/include/net/sock.h
index bdc4323ce53c..b846a6d24459 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2653,9 +2653,6 @@ extern __u32 sysctl_rmem_max;
 extern int sysctl_tstamp_allow_data;
 extern int sysctl_optmem_max;
 
-extern __u32 sysctl_wmem_default;
-extern __u32 sysctl_rmem_default;
-
 DECLARE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);
 
 static inline int sk_get_wmem0(const struct sock *sk, const struct proto 
*proto)
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 2ef3b4557f40..eb4ea99131d6 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -374,6 +374,8 @@ static __net_init int setup_net(struct net *net, struct 
user_namespace *user_ns)
 
 static int __net_init net_defaults_init_net(struct net *net)
 {
+   net->core.sysctl_rmem_default = SK_RMEM_MAX;
+   net->core.sysctl_wmem_default = SK_WMEM_MAX;
net->core.sysctl_somaxconn = SOMAXCONN;
return 0;
 }
diff --git a/net/core/sock.c b/net/core/sock.c
index bbcd4b97eddd..2421e4ea1915 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -270,8 +270,6 @@ __u32 sysctl_wmem_max __read_mostly = SK_WMEM_MAX;
 EXPORT_SYMBOL(sysctl_wmem_max);
 __u32 sysctl_rmem_max __read_mostly = SK_RMEM_MAX;
 EXPORT_SYMBOL(sysctl_rmem_max);
-__u32 sysctl_wmem_default __read_mostly = SK_WMEM_MAX;
-__u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
 
 /* Maximal space eaten by iovec or ancillary data plus some space */
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
@@ -2970,8 +2968,8 @@ void sock_init_data(struct socket *sock, struct sock *sk)
timer_setup(&sk->sk_timer, NULL, 0);
 
sk->sk_allocation   =   GFP_KERNEL;
-   sk->sk_rcvbuf   =   sysctl_rmem_default;
-   sk->sk_sndbuf   =   sysctl_wmem_default;
+   sk->sk_rcvbuf   =   sock_net(sk)->core.sysctl_rmem_default;
+   sk->sk_sndbuf   =   sock_net(sk)->core.sysctl_wmem_default;
sk->sk_state=   TCP_CLOSE;
sk_set_socket(sk, sock);
 
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 966d976dee84..5c1c75e42a09 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -326,22 +326,6 @@ static struct ctl_table net_core_table[] = {
.proc_handler   = proc_dointvec_minmax,
.extra1 = &min_rcvbuf,
},
-   {
-   .procname   = "wmem_default",
-   .data   = &sysctl_wmem_default,
-   .maxlen = sizeof(int),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec_minmax,
-   .extra1 = &min_sndbuf,
-   },
-   {
-   .procname   = "rmem_default",
-   .data   = &sysctl_rmem_default,
-   .maxlen = sizeof(int),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec_minmax,
-   .extra1 = &min_rcvbuf,
-   },
{
.procname   = "dev_weight",
.data   = &weight_p,
@@ -584,6 +568,22 @@ static struct ctl_table netns_core_table[] = {
.extra1 = SYSCTL_ZERO,
.proc_handler   = proc_dointvec_minmax
},
+   {
+   .procname   = "wmem_default",
+   .data   = &init_net.core.sysctl_wmem_default,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   .extra1 = &min_sndbuf,
+   },
+   {
+   .procname   = "rmem_default",
+   .data   = &init_net.core.sysctl_rmem_default,
+   .maxlen = sizeof(int),
+   .mode

[PATCH net-next] net: core: Namespace-ify sysctl_rmem_max and sysctl_wmem_max

2021-01-17 Thread menglong8 . dong

From: Menglong Dong 

For now, sysctl_wmem_max and sysctl_rmem_max are globally unified.
It's not convenient in some case. For example, when we use docker
and try to control the default udp socket receive buffer for each
container.

For that reason, make sysctl_wmem_max and sysctl_rmem_max
per-namespace.

Signed-off-by: Menglong Dong 
---
 include/net/netns/core.h|  2 ++
 include/net/sock.h  |  3 ---
 net/core/filter.c   |  4 ++--
 net/core/net_namespace.c|  2 ++
 net/core/sock.c | 12 
 net/core/sysctl_net_core.c  | 32 
 net/ipv4/tcp_output.c   |  2 +-
 net/netfilter/ipvs/ip_vs_sync.c |  4 ++--
 8 files changed, 29 insertions(+), 32 deletions(-)

diff --git a/include/net/netns/core.h b/include/net/netns/core.h
index 317b47df6d08..b4aecac6e8ce 100644
--- a/include/net/netns/core.h
+++ b/include/net/netns/core.h
@@ -11,6 +11,8 @@ struct netns_core {
 
int sysctl_wmem_default;
int sysctl_rmem_default;
+   int sysctl_wmem_max;
+   int sysctl_rmem_max;
int sysctl_somaxconn;
 
 #ifdef CONFIG_PROC_FS
diff --git a/include/net/sock.h b/include/net/sock.h
index b846a6d24459..f6b0f2c482ad 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2647,9 +2647,6 @@ void sk_get_meminfo(const struct sock *sk, u32 *meminfo);
 #define SK_WMEM_MAX(_SK_MEM_OVERHEAD * _SK_MEM_PACKETS)
 #define SK_RMEM_MAX(_SK_MEM_OVERHEAD * _SK_MEM_PACKETS)
 
-extern __u32 sysctl_wmem_max;
-extern __u32 sysctl_rmem_max;
-
 extern int sysctl_tstamp_allow_data;
 extern int sysctl_optmem_max;
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 255aeee72402..3dca58f6c40c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4717,13 +4717,13 @@ static int _bpf_setsockopt(struct sock *sk, int level, 
int optname,
/* Only some socketops are supported */
switch (optname) {
case SO_RCVBUF:
-   val = min_t(u32, val, sysctl_rmem_max);
+   val = min_t(u32, val, 
sock_net(sk)->core.sysctl_rmem_max);
sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
WRITE_ONCE(sk->sk_rcvbuf,
   max_t(int, val * 2, SOCK_MIN_RCVBUF));
break;
case SO_SNDBUF:
-   val = min_t(u32, val, sysctl_wmem_max);
+   val = min_t(u32, val, 
sock_net(sk)->core.sysctl_wmem_max);
sk->sk_userlocks |= SOCK_SNDBUF_LOCK;
WRITE_ONCE(sk->sk_sndbuf,
   max_t(int, val * 2, SOCK_MIN_SNDBUF));
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index eb4ea99131d6..552e3c5b2a41 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -376,6 +376,8 @@ static int __net_init net_defaults_init_net(struct net *net)
 {
net->core.sysctl_rmem_default = SK_RMEM_MAX;
net->core.sysctl_wmem_default = SK_WMEM_MAX;
+   net->core.sysctl_rmem_max = SK_RMEM_MAX;
+   net->core.sysctl_wmem_max = SK_WMEM_MAX;
net->core.sysctl_somaxconn = SOMAXCONN;
return 0;
 }
diff --git a/net/core/sock.c b/net/core/sock.c
index 2421e4ea1915..eb7eaaa840ce 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -265,12 +265,6 @@ static struct lock_class_key af_wlock_keys[AF_MAX];
 static struct lock_class_key af_elock_keys[AF_MAX];
 static struct lock_class_key af_kern_callback_keys[AF_MAX];
 
-/* Run time adjustable parameters. */
-__u32 sysctl_wmem_max __read_mostly = SK_WMEM_MAX;
-EXPORT_SYMBOL(sysctl_wmem_max);
-__u32 sysctl_rmem_max __read_mostly = SK_RMEM_MAX;
-EXPORT_SYMBOL(sysctl_rmem_max);
-
 /* Maximal space eaten by iovec or ancillary data plus some space */
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
 EXPORT_SYMBOL(sysctl_optmem_max);
@@ -877,7 +871,7 @@ int sock_setsockopt(struct socket *sock, int level, int 
optname,
 * play 'guess the biggest size' games. RCVBUF/SNDBUF
 * are treated in BSD as hints
 */
-   val = min_t(u32, val, sysctl_wmem_max);
+   val = min_t(u32, val, sock_net(sk)->core.sysctl_wmem_max);
 set_sndbuf:
/* Ensure val * 2 fits into an int, to prevent max_t()
 * from treating it as a negative value.
@@ -909,7 +903,9 @@ int sock_setsockopt(struct socket *sock, int level, int 
optname,
 * play 'guess the biggest size' games. RCVBUF/SNDBUF
 * are treated in BSD as hints
 */
-   __sock_set_rcvbuf(sk, min_t(u32, val, sysctl_rmem_max));
+   __sock_set_rcvbuf(sk,
+ min_t(u32, val,
+   sock_net(sk)->core.sysctl_rmem_max));
break;
 
case SO_RCVBUFFORCE:
diff -

Re: [PATCH net-next V2 1/8] net: netdevice: Add operation ndo_sk_get_slave

2021-01-17 Thread Tariq Toukan





On 1/17/2021 4:51 AM, Jakub Kicinski wrote:

On Thu, 14 Jan 2021 20:01:28 +0200 Tariq Toukan wrote:

ndo_sk_get_slave returns the slave that corresponds to a given socket.
Additionally, we implement a helper netdev_sk_get_lowest_dev() to get
the lowest slave netdevice.


Please don't add new uses of the word "slave" outside of the bond,
and preferably even there.



I'll fix this.

Re: [PATCH net-next V2 5/8] net/bonding: Implement TLS TX device offload

2021-01-17 Thread Tariq Toukan





On 1/17/2021 4:54 AM, Jakub Kicinski wrote:

On Thu, 14 Jan 2021 20:01:32 +0200 Tariq Toukan wrote:

As the bond interface is being bypassed by the TLS module, interacting
directly against the slaves, there is no way for the bond interface to
disable its device offload capabilities, as long as the mode/policy
config allows it.
Hence, the feature flag is not directly controllable, but just reflects
the current offload status based on the logic under bond_sk_check().


In that case why set it in ->hw_features ?
IIRC features set only in ->features but not ->hw_features show up to
userspace as "fixed" which I gather is what we want here, no?



On one hand, by showing "off [Fixed]" we might hide the fact that bond 
driver now does support the TLS offload feature, you simply need to 
choose the proper mode/xmit_policy.


On the other hand, as the feature flag toggling has totally no impact, I 
don't see a point in opening it for toggling.


So yeah, I'll fix.



+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+   bond_dev->hw_features |= BOND_TLS_FEATURES;
+   if (bond_sk_check(bond))
+   bond_dev->features |= BOND_TLS_FEATURES;
+#endif

Re: [PATCH v2 net-next 01/14] net: mscc: ocelot: allow offloading of bridge on top of LAG

2021-01-17 Thread Vladimir Oltean

On Sat, Jan 16, 2021 at 05:26:23PM -0800, Jakub Kicinski wrote:
> On Sat, 16 Jan 2021 02:59:30 +0200 Vladimir Oltean wrote:
> > From: Vladimir Oltean 
> >
> > Commit 7afb3e575e5a ("net: mscc: ocelot: don't handle netdev events for
> > other netdevs") was too aggressive, and it made ocelot_netdevice_event
> > react only to network interface events emitted for the ocelot switch
> > ports.
> >
> > In fact, only the PRECHANGEUPPER should have had that check.
> >
> > When we ignore all events that are not for us, we miss the fact that the
> > upper of the LAG changes, and the bonding interface gets enslaved to a
> > bridge. This is an operation we could offload under certain conditions.
>
> I see the commit in question is in net, perhaps worth spelling out why
> this is not a fix? Perhaps add some "in the future" to the last
> sentence if it's the case that this will only matter with the following
> patches applied?

It is a fix. However, so is patch 13/14 "net: mscc: ocelot: rebalance
LAGs on link up/down events", but I didn't see an easy way to backport
that. Honestly the reasons why I did not attempt to split this series
into a part for "net" and one for "net-next" are:
(a) It would unnecessarily complicate my work for felix DSA, where this
is considered a new feature as opposed to ocelot switchdev where it
was supposedly already working (although.. not quite, due to the
lack of rebalancing, a link down would throw off the LAG).
I don't really think that anybody was seriously using LAG offload on
ocelot so far.
(b) Even if I were to split this patch, it can only be trivially
backported as far as commit 9c90eea310f8 ("net: mscc: ocelot: move
net_device related functions to ocelot_net.c") from June 2020
anyway.
(c) I cannot test the mscc_ocelot.ko switchdev driver with traffic,
since I don't have the hardware (I just have a local patch that
I keep rebasing on top of net-next which makes me able to at least
probe it and access its registers on a different switch revision,
but the traffic I/O procedure there is completely different). So I
can not really confirm what is the state I'm leaving the mscc_ocelot
driver in, for stable kernels. At least now, I've made the entry
points into the control code path very similar to those of DSA, and
I've exercised the switchdev driver in blind (without traffic), so I
have a bit more confidence that it should work.
(d) Had the AUTOSEL guys picked up this patch, I would have probably had
no objection (since my belief is that there's nothing to break and
nothing to fix in stable kernels).

That being said, if we want to engage in a rigid demonstration of
procedures, sure we can do that. I have other patches anyway to fill the
pipeline until "net" is merged back into "net-next" :)

Re: [PATCH net-next v5] GTP: add support for flow based tunneling API

2021-01-17 Thread Jonas Bonn


Hi Jakub,

On 17/01/2021 01:46, Jakub Kicinski wrote:

On Sat,  9 Jan 2021 23:00:21 -0800 Pravin B Shelar wrote:

Following patch add support for flow based tunneling API
to send and recv GTP tunnel packet over tunnel metadata API.
This would allow this device integration with OVS or eBPF using
flow based tunneling APIs.

Signed-off-by: Pravin B Shelar 


Applied, thanks!



This patch hasn't received any ACK's from either the maintainers or 
anyone else providing review.  The following issues remain unaddressed 
after review:


i)  the patch contains several logically separate changes that would be 
better served as smaller patches
ii) functionality like the handling of end markers has been introduced 
without further explanation
iii) symmetry between the handling of GTPv0 and GTPv1 has been 
unnecessarily broken
iv) there are no available userspace tools to allow for testing this 
functionality


I have requested that this patch be reworked into a series of smaller 
changes.  That would allow:


i) reasonable review
ii) the possibility to explain _why_ things are being done in the patch 
comment where this isn't obvious (like the handling of end markers)
iii) the chance to do a reasonable rebase of other ongoing work onto 
this patch (series):  this one patch is invasive and difficult to rebase 
onto


I'm not sure what the hurry is to get this patch into mainline.  Large 
and complicated patches like this take time to review; please revert 
this and allow that process to happen.


Thanks,
Jonas

[PATCH v1] ipv4: add iPv4_is_multicast() check in ip_mc_leave_group().

2021-01-17 Thread wangyingjie55

From: Yingjie Wang 

There is no iPv4_is_multicast() check added to ip_mc_leave_group()
to check if imr->imr_multiaddr.s_addr is a multicast address.
If not a multicast address, it may result in an error.
In some cases, the callers of ip_mc_leave_group() don't check
whether it is multicast address or not before calling
such as do_ip_setsockopt(). So I suggest adding the ipv4_is_multicast()
check to the ip_mc_leave_group() to prevent this from happening.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Yingjie Wang 
---
 net/ipv4/igmp.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index 7b272bbed2b4..1b6f91271cfd 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -2248,6 +2248,9 @@ int ip_mc_leave_group(struct sock *sk, struct ip_mreqn 
*imr)
u32 ifindex;
int ret = -EADDRNOTAVAIL;
 
+   if (!ipv4_is_multicast(group))
+   return -EINVAL;
+
ASSERT_RTNL();
 
in_dev = ip_mc_find_dev(net, imr);
-- 
2.7.4

Re: [PATCH net-next v5] GTP: add support for flow based tunneling API

2021-01-17 Thread Jonas Bonn


Hi Pravin,

Again, this patch is too big and contains too many disparate changes to 
allow for easy review.  Please break this up into a series of logical 
changes for easier review.


On 10/01/2021 08:00, Pravin B Shelar wrote:

Following patch add support for flow based tunneling API
to send and recv GTP tunnel packet over tunnel metadata API.
This would allow this device integration with OVS or eBPF using
flow based tunneling APIs.

Signed-off-by: Pravin B Shelar 
---
v4-v5:
- coding style changes
v3-v4:
- add check for non-zero dst port
v2-v3:
-  Fixed coding style
-  changed IFLA_GTP_FD1 to optional param for LWT dev.
v1-v2:
-  Fixed according to comments from Jonas Bonn

  drivers/net/gtp.c  | 527 +
  include/uapi/linux/gtp.h   |  12 +
  include/uapi/linux/if_link.h   |   1 +
  include/uapi/linux/if_tunnel.h |   1 +
  tools/include/uapi/linux/if_link.h |   1 +
  5 files changed, 398 insertions(+), 144 deletions(-)

diff --git a/drivers/net/gtp.c b/drivers/net/gtp.c
index 4c04e271f184..851364314ecc 100644
--- a/drivers/net/gtp.c
+++ b/drivers/net/gtp.c
@@ -21,6 +21,7 @@
  #include 
  #include 
  
+#include 

  #include 
  #include 
  #include 
@@ -73,6 +74,9 @@ struct gtp_dev {
unsigned inthash_size;
struct hlist_head   *tid_hash;
struct hlist_head   *addr_hash;
+   /* Used by LWT tunnel. */
+   boolcollect_md;
+   struct socket   *collect_md_sock;
  };


Can't you use the 'sock' member of sk1u as the reference this socket?

  
  static unsigned int gtp_net_id __read_mostly;

@@ -179,33 +183,121 @@ static bool gtp_check_ms(struct sk_buff *skb, struct 
pdp_ctx *pctx,
return false;
  }
  
-static int gtp_rx(struct pdp_ctx *pctx, struct sk_buff *skb,

-   unsigned int hdrlen, unsigned int role)
+static int gtp_set_tun_dst(struct gtp_dev *gtp, struct sk_buff *skb,
+  unsigned int hdrlen, u8 gtp_version,
+  __be64 tid, u8 flags)
  {
-   if (!gtp_check_ms(skb, pctx, hdrlen, role)) {
-   netdev_dbg(pctx->dev, "No PDP ctx for this MS\n");
-   return 1;
+   struct metadata_dst *tun_dst;
+   int opts_len = 0;
+
+   if (unlikely(flags & GTP1_F_MASK))
+   opts_len = sizeof(struct gtpu_metadata);
+
+   tun_dst = udp_tun_rx_dst(skb, gtp->sk1u->sk_family, TUNNEL_KEY, tid, 
opts_len);
+   if (!tun_dst) {
+   netdev_dbg(gtp->dev, "Failed to allocate tun_dst");
+   goto err;
}
  
+	netdev_dbg(gtp->dev, "attaching metadata_dst to skb, gtp ver %d hdrlen %d\n",

+  gtp_version, hdrlen);
+   if (unlikely(opts_len)) {
+   struct gtpu_metadata *opts;
+   struct gtp1_header *gtp1;
+
+   opts = ip_tunnel_info_opts(&tun_dst->u.tun_info);
+   gtp1 = (struct gtp1_header *)(skb->data + sizeof(struct 
udphdr));
+   opts->ver = GTP_METADATA_V1;
+   opts->flags = gtp1->flags;
+   opts->type = gtp1->type;
+   netdev_dbg(gtp->dev, "recved control pkt: flag %x type: %d\n",
+  opts->flags, opts->type);
+   tun_dst->u.tun_info.key.tun_flags |= TUNNEL_GTPU_OPT;
+   tun_dst->u.tun_info.options_len = opts_len;
+   skb->protocol = htons(0x); /* Unknown */
+   }
/* Get rid of the GTP + UDP headers. */
if (iptunnel_pull_header(skb, hdrlen, skb->protocol,
-!net_eq(sock_net(pctx->sk), 
dev_net(pctx->dev
-   return -1;
+!net_eq(sock_net(gtp->sk1u), 
dev_net(gtp->dev {
+   gtp->dev->stats.rx_length_errors++;
+   goto err;
+   }
+
+   skb_dst_set(skb, &tun_dst->dst);
+   return 0;
+err:
+   return -1;
+}
+
+static int gtp_rx(struct gtp_dev *gtp, struct sk_buff *skb,
+ unsigned int hdrlen, u8 gtp_version, unsigned int role,
+ __be64 tid, u8 flags, u8 type)
+{
+   if (ip_tunnel_collect_metadata() || gtp->collect_md) {
+   int err;
+
+   err = gtp_set_tun_dst(gtp, skb, hdrlen, gtp_version, tid, 
flags);
+   if (err)
+   goto err;
+   } else {
+   struct pdp_ctx *pctx;
+
+   if (flags & GTP1_F_MASK)
+   hdrlen += 4;
  
-	netdev_dbg(pctx->dev, "forwarding packet from GGSN to uplink\n");

+   if (type != GTP_TPDU)
+   return 1;
+
+   if (gtp_version == GTP_V0)
+   pctx = gtp0_pdp_find(gtp, be64_to_cpu(tid));
+   else
+   pctx = gtp1_pdp_find(gtp, be64_to_cpu(tid));
+   if (!pctx) {
+   netdev_dbg(gtp->dev, "No PDP ctx to decap skb=%p\n", 
skb);
+

Re: [PATCH net-next 1/1] stmmac: intel: change all EHL/TGL to auto detect phy addr

2021-01-17 Thread Greg Kroah-Hartman

On Sat, Jan 16, 2021 at 04:59:14PM -0800, Jakub Kicinski wrote:
> On Sat, 16 Jan 2021 10:12:21 +0100 Jan Kiszka wrote:
> > On 06.11.20 10:43, Wong Vee Khee wrote:
> > > From: Voon Weifeng 
> > > 
> > > Set all EHL/TGL phy_addr to -1 so that the driver will automatically
> > > detect it at run-time by probing all the possible 32 addresses.
> > > 
> > > Signed-off-by: Voon Weifeng 
> > > Signed-off-by: Wong Vee Khee 
> > 
> > This fixes PHY detection on one of our EHL-based boards. Can this also
> > be applied to stable 5.10?
> 
> Sure.
> 
> Greg, we'd like to request a backport of the following commit to 5.10.
> 
> commit bff6f1db91e330d7fba56f815cdbc412c75fe163
> Author: Voon Weifeng 
> Date:   Fri Nov 6 17:43:41 2020 +0800
> 
> stmmac: intel: change all EHL/TGL to auto detect phy addr
> 
> Set all EHL/TGL phy_addr to -1 so that the driver will automatically
> detect it at run-time by probing all the possible 32 addresses.
> 
> Signed-off-by: Voon Weifeng 
> Signed-off-by: Wong Vee Khee 
> Link: 
> https://lore.kernel.org/r/20201106094341.4241-1-vee.khee.w...@intel.com
> Signed-off-by: Jakub Kicinski 
> 
> 
> It's relatively small, and Jan reports it makes his boards detect the
> PHY. The change went in via -next and into Linus's tree during the 5.11
> merge window.

Now queued up, thanks.

greg k-h

[PATCH net-next V3 5/8] net/bonding: Implement TLS TX device offload

2021-01-17 Thread Tariq Toukan

Implement TLS TX device offload for bonding interfaces.
This allows kTLS sockets running on a bond to benefit from the
device offload on capable lower devices.

To allow a simple and fast maintenance of the TLS context in SW and
lower devices, we bind the TLS socket to a specific lower dev.
To achieve a behavior similar to SW kTLS, we support only balance-xor
and 802.3ad modes, with xmit_hash_policy=layer3+4. This is enforced
in bond_sk_check(), done in a previous patch.

For the above configuration, the SW implementation keeps picking the
same exact lower dev for all the socket's SKBs. The device offload
behaves similarly, making the decision once at the connection creation.

Per socket, the TLS module should work directly with the lowest netdev
in chain, to call the tls_dev_ops operations.

As the bond interface is being bypassed by the TLS module, interacting
directly against the lower devs, there is no way for the bond interface
to disable its device offload capabilities, as long as the mode/policy
config allows it.
Hence, the feature flag is not directly controllable, but just reflects
the current offload status based on the logic under bond_sk_check().

Signed-off-by: Tariq Toukan 
Reviewed-by: Boris Pismenny 
---
 drivers/net/bonding/bond_main.c| 29 +
 drivers/net/bonding/bond_options.c | 27 +--
 include/net/bonding.h  |  2 ++
 3 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 09524f99c753..539c6bc218df 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -83,6 +83,9 @@
 #include 
 #include 
 #include 
+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+#include 
+#endif
 
 #include "bonding_priv.h"
 
@@ -1225,6 +1228,13 @@ static netdev_features_t bond_fix_features(struct 
net_device *dev,
netdev_features_t mask;
struct slave *slave;
 
+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+   if (bond_sk_check(bond))
+   features |= BOND_TLS_FEATURES;
+   else
+   features &= ~BOND_TLS_FEATURES;
+#endif
+
mask = features;
 
features &= ~NETIF_F_ONE_FOR_ALL;
@@ -4647,6 +4657,16 @@ static struct net_device *bond_sk_get_lower_dev(struct 
net_device *dev,
return lower;
 }
 
+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+static netdev_tx_t bond_tls_device_xmit(struct bonding *bond, struct sk_buff 
*skb,
+   struct net_device *dev)
+{
+   if (likely(bond_get_slave_by_dev(bond, tls_get_ctx(skb->sk)->netdev)))
+   return bond_dev_queue_xmit(bond, skb, 
tls_get_ctx(skb->sk)->netdev);
+   return bond_tx_drop(dev, skb);
+}
+#endif
+
 static netdev_tx_t __bond_start_xmit(struct sk_buff *skb, struct net_device 
*dev)
 {
struct bonding *bond = netdev_priv(dev);
@@ -4655,6 +4675,11 @@ static netdev_tx_t __bond_start_xmit(struct sk_buff 
*skb, struct net_device *dev
!bond_slave_override(bond, skb))
return NETDEV_TX_OK;
 
+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+   if (skb->sk && tls_is_sk_tx_device_offloaded(skb->sk))
+   return bond_tls_device_xmit(bond, skb, dev);
+#endif
+
switch (BOND_MODE(bond)) {
case BOND_MODE_ROUNDROBIN:
return bond_xmit_roundrobin(skb, dev);
@@ -4855,6 +4880,10 @@ void bond_setup(struct net_device *bond_dev)
if (BOND_MODE(bond) == BOND_MODE_ACTIVEBACKUP)
bond_dev->features |= BOND_XFRM_FEATURES;
 #endif /* CONFIG_XFRM_OFFLOAD */
+#if IS_ENABLED(CONFIG_TLS_DEVICE)
+   if (bond_sk_check(bond))
+   bond_dev->features |= BOND_TLS_FEATURES;
+#endif
 }
 
 /* Destroy a bonding device.
diff --git a/drivers/net/bonding/bond_options.c 
b/drivers/net/bonding/bond_options.c
index 7f0ad97926de..8fcbf7f9c7b2 100644
--- a/drivers/net/bonding/bond_options.c
+++ b/drivers/net/bonding/bond_options.c
@@ -758,6 +758,19 @@ static bool bond_set_xfrm_features(struct bonding *bond)
return true;
 }
 
+static bool bond_set_tls_features(struct bonding *bond)
+{
+   if (!IS_ENABLED(CONFIG_TLS_DEVICE))
+   return false;
+
+   if (bond_sk_check(bond))
+   bond->dev->wanted_features |= BOND_TLS_FEATURES;
+   else
+   bond->dev->wanted_features &= ~BOND_TLS_FEATURES;
+
+   return true;
+}
+
 static int bond_option_mode_set(struct bonding *bond,
const struct bond_opt_value *newval)
 {
@@ -784,9 +797,15 @@ static int bond_option_mode_set(struct bonding *bond,
bond->params.arp_validate = BOND_ARP_VALIDATE_NONE;
bond->params.mode = newval->value;
 
-   if (bond->dev->reg_state == NETREG_REGISTERED)
-   if (bond_set_xfrm_features(bond))
+   if (bond->dev->reg_state == NETREG_REGISTERED) {
+   bool update = false;
+
+   update |= bond_set_xfrm_features(bond);
+   update |= bond_set_tl

[PATCH net-next V3 8/8] net/tls: Except bond interface from some TLS checks

2021-01-17 Thread Tariq Toukan

In the tls_dev_event handler, ignore tlsdev_ops requirement for bond
interfaces, they do not exist as the interaction is done directly with
the lower device.

Also, make the validate function pass when it's called with the upper
bond interface.

Signed-off-by: Tariq Toukan 
Reviewed-by: Boris Pismenny 
---
 net/tls/tls_device.c  | 2 ++
 net/tls/tls_device_fallback.c | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index 75ceea0a41bf..d9cd229aa111 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -1329,6 +1329,8 @@ static int tls_dev_event(struct notifier_block *this, 
unsigned long event,
switch (event) {
case NETDEV_REGISTER:
case NETDEV_FEAT_CHANGE:
+   if (netif_is_bond_master(dev))
+   return NOTIFY_DONE;
if ((dev->features & NETIF_F_HW_TLS_RX) &&
!dev->tlsdev_ops->tls_dev_resync)
return NOTIFY_BAD;
diff --git a/net/tls/tls_device_fallback.c b/net/tls/tls_device_fallback.c
index d946817ed065..cacf040872c7 100644
--- a/net/tls/tls_device_fallback.c
+++ b/net/tls/tls_device_fallback.c
@@ -424,7 +424,7 @@ struct sk_buff *tls_validate_xmit_skb(struct sock *sk,
  struct net_device *dev,
  struct sk_buff *skb)
 {
-   if (dev == tls_get_ctx(sk)->netdev)
+   if (dev == tls_get_ctx(sk)->netdev || netif_is_bond_master(dev))
return skb;
 
return tls_sw_fallback(sk, skb);
-- 
2.21.0

[PATCH net-next V3 7/8] net/tls: Device offload to use lowest netdevice in chain

2021-01-17 Thread Tariq Toukan

Do not call the tls_dev_ops of upper devices. Instead, ask them
for the proper lowest device and communicate with it directly.

Signed-off-by: Tariq Toukan 
Reviewed-by: Boris Pismenny 
---
 net/tls/tls_device.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index f7fb7d2c1de1..75ceea0a41bf 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -113,7 +113,7 @@ static struct net_device *get_netdev_for_sock(struct sock 
*sk)
struct net_device *netdev = NULL;
 
if (likely(dst)) {
-   netdev = dst->dev;
+   netdev = netdev_sk_get_lowest_dev(dst->dev, sk);
dev_hold(netdev);
}
 
-- 
2.21.0

[PATCH net-next V3 6/8] net/bonding: Declare TLS RX device offload support

2021-01-17 Thread Tariq Toukan

Following the description in previous patch (for TX):
As the bond interface is being bypassed by the TLS module, interacting
directly against the lower devs, there is no way for the bond interface
to disable its device offload capabilities, as long as the mode/policy
config allows it.
Hence, the feature flag is not directly controllable, but just reflects
the offload status based on the logic under bond_sk_check().

Here we just declare RX device offload support, and expose it via the
NETIF_F_HW_TLS_RX flag.

Signed-off-by: Tariq Toukan 
Reviewed-by: Boris Pismenny 
---
 include/net/bonding.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/bonding.h b/include/net/bonding.h
index 97fbec02df2d..019e998d944a 100644
--- a/include/net/bonding.h
+++ b/include/net/bonding.h
@@ -89,7 +89,7 @@
 #define BOND_XFRM_FEATURES (NETIF_F_HW_ESP | NETIF_F_HW_ESP_TX_CSUM | \
NETIF_F_GSO_ESP)
 
-#define BOND_TLS_FEATURES (NETIF_F_HW_TLS_TX)
+#define BOND_TLS_FEATURES (NETIF_F_HW_TLS_TX | NETIF_F_HW_TLS_RX)
 
 #ifdef CONFIG_NET_POLL_CONTROLLER
 extern atomic_t netpoll_block_tx;
-- 
2.21.0

[PATCH net-next V3 4/8] net/bonding: Take update_features call out of XFRM funciton

2021-01-17 Thread Tariq Toukan

In preparation for more cases that call netdev_update_features().

While here, move the features logic to the stage where struct bond
is already updated, and pass it as the only parameter to function
bond_set_xfrm_features().

Signed-off-by: Tariq Toukan 
Reviewed-by: Boris Pismenny 
---
 drivers/net/bonding/bond_options.c | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/drivers/net/bonding/bond_options.c 
b/drivers/net/bonding/bond_options.c
index a4e4e15f574d..7f0ad97926de 100644
--- a/drivers/net/bonding/bond_options.c
+++ b/drivers/net/bonding/bond_options.c
@@ -745,17 +745,17 @@ const struct bond_option *bond_opt_get(unsigned int 
option)
return &bond_opts[option];
 }
 
-static void bond_set_xfrm_features(struct net_device *bond_dev, u64 mode)
+static bool bond_set_xfrm_features(struct bonding *bond)
 {
if (!IS_ENABLED(CONFIG_XFRM_OFFLOAD))
-   return;
+   return false;
 
-   if (mode == BOND_MODE_ACTIVEBACKUP)
-   bond_dev->wanted_features |= BOND_XFRM_FEATURES;
+   if (BOND_MODE(bond) == BOND_MODE_ACTIVEBACKUP)
+   bond->dev->wanted_features |= BOND_XFRM_FEATURES;
else
-   bond_dev->wanted_features &= ~BOND_XFRM_FEATURES;
+   bond->dev->wanted_features &= ~BOND_XFRM_FEATURES;
 
-   netdev_update_features(bond_dev);
+   return true;
 }
 
 static int bond_option_mode_set(struct bonding *bond,
@@ -780,13 +780,14 @@ static int bond_option_mode_set(struct bonding *bond,
if (newval->value == BOND_MODE_ALB)
bond->params.tlb_dynamic_lb = 1;
 
-   if (bond->dev->reg_state == NETREG_REGISTERED)
-   bond_set_xfrm_features(bond->dev, newval->value);
-
/* don't cache arp_validate between modes */
bond->params.arp_validate = BOND_ARP_VALIDATE_NONE;
bond->params.mode = newval->value;
 
+   if (bond->dev->reg_state == NETREG_REGISTERED)
+   if (bond_set_xfrm_features(bond))
+   netdev_update_features(bond->dev);
+
return 0;
 }
 
-- 
2.21.0

[PATCH net-next V3 2/8] net/bonding: Take IP hash logic into a helper

2021-01-17 Thread Tariq Toukan

Hash logic on L3 will be used in a downstream patch for one more use
case.
Take it to a function for a better code reuse.

Signed-off-by: Tariq Toukan 
Reviewed-by: Boris Pismenny 
---
 drivers/net/bonding/bond_main.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index ad5192ee1845..759ad22b7279 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3541,6 +3541,16 @@ static bool bond_flow_dissect(struct bonding *bond, 
struct sk_buff *skb,
return true;
 }
 
+static u32 bond_ip_hash(u32 hash, struct flow_keys *flow)
+{
+   hash ^= (__force u32)flow_get_u32_dst(flow) ^
+   (__force u32)flow_get_u32_src(flow);
+   hash ^= (hash >> 16);
+   hash ^= (hash >> 8);
+   /* discard lowest hash bit to deal with the common even ports pattern */
+   return hash >> 1;
+}
+
 /**
  * bond_xmit_hash - generate a hash value based on the xmit policy
  * @bond: bonding device
@@ -3571,12 +3581,8 @@ u32 bond_xmit_hash(struct bonding *bond, struct sk_buff 
*skb)
else
memcpy(&hash, &flow.ports.ports, sizeof(hash));
}
-   hash ^= (__force u32)flow_get_u32_dst(&flow) ^
-   (__force u32)flow_get_u32_src(&flow);
-   hash ^= (hash >> 16);
-   hash ^= (hash >> 8);
 
-   return hash >> 1;
+   return bond_ip_hash(hash, &flow);
 }
 
 /*-- Device entry points */
-- 
2.21.0

[PATCH net-next V3 3/8] net/bonding: Implement ndo_sk_get_lower_dev

2021-01-17 Thread Tariq Toukan

Add ndo_sk_get_lower_dev() implementation for bond interfaces.

Support only for the cases where the socket's and SKBs' hash
yields identical value for the whole connection lifetime.

Here we restrict it to L3+4 sockets only, with
xmit_hash_policy==LAYER34 and bond modes xor/802.3ad.

Signed-off-by: Tariq Toukan 
Reviewed-by: Boris Pismenny 
---
 drivers/net/bonding/bond_main.c | 93 +
 include/net/bonding.h   |  2 +
 2 files changed, 95 insertions(+)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 759ad22b7279..09524f99c753 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -301,6 +301,19 @@ netdev_tx_t bond_dev_queue_xmit(struct bonding *bond, 
struct sk_buff *skb,
return dev_queue_xmit(skb);
 }
 
+bool bond_sk_check(struct bonding *bond)
+{
+   switch (BOND_MODE(bond)) {
+   case BOND_MODE_8023AD:
+   case BOND_MODE_XOR:
+   if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER34)
+   return true;
+   fallthrough;
+   default:
+   return false;
+   }
+}
+
 /*-- VLAN ---*/
 
 /* In the following 2 functions, bond_vlan_rx_add_vid and 
bond_vlan_rx_kill_vid,
@@ -4555,6 +4568,85 @@ static struct net_device *bond_xmit_get_slave(struct 
net_device *master_dev,
return NULL;
 }
 
+static void bond_sk_to_flow(struct sock *sk, struct flow_keys *flow)
+{
+   switch (sk->sk_family) {
+#if IS_ENABLED(CONFIG_IPV6)
+   case AF_INET6:
+   if (sk->sk_ipv6only ||
+   ipv6_addr_type(&sk->sk_v6_daddr) != IPV6_ADDR_MAPPED) {
+   flow->control.addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS;
+   flow->addrs.v6addrs.src = inet6_sk(sk)->saddr;
+   flow->addrs.v6addrs.dst = sk->sk_v6_daddr;
+   break;
+   }
+   fallthrough;
+#endif
+   default: /* AF_INET */
+   flow->control.addr_type = FLOW_DISSECTOR_KEY_IPV4_ADDRS;
+   flow->addrs.v4addrs.src = inet_sk(sk)->inet_rcv_saddr;
+   flow->addrs.v4addrs.dst = inet_sk(sk)->inet_daddr;
+   break;
+   }
+
+   flow->ports.src = inet_sk(sk)->inet_sport;
+   flow->ports.dst = inet_sk(sk)->inet_dport;
+}
+
+/**
+ * bond_sk_hash_l34 - generate a hash value based on the socket's L3 and L4 
fields
+ * @sk: socket to use for headers
+ *
+ * This function will extract the necessary field from the socket and use
+ * them to generate a hash based on the LAYER34 xmit_policy.
+ * Assumes that sk is a TCP or UDP socket.
+ */
+static u32 bond_sk_hash_l34(struct sock *sk)
+{
+   struct flow_keys flow;
+   u32 hash;
+
+   bond_sk_to_flow(sk, &flow);
+
+   /* L4 */
+   memcpy(&hash, &flow.ports.ports, sizeof(hash));
+   /* L3 */
+   return bond_ip_hash(hash, &flow);
+}
+
+static struct net_device *__bond_sk_get_lower_dev(struct bonding *bond,
+ struct sock *sk)
+{
+   struct bond_up_slave *slaves;
+   struct slave *slave;
+   unsigned int count;
+   u32 hash;
+
+   slaves = rcu_dereference(bond->usable_slaves);
+   count = slaves ? READ_ONCE(slaves->count) : 0;
+   if (unlikely(!count))
+   return NULL;
+
+   hash = bond_sk_hash_l34(sk);
+   slave = slaves->arr[hash % count];
+
+   return slave->dev;
+}
+
+static struct net_device *bond_sk_get_lower_dev(struct net_device *dev,
+   struct sock *sk)
+{
+   struct bonding *bond = netdev_priv(dev);
+   struct net_device *lower = NULL;
+
+   rcu_read_lock();
+   if (bond_sk_check(bond))
+   lower = __bond_sk_get_lower_dev(bond, sk);
+   rcu_read_unlock();
+
+   return lower;
+}
+
 static netdev_tx_t __bond_start_xmit(struct sk_buff *skb, struct net_device 
*dev)
 {
struct bonding *bond = netdev_priv(dev);
@@ -4691,6 +4783,7 @@ static const struct net_device_ops bond_netdev_ops = {
.ndo_fix_features   = bond_fix_features,
.ndo_features_check = passthru_features_check,
.ndo_get_xmit_slave = bond_xmit_get_slave,
+   .ndo_sk_get_lower_dev   = bond_sk_get_lower_dev,
 };
 
 static const struct device_type bond_type = {
diff --git a/include/net/bonding.h b/include/net/bonding.h
index adc3da776970..21497193c4a4 100644
--- a/include/net/bonding.h
+++ b/include/net/bonding.h
@@ -265,6 +265,8 @@ struct bond_vlan_tag {
unsigned short  vlan_id;
 };
 
+bool bond_sk_check(struct bonding *bond);
+
 /**
  * Returns NULL if the net_device does not belong to any of the bond's slaves
  *
-- 
2.21.0

[PATCH net-next V3 0/8] TLS device offload for Bond

2021-01-17 Thread Tariq Toukan

Hi,

This series opens TX and RX TLS device offload for bond interfaces.
This allows bond interfaces to benefit from capable lower devices.

We add a new ndo_sk_get_lower_dev() to be used to get the lower dev that
corresponds to a given socket.
The TLS module uses it to interact directly with the lowest device in
chain, and invoke the control operations in tlsdev_ops. This means that the
bond interface doesn't have his own struct tlsdev_ops instance and
derived logic/callbacks.

To keep simple track of the HW and SW TLS contexts, we bind each socket to
a specific lower device for the socket's whole lifetime. This is logically
valid (and similar to the SW kTLS behavior) in the following bond configuration,
so we restrict the offload support to it:

((mode == balance-xor) or (mode == 802.3ad))
and xmit_hash_policy == layer3+4.

In this design, TLS TX/RX offload feature flags of the bond device are
independent from the lower devices. They reflect the current features state,
but are not directly controllable.
This is because the bond driver is bypassed by the call to
ndo_sk_get_lower_dev(), without him knowing who the caller is.
The bond TLS feature flags are set/cleared only according to the configuration
of the mode and xmit_hash_policy.

Bypass is true only for the control flow. Packets in fast path still go through
the bond logic.

The design here differs from the xfrm/ipsec offload, where the bond driver
has his own copy of struct xfrmdev_ops and callbacks.

Regards,
Tariq

V3:
- Use "lower device" instead of "slave".
- Make TLS TX/RX devie offload feature flags non-controllable [Fixed].

V2:
- Declare RX support.
- Enhance the feature flags logic.
- Slight modifications for bond_set_xfrm_features().
- 

RFC:
- New design for the tlsdev_ops calls, introduce and use ndo_sk_get_slave()
  to interact directly with the slave netdev.
- Remove bond copy of tlsdev_ops callbacks.
- In TLS module: Use netdev_sk_get_lowest_dev(), give exceptions to some checks
  to allow bond support.


Tariq Toukan (8):
  net: netdevice: Add operation ndo_sk_get_lower_dev
  net/bonding: Take IP hash logic into a helper
  net/bonding: Implement ndo_sk_get_lower_dev
  net/bonding: Take update_features call out of XFRM funciton
  net/bonding: Implement TLS TX device offload
  net/bonding: Declare TLS RX device offload support
  net/tls: Device offload to use lowest netdevice in chain
  net/tls: Except bond interface from some TLS checks

 drivers/net/bonding/bond_main.c| 138 +++--
 drivers/net/bonding/bond_options.c |  42 +++--
 include/linux/netdevice.h  |   4 +
 include/net/bonding.h  |   4 +
 net/core/dev.c |  33 +++
 net/tls/tls_device.c   |   4 +-
 net/tls/tls_device_fallback.c  |   2 +-
 7 files changed, 211 insertions(+), 16 deletions(-)

-- 
2.21.0

[PATCH net-next V3 1/8] net: netdevice: Add operation ndo_sk_get_lower_dev

2021-01-17 Thread Tariq Toukan

ndo_sk_get_lower_dev returns the lower netdev that corresponds to
a given socket.
Additionally, we implement a helper netdev_sk_get_lowest_dev() to get
the lowest one in chain.

Signed-off-by: Tariq Toukan 
Reviewed-by: Boris Pismenny 
---
 include/linux/netdevice.h |  4 
 net/core/dev.c| 33 +
 2 files changed, 37 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5b949076ed23..02dcef4d66e2 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1398,6 +1398,8 @@ struct net_device_ops {
struct net_device*  (*ndo_get_xmit_slave)(struct net_device *dev,
  struct sk_buff *skb,
  bool all_slaves);
+   struct net_device*  (*ndo_sk_get_lower_dev)(struct net_device *dev,
+   struct sock *sk);
netdev_features_t   (*ndo_fix_features)(struct net_device *dev,
netdev_features_t features);
int (*ndo_set_features)(struct net_device *dev,
@@ -2858,6 +2860,8 @@ int init_dummy_netdev(struct net_device *dev);
 struct net_device *netdev_get_xmit_slave(struct net_device *dev,
 struct sk_buff *skb,
 bool all_slaves);
+struct net_device *netdev_sk_get_lowest_dev(struct net_device *dev,
+   struct sock *sk);
 struct net_device *dev_get_by_index(struct net *net, int ifindex);
 struct net_device *__dev_get_by_index(struct net *net, int ifindex);
 struct net_device *dev_get_by_index_rcu(struct net *net, int ifindex);
diff --git a/net/core/dev.c b/net/core/dev.c
index bae35c1ae192..6b90520a01b1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -8105,6 +8105,39 @@ struct net_device *netdev_get_xmit_slave(struct 
net_device *dev,
 }
 EXPORT_SYMBOL(netdev_get_xmit_slave);
 
+static struct net_device *netdev_sk_get_lower_dev(struct net_device *dev,
+ struct sock *sk)
+{
+   const struct net_device_ops *ops = dev->netdev_ops;
+
+   if (!ops->ndo_sk_get_lower_dev)
+   return NULL;
+   return ops->ndo_sk_get_lower_dev(dev, sk);
+}
+
+/**
+ * netdev_sk_get_lowest_dev - Get the lowest device in chain given device and 
socket
+ * @dev: device
+ * @sk: the socket
+ *
+ * %NULL is returned if no lower device is found.
+ */
+
+struct net_device *netdev_sk_get_lowest_dev(struct net_device *dev,
+   struct sock *sk)
+{
+   struct net_device *lower;
+
+   lower = netdev_sk_get_lower_dev(dev, sk);
+   while (lower) {
+   dev = lower;
+   lower = netdev_sk_get_lower_dev(dev, sk);
+   }
+
+   return dev;
+}
+EXPORT_SYMBOL(netdev_sk_get_lowest_dev);
+
 static void netdev_adjacent_add_links(struct net_device *dev)
 {
struct netdev_adjacent *iter;
-- 
2.21.0

Re: [PATCH v2] hv_netvsc: Add (more) validation for untrusted Hyper-V values

2021-01-17 Thread Wei Liu

On Sat, Jan 16, 2021 at 02:02:01PM +0100, Andrea Parri wrote:
> On Fri, Jan 15, 2021 at 08:30:22PM -0800, Jakub Kicinski wrote:
> > On Thu, 14 Jan 2021 21:26:28 +0100 Andrea Parri (Microsoft) wrote:
> > > For additional robustness in the face of Hyper-V errors or malicious
> > > behavior, validate all values that originate from packets that Hyper-V
> > > has sent to the guest.  Ensure that invalid values cannot cause indexing
> > > off the end of an array, or subvert an existing validation via integer
> > > overflow.  Ensure that outgoing packets do not have any leftover guest
> > > memory that has not been zeroed out.
> > > 
> > > Reported-by: Juan Vazquez 
> > > Signed-off-by: Andrea Parri (Microsoft) 
> > > Cc: "David S. Miller" 
> > > Cc: Jakub Kicinski 
> > > Cc: Alexei Starovoitov 
> > > Cc: Daniel Borkmann 
> > > Cc: Andrii Nakryiko 
> > > Cc: Martin KaFai Lau 
> > > Cc: Song Liu 
> > > Cc: Yonghong Song 
> > > Cc: John Fastabend 
> > > Cc: KP Singh 
> > > Cc: netdev@vger.kernel.org
> > > Cc: b...@vger.kernel.org
> > > ---
> > > Applies to 5.11-rc3 (and hyperv-next).
> > 
> > So this is for hyperv-next or should we take it via netdev trees?
> 
> No preference, either way is good for me.

To be clear: There is no dependency on any patch in hyperv-next, right?

That's my understanding, but I would like to confirm it.

Wei.

> 
> Thanks,
>   Andrea

[PATCH net] net: Disable NETIF_F_HW_TLS_RX when RXCSUM is disabled

2021-01-17 Thread Tariq Toukan

With NETIF_F_HW_TLS_RX packets are decrypted in HW. This cannot be
logically done when RXCSUM offload is off.

Fixes: 14136564c8ee ("net: Add TLS RX offload feature")
Signed-off-by: Tariq Toukan 
Reviewed-by: Boris Pismenny 
---
 Documentation/networking/tls-offload.rst | 3 +++
 net/core/dev.c   | 5 +
 2 files changed, 8 insertions(+)

Hi,

Please queue to -stable >= v4.19.

Thanks,
Tariq

diff --git a/Documentation/networking/tls-offload.rst 
b/Documentation/networking/tls-offload.rst
index 9af3334d9ad0..5f0dea3d571e 100644
--- a/Documentation/networking/tls-offload.rst
+++ b/Documentation/networking/tls-offload.rst
@@ -534,3 +534,6 @@ offload. Hence, TLS TX device feature flag requires TX csum 
offload being set.
 Disabling the latter implies clearing the former. Disabling TX checksum offload
 should not affect old connections, and drivers should make sure checksum
 calculation does not break for them.
+Similarly, device-offloaded TLS decryption implies doing RXCSUM. If the user
+does not want to enable RX csum offload, TLS RX device feature is disabled
+as well.
diff --git a/net/core/dev.c b/net/core/dev.c
index c360bb5367e2..a979b86dbacd 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9672,6 +9672,11 @@ static netdev_features_t netdev_fix_features(struct 
net_device *dev,
}
}
 
+   if ((features & NETIF_F_HW_TLS_RX) && !(features & NETIF_F_RXCSUM)) {
+   netdev_dbg(dev, "Dropping TLS RX HW offload feature since no 
RXCSUM feature.\n");
+   features &= ~NETIF_F_HW_TLS_RX;
+   }
+
return features;
 }
 
-- 
2.21.0

Re: [Patch net-next] net_sched: fix RTNL deadlock again caused by request_module()

2021-01-17 Thread Jamal Hadi Salim


On 2021-01-16 7:56 p.m., Cong Wang wrote:

From: Cong Wang 

tcf_action_init_1() loads tc action modules automatically with
request_module() after parsing the tc action names, and it drops RTNL
lock and re-holds it before and after request_module(). This causes a
lot of troubles, as discovered by syzbot, because we can be in the
middle of batch initializations when we create an array of tc actions.

One of the problem is deadlock:

CPU 0   CPU 1
rtnl_lock();
for (...) {
   tcf_action_init_1();
 -> rtnl_unlock();
 -> request_module();
rtnl_lock();
for (...) {
  tcf_action_init_1();
-> tcf_idr_check_alloc();
   // Insert one action into idr,
   // but it is not committed until
   // tcf_idr_insert_many(), then drop
   // the RTNL lock in the _next_
   // iteration
   -> rtnl_unlock();
 -> rtnl_lock();
 -> a_o->init();
   -> tcf_idr_check_alloc();
   // Now waiting for the same index
   // to be committed
-> request_module();
-> rtnl_lock()
// Now waiting for RTNL lock
}
rtnl_unlock();
}
rtnl_unlock();

This is not easy to solve, we can move the request_module() before
this loop and pre-load all the modules we need for this netlink
message and then do the rest initializations. So the loop breaks down
to two now:

 for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
 struct tc_action_ops *a_o;

 a_o = tc_action_load_ops(name, tb[i]...);
 ops[i - 1] = a_o;
 }

 for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) {
 act = tcf_action_init_1(ops[i - 1]...);
 }

Although this looks serious, it only has been reported by syzbot, so it
seems hard to trigger this by humans. And given the size of this patch,
I'd suggest to make it to net-next and not to backport to stable.

This patch has been tested by syzbot and tested with tdc.py by me.



LGTM.
Initially i was worried about performance impact but i found nothing
observable. We need to add a tdc test for batch (I can share how i did
batch testing at next meet).

Tested-by: Jamal Hadi Salim 
Acked-by: Jamal Hadi Salim 

cheers,
jamal

Re: [PATCH net-next v5] GTP: add support for flow based tunneling API

2021-01-17 Thread Harald Welte

Hi Jonas, Jakub and others,

On Sun, Jan 17, 2021 at 02:23:52PM +0100, Jonas Bonn wrote:
> This patch hasn't received any ACK's from either the maintainers or anyone
> else providing review.  The following issues remain unaddressed after
> review:

[...]

Full ACK from my point of view.  The patch is so massive that I
as the original co-author and co-maintainer of the GTP kernel module
have problems understanding what it is doing at all.  Furthermore,
I am actually wondering if there is any commonality between the existing
use cases and whatever the modified gtp.ko is trying to achieve.  Up to
the point on whether or not it makes sense to have both functionalities
in the same driver/module at all

> I'm not sure what the hurry is to get this patch into mainline.  Large and
> complicated patches like this take time to review; please revert this and
> allow that process to happen.

Also acknowledged and supported from my side.

-- 
- Harald Weltehttp://laforge.gnumonks.org/

"Privacy in residential applications is a desirable marketing option."
  (ETSI EN 300 175-7 Ch. A6)

Re: [PATCH net-next 7/7] net: ipa: allow arbitrary number of interconnects

2021-01-17 Thread Alex Elder


On 1/16/21 9:12 PM, Jakub Kicinski wrote:

On Fri, 15 Jan 2021 06:50:50 -0600 Alex Elder wrote:

Currently we assume that the IPA hardware has exactly three
interconnects.  But that won't be guaranteed for all platforms,
so allow any number of interconnects to be specified in the
configuration data.

For each platform, define an array of interconnect data entries
(still associated with the IPA clock structure), and record the
number of entries initialized in that array.

Loop over all entries in this array when initializing, enabling,
disabling, or tearing down the set of interconnects.

With this change we no longer need the ipa_interconnect_id
enumerated type, so get rid of it.


Okay, all the platforms supported as of the end of the series
still have 3 interconnects, or there is no upstream user of
this functionality, if you will. What's the story?


The short answer is that there is another version of IPA that
has four interconnects instead of three.  (A DTS file for it is
in "sdxprairie.dtsi" in the Qualcomm "downstream" code.)  I hope
to have that version supported this year, but it's not my top
priority right now.  Generalizing the interconnect definitions
as done in this series improves the driver, but you're right, it
is technically not required at this time.

And some more background:
The upstream IPA driver is derived from the Qualcomm downstream
code published at codeaurora.org.  The downstream driver is huge
(it's well over 100,000 lines of code) and it supports lots of
IPA versions and some features that are not present upstream.

In order to have any hope of getting upstream support for the
IPA hardware, the downstream driver functionality was reduced,
removing support for filtering, routing, and NAT.  I spent many
months refactoring and reworking that code to make it more
"upstreamable," and eventually posted the result for review.

Now that there is an upstream driver, a long term goal is to
add back functionality that got removed, matching the most
important features and hardware support found in the downstream
code.  So in cases like this, even though this feature is not
yet required, adding it now lays groundwork to make later work
easier.

Everything I do with the upstream IPA driver is aimed at in-tree
support for additional IPA features and hardware versions.  So
even if an improvement isn't required *now*, there is at least
a general plan to add support "soon" for something that will
need it.

Beyond even that, there are some things I intend to do that
will improve the driver, even if they aren't technically
required near-term.  For example, I'd like to dynamically
allocate endpoints based on what's needed, rather than
having the driver support any number of them up to a maximum
fixed at build time.

Probably a longer response than you needed, but I thought it
would help to provide more background.  Besides, you *did* ask
for "the story..."

-Alex

Re: [EXT] Re: [PATCH net-next 1/3] qede: add netpoll support for qede driver

2021-01-17 Thread Igor Russkikh

> On Fri, 15 Jan 2021 01:06:08 -0800 Bhaskar Upadhaya wrote:
>> Add net poll controller support to transmit kernel printks
>> over UDP
> 
> Why do you need this patch? Couple years back netpoll was taught 
> how to pull NAPIs by itself, and all you do is schedule NAPIs.
> 
> All the driver should do is to make sure that when napi is called 
> with budget of 0 it only processes Tx completions, not Rx traffic.

Hi Jakub,

Thanks for the hint, we were not aware of that.

I see our driver may not handle zero budget accordingly. Will check.

But then, all this means .ndo_poll_controller is basically deprecated?

Regards,
   Igor

Re: [EXT] Re: [PATCH net-next 3/3] qede: set default per queue rx/tx usecs coalescing parameters

2021-01-17 Thread Igor Russkikh



>> Here we do the initialization of coalescing values on load.
>> Although the default device values are the same - explicit
>> config is better visible.
> 
> Can you also make the driver store the settings across ifdown / ifup
> and allow the settings to be checked while the interface is down, 
> while at it?

Good suggestion, we'll implement that.

Thanks,
   Igor

Re: [RFC net-next 1/7] net: bridge: switchdev: Refactor br_switchdev_fdb_notify

2021-01-17 Thread Vladimir Oltean

On Sat, Jan 16, 2021 at 02:25:09AM +0100, Tobias Waldekranz wrote:
> Instead of having to add more and more arguments to
> br_switchdev_fdb_call_notifiers, get rid of it and build the info
> struct directly in br_switchdev_fdb_notify.
> 
> Signed-off-by: Tobias Waldekranz 
> ---

Reviewed-by: Vladimir Oltean

Re: [PATCH 17/18] net: iosm: readme file

2021-01-17 Thread Bjørn Mork

Sorry about being much too late into this discussion.  I'm not having
the bandwidth to read netdev anymore, and just stumbled across this now.

Andrew Lunn  writes:

> So, this is what all the Ethernet nonsense is all about. You have a
> session ID you need to somehow represent to user space. And you
> decided to use VLANs. But to use VLANs, you need an Ethernet
> header. So you added a bogus Ethernet header.

Actually, the original reasoning was the other way around.

The bogus ethernet header was added because I had seen the 3G modem
vendors do that for a few years already, in the modem firmware.  And I
didn't think enough about it to realize that it was a really bad idea,
or even that it was something I could change.  Or should change.

I cannot blame the MBIM sesison to VLAN mapping idea on anyone else.  As
far as I can remember, that was just something that popped up in my head
while working on the cdc_mbim driver. But it came as a consequence of
already having the bogus ethernet header.  And I didn't really
understand that I could define a new wwan subsystem with new device
types. I thought I had to use whatever was there already.

I was young and stupid. Now I'm not that young anymore ;-)

Never ever imagined that this would be replicated in another driver,
though.  That doesn't really make much sense.  We have learned by now,
haven't we?  This subject has been discussed a few times in the past,
and Johannes summary is my understanding as well:
"I don't think anyone likes that"

The DSS mapping sucks even more that the IPS mapping, BTW.  I don't
think there are any real users?  Not that I know of, at least.  DSS is
much better implmeneted as some per-session character device, as
requested by numerous people for years.  Sorry for not listening. Looks
like it is too late now.

> Is any of this VLAN stuff required by MBIM?

No.  It's my fault and mine alone.

> I suggest you throw away the pretence this is an Ethernet device. It
> is not.

I completely agree.  I wish I had gone for simple raw-ip devices both in
the qmi_wwan and cdc_mbim.  But qmi_wwan got them later, so there is
already support for such things in wwan userspace.

Bjørn

Re: [PATCH v2] hv_netvsc: Add (more) validation for untrusted Hyper-V values

2021-01-17 Thread Andrea Parri

On Sun, Jan 17, 2021 at 03:10:32PM +, Wei Liu wrote:
> On Sat, Jan 16, 2021 at 02:02:01PM +0100, Andrea Parri wrote:
> > On Fri, Jan 15, 2021 at 08:30:22PM -0800, Jakub Kicinski wrote:
> > > On Thu, 14 Jan 2021 21:26:28 +0100 Andrea Parri (Microsoft) wrote:
> > > > For additional robustness in the face of Hyper-V errors or malicious
> > > > behavior, validate all values that originate from packets that Hyper-V
> > > > has sent to the guest.  Ensure that invalid values cannot cause indexing
> > > > off the end of an array, or subvert an existing validation via integer
> > > > overflow.  Ensure that outgoing packets do not have any leftover guest
> > > > memory that has not been zeroed out.
> > > > 
> > > > Reported-by: Juan Vazquez 
> > > > Signed-off-by: Andrea Parri (Microsoft) 
> > > > Cc: "David S. Miller" 
> > > > Cc: Jakub Kicinski 
> > > > Cc: Alexei Starovoitov 
> > > > Cc: Daniel Borkmann 
> > > > Cc: Andrii Nakryiko 
> > > > Cc: Martin KaFai Lau 
> > > > Cc: Song Liu 
> > > > Cc: Yonghong Song 
> > > > Cc: John Fastabend 
> > > > Cc: KP Singh 
> > > > Cc: netdev@vger.kernel.org
> > > > Cc: b...@vger.kernel.org
> > > > ---
> > > > Applies to 5.11-rc3 (and hyperv-next).
> > > 
> > > So this is for hyperv-next or should we take it via netdev trees?
> > 
> > No preference, either way is good for me.
> 
> To be clear: There is no dependency on any patch in hyperv-next, right?
> 
> That's my understanding, but I would like to confirm it.

Well, I wrote that this *applies* to hyperv-next... but that's indeed
the only 'dependency' I can think of.

Hope this helps.

Thanks,
  Andrea

How to port rtl8xxxu to new device...?

2021-01-17 Thread Alex Dewar

Hi all,

I'm currently trying to port the rtl8xxxu driver to RTL8822u chips (one
of which I own), but I'm finding it hard to figure out where to start
and I figured that some wiser minds could maybe help steer me in the
right direction.

I've managed to get as far as reading and writing to various registers
as part of the device setup, but the dongle appears to crash at the
downloading firmware stage. At the moment it just feels like I'm
changing random bits and pieces to see what happens. I'm using this
downstream driver for reference: 
https://github.com/RinCat/RTL88x2BU-Linux-Driver

Does anyone know of any documentation or anything that could be useful
for working on this?

Best,
Alex

[PATCH] arcnet: fix macro name when DEBUG is defined

2021-01-17 Thread trix

From: Tom Rix 

When DEBUG is defined this error occurs

drivers/net/arcnet/com20020_cs.c:70:15: error: ‘com20020_REG_W_ADDR_HI’
  undeclared (first use in this function);
  did you mean ‘COM20020_REG_W_ADDR_HI’?
   ioaddr, com20020_REG_W_ADDR_HI);
   ^~

>From reviewing the context, the suggestion is what is meant.

Fixes: 0fec65130b9f ("arcnet: com20020: Use arcnet_ routines")
Signed-off-by: Tom Rix 
---
 drivers/net/arcnet/com20020_cs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/arcnet/com20020_cs.c b/drivers/net/arcnet/com20020_cs.c
index cf607ffcf358..81223f6bebcc 100644
--- a/drivers/net/arcnet/com20020_cs.c
+++ b/drivers/net/arcnet/com20020_cs.c
@@ -67,7 +67,7 @@ static void regdump(struct net_device *dev)
/* set up the address register */
count = 0;
arcnet_outb((count >> 8) | RDDATAflag | AUTOINCflag,
-   ioaddr, com20020_REG_W_ADDR_HI);
+   ioaddr, COM20020_REG_W_ADDR_HI);
arcnet_outb(count & 0xff, ioaddr, COM20020_REG_W_ADDR_LO);
 
for (count = 0; count < 256 + 32; count++) {
-- 
2.27.0

Re: [PATCH] arcnet: fix macro name when DEBUG is defined

2021-01-17 Thread Joe Perches

On Sun, 2021-01-17 at 10:15 -0800, t...@redhat.com wrote:
> From: Tom Rix 
> 
> When DEBUG is defined this error occurs
> 
> drivers/net/arcnet/com20020_cs.c:70:15: error: ‘com20020_REG_W_ADDR_HI’
>   undeclared (first use in this function);
>   did you mean ‘COM20020_REG_W_ADDR_HI’?
>    ioaddr, com20020_REG_W_ADDR_HI);
>    ^~
> 
> From reviewing the context, the suggestion is what is meant.
> 
> Fixes: 0fec65130b9f ("arcnet: com20020: Use arcnet_ routines")

Nice find thanks, especially seeing as how this hasn't been tested or
compiled in 5+ years...

commit 0fec65130b9f11a73d74f47025491f97f82ba070
Author: Joe Perches 
Date:   Tue May 5 10:06:06 2015 -0700

Acked-by: Joe Perches 

> Signed-off-by: Tom Rix 
> ---
>  drivers/net/arcnet/com20020_cs.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/arcnet/com20020_cs.c 
> b/drivers/net/arcnet/com20020_cs.c
> index cf607ffcf358..81223f6bebcc 100644
> --- a/drivers/net/arcnet/com20020_cs.c
> +++ b/drivers/net/arcnet/com20020_cs.c
> @@ -67,7 +67,7 @@ static void regdump(struct net_device *dev)
>   /* set up the address register */
>   count = 0;
>   arcnet_outb((count >> 8) | RDDATAflag | AUTOINCflag,
> - ioaddr, com20020_REG_W_ADDR_HI);
> + ioaddr, COM20020_REG_W_ADDR_HI);
>   arcnet_outb(count & 0xff, ioaddr, COM20020_REG_W_ADDR_LO);
>  
> 
>   for (count = 0; count < 256 + 32; count++) {

[PATCH] net: hns: fix variable used when DEBUG is defined

2021-01-17 Thread trix

From: Tom Rix 

When DEBUG is defined this error occurs

drivers/net/ethernet/hisilicon/hns/hns_enet.c:1505:36: error:
  ‘struct net_device’ has no member named ‘ae_handle’;
  did you mean ‘rx_handler’?
  assert(skb->queue_mapping < ndev->ae_handle->q_num);
^

ae_handle is an element of struct hns_nic_priv, so change
ndev to priv.

Fixes: b5996f11ea54 ("net: add Hisilicon Network Subsystem basic ethernet 
support")
Signed-off-by: Tom Rix 
---
 drivers/net/ethernet/hisilicon/hns/hns_enet.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/hisilicon/hns/hns_enet.c 
b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
index 858cb293152a..5d7824d2b4d4 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
@@ -1502,7 +1502,7 @@ static netdev_tx_t hns_nic_net_xmit(struct sk_buff *skb,
 {
struct hns_nic_priv *priv = netdev_priv(ndev);
 
-   assert(skb->queue_mapping < ndev->ae_handle->q_num);
+   assert(skb->queue_mapping < priv->ae_handle->q_num);
 
return hns_nic_net_xmit_hw(ndev, skb,
   &tx_ring_data(priv, skb->queue_mapping));
-- 
2.27.0

Re: [RFC net-next 2/7] net: bridge: switchdev: Include local flag in FDB notifications

2021-01-17 Thread Vladimir Oltean

Hi Tobias,

On Sat, Jan 16, 2021 at 02:25:10AM +0100, Tobias Waldekranz wrote:
> Some switchdev drivers, notably DSA, ignore all dynamically learned
> address notifications (!added_by_user) as these are autonomously added
> by the switch. Previously, such a notification was indistinguishable
> from a local address notification. Include a local bit in the
> notification so that the two classes can be discriminated.
>
> This allows DSA-like devices to add local addresses to the hardware
> FDB (with the CPU as the destination), thereby avoiding flows towards
> the CPU being flooded by the switch as unknown unicast.
>
> Signed-off-by: Tobias Waldekranz 
> ---

In an ideal world, the BR_FDB_LOCAL bit of an FDB entry is what you
would probably want to use as an indication that the packet must be
delivered upstream by the hardware, considering that this is what the
software data path does:

br_handle_frame_finish:
if (test_bit(BR_FDB_LOCAL, &dst->flags))
return br_pass_frame_up(skb);

However, we are not in an ideal world, but in a cacophony of nonsensical
flags that must be passed to the 'bridge fdb add' command. For example,
I noticed this usage pattern in your patch 6/7:

|br0
|/ \
| swp0 dummy0
|
| $ bridge fdb add 02:00:de:ad:00:01 dev dummy0 vlan 1 master

Do you know that this command doesn't do what you think it does
(assuming that 02:00:de:ad:00:01 is not the MAC address of dummy0)?

The command you wrote will add a _local_ FDB entry on dummy0.
I tried it on a DSA switch interface (swp0):

$ bridge fdb add 02:00:de:ad:00:01 dev swp0 vlan 1 master
[ 3162.165561] rtnl_fdb_add: addr 02:00:de:ad:00:01 vid 1 ndm_state 
NUD_NOARP|NUD_PERMANENT
[ 3162.172487] fdb_add_entry: fdb 02:00:de:ad:00:01 state 
NUD_NOARP|NUD_PERMANENT, fdb_to_nud NUD_REACHABLE, flags 0x0
[ 3162.180515] mscc_felix :00:00.5 swp0: local fdb: 02:00:de:ad:00:01 vid 1

So, after your patches, this command syntax will stop adding a
front-facing FDB entry on swp0. It will add a CPU-facing FDB entry
instead.

You know why the bridge neighbour state is NUD_NOARP|NUD_PERMANENT in
rtnl_fdb_add? Well, because iproute2 set it that way:

/* Assume permanent */
if (!(req.ndm.ndm_state&(NUD_PERMANENT|NUD_REACHABLE)))
req.ndm.ndm_state |= NUD_PERMANENT;

See iproute2 commit 0849e60a10d0 ("bridge: manage VXLAN based forwarding
tables"). I know so little about VXLAN's use of the bridge command, that
I cannot tell why it was decided to "assume permanent" (which seems to
have changed the default behavior for everybody).

Otherwise said, even if not mentioned in the man page, the default FDB
entry type is NUD_PERMANENT (which, in short, means a "local" entry, see
a more detailed explanation at the end).

The man page just says:

   bridge fdb add - add a new fdb entry
   This command creates a new fdb entry.

   LLADDR the Ethernet MAC address.

   dev DEV
  the interface to which this address is associated.

  local - is a local permanent fdb entry

  static - is a static (no arp) fdb entry

which is utterly misleading and useless. It does not say:
(a) what a local FDB entry is
(b) that if neither "local" or "static"|"dynamic" is specified,
"local" is default

This already creates pretty bad premises. You would have needed to
explicitly add "static" to your command. Not only you, but in fact also
thousands of other people who already have switchdev deployments using
the 'bridge fdb' command.

So, in short, if everybody with switchdev hardware used the 'bridge fdb'
command correctly so far, your series would have been great. But in
fact, nobody did (me included). So we need to be more creative.

For example, there's that annoying "self" flag.
As far as I understand, there is zero reason for a DSA driver to use the
"self" flag, since that means "bypass the bridge FDB and just call the
.ndo_fdb_add of the device driver, which in the case of DSA is
dsa_legacy_fdb_add". Instead you would just use the "master" flag, which
makes the operation be propagated through br_fdb_add and the software
FDB has a chance to be updated.

Only that there's no one preventing you from using the 'self' and
'master' flags together. Which means that the FDB would be offloaded to
the device twice: once through the SWITCHDEV_FDB_ADD_TO_DEVICE
notification emitted by br_fdb_add, and once through dsa_legacy_fdb_add.
Contradict me if I'm wrong, but I'm thinking that you may have missed
this detail that bridge fdb addresses are implicitly 'local' because you
also used some 'self master' commands, and the "to-CPU" address
installed through switchdev was immediately overwritten by the correct
one installed through dsa_legacy_fdb_add.

So I think there is a very real issue in that the FDB entries with the
is_local bit was never specified to switchdev thus far, and now suddenly
is. I'm sorry, but what you're saying in the commit message, that
"!added_by_user ha

Re: [PATCH] rtnetlink.7: Remove IPv4 from description

2021-01-17 Thread Michael Kerrisk (man-pages)

Hi Alex and Pali

On 1/16/21 4:04 PM, Alejandro Colomar wrote:
> From: Pali Rohár 
> 
> rtnetlink is not only used for IPv4
> 
> Signed-off-by: Pali Rohár 
> Signed-off-by: Alejandro Colomar 

Thanks. Patch applied.

Cheers,

Michael

> ---
>  man7/rtnetlink.7 | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/man7/rtnetlink.7 b/man7/rtnetlink.7
> index cd6809320..aec005ff9 100644
> --- a/man7/rtnetlink.7
> +++ b/man7/rtnetlink.7
> @@ -13,7 +13,7 @@
>  .\"
>  .TH RTNETLINK  7 2020-06-09 "Linux" "Linux Programmer's Manual"
>  .SH NAME
> -rtnetlink \- Linux IPv4 routing socket
> +rtnetlink \- Linux routing socket
>  .SH SYNOPSIS
>  .nf
>  .B #include 
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

Re: [PATCH net-next v5] GTP: add support for flow based tunneling API

2021-01-17 Thread Pravin Shelar

On Sun, Jan 17, 2021 at 5:46 AM Jonas Bonn  wrote:
>
> Hi Pravin,
>
> Again, this patch is too big and contains too many disparate changes to
> allow for easy review.  Please break this up into a series of logical
> changes for easier review.
>
> On 10/01/2021 08:00, Pravin B Shelar wrote:
> > Following patch add support for flow based tunneling API
> > to send and recv GTP tunnel packet over tunnel metadata API.
> > This would allow this device integration with OVS or eBPF using
> > flow based tunneling APIs.
> >
> > Signed-off-by: Pravin B Shelar 
> > ---
> > v4-v5:
> > - coding style changes
> > v3-v4:
> > - add check for non-zero dst port
> > v2-v3:
> > -  Fixed coding style
> > -  changed IFLA_GTP_FD1 to optional param for LWT dev.
> > v1-v2:
> > -  Fixed according to comments from Jonas Bonn
> >
> >   drivers/net/gtp.c  | 527 +
> >   include/uapi/linux/gtp.h   |  12 +
> >   include/uapi/linux/if_link.h   |   1 +
> >   include/uapi/linux/if_tunnel.h |   1 +
> >   tools/include/uapi/linux/if_link.h |   1 +
> >   5 files changed, 398 insertions(+), 144 deletions(-)
> >
> > diff --git a/drivers/net/gtp.c b/drivers/net/gtp.c
> > index 4c04e271f184..851364314ecc 100644
> > --- a/drivers/net/gtp.c
> > +++ b/drivers/net/gtp.c
> > @@ -21,6 +21,7 @@
> >   #include 
> >   #include 
> >
> > +#include 
> >   #include 
> >   #include 
> >   #include 
> > @@ -73,6 +74,9 @@ struct gtp_dev {
> >   unsigned inthash_size;
> >   struct hlist_head   *tid_hash;
> >   struct hlist_head   *addr_hash;
> > + /* Used by LWT tunnel. */
> > + boolcollect_md;
> > + struct socket   *collect_md_sock;
> >   };
>
> Can't you use the 'sock' member of sk1u as the reference this socket?
As discussed earlier I use it to destroy the LWT socket in the cleanup
code path. I need to test a solution to avoid this reference. I can
send an incremental patch if that works.

>
> >
> >   static unsigned int gtp_net_id __read_mostly;
> > @@ -179,33 +183,121 @@ static bool gtp_check_ms(struct sk_buff *skb, struct 
> > pdp_ctx *pctx,
> >   return false;
> >   }
> >
> > -static int gtp_rx(struct pdp_ctx *pctx, struct sk_buff *skb,
> > - unsigned int hdrlen, unsigned int role)
> > +static int gtp_set_tun_dst(struct gtp_dev *gtp, struct sk_buff *skb,
> > +unsigned int hdrlen, u8 gtp_version,
> > +__be64 tid, u8 flags)
> >   {
> > - if (!gtp_check_ms(skb, pctx, hdrlen, role)) {
> > - netdev_dbg(pctx->dev, "No PDP ctx for this MS\n");
> > - return 1;
> > + struct metadata_dst *tun_dst;
> > + int opts_len = 0;
> > +
> > + if (unlikely(flags & GTP1_F_MASK))
> > + opts_len = sizeof(struct gtpu_metadata);
> > +
> > + tun_dst = udp_tun_rx_dst(skb, gtp->sk1u->sk_family, TUNNEL_KEY, tid, 
> > opts_len);
> > + if (!tun_dst) {
> > + netdev_dbg(gtp->dev, "Failed to allocate tun_dst");
> > + goto err;
> >   }
> >
> > + netdev_dbg(gtp->dev, "attaching metadata_dst to skb, gtp ver %d 
> > hdrlen %d\n",
> > +gtp_version, hdrlen);
> > + if (unlikely(opts_len)) {
> > + struct gtpu_metadata *opts;
> > + struct gtp1_header *gtp1;
> > +
> > + opts = ip_tunnel_info_opts(&tun_dst->u.tun_info);
> > + gtp1 = (struct gtp1_header *)(skb->data + sizeof(struct 
> > udphdr));
> > + opts->ver = GTP_METADATA_V1;
> > + opts->flags = gtp1->flags;
> > + opts->type = gtp1->type;
> > + netdev_dbg(gtp->dev, "recved control pkt: flag %x type: %d\n",
> > +opts->flags, opts->type);
> > + tun_dst->u.tun_info.key.tun_flags |= TUNNEL_GTPU_OPT;
> > + tun_dst->u.tun_info.options_len = opts_len;
> > + skb->protocol = htons(0x); /* Unknown */
> > + }
> >   /* Get rid of the GTP + UDP headers. */
> >   if (iptunnel_pull_header(skb, hdrlen, skb->protocol,
> > -  !net_eq(sock_net(pctx->sk), 
> > dev_net(pctx->dev
> > - return -1;
> > +  !net_eq(sock_net(gtp->sk1u), 
> > dev_net(gtp->dev {
> > + gtp->dev->stats.rx_length_errors++;
> > + goto err;
> > + }
> > +
> > + skb_dst_set(skb, &tun_dst->dst);
> > + return 0;
> > +err:
> > + return -1;
> > +}
> > +
> > +static int gtp_rx(struct gtp_dev *gtp, struct sk_buff *skb,
> > +   unsigned int hdrlen, u8 gtp_version, unsigned int role,
> > +   __be64 tid, u8 flags, u8 type)
> > +{
> > + if (ip_tunnel_collect_metadata() || gtp->collect_md) {
> > + int err;
> > +
> > + err = gtp_set_tun_dst(gtp, skb, hdrlen, gtp_version, tid, 
> > flags);
> > + if (err)
> > + goto err;
> > + }

Re: [PATCH net-next v5] GTP: add support for flow based tunneling API

2021-01-17 Thread Pravin Shelar

On Sun, Jan 17, 2021 at 5:25 AM Jonas Bonn  wrote:
>
> Hi Jakub,
>
> On 17/01/2021 01:46, Jakub Kicinski wrote:
> > On Sat,  9 Jan 2021 23:00:21 -0800 Pravin B Shelar wrote:
> >> Following patch add support for flow based tunneling API
> >> to send and recv GTP tunnel packet over tunnel metadata API.
> >> This would allow this device integration with OVS or eBPF using
> >> flow based tunneling APIs.
> >>
> >> Signed-off-by: Pravin B Shelar 
> >
> > Applied, thanks!
> >
>
> This patch hasn't received any ACK's from either the maintainers or
> anyone else providing review.  The following issues remain unaddressed
> after review:
>
This patch was first sent out on Dec 10 and you responded on Dec 11. I
think there was a reasonable amount of time given for reviews.

> i)  the patch contains several logically separate changes that would be
> better served as smaller patches
Given this patch is adding a single feature, to add support for LWT, I
sent a single patch. Now sure how much benefit it would be to divide
it in smaller patches. In future I will be more careful with dividing
the patch, since that is THE objection you have presented here.

> ii) functionality like the handling of end markers has been introduced
> without further explanation
This is the first time you are raising this question. End marker is
introduced to handle these packets in a single LWT device. This way
the control plane can use a single device to program all GTP
user-plane functionality.

> iii) symmetry between the handling of GTPv0 and GTPv1 has been
> unnecessarily broken
This is already discussed in previous review: Once we add support for
LWT for v0, we can get symmetry between V1 and V0. At this point there
is no use case to use V0 in LWT, so I do not see a point introducing
the support.

> iv) there are no available userspace tools to allow for testing this
> functionality
>
This is not true. I have mentioned and provided open source tools
multiple times on review tread:

Patch for iproute to add support for LWT GTP devices.
https://github.com/pshelar/iproute2/commit/d6e99f8342672e6e9ce0b71e153296f8e2b41cfc

OVS support with integration test:
https://github.com/pshelar/ovs/blob/6ec6a2a86adc56c7c9dcab7b3a7b70bb6dad35c9/tests/system-layer3-tunnels.at#L158


> I have requested that this patch be reworked into a series of smaller
> changes.  That would allow:
>
> i) reasonable review
> ii) the possibility to explain _why_ things are being done in the patch
> comment where this isn't obvious (like the handling of end markers)
> iii) the chance to do a reasonable rebase of other ongoing work onto
> this patch (series):  this one patch is invasive and difficult to rebase
> onto
>
> I'm not sure what the hurry is to get this patch into mainline.  Large
> and complicated patches like this take time to review; please revert
> this and allow that process to happen.
>

Rather than reverting this patch, I can handle comments you have
posted. Those can be fixed by minor incremental patches.

Let me know if you find any regression introduced by this patch set, I
can quickly fix it.

Thanks,
Pravin.

Re: [PATCH net-next v5] GTP: add support for flow based tunneling API

2021-01-17 Thread Pravin Shelar

On Sun, Jan 17, 2021 at 7:31 AM Harald Welte  wrote:
>
> Hi Jonas, Jakub and others,
>
> On Sun, Jan 17, 2021 at 02:23:52PM +0100, Jonas Bonn wrote:
> > This patch hasn't received any ACK's from either the maintainers or anyone
> > else providing review.  The following issues remain unaddressed after
> > review:
>
> [...]
>
> Full ACK from my point of view.  The patch is so massive that I
> as the original co-author and co-maintainer of the GTP kernel module
> have problems understanding what it is doing at all.  Furthermore,
> I am actually wondering if there is any commonality between the existing
> use cases and whatever the modified gtp.ko is trying to achieve.  Up to
> the point on whether or not it makes sense to have both functionalities
> in the same driver/module at all
>

This is not modifying existing functionality. This patch is adding LWT
tunneling API. Existing functionality remains the same. Let me know if
you find any regression. I can fix it.
LWT is a well known method to implement scalable tunneling which most
of the tunneling modules (GENEVE, GRE, VxLAN etc..) in linux kernel
already supports.

If we separate out gtp.ko. from its LWT implementation, we will need
to duplicate a bunch of existing code as well as code that Jonas is
adding to improve performance using UDP tunnel offloading APIs. I
don't think that is the right approach. Existing tunneling modules
also use the unified module approach to implement traditional and LWT
based tunnel devices.

> > I'm not sure what the hurry is to get this patch into mainline.  Large and
> > complicated patches like this take time to review; please revert this and
> > allow that process to happen.
>
> Also acknowledged and supported from my side.
>
> --
> - Harald Weltehttp://laforge.gnumonks.org/
> 
> "Privacy in residential applications is a desirable marketing option."
>   (ETSI EN 300 175-7 Ch. A6)

Re: [PATCH 0/2] net: dsa: mv88e6xxx: fix vlan filtering for 6250

2021-01-17 Thread Vladimir Oltean

Hi Rasmus,

On Sat, Jan 16, 2021 at 03:39:34AM +0100, Rasmus Villemoes wrote:
> I finally managed to figure out why enabling VLAN filtering on the
> 6250 broke all (ingressing) traffic,
> cf. 
> https://lore.kernel.org/netdev/6424c14e-bd25-2a06-cf0b-f1a07f9a3...@prevas.dk/
> .
> 
> The first patch is the minimal fix and for net, while the second one
> is a little cleanup for net-next.
> 
> Rasmus Villemoes (2):
>   net: dsa: mv88e6xxx: also read STU state in mv88e6250_g1_vtu_getnext
>   net: dsa: mv88e6xxx: use mv88e6185_g1_vtu_getnext() for the 6250

It's strange to put a patch for net and one for net-next in the same
series. Nobody will keep a note for you to apply the second patch after
net has been merged back into net-next. So if you want to keep the
two-patch approach, you'd have to send just the "net" patch now, and the
"net-next" patch later.
But is there any reason why you don't just apply the second patch to
"net"?

Re: [PATCH v2 net-next 1/1] net: dsa: hellcreek: Add TAPRIO offloading support

2021-01-17 Thread Vladimir Oltean

On Sat, Jan 16, 2021 at 01:49:22PM +0100, Kurt Kanzenbach wrote:
> The switch has support for the 802.1Qbv Time Aware Shaper (TAS). Traffic
> schedules may be configured individually on each front port. Each port has 
> eight
> egress queues. The traffic is mapped to a traffic class respectively via the 
> PCP
> field of a VLAN tagged frame.
>
> The TAPRIO Qdisc already implements that. Therefore, this interface can simply
> be reused. Add .port_setup_tc() accordingly.
>
> The activation of a schedule on a port is split into two parts:
>
>  * Programming the necessary gate control list (GCL)
>  * Setup delayed work for starting the schedule
>
> The hardware supports starting a schedule up to eight seconds in the future. 
> The
> TAPRIO interface provides an absolute base time. Therefore, periodic delayed
> work is leveraged to check whether a schedule may be started or not.
>
> Signed-off-by: Kurt Kanzenbach 
> ---

Reviewed-by: Vladimir Oltean

RE: [PATCH bpf-next] xsk: build skb by page

2021-01-17 Thread John Fastabend

Xuan Zhuo wrote:
> This patch is used to construct skb based on page to save memory copy
> overhead.
> 
> This has one problem:
> 
> We construct the skb by fill the data page as a frag into the skb. In
> this way, the linear space is empty, and the header information is also
> in the frag, not in the linear space, which is not allowed for some
> network cards. For example, Mellanox Technologies MT27710 Family
> [ConnectX-4 Lx] will get the following error message:
> 
> mlx5_core :3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8, qn 0x1dbb, 
> opcode 0xd, syndrome 0x1, vendor syndrome 0x68
> : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2
> WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64
> : 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00
> 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 0020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00
> 0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> mlx5_core :3b:00.1 eth1: ERR CQE on SQ: 0x1dbb
> 
> I also tried to use build_skb to construct skb, but because of the
> existence of skb_shinfo, it must be behind the linear space, so this
> method is not working. We can't put skb_shinfo on desc->addr, it will be
> exposed to users, this is not safe.
> 
> Finally, I added a feature NETIF_F_SKB_NO_LINEAR to identify whether the
> network card supports the header information of the packet in the frag
> and not in the linear space.
> 
>  Performance Testing 
> 
> The test environment is Aliyun ECS server.
> Test cmd:
> ```
> xdpsock -i eth0 -t  -S -s 
> ```
> 
> Test result data:
> 
> size64  512 10241500
> copy1916747 1775988 1600203 1440054
> page1974058 1953655 1945463 1904478
> percent 3.0%10.0%   21.58%  32.3%

Looks like a good perf bump. Some easy suggestions below

> +static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> +  struct xdp_desc *desc, int *err)
> +{

Passing a 'int *err' here is ugly IMO use the ERR_PTR/PTR_ERR macros
and roll it into the return value.

or maybe use the out: pattern used in the kernel, but just doing direct
returns like now but with ERR_PTR() would also be fine.

> + struct sk_buff *skb ;
struct sk_buff *skb = NULL;
err = -ENOMEM;
> +
> + if (xs->dev->features & NETIF_F_SKB_NO_LINEAR) {
> + skb = xsk_build_skb_zerocopy(xs, desc);
> + if (unlikely(!skb)) {
goto out

> + *err = -ENOMEM;
> + return NULL;
> + }
> + } else {
> + char *buffer;
> + u64 addr;
> + u32 len;
> + int err;
> +
> + len = desc->len;
> + skb = sock_alloc_send_skb(&xs->sk, len, 1, &err);
> + if (unlikely(!skb)) {
goto out;
> + *err = -ENOMEM;
> + return NULL;
> + }
> +
> + skb_put(skb, len);
> + addr = desc->addr;
> + buffer = xsk_buff_raw_get_data(xs->pool, desc->addr);
> + err = skb_store_bits(skb, 0, buffer, len);
> +
> + if (unlikely(err)) {
> + kfree_skb(skb);

err = -EINVAL;
goto out

> + *err = -EINVAL;
> + return NULL;
> + }
> + }
> +
> + skb->dev = xs->dev;
> + skb->priority = xs->sk.sk_priority;
> + skb->mark = xs->sk.sk_mark;
> + skb_shinfo(skb)->destructor_arg = (void *)(long)desc->addr;
> + skb->destructor = xsk_destruct_skb;
> +
> + return skb;

out:
kfree_skb(skb)
return ERR_PTR(err);

> +}
> +

Otherwise looks good thanks.

[PATCH v2 bpf-next 0/4] libbpf: BTF dumper support for typed data

2021-01-17 Thread Alan Maguire

Add a libbpf dumper function that supports dumping a representation
of data passed in using the BTF id associated with the data in a
manner similar to the bpf_snprintf_btf helper.

Default output format is identical to that dumped by bpf_snprintf_btf(),
for example a "struct sk_buff" representation would look like this:

struct sk_buff){
 (union){
  (struct){
   .next = (struct sk_buff *)0x,
   .prev = (struct sk_buff *)0x,
   (union){
.dev = (struct net_device *)0x,
.dev_scratch = (long unsigned int)18446744073709551615,
   },
  },
...

Patches 1 and 2 make functions available that are needed during
dump operations.

Patch 3 implements the dump functionality in a manner similar
to that in kernel/bpf/btf.c, but with a view to fitting into
libbpf more naturally.  For example, rather than using flags,
boolean dump options are used to control output.

Patch 4 is a selftest that utilizes a dump printf function
to snprintf the dump output to a string for comparison with
expected output.  Tests deliberately mirror those in
snprintf_btf helper test to keep output consistent.

Changes since RFC [1]

- The initial approach explored was to share the kernel code
  with libbpf using #defines to paper over the different needs;
  however it makes more sense to try and fit in with libbpf
  code style for maintenance.  A comment in the code points at
  the implementation in kernel/bpf/btf.c and notes that any
  issues found in it should be fixed there or vice versa;
  mirroring the tests should help with this also
  (Andrii)

[1] 
https://lore.kernel.org/bpf/1610386373-24162-1-git-send-email-alan.magu...@oracle.com/T/#t

Alan Maguire (4):
  libbpf: add btf_has_size() and btf_int() inlines
  libbpf: make skip_mods_and_typedefs available internally in libbpf
  libbpf: BTF dumper support for typed data
  selftests/bpf: add dump type data tests to btf dump tests

 tools/lib/bpf/btf.h   |  36 +
 tools/lib/bpf/btf_dump.c  | 974 ++
 tools/lib/bpf/libbpf.c|   4 +-
 tools/lib/bpf/libbpf.map  |   5 +
 tools/lib/bpf/libbpf_internal.h   |   2 +
 tools/testing/selftests/bpf/prog_tests/btf_dump.c | 233 ++
 6 files changed, 1251 insertions(+), 3 deletions(-)

-- 
1.8.3.1

[PATCH v2 bpf-next 2/4] libbpf: make skip_mods_and_typedefs available internally in libbpf

2021-01-17 Thread Alan Maguire

btf_dump.c will need it for type-based data display.

Signed-off-by: Alan Maguire 
---
 tools/lib/bpf/libbpf.c  | 4 +---
 tools/lib/bpf/libbpf_internal.h | 2 ++
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 2abbc38..4ef84e1 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -73,8 +73,6 @@
 #define __printf(a, b) __attribute__((format(printf, a, b)))
 
 static struct bpf_map *bpf_object__add_map(struct bpf_object *obj);
-static const struct btf_type *
-skip_mods_and_typedefs(const struct btf *btf, __u32 id, __u32 *res_id);
 
 static int __base_pr(enum libbpf_print_level level, const char *format,
 va_list args)
@@ -1885,7 +1883,7 @@ static int bpf_object__init_user_maps(struct bpf_object 
*obj, bool strict)
return 0;
 }
 
-static const struct btf_type *
+const struct btf_type *
 skip_mods_and_typedefs(const struct btf *btf, __u32 id, __u32 *res_id)
 {
const struct btf_type *t = btf__type_by_id(btf, id);
diff --git a/tools/lib/bpf/libbpf_internal.h b/tools/lib/bpf/libbpf_internal.h
index 969d0ac..c25d2df 100644
--- a/tools/lib/bpf/libbpf_internal.h
+++ b/tools/lib/bpf/libbpf_internal.h
@@ -108,6 +108,8 @@ static inline void *libbpf_reallocarray(void *ptr, size_t 
nmemb, size_t size)
 void *btf_add_mem(void **data, size_t *cap_cnt, size_t elem_sz,
  size_t cur_cnt, size_t max_cnt, size_t add_cnt);
 int btf_ensure_mem(void **data, size_t *cap_cnt, size_t elem_sz, size_t 
need_cnt);
+const struct btf_type *skip_mods_and_typedefs(const struct btf *btf, __u32 id,
+ __u32 *res_id);
 
 static inline bool libbpf_validate_opts(const char *opts,
size_t opts_sz, size_t user_sz,
-- 
1.8.3.1

[PATCH v2 bpf-next 4/4] selftests/bpf: add dump type data tests to btf dump tests

2021-01-17 Thread Alan Maguire

Test various type data dumping operations by comparing expected
format with the dumped string; an snprintf-style printf function
is used to record the string dumped.

Signed-off-by: Alan Maguire 
---
 tools/testing/selftests/bpf/prog_tests/btf_dump.c | 233 ++
 1 file changed, 233 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/btf_dump.c 
b/tools/testing/selftests/bpf/prog_tests/btf_dump.c
index c60091e..262561f4 100644
--- a/tools/testing/selftests/bpf/prog_tests/btf_dump.c
+++ b/tools/testing/selftests/bpf/prog_tests/btf_dump.c
@@ -232,6 +232,237 @@ void test_btf_dump_incremental(void)
btf__free(btf);
 }
 
+#define STRSIZE2048
+#defineEXPECTED_STRSIZE256
+
+void btf_dump_snprintf(void *ctx, const char *fmt, va_list args)
+{
+   char *s = ctx, new[STRSIZE];
+
+   vsnprintf(new, STRSIZE, fmt, args);
+   strncat(s, new, STRSIZE);
+   vfprintf(ctx, fmt, args);
+}
+
+/* skip "enum "/"struct " prefixes */
+#define SKIP_PREFIX(_typestr, _prefix) \
+   do {\
+   if (strstr(_typestr, _prefix) == _typestr)  \
+   _typestr += strlen(_prefix) + 1;\
+   } while (0)
+
+int btf_dump_data(struct btf *btf, struct btf_dump *d,
+ char *ptrtype, __u64 flags, void *ptr,
+ char *str, char *expectedval)
+{
+   struct btf_dump_emit_type_data_opts opts = { 0 };
+   int ret = 0, cmp;
+   __s32 type_id;
+
+   opts.sz = sizeof(opts);
+   opts.compact = true;
+   if (flags & BTF_F_NONAME)
+   opts.noname = true;
+   if (flags & BTF_F_ZERO)
+   opts.zero = true;
+   SKIP_PREFIX(ptrtype, "enum");
+   SKIP_PREFIX(ptrtype, "struct");
+   SKIP_PREFIX(ptrtype, "union");
+   type_id = btf__find_by_name(btf, ptrtype);
+   if (CHECK(type_id <= 0, "find type id",
+ "no '%s' in BTF: %d\n", ptrtype, type_id)) {
+   ret = -ENOENT;
+   goto err;
+   }
+   str[0] = '\0';
+   ret = btf_dump__emit_type_data(d, type_id, &opts, ptr);
+   if (CHECK(ret < 0, "btf_dump__emit_type_data",
+ "failed: %d\n", ret))
+   goto err;
+
+   cmp = strncmp(str, expectedval, EXPECTED_STRSIZE);
+   if (CHECK(cmp, "ensure expected/actual match",
+ "'%s' does not match expected '%s': %d\n",
+ str, expectedval, cmp))
+   ret = -EFAULT;
+
+err:
+   if (ret)
+   btf_dump__free(d);
+   return ret;
+}
+
+#define TEST_BTF_DUMP_DATA(_b, _d, _str, _type, _flags, _expected, ...)
\
+   do {\
+   char _expectedval[EXPECTED_STRSIZE] = _expected;\
+   char __ptrtype[64] = #_type;\
+   char *_ptrtype = (char *)__ptrtype; \
+   static _type _ptrdata = __VA_ARGS__;\
+   void *_ptr = &_ptrdata; \
+   \
+   if (btf_dump_data(_b, _d, _ptrtype, _flags, _ptr,   \
+ _str, _expectedval))  \
+   return; \
+   } while (0)
+
+/* Use where expected data string matches its stringified declaration */
+#define TEST_BTF_DUMP_DATA_C(_b, _d, _str, _type, _opts, ...)  \
+   TEST_BTF_DUMP_DATA(_b, _d, _str, _type, _opts,  \
+  "(" #_type ")" #__VA_ARGS__, __VA_ARGS__)
+
+void test_btf_dump_data(void)
+{
+   struct btf *btf = libbpf_find_kernel_btf();
+   char str[STRSIZE];
+   struct btf_dump_opts opts = { .ctx = str };
+   struct btf_dump *d;
+
+   if (CHECK(!btf, "get kernel BTF", "no kernel BTF found"))
+   return;
+
+   d = btf_dump__new(btf, NULL, &opts, btf_dump_snprintf);
+
+   if (CHECK(!d, "new dump", "could not create BTF dump"))
+   return;
+
+   /* Verify type display for various types. */
+
+   /* simple int */
+   TEST_BTF_DUMP_DATA_C(btf, d, str, int, 0, 1234);
+   TEST_BTF_DUMP_DATA(btf, d, str, int, BTF_F_NONAME, "1234", 1234);
+
+   /* zero value should be printed at toplevel */
+   TEST_BTF_DUMP_DATA(btf, d, str, int, 0, "(int)0", 0);
+   TEST_BTF_DUMP_DATA(btf, d, str, int, BTF_F_NONAME, "0", 0);
+   TEST_BTF_DUMP_DATA(btf, d, str, int, BTF_F_ZERO, "(int)0", 0);
+   TEST_BTF_DUMP_DATA(btf, d, str, int, BTF_F_NONAME | BTF_F_ZERO,
+  "0", 0);
+   TEST_BTF_DUMP_DATA_C(btf, d, str, int, 0, -4567);
+   TEST_BTF_DUMP_DATA(btf, d, str, int, BTF_F_NONAME, "-4567", -456

[PATCH v2 bpf-next 3/4] libbpf: BTF dumper support for typed data

2021-01-17 Thread Alan Maguire

Add a BTF dumper for typed data, so that the user can dump a typed
version of the data provided.

The API is

int btf_dump__emit_type_data(struct btf_dump *d, __u32 id,
 const struct btf_dump_emit_type_data_opts *opts,
 void *data);

...where the id is the BTF id of the data pointed to by the "void *"
argument; for example the BTF id of "struct sk_buff" for a
"struct skb *" data pointer.  Options supported are

 - a starting indent level (indent_lvl)
 - a set of boolean options to control dump display, similar to those
   used for BPF helper bpf_snprintf_btf().  Options are
- compact : omit newlines and other indentation
- noname: omit member names
- zero: show zero-value members

Default output format is identical to that dumped by bpf_snprintf_btf(),
for example a "struct sk_buff" representation would look like this:

struct sk_buff){
 (union){
  (struct){
   .next = (struct sk_buff *)0x,
   .prev = (struct sk_buff *)0x,
   (union){
.dev = (struct net_device *)0x,
.dev_scratch = (long unsigned int)18446744073709551615,
   },
  },
...

Signed-off-by: Alan Maguire 
---
 tools/lib/bpf/btf.h  |  17 +
 tools/lib/bpf/btf_dump.c | 974 +++
 tools/lib/bpf/libbpf.map |   5 +
 3 files changed, 996 insertions(+)

diff --git a/tools/lib/bpf/btf.h b/tools/lib/bpf/btf.h
index 0c48f2e..7937124 100644
--- a/tools/lib/bpf/btf.h
+++ b/tools/lib/bpf/btf.h
@@ -180,6 +180,23 @@ struct btf_dump_emit_type_decl_opts {
 btf_dump__emit_type_decl(struct btf_dump *d, __u32 id,
 const struct btf_dump_emit_type_decl_opts *opts);
 
+
+struct btf_dump_emit_type_data_opts {
+   /* size of this struct, for forward/backward compatibility */
+   size_t sz;
+   int indent_level;
+   /* below match "show" flags for bpf_show_snprintf() */
+   bool compact;
+   bool noname;
+   bool zero;
+};
+#define btf_dump_emit_type_data_opts__last_field zero
+
+LIBBPF_API int
+btf_dump__emit_type_data(struct btf_dump *d, __u32 id,
+const struct btf_dump_emit_type_data_opts *opts,
+void *data);
+
 /*
  * A set of helpers for easier BTF types handling
  */
diff --git a/tools/lib/bpf/btf_dump.c b/tools/lib/bpf/btf_dump.c
index 2f9d685..04d604f 100644
--- a/tools/lib/bpf/btf_dump.c
+++ b/tools/lib/bpf/btf_dump.c
@@ -10,6 +10,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -19,14 +21,31 @@
 #include "libbpf.h"
 #include "libbpf_internal.h"
 
+#define BITS_PER_BYTE  8
+#define BITS_PER_U128  (sizeof(__u64) * BITS_PER_BYTE * 2)
+#define BITS_PER_BYTE_MASK (BITS_PER_BYTE - 1)
+#define BITS_PER_BYTE_MASKED(bits) ((bits) & BITS_PER_BYTE_MASK)
+#define BITS_ROUNDDOWN_BYTES(bits) ((bits) >> 3)
+#define BITS_ROUNDUP_BYTES(bits) \
+   (BITS_ROUNDDOWN_BYTES(bits) + !!BITS_PER_BYTE_MASKED(bits))
+
 static const char PREFIXES[] = "\t\t\t\t\t\t\t\t\t\t\t\t\t";
 static const size_t PREFIX_CNT = sizeof(PREFIXES) - 1;
 
+
 static const char *pfx(int lvl)
 {
return lvl >= PREFIX_CNT ? PREFIXES : &PREFIXES[PREFIX_CNT - lvl];
 }
 
+static const char SPREFIXES[] = " ";
+static const size_t SPREFIX_CNT = sizeof(SPREFIXES) - 1;
+
+static const char *spfx(int lvl)
+{
+   return lvl >= SPREFIX_CNT ? SPREFIXES : &SPREFIXES[SPREFIX_CNT - lvl];
+}
+
 enum btf_dump_type_order_state {
NOT_ORDERED,
ORDERING,
@@ -53,6 +72,49 @@ struct btf_dump_type_aux_state {
__u8 referenced: 1;
 };
 
+#define BTF_DUMP_DATA_MAX_NAME_LEN 256
+
+/*
+ * Common internal data for BTF type data dump operations.
+ *
+ * The implementation here is similar to that in kernel/bpf/btf.c
+ * that supports the bpf_snprintf_btf() helper, so any bugs in
+ * type data dumping here are likely in that code also.
+ *
+ * One challenge with showing nested data is we want to skip 0-valued
+ * data, but in order to figure out whether a nested object is all zeros
+ * we need to walk through it.  As a result, we need to make two passes
+ * when handling structs, unions and arrays; the first path simply looks
+ * for nonzero data, while the second actually does the display.  The first
+ * pass is signalled by state.depth_check being set, and if we
+ * encounter a non-zero value we set state.depth_to_show to the depth
+ * at which we encountered it.  When we have completed the first pass,
+ * we will know if anything needs to be displayed if
+ * state.depth_to_show > state.depth.  See btf_dump_emit_[struct,array]_data()
+ * for the implementation of this.
+ *
+ */
+struct btf_dump_data {
+   bool compact;
+   bool noname;
+   bool zero;
+   __u8 indent_lvl;/* base indent level */
+   /* below are used during iteration */
+   struct {
+   __u8 dept

[PATCH v2 bpf-next 1/4] libbpf: add btf_has_size() and btf_int() inlines

2021-01-17 Thread Alan Maguire

BTF type data dumping will use them in later patches, and they
are useful generally when handling BTF data.

Signed-off-by: Alan Maguire 
---
 tools/lib/bpf/btf.h | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/tools/lib/bpf/btf.h b/tools/lib/bpf/btf.h
index 1237bcd..0c48f2e 100644
--- a/tools/lib/bpf/btf.h
+++ b/tools/lib/bpf/btf.h
@@ -294,6 +294,20 @@ static inline bool btf_is_datasec(const struct btf_type *t)
return btf_kind(t) == BTF_KIND_DATASEC;
 }
 
+static inline bool btf_has_size(const struct btf_type *t)
+{
+   switch (BTF_INFO_KIND(t->info)) {
+   case BTF_KIND_INT:
+   case BTF_KIND_STRUCT:
+   case BTF_KIND_UNION:
+   case BTF_KIND_ENUM:
+   case BTF_KIND_DATASEC:
+   return true;
+   default:
+   return false;
+   }
+}
+
 static inline __u8 btf_int_encoding(const struct btf_type *t)
 {
return BTF_INT_ENCODING(*(__u32 *)(t + 1));
@@ -309,6 +323,11 @@ static inline __u8 btf_int_bits(const struct btf_type *t)
return BTF_INT_BITS(*(__u32 *)(t + 1));
 }
 
+static inline __u32 btf_int(const struct btf_type *t)
+{
+   return *(__u32 *)(t + 1);
+}
+
 static inline struct btf_array *btf_array(const struct btf_type *t)
 {
return (struct btf_array *)(t + 1);
-- 
1.8.3.1

[PATCH net v2] net: lapb: Add locking to the lapb module

2021-01-17 Thread Xie He

In the lapb module, the timers may run concurrently with other code in
this module, and there is currently no locking to prevent the code from
racing on "struct lapb_cb". This patch adds locking to prevent racing.

1. Add "spinlock_t lock" to "struct lapb_cb"; Add "spin_lock_bh" and
"spin_unlock_bh" to APIs, timer functions and notifier functions.

2. Add "bool t1timer_stop, t2timer_stop" to "struct lapb_cb" to make us
able to ask running timers to abort; Modify "lapb_stop_t1timer" and
"lapb_stop_t2timer" to make them able to abort running timers;
Modify "lapb_t2timer_expiry" and "lapb_t1timer_expiry" to make them
abort after they are stopped by "lapb_stop_t1timer", "lapb_stop_t2timer",
and "lapb_start_t1timer", "lapb_start_t2timer".

3. In lapb_unregister, change "lapb_stop_t1timer" and "lapb_stop_t2timer"
to "del_timer_sync" to make sure all running timers have exited.

4. In lapb_device_event, replace the "lapb_disconnect_request" call with
the content of "lapb_disconnect_request", to avoid trying to hold the
lock twice. When I do this, I removed "lapb_start_t1timer" because I
don't think it's necessary to start the timer when "NETDEV_GOING_DOWN".

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Cc: Martin Schiller 
Signed-off-by: Xie He 
---
 include/net/lapb.h|  2 ++
 net/lapb/lapb_iface.c | 56 +++
 net/lapb/lapb_timer.c | 30 +++
 3 files changed, 80 insertions(+), 8 deletions(-)

diff --git a/include/net/lapb.h b/include/net/lapb.h
index ccc3d1f020b0..eee73442a1ba 100644
--- a/include/net/lapb.h
+++ b/include/net/lapb.h
@@ -92,6 +92,7 @@ struct lapb_cb {
unsigned short  n2, n2count;
unsigned short  t1, t2;
struct timer_list   t1timer, t2timer;
+   boolt1timer_stop, t2timer_stop;
 
/* Internal control information */
struct sk_buff_head write_queue;
@@ -103,6 +104,7 @@ struct lapb_cb {
struct lapb_frame   frmr_data;
unsigned char   frmr_type;
 
+   spinlock_t  lock;
refcount_t  refcnt;
 };
 
diff --git a/net/lapb/lapb_iface.c b/net/lapb/lapb_iface.c
index 40961889e9c0..45f332607685 100644
--- a/net/lapb/lapb_iface.c
+++ b/net/lapb/lapb_iface.c
@@ -122,6 +122,8 @@ static struct lapb_cb *lapb_create_cb(void)
 
timer_setup(&lapb->t1timer, NULL, 0);
timer_setup(&lapb->t2timer, NULL, 0);
+   lapb->t1timer_stop = true;
+   lapb->t2timer_stop = true;
 
lapb->t1  = LAPB_DEFAULT_T1;
lapb->t2  = LAPB_DEFAULT_T2;
@@ -129,6 +131,8 @@ static struct lapb_cb *lapb_create_cb(void)
lapb->mode= LAPB_DEFAULT_MODE;
lapb->window  = LAPB_DEFAULT_WINDOW;
lapb->state   = LAPB_STATE_0;
+
+   spin_lock_init(&lapb->lock);
refcount_set(&lapb->refcnt, 1);
 out:
return lapb;
@@ -178,8 +182,8 @@ int lapb_unregister(struct net_device *dev)
goto out;
lapb_put(lapb);
 
-   lapb_stop_t1timer(lapb);
-   lapb_stop_t2timer(lapb);
+   del_timer_sync(&lapb->t1timer);
+   del_timer_sync(&lapb->t2timer);
 
lapb_clear_queues(lapb);
 
@@ -201,6 +205,8 @@ int lapb_getparms(struct net_device *dev, struct 
lapb_parms_struct *parms)
if (!lapb)
goto out;
 
+   spin_lock_bh(&lapb->lock);
+
parms->t1  = lapb->t1 / HZ;
parms->t2  = lapb->t2 / HZ;
parms->n2  = lapb->n2;
@@ -219,6 +225,7 @@ int lapb_getparms(struct net_device *dev, struct 
lapb_parms_struct *parms)
else
parms->t2timer = (lapb->t2timer.expires - jiffies) / HZ;
 
+   spin_unlock_bh(&lapb->lock);
lapb_put(lapb);
rc = LAPB_OK;
 out:
@@ -234,6 +241,8 @@ int lapb_setparms(struct net_device *dev, struct 
lapb_parms_struct *parms)
if (!lapb)
goto out;
 
+   spin_lock_bh(&lapb->lock);
+
rc = LAPB_INVALUE;
if (parms->t1 < 1 || parms->t2 < 1 || parms->n2 < 1)
goto out_put;
@@ -256,6 +265,7 @@ int lapb_setparms(struct net_device *dev, struct 
lapb_parms_struct *parms)
 
rc = LAPB_OK;
 out_put:
+   spin_unlock_bh(&lapb->lock);
lapb_put(lapb);
 out:
return rc;
@@ -270,6 +280,8 @@ int lapb_connect_request(struct net_device *dev)
if (!lapb)
goto out;
 
+   spin_lock_bh(&lapb->lock);
+
rc = LAPB_OK;
if (lapb->state == LAPB_STATE_1)
goto out_put;
@@ -285,6 +297,7 @@ int lapb_connect_request(struct net_device *dev)
 
rc = LAPB_OK;
 out_put:
+   spin_unlock_bh(&lapb->lock);
lapb_put(lapb);
 out:
return rc;
@@ -299,6 +312,8 @@ int lapb_disconnect_request(struct net_device *dev)
if (!lapb)
goto out;
 
+   spin_lock_bh(&lapb->lock);
+
switch (lapb->state) {
case LAPB_STATE_0:
rc = LAPB_NOTCONNECTED;
@@ -330,6 +345,7 @@ int lapb

[PATCH iproute2 1/2] vrf: print BPF log buffer if bpf_program_load fails

2021-01-17 Thread Luca Boccassi

Necessary to understand what is going on when bpf_program_load fails

Signed-off-by: Luca Boccassi 
---
 ip/ipvrf.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/ip/ipvrf.c b/ip/ipvrf.c
index 42779e5c..91578031 100644
--- a/ip/ipvrf.c
+++ b/ip/ipvrf.c
@@ -278,8 +278,8 @@ static int vrf_configure_cgroup(const char *path, int 
ifindex)
 */
prog_fd = prog_load(ifindex);
if (prog_fd < 0) {
-   fprintf(stderr, "Failed to load BPF prog: '%s'\n",
-   strerror(errno));
+   fprintf(stderr, "Failed to load BPF prog: '%s'\n%s",
+   strerror(errno), bpf_log_buf);
 
if (errno != EPERM) {
fprintf(stderr,
-- 
2.29.2

[PATCH iproute2 2/2] vrf: fix ip vrf exec with libbpf

2021-01-17 Thread Luca Boccassi

The size of bpf_insn is passed to bpf_load_program instead of the number
of elements as it expects, so ip vrf exec fails with:

$ sudo ip link add vrf-blue type vrf table 10
$ sudo ip link set dev vrf-blue up
$ sudo ip/ip vrf exec vrf-blue ls
Failed to load BPF prog: 'Invalid argument'
last insn is not an exit or jmp
processed 0 insns (limit 100) max_states_per_insn 0 total_states 0 
peak_states 0 mark_read 0
Kernel compiled with CGROUP_BPF enabled?

https://bugs.debian.org/980046

Reported-by: Emmanuel DECAEN 

Signed-off-by: Luca Boccassi 
---
 lib/bpf_glue.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/lib/bpf_glue.c b/lib/bpf_glue.c
index fa609bfe..d00a0dc1 100644
--- a/lib/bpf_glue.c
+++ b/lib/bpf_glue.c
@@ -14,7 +14,8 @@ int bpf_program_load(enum bpf_prog_type type, const struct 
bpf_insn *insns,
 size_t size_log)
 {
 #ifdef HAVE_LIBBPF
-   return bpf_load_program(type, insns, size_insns, license, 0, log, 
size_log);
+   return bpf_load_program(type, insns, size_insns / sizeof(struct 
bpf_insn),
+   license, 0, log, size_log);
 #else
return bpf_prog_load_dev(type, insns, size_insns, license, 0, log, 
size_log);
 #endif
-- 
2.29.2

RE: [PATCHv14 bpf-next 1/6] bpf: run devmap xdp_prog on flush instead of bulk enqueue

2021-01-17 Thread John Fastabend

Hangbin Liu wrote:
> From: Jesper Dangaard Brouer 
> 
> This changes the devmap XDP program support to run the program when the
> bulk queue is flushed instead of before the frame is enqueued. This has
> a couple of benefits:
> 
> - It "sorts" the packets by destination devmap entry, and then runs the
>   same BPF program on all the packets in sequence. This ensures that we
>   keep the XDP program and destination device properties hot in I-cache.
> 
> - It makes the multicast implementation simpler because it can just
>   enqueue packets using bq_enqueue() without having to deal with the
>   devmap program at all.
> 
> The drawback is that if the devmap program drops the packet, the enqueue
> step is redundant. However, arguably this is mostly visible in a
> micro-benchmark, and with more mixed traffic the I-cache benefit should
> win out. The performance impact of just this patch is as follows:
> 
> Using xdp_redirect_map(with a 2nd xdp_prog patch[1]) in sample/bpf and send
> pkts via pktgen cmd:
> ./pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -t 10 
> -s 64
> 
> There are about +/- 0.1M deviation for native testing, the performance
> improved for the base-case, but some drop back with xdp devmap prog attached.
> 
> Version  | Test   | Generic | Native | Native 
> + 2nd xdp_prog
> 5.10 rc6 | xdp_redirect_map   i40e->i40e  |2.0M |   9.1M |  8.0M
> 5.10 rc6 | xdp_redirect_map   i40e->veth  |1.7M |  11.0M |  9.7M
> 5.10 rc6 + patch | xdp_redirect_map   i40e->i40e  |2.0M |   9.5M |  7.5M
> 5.10 rc6 + patch | xdp_redirect_map   i40e->veth  |1.7M |  11.6M |  9.1M
> 
> [1] 
> https://patchwork.ozlabs.org/project/netdev/patch/20201208120159.2278277-1-liuhang...@gmail.com/
> 
> Signed-off-by: Jesper Dangaard Brouer 
> Signed-off-by: Hangbin Liu 
> 
> --
> v14: no update, only rebase the code
> v13: pass in xdp_prog through __xdp_enqueue()
> v2-v12: no this patch
> ---
>  kernel/bpf/devmap.c | 115 +++-
>  1 file changed, 72 insertions(+), 43 deletions(-)
> 
> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
> index f6e9c68afdd4..84fe15950e44 100644
> --- a/kernel/bpf/devmap.c
> +++ b/kernel/bpf/devmap.c
> @@ -57,6 +57,7 @@ struct xdp_dev_bulk_queue {
>   struct list_head flush_node;
>   struct net_device *dev;
>   struct net_device *dev_rx;
> + struct bpf_prog *xdp_prog;
>   unsigned int count;
>  };
>  
> @@ -327,40 +328,92 @@ bool dev_map_can_have_prog(struct bpf_map *map)
>   return false;
>  }
>  
> +static int dev_map_bpf_prog_run(struct bpf_prog *xdp_prog,
> + struct xdp_frame **frames, int n,
> + struct net_device *dev)
> +{
> + struct xdp_txq_info txq = { .dev = dev };
> + struct xdp_buff xdp;
> + int i, nframes = 0;
> +
> + for (i = 0; i < n; i++) {
> + struct xdp_frame *xdpf = frames[i];
> + u32 act;
> + int err;
> +
> + xdp_convert_frame_to_buff(xdpf, &xdp);

Hi, slightly higher level question about the desgin. How come we have
to bounce the xdp_frame back and forth between an xdp_buff<->xdp-frame?
Seems a bit wasteful.

> + xdp.txq = &txq;
> +
> + act = bpf_prog_run_xdp(xdp_prog, &xdp);
> + switch (act) {
> + case XDP_PASS:
> + err = xdp_update_frame_from_buff(&xdp, xdpf);

xdp_update_frame_from_buff will then convert it back from the xdp_buff?

struct xdp_buff {
void *data;
void *data_end;
void *data_meta;
void *data_hard_start;
struct xdp_rxq_info *rxq;
struct xdp_txq_info *txq;
u32 frame_sz; /* frame size to deduce data_hard_end/reserved tailroom*/
};

struct xdp_frame {
void *data;
u16 len;
u16 headroom;
u32 metasize:8;
u32 frame_sz:24;
/* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
 * while mem info is valid on remote CPU.
 */
struct xdp_mem_info mem;
struct net_device *dev_rx; /* used by cpumap */
};


It looks like we could embed xdp_buff in xdp_frame and then keep the metadata
at the end.

Because you are working performance here wdyt? <- @Jesper as well.


> + if (unlikely(err < 0))
> + xdp_return_frame_rx_napi(xdpf);
> + else
> + frames[nframes++] = xdpf;
> + break;
> + default:
> + bpf_warn_invalid_xdp_action(act);
> + fallthrough;
> + case XDP_ABORTED:
> + trace_xdp_exception(dev, xdp_prog, act);
> + fallthrough;
> + case XDP_DROP:
> + xdp_return_frame_rx_napi(xdpf);
> + break;
> + }
> + }
> + return n - nframes; /* d

RE: [PATCH net v2] udp: ipv4: manipulate network header of NATed UDP GRO fraglist

2021-01-17 Thread Dongseok Yi

On 2021-01-16 02:12, Alexander Lobakin wrote:
> From: Dongseok Yi 
> Date: Fri, 15 Jan 2021 22:20:35 +0900
> 
> > UDP/IP header of UDP GROed frag_skbs are not updated even after NAT
> > forwarding. Only the header of head_skb from ip_finish_output_gso ->
> > skb_gso_segment is updated but following frag_skbs are not updated.
> >
> > A call path skb_mac_gso_segment -> inet_gso_segment ->
> > udp4_ufo_fragment -> __udp_gso_segment -> __udp_gso_segment_list
> > does not try to update UDP/IP header of the segment list but copy
> > only the MAC header.
> >
> > Update dport, daddr and checksums of each skb of the segment list
> > in __udp_gso_segment_list. It covers both SNAT and DNAT.
> >
> > Fixes: 9fd1ff5d2ac7 (udp: Support UDP fraglist GRO/GSO.)
> > Signed-off-by: Dongseok Yi 
> > ---
> > v1:
> > Steffen Klassert said, there could be 2 options.
> > https://lore.kernel.org/patchwork/patch/1362257/
> > I was trying to write a quick fix, but it was not easy to forward
> > segmented list. Currently, assuming DNAT only.
> >
> > v2:
> > Per Steffen Klassert request, move the procedure from
> > udp4_ufo_fragment to __udp_gso_segment_list and support SNAT.
> >
> > To Alexander Lobakin, I've checked your email late. Just use this
> > patch as a reference. It support SNAT too, but does not support IPv6
> > yet. I cannot make IPv6 header changes in __udp_gso_segment_list due
> > to the file is in IPv4 directory.
> 
> I used another approach, tried to make fraglist GRO closer to plain
> in terms of checksummming, as it is confusing to me why GSO packet
> should have CHECKSUM_UNNECESSARY. Just let Netfilter do its mangling,
> and then use classic UDP GSO magic at the end of segmentation.
> I also see the idea of explicit comparing and editing of IP and UDP
> headers right in __udp_gso_segment_list() rather unacceptable.

If I understand UDP GRO fraglist correctly, it keeps the length of
each skb of the fraglist. But your approach might change the lengths
by gso_size. What if each skb of the fraglist had different lengths?

For CHECKSUM_UNNECESSARY, GROed head_skb might have an invalid
checksum. But finally, the fraglist will be segmented to queue to
sk_receive_queue with head_skb. We could pass the GROed head_skb with
CHECKSUM_UNNECESSARY.

> 
> Dongseok, Steffen, please test this WIP diff and tell if this one
> works for you, so I could clean up the code and make a patch.
> For me, it works now in any configurations, with and without
> checksum/GSO/fraglist offload.
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index c1a6f262636a..646a42e88e83 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3674,6 +3674,7 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
>unsigned int offset)
>  {
>   struct sk_buff *list_skb = skb_shinfo(skb)->frag_list;
> + unsigned int doffset = skb->data - skb_mac_header(skb);
>   unsigned int tnl_hlen = skb_tnl_header_len(skb);
>   unsigned int delta_truesize = 0;
>   unsigned int delta_len = 0;
> @@ -3681,7 +3682,7 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
>   struct sk_buff *nskb, *tmp;
>   int err;
> 
> - skb_push(skb, -skb_network_offset(skb) + offset);
> + skb_push(skb, doffset);
> 
>   skb_shinfo(skb)->frag_list = NULL;
> 
> @@ -3716,12 +3717,11 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
>   delta_len += nskb->len;
>   delta_truesize += nskb->truesize;
> 
> - skb_push(nskb, -skb_network_offset(nskb) + offset);
> + skb_push(nskb, skb_headroom(nskb) - skb_headroom(skb));
> 
>   skb_release_head_state(nskb);
> -  __copy_skb_header(nskb, skb);
> + __copy_skb_header(nskb, skb);
> 
> - skb_headers_offset_update(nskb, skb_headroom(nskb) - 
> skb_headroom(skb));
>   skb_copy_from_linear_data_offset(skb, -tnl_hlen,
>nskb->data - tnl_hlen,
>offset + tnl_hlen);
> diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
> index ff39e94781bf..61665fcd8c85 100644
> --- a/net/ipv4/udp_offload.c
> +++ b/net/ipv4/udp_offload.c
> @@ -190,13 +190,58 @@ EXPORT_SYMBOL(skb_udp_tunnel_segment);
>  static struct sk_buff *__udp_gso_segment_list(struct sk_buff *skb,
> netdev_features_t features)
>  {
> - unsigned int mss = skb_shinfo(skb)->gso_size;
> + struct sk_buff *seg;
> + struct udphdr *uh;
> + unsigned int mss;
> + __be16 newlen;
> + __sum16 check;
> +
> + mss = skb_shinfo(skb)->gso_size;
> + if (skb->len <= sizeof(*uh) + mss)
> + return ERR_PTR(-EINVAL);
> 
> - skb = skb_segment_list(skb, features, skb_mac_header_len(skb));
> + skb_pull(skb, sizeof(*uh));
> +
> + skb = skb_segment_list(skb, features, skb->data - skb_mac_header(skb));
>   if (IS_ERR(skb))
>

RE: [PATCHv14 bpf-next 3/6] xdp: add a new helper for dev map multicast support

2021-01-17 Thread John Fastabend

Hangbin Liu wrote:
> This patch is for xdp multicast support. which has been discussed
> before[0], The goal is to be able to implement an OVS-like data plane in
> XDP, i.e., a software switch that can forward XDP frames to multiple ports.
> 
> To achieve this, an application needs to specify a group of interfaces
> to forward a packet to. It is also common to want to exclude one or more
> physical interfaces from the forwarding operation - e.g., to forward a
> packet to all interfaces in the multicast group except the interface it
> arrived on. While this could be done simply by adding more groups, this
> quickly leads to a combinatorial explosion in the number of groups an
> application has to maintain.
> 
> To avoid the combinatorial explosion, we propose to include the ability
> to specify an "exclude group" as part of the forwarding operation. This
> needs to be a group (instead of just a single port index), because a
> physical interface can be part of a logical grouping, such as a bond
> device.
> 
> Thus, the logical forwarding operation becomes a "set difference"
> operation, i.e. "forward to all ports in group A that are not also in
> group B". This series implements such an operation using device maps to
> represent the groups. This means that the XDP program specifies two
> device maps, one containing the list of netdevs to redirect to, and the
> other containing the exclude list.
> 
> To achieve this, I re-implement a new helper bpf_redirect_map_multi()
> to accept two maps, the forwarding map and exclude map. The forwarding
> map could be DEVMAP or DEVMAP_HASH, but the exclude map *must* be
> DEVMAP_HASH to get better performace. If user don't want to use exclude
> map and just want simply stop redirecting back to ingress device, they
> can use flag BPF_F_EXCLUDE_INGRESS.
> 
> As both bpf_xdp_redirect_map() and this new helpers are using struct
> bpf_redirect_info, I add a new ex_map and set tgt_value to NULL in the
> new helper to make a difference with bpf_xdp_redirect_map().
> 
> Also I keep the general data path in net/core/filter.c, the native data
> path in kernel/bpf/devmap.c so we can use direct calls to get better
> performace.

[...]

> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 0cf3976ce77c..0e6468cd0ab9 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -164,6 +164,7 @@ void xdp_warn(const char *msg, const char *func, const 
> int line);
>  #define XDP_WARN(msg) xdp_warn(msg, __func__, __LINE__)
>  
>  struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp);
> +struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf);
>  
>  static inline
>  void xdp_convert_frame_to_buff(struct xdp_frame *frame, struct xdp_buff *xdp)
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index a1ad32456f89..ecf5d117b96a 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3830,6 +3830,27 @@ union bpf_attr {
>   *   Return
>   *   A pointer to a struct socket on success or NULL if the file is
>   *   not a socket.
> + *
> + * long bpf_redirect_map_multi(struct bpf_map *map, struct bpf_map *ex_map, 
> u64 flags)
> + *   Description
> + *   This is a multicast implementation for XDP redirect. It will
> + *   redirect the packet to ALL the interfaces in *map*, but
> + *   exclude the interfaces in *ex_map*.
> + *
> + *   The forwarding *map* could be either BPF_MAP_TYPE_DEVMAP or
> + *   BPF_MAP_TYPE_DEVMAP_HASH. But the *ex_map* must be
> + *   BPF_MAP_TYPE_DEVMAP_HASH to get better performance.

Would be good to add a note ex_map _must_ be keyed by ifindex for the
helper to work. Its the obvious way to key a hashmap, but not required
iirc.

> + *
> + *   Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*,
> + *   which additionally excludes the current ingress device.
> + *
> + *   See also bpf_redirect_map() as a unicast implementation,
> + *   which supports redirecting packet to a specific ifindex
> + *   in the map. As both helpers use struct bpf_redirect_info
> + *   to store the redirect info, we will use a a NULL tgt_value
> + *   to distinguish multicast and unicast redirecting.
> + *   Return
> + *   **XDP_REDIRECT** on success, or **XDP_ABORTED** on error.
>   */

[...]

> +
> +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx,
> +   struct bpf_map *map, struct bpf_map *ex_map,
> +   u32 flags)
> +{
> + struct bpf_dtab_netdev *obj = NULL, *next_obj = NULL;
> + struct xdp_frame *xdpf, *nxdpf;
> + bool last_one = false;
> + int ex_ifindex;
> + u32 key, next_key;
> +
> + ex_ifindex = flags & BPF_F_EXCLUDE_INGRESS ? dev_rx->ifindex : 0;
> +
> + /* Find first available obj */
> + obj = devmap_get_next_obj(xdp, map, ex_map, NULL, &key, ex_ifindex);
> + if (!obj)
> + return 0

Re: [PATCH v3 1/3] bpf,x64: pad NOPs to make images converge more easily

2021-01-17 Thread Gary Lin

On Fri, Jan 15, 2021 at 08:04:06AM -0800, Alexei Starovoitov wrote:
> On Fri, Jan 15, 2021 at 1:41 AM Gary Lin  wrote:
> >
> > On Thu, Jan 14, 2021 at 10:37:33PM -0800, Alexei Starovoitov wrote:
> > > On Thu, Jan 14, 2021 at 1:54 AM Gary Lin  wrote:
> > > >  * pass to emit the final image.
> > > >  */
> > > > -   for (pass = 0; pass < 20 || image; pass++) {
> > > > -   proglen = do_jit(prog, addrs, image, oldproglen, &ctx);
> > > > +   for (pass = 0; pass < MAX_PASSES || image; pass++) {
> > > > +   if (!padding && pass >= PADDING_PASSES)
> > > > +   padding = true;
> > > > +   proglen = do_jit(prog, addrs, image, oldproglen, &ctx, 
> > > > padding);
> > >
> > > I'm struggling to reconcile the discussion we had before holidays with
> > > the discussion you guys had in v2:
> > >
> > > >> What is the rationale for the latter when JIT is called again for 
> > > >> subprog to fill in relative
> > > >> call locations?
> > > >>
> > > > H, my thinking was that we only enable padding for those programs
> > > > which are already padded before. But, you're right. For the programs
> > > > converging without padding, enabling padding won't change the final
> > > > image, so it's safe to always set "padding" to true for the extra pass.
> > > >
> > > > Will remove the "padded" flag in v3.
> > >
> > > I'm not following why "enabling padding won't change the final image"
> > > is correct.
> > > Say the subprog image converges without padding.
> > > Then for subprog we call JIT again.
> > > Now extra_pass==true and padding==true.
> > > The JITed image will be different.
> > Actually no.
> >
> > > The test in patch 3 should have caught it, but it didn't,
> > > because it checks for a subprog that needed padding.
> > > The extra_pass needs to emit insns exactly in the right spots.
> > > Otherwise jump targets will be incorrect.
> > > The saved addrs[] array is crucial.
> > > If extra_pass emits different things the instruction starts won't align
> > > to places where addrs[] expects them to be.
> > >
> > When calculating padding bytes, if the image already converges, the
> > emitted instruction size just matches (addrs[i] - addrs[i-1]), so
> > emit_nops() emits 0 byte, and the image doesn't change.
> 
> I see. You're right. That's very tricky.
> 
> The patch set doesn't apply cleanly.
> Could you please rebase and add a detailed comment about this logic?
> 
> Also please add comments why we check:
> nops != 0 && nops != 4
> nops != 0 && nops != 2 && nops != 5
> nops != 0 && nops != 3
> None of it is obvious.
Sure, I'll add comments for them.

> 
> Does your single test cover all combinations of numbers?
> 
The test case only covers the NOP JUMP for nops == 0 and nops == 2.
I have to figure out how to create a large enough program to trigger the
transition of imm32 jump to imm8 jump.

Gary Lin

[PATCH v3 12/21] net: stmmac: dwmac-sun8i: Prepare for second EMAC clock register

2021-01-17 Thread Andre Przywara

The Allwinner H616 SoC has two EMAC controllers, with the second one
being tied to the internal PHY, but also using a separate EMAC clock
register.

To tell the driver about which clock register to use, we add a parameter
to our syscon phandle. The driver will use this value as an index into
the regmap, so that we can address more than the first register, if
needed.

Signed-off-by: Andre Przywara 
---
 drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
index 58e0511badba..00c10ec7b693 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c
@@ -1129,6 +1129,8 @@ static int sun8i_dwmac_probe(struct platform_device *pdev)
struct stmmac_priv *priv;
struct net_device *ndev;
struct regmap *regmap;
+   struct reg_field syscon_field;
+   u32 syscon_idx = 0;
 
ret = stmmac_get_platform_resources(pdev, &stmmac_res);
if (ret)
@@ -1190,8 +1192,12 @@ static int sun8i_dwmac_probe(struct platform_device 
*pdev)
return ret;
}
 
-   gmac->regmap_field = devm_regmap_field_alloc(dev, regmap,
-
*gmac->variant->syscon_field);
+   syscon_field = *gmac->variant->syscon_field;
+   ret = of_property_read_u32_index(pdev->dev.of_node, "syscon", 1,
+&syscon_idx);
+   if (!ret)
+   syscon_field.reg += syscon_idx * sizeof(u32);
+   gmac->regmap_field = devm_regmap_field_alloc(dev, regmap, syscon_field);
if (IS_ERR(gmac->regmap_field)) {
ret = PTR_ERR(gmac->regmap_field);
dev_err(dev, "Unable to map syscon register: %d\n", ret);
@@ -1263,6 +1269,8 @@ static const struct of_device_id sun8i_dwmac_match[] = {
.data = &emac_variant_a64 },
{ .compatible = "allwinner,sun50i-h6-emac",
.data = &emac_variant_h6 },
+   { .compatible = "allwinner,sun50i-h616-emac",
+   .data = &emac_variant_h6 },
{ }
 };
 MODULE_DEVICE_TABLE(of, sun8i_dwmac_match);
-- 
2.17.5

Re: [RFC PATCH 0/7] Support for virtio-net hash reporting

2021-01-17 Thread Jason Wang




On 2021/1/17 下午3:57, Yuri Benditovich wrote:

On Thu, Jan 14, 2021 at 5:39 AM Jason Wang  wrote:


On 2021/1/13 下午10:33, Willem de Bruijn wrote:

On Tue, Jan 12, 2021 at 11:11 PM Jason Wang  wrote:

On 2021/1/13 上午7:47, Willem de Bruijn wrote:

On Tue, Jan 12, 2021 at 3:29 PM Yuri Benditovich
 wrote:

On Tue, Jan 12, 2021 at 9:49 PM Yuri Benditovich
 wrote:

On Tue, Jan 12, 2021 at 9:41 PM Yuri Benditovich
 wrote:

Existing TUN module is able to use provided "steering eBPF" to
calculate per-packet hash and derive the destination queue to
place the packet to. The eBPF uses mapped configuration data
containing a key for hash calculation and indirection table
with array of queues' indices.

This series of patches adds support for virtio-net hash reporting
feature as defined in virtio specification. It extends the TUN module
and the "steering eBPF" as follows:

Extended steering eBPF calculates the hash value and hash type, keeps
hash value in the skb->hash and returns index of destination virtqueue
and the type of the hash. TUN module keeps returned hash type in
(currently unused) field of the skb.
skb->__unused renamed to 'hash_report_type'.

When TUN module is called later to allocate and fill the virtio-net
header and push it to destination virtqueue it populates the hash
and the hash type into virtio-net header.

VHOST driver is made aware of respective virtio-net feature that
extends the virtio-net header to report the hash value and hash report
type.

Comment from Willem de Bruijn:

Skbuff fields are in short supply. I don't think we need to add one
just for this narrow path entirely internal to the tun device.


We understand that and try to minimize the impact by using an already
existing unused field of skb.

Not anymore. It was repurposed as a flags field very recently.

This use case is also very narrow in scope. And a very short path from
data producer to consumer. So I don't think it needs to claim scarce
bits in the skb.

tun_ebpf_select_queue stores the field, tun_put_user reads it and
converts it to the virtio_net_hdr in the descriptor.

tun_ebpf_select_queue is called from .ndo_select_queue.  Storing the
field in skb->cb is fragile, as in theory some code could overwrite
that between field between ndo_select_queue and
ndo_start_xmit/tun_net_xmit, from which point it is fully under tun
control again. But in practice, I don't believe anything does.

Alternatively an existing skb field that is used only on disjoint
datapaths, such as ingress-only, could be viable.

A question here. We had metadata support in XDP for cooperation between
eBPF programs. Do we have something similar in the skb?

E.g in the RSS, if we want to pass some metadata information between
eBPF program and the logic that generates the vnet header (either hard
logic in the kernel or another eBPF program). Is there any way that can
avoid the possible conflicts of qdiscs?

Not that I am aware of. The closest thing is cb[].

It'll have to aliase a field like that, that is known unused for the given path.


Right, we need to make sure cb is not used by other ones. I'm not sure
how hard to achieve that consider Qemu installs the eBPF program but it
doesn't deal with networking configurations.



One other approach that has been used within linear call stacks is out
of band. Like percpu variables softnet_data.xmit.more and
mirred_rec_level. But that is perhaps a bit overwrought for this use
case.


Yes, and if we go that way then eBPF turns out to be a burden since we
need to invent helpers to access those auxiliary data structure. It
would be better then to hard-coded the RSS in the kernel.



Instead, you could just run the flow_dissector in tun_put_user if the
feature is negotiated. Indeed, the flow dissector seems more apt to me
than BPF here. Note that the flow dissector internally can be
overridden by a BPF program if the admin so chooses.


When this set of patches is related to hash delivery in the virtio-net
packet in general,
it was prepared in context of RSS feature implementation as defined in
virtio spec [1]
In case of RSS it is not enough to run the flow_dissector in tun_put_user:
in tun_ebpf_select_queue the TUN calls eBPF to calculate the hash,
hash type and queue index
according to the (mapped) parameters (key, hash types, indirection
table) received from the guest.

TUNSETSTEERINGEBPF was added to support more diverse queue selection
than the default in case of multiqueue tun. Not sure what the exact
use cases are.

But RSS is exactly the purpose of the flow dissector. It is used for
that purpose in the software variant RPS. The flow dissector
implements a superset of the RSS spec, and certainly computes a
four-tuple for TCP/IPv6. In the case of RPS, it is skipped if the NIC
has already computed a 4-tuple hash.

What it does not give is a type indication, such as
VIRTIO_NET_HASH_TYPE_TCPv6. I don't understand how this would be used.
In datapaths where the NIC has already computed the four-tuple hash
and stored i

Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors

2021-01-17 Thread Alexander Duyck

On Sat, Jan 16, 2021 at 12:20 AM Leon Romanovsky  wrote:
>
> On Fri, Jan 15, 2021 at 05:48:59PM -0800, Alexander Duyck wrote:
> > On Fri, Jan 15, 2021 at 7:53 AM Leon Romanovsky  wrote:
> > >
> > > On Fri, Jan 15, 2021 at 10:06:19AM -0400, Jason Gunthorpe wrote:
> > > > On Thu, Jan 14, 2021 at 05:56:20PM -0800, Alexander Duyck wrote:
> > > >
> > > > > That said, it only works at the driver level. So if the firmware is
> > > > > the one that is having to do this it also occured to me that if this
> > > > > update happened on FLR that would probably be preferred.
> > > >
> > > > FLR is not free, I'd prefer not to require it just for some
> > > > philosophical reason.
> > > >
> > > > > Since the mlx5 already supports devlink I don't see any reason why the
> > > > > driver couldn't be extended to also support the devlink resource
> > > > > interface and apply it to interrupts.
> > > >
> > > > So you are OK with the PF changing the VF as long as it is devlink not
> > > > sysfs? Seems rather arbitary?
> > > >
> > > > Leon knows best, but if I recall devlink becomes wonky when the VF
> > > > driver doesn't provide a devlink instance. How does it do reload of a
> > > > VF then?
> > > >
> > > > I think you end up with essentially the same logic as presented here
> > > > with sysfs.
> > >
> > > The reasons why I decided to go with sysfs are:
> > > 1. This MSI-X table size change is applicable to ALL devices in the world,
> > > and not only netdev.
> >
> > In the PCI world MSI-X table size is a read only value. That is why I
> > am pushing back on this as a PCI interface.
>
> And it stays read-only.

Only if you come at it directly. What this is adding is a back door
that is visible as a part of the VF sysfs.

> >
> > > 2. This is purely PCI field and apply equally with same logic to all
> > > subsystems and not to netdev only.
> >
> > Again, calling this "purely PCI" is the sort of wording that has me
> > concerned. I would prefer it if we avoid that wording. There is much
> > more to this than just modifying the table size field. The firmware is
> > having to shift resources between devices and this potentially has an
> > effect on the entire part, not just one VF.
>
> It is internal to HW implementation, dumb device can solve it differently.

That is my point. I am worried about "dumb devices" that may follow. I
would like to see the steps that should be taken to prevent these sort
of things called out specifically. Basically this isn't just modifying
the PCIe config space, it is actually resizing the PBA and MSI-X
table.

> >
> > > 3. The sysfs interface is the standard way of configuring PCI/core, not
> > > devlink.
> >
> > This isn't PCI core that is being configured. It is the firmware for
> > the device. You are working with resources that are shared between
> > multiple functions.
>
> I'm ensuring that "lspci -vv .." will work correctly after such change.
> It is PCI core responsibility.

The current code doesn't work on anything with a driver loaded on it.
In addition the messaging provided is fairly minimal which results in
an interface that will be difficult to understand when it doesn't
work. In addition there is currently only one piece of hardware that
works with this interface which is the mlx5. My concern is this is
adding overhead to all VFs that will not be used by most SR-IOV
capable devices. In my view it would make much more sense to have a
top-down approach instead of bottom-up where the PF is registering
interfaces for the VFs.

If you want yet another compromise I would be much happier with the PF
registering the sysfs interfaces on the VFs rather than the VFs
registering the interface and hoping the PF supports it. At least with
that you are guaranteed the PF will respond to the interface when it
is registered.

> >
> > > 4. This is how orchestration software provisioning VFs already. It fits
> > > real world usage of SR-IOV, not the artificial one that is proposed during
> > > the discussion.
> >
> > What do you mean this is how they are doing it already? Do you have
> > something out-of-tree and that is why you are fighting to keep the
> > sysfs? If so that isn't a valid argument.
>
> I have Kubernetes and OpenStack, indeed they are not part of the kernel tree.
> They already use sriov_driver_autoprobe sysfs knob to disable autobind
> before even starting. They configure MACs and bind VFs through sysfs/netlink
> already. For them, the read/write of sysfs that is going to be bound to
> the already created VM with known CPU properties, fits perfectly.

By that argument the same could be said about netlink. What I don't
get is why it is okay to configure the MAC through netlink but
suddenly when we are talking about interrupts it is out of the
question. As far as the binding that is the driver interface which is
more or less grandfathered in anyway as there aren't too many ways to
deal with them as there isn't an alternate interface for the drivers
to define support.

> >
> > > So the idea to

Re: [PATCH bpf 1/2] samples/bpf: Set flag __SANE_USERSPACE_TYPES__ for MIPS to fix build warnings

2021-01-17 Thread Tiezhu Yang


On 01/14/2021 01:12 AM, Yonghong Song wrote:



On 1/13/21 2:57 AM, Tiezhu Yang wrote:

MIPS needs __SANE_USERSPACE_TYPES__ before  to select
'int-ll64.h' in arch/mips/include/uapi/asm/types.h and avoid compile
warnings when printing __u64 with %llu, %llx or %lld.


could you mention which command produces the following warning?


make M=samples/bpf





 printf("0x%02x : %llu\n", key, value);
  ~~~^  ~
  %lu
printf("%s/%llx;", sym->name, addr);
   ~~~^   
   %lx
   printf(";%s %lld\n", key->waker, count);
   ~~~^ ~
   %ld

Signed-off-by: Tiezhu Yang 
---
  samples/bpf/Makefile| 4 
  tools/include/linux/types.h | 3 +++
  2 files changed, 7 insertions(+)

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 26fc96c..27de306 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -183,6 +183,10 @@ BPF_EXTRA_CFLAGS := $(ARM_ARCH_SELECTOR)
  TPROGS_CFLAGS += $(ARM_ARCH_SELECTOR)
  endif
  +ifeq ($(ARCH), mips)
+TPROGS_CFLAGS += -D__SANE_USERSPACE_TYPES__
+endif
+


This change looks okay based on description in
arch/mips/include/uapi/asm/types.h

'''
/*
 * We don't use int-l64.h for the kernel anymore but still use it for
 * userspace to avoid code changes.
 *
 * However, some user programs (e.g. perf) may not want this. They can
 * flag __SANE_USERSPACE_TYPES__ to get int-ll64.h here.
 */
'''


  TPROGS_CFLAGS += -Wall -O2
  TPROGS_CFLAGS += -Wmissing-prototypes
  TPROGS_CFLAGS += -Wstrict-prototypes
diff --git a/tools/include/linux/types.h b/tools/include/linux/types.h
index 154eb4e..e9c5a21 100644
--- a/tools/include/linux/types.h
+++ b/tools/include/linux/types.h
@@ -6,7 +6,10 @@
  #include 
  #include 
  +#ifndef __SANE_USERSPACE_TYPES__
  #define __SANE_USERSPACE_TYPES__/* For PPC64, to get LL64 types */
+#endif


What problem this patch fixed?


If add "TPROGS_CFLAGS += -D__SANE_USERSPACE_TYPES__" in
samples/bpf/Makefile, it appears the following error:

Auto-detecting system features:
...libelf: [ on  ]
...  zlib: [ on  ]
...   bpf: [ OFF ]

BPF API too old
make[3]: *** [Makefile:293: bpfdep] Error 1
make[2]: *** [Makefile:156: all] Error 2

With #ifndef __SANE_USERSPACE_TYPES__  in tools/include/linux/types.h,
the above error has gone.


If this header is used, you can just
change comment from "PPC64" to "PPC64/MIPS", right?


If include  in the source files which have compile warnings
when printing __u64 with %llu, %llx or %lld, it has no effect due to 
actually
it includes usr/include/linux/types.h instead of 
tools/include/linux/types.h,

this is because the include-directories in samples/bpf/Makefile are searched
in the order, -I./usr/include is in the front of -I./tools/include.

So I think define __SANE_USERSPACE_TYPES__ for MIPS in samples/bpf/Makefile
is proper, at the same time, add #ifndef __SANE_USERSPACE_TYPES__ in
tools/include/linux/types.h can avoid build error and have no side effect.

I will send v2 later with mention in the commit message that this is
mips related.

Thanks,
Tiezhu




+
  #include 
  #include

Re: [PATCH iproute2] iplink: work around rtattr length limits for IFLA_VFINFO_LIST

2021-01-17 Thread David Ahern

On 1/16/21 6:21 PM, Jakub Kicinski wrote:
> 
> I wonder. There is something inherently risky about making
> a precedent for user space depending on invalid kernel output.
> 
> _If_ we want to fix the kernel, IMO we should only fix the kernel.
> 

IMHO this is a kernel bug that should be fixed. An easy fix to check the
overflow in nla_nest_end and return an error. Sadly, nla_nest_end return
code is ignored and backporting any change to fix that will be
nightmare. A warning will identify places that need to be fixed.

We can at least catch and fix this overflow which is by far the primary
known victim of the rollover.

Re: [PATCH iproute2 2/2] vrf: fix ip vrf exec with libbpf

2021-01-17 Thread David Ahern

On 1/17/21 3:54 PM, Luca Boccassi wrote:
> The size of bpf_insn is passed to bpf_load_program instead of the number
> of elements as it expects, so ip vrf exec fails with:
> 
> $ sudo ip link add vrf-blue type vrf table 10
> $ sudo ip link set dev vrf-blue up
> $ sudo ip/ip vrf exec vrf-blue ls
> Failed to load BPF prog: 'Invalid argument'
> last insn is not an exit or jmp
> processed 0 insns (limit 100) max_states_per_insn 0 total_states 0 
> peak_states 0 mark_read 0
> Kernel compiled with CGROUP_BPF enabled?
> 
> https://bugs.debian.org/980046
> 
> Reported-by: Emmanuel DECAEN 
> 
> Signed-off-by: Luca Boccassi 
> ---
>  lib/bpf_glue.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 



Reviewed-by: David Ahern

Re: [PATCH iproute2 1/2] vrf: print BPF log buffer if bpf_program_load fails

2021-01-17 Thread David Ahern

On 1/17/21 3:54 PM, Luca Boccassi wrote:
> Necessary to understand what is going on when bpf_program_load fails
> 
> Signed-off-by: Luca Boccassi 
> ---
>  ip/ipvrf.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 


Reviewed-by: David Ahern

Re: [PATCH iproute2-next v2 0/7] dcb: Support APP, DCBX objects

2021-01-17 Thread David Ahern

On 1/1/21 5:03 PM, Petr Machata wrote:
> Add support to the dcb tool for the following two DCB objects:
> 
> - APP, which allows configuration of traffic prioritization rules based on
>   several possible packet headers.
> 
> - DCBX, which is a 1-byte bitfield of flags that configure whether the DCBX
>   protocol is implemented in the device or in the host, and which version
>   of the protocol should be used.
> 
> Patch #1 adds a new helper for finding a name of a given dsfield value.
> This is useful for APP DSCP-to-priority rules, which can use human-readable
> DSCP names.
> 
> Patches #2, #3 and #4 extend existing interfaces for, respectively, parsing
> of the X:Y mappings, for setting a DCB object, and for getting a DCB
> object.
> 
> In patch #5, support for the command line argument -N / --Numeric is
> added. The APP tool later uses it to decide whether to format DSCP values
> as human-readable strings or as plain numbers.
> 
> Patches #6 and #7 add the subtools themselves and their man pages.
> 

applied to iproute2-next.

[PATCH net-next] net/sched: cls_flower add CT_FLAGS_INVALID flag support

2021-01-17 Thread wenxu

From: wenxu 

This patch add the TCA_FLOWER_KEY_CT_FLAGS_INVALID flag to
match the ct_state with invalid for conntrack.

Signed-off-by: wenxu 
---
 include/linux/skbuff.h   |  4 ++--
 include/net/sch_generic.h|  1 +
 include/uapi/linux/pkt_cls.h |  1 +
 net/core/dev.c   |  2 ++
 net/core/flow_dissector.c| 13 +
 net/sched/act_ct.c   |  1 +
 net/sched/cls_flower.c   |  6 +-
 7 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index c9568cf..e22ccf0 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1353,8 +1353,8 @@ void skb_flow_dissect_meta(const struct sk_buff *skb,
 skb_flow_dissect_ct(const struct sk_buff *skb,
struct flow_dissector *flow_dissector,
void *target_container,
-   u16 *ctinfo_map,
-   size_t mapsize);
+   u16 *ctinfo_map, size_t mapsize,
+   bool post_ct);
 void
 skb_flow_dissect_tunnel_info(const struct sk_buff *skb,
 struct flow_dissector *flow_dissector,
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 639e465..e7bee99 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -388,6 +388,7 @@ struct qdisc_skb_cb {
 #define QDISC_CB_PRIV_LEN 20
unsigned char   data[QDISC_CB_PRIV_LEN];
u16 mru;
+   boolpost_ct;
 };
 
 typedef void tcf_chain_head_change_t(struct tcf_proto *tp_head, void *priv);
diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h
index ee95f42..709668e 100644
--- a/include/uapi/linux/pkt_cls.h
+++ b/include/uapi/linux/pkt_cls.h
@@ -591,6 +591,7 @@ enum {
TCA_FLOWER_KEY_CT_FLAGS_ESTABLISHED = 1 << 1, /* Part of an existing 
connection. */
TCA_FLOWER_KEY_CT_FLAGS_RELATED = 1 << 2, /* Related to an established 
connection. */
TCA_FLOWER_KEY_CT_FLAGS_TRACKED = 1 << 3, /* Conntrack has occurred. */
+   TCA_FLOWER_KEY_CT_FLAGS_INVALID = 1 << 4, /* Conntrack is invalid. */
 };
 
 enum {
diff --git a/net/core/dev.c b/net/core/dev.c
index bae35c1..9dce3f7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3878,6 +3878,7 @@ int dev_loopback_xmit(struct net *net, struct sock *sk, 
struct sk_buff *skb)
 
/* qdisc_skb_cb(skb)->pkt_len was already set by the caller. */
qdisc_skb_cb(skb)->mru = 0;
+   qdisc_skb_cb(skb)->post_ct = false;
mini_qdisc_bstats_cpu_update(miniq, skb);
 
switch (tcf_classify(skb, miniq->filter_list, &cl_res, false)) {
@@ -4960,6 +4961,7 @@ static __latent_entropy void net_tx_action(struct 
softirq_action *h)
 
qdisc_skb_cb(skb)->pkt_len = skb->len;
qdisc_skb_cb(skb)->mru = 0;
+   qdisc_skb_cb(skb)->post_ct = false;
skb->tc_at_ingress = 1;
mini_qdisc_bstats_cpu_update(miniq, skb);
 
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 2d70ded..c565c7a 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -237,9 +237,8 @@ void skb_flow_dissect_meta(const struct sk_buff *skb,
 void
 skb_flow_dissect_ct(const struct sk_buff *skb,
struct flow_dissector *flow_dissector,
-   void *target_container,
-   u16 *ctinfo_map,
-   size_t mapsize)
+   void *target_container, u16 *ctinfo_map,
+   size_t mapsize, bool post_ct)
 {
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
struct flow_dissector_key_ct *key;
@@ -251,13 +250,19 @@ void skb_flow_dissect_meta(const struct sk_buff *skb,
return;
 
ct = nf_ct_get(skb, &ctinfo);
-   if (!ct)
+   if (!ct && !post_ct)
return;
 
key = skb_flow_dissector_target(flow_dissector,
FLOW_DISSECTOR_KEY_CT,
target_container);
 
+   if (!ct) {
+   key->ct_state = TCA_FLOWER_KEY_CT_FLAGS_TRACKED |
+   TCA_FLOWER_KEY_CT_FLAGS_INVALID;
+   return;
+   }
+
if (ctinfo < mapsize)
key->ct_state = ctinfo_map[ctinfo];
 #if IS_ENABLED(CONFIG_NF_CONNTRACK_ZONES)
diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c
index 83a5c67..b344207 100644
--- a/net/sched/act_ct.c
+++ b/net/sched/act_ct.c
@@ -1030,6 +1030,7 @@ static int tcf_ct_act(struct sk_buff *skb, const struct 
tc_action *a,
 
 out:
tcf_action_update_bstats(&c->common, skb);
+   qdisc_skb_cb(skb)->post_ct = true;
if (defrag)
qdisc_skb_cb(skb)->pkt_len = skb->len;
return retval;
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 1319986..e8653d9 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -305,6 +305,9 @@ static int fl_classify(struct sk_buff *skb, const struct 
tcf_proto *tp,

[PATCH] net: usb: qmi_wwan: added support for Thales Cinterion PLSx3 modem family

2021-01-17 Thread Giacinto Cifelli

Bus 003 Device 009: ID 1e2d:006f
Device Descriptor:
  bLength18
  bDescriptorType 1
  bcdUSB   2.00
  bDeviceClass  239 Miscellaneous Device
  bDeviceSubClass 2 ?
  bDeviceProtocol 1 Interface Association
  bMaxPacketSize064
  idVendor   0x1e2d
  idProduct  0x006f
  bcdDevice0.00
  iManufacturer   3 Cinterion Wireless Modules
  iProduct2 PLSx3
  iSerial 4 fa3c1419
  bNumConfigurations  1
  Configuration Descriptor:
bLength 9
bDescriptorType 2
wTotalLength  303
bNumInterfaces  9
bConfigurationValue 1
iConfiguration  1 Cinterion Configuration
bmAttributes 0xe0
  Self Powered
  Remote Wakeup
MaxPower  500mA
Interface Association:
  bLength 8
  bDescriptorType11
  bFirstInterface 0
  bInterfaceCount 2
  bFunctionClass  2 Communications
  bFunctionSubClass   2 Abstract (modem)
  bFunctionProtocol   1 AT-commands (v.25ter)
  iFunction   0
Interface Descriptor:
  bLength 9
  bDescriptorType 4
  bInterfaceNumber0
  bAlternateSetting   0
  bNumEndpoints   1
  bInterfaceClass 2 Communications
  bInterfaceSubClass  2 Abstract (modem)
  bInterfaceProtocol  1 AT-commands (v.25ter)
  iInterface  0
  CDC Header:
bcdCDC   1.10
  CDC ACM:
bmCapabilities   0x02
  line coding and serial state
  CDC Call Management:
bmCapabilities   0x03
  call management
  use DataInterface
bDataInterface  1
  CDC Union:
bMasterInterface0
bSlaveInterface 1
  Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x81  EP 1 IN
bmAttributes3
  Transfer TypeInterrupt
  Synch Type   None
  Usage Type   Data
wMaxPacketSize 0x0040  1x 64 bytes
bInterval   5
Interface Descriptor:
  bLength 9
  bDescriptorType 4
  bInterfaceNumber1
  bAlternateSetting   0
  bNumEndpoints   2
  bInterfaceClass10 CDC Data
  bInterfaceSubClass  0 Unused
  bInterfaceProtocol  0
  iInterface  0
  Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x82  EP 2 IN
bmAttributes2
  Transfer TypeBulk
  Synch Type   None
  Usage Type   Data
wMaxPacketSize 0x0200  1x 512 bytes
bInterval   0
  Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x01  EP 1 OUT
bmAttributes2
  Transfer TypeBulk
  Synch Type   None
  Usage Type   Data
wMaxPacketSize 0x0200  1x 512 bytes
bInterval   0
Interface Association:
  bLength 8
  bDescriptorType11
  bFirstInterface 2
  bInterfaceCount 2
  bFunctionClass  2 Communications
  bFunctionSubClass   2 Abstract (modem)
  bFunctionProtocol   1 AT-commands (v.25ter)
  iFunction   0
Interface Descriptor:
  bLength 9
  bDescriptorType 4
  bInterfaceNumber2
  bAlternateSetting   0
  bNumEndpoints   1
  bInterfaceClass 2 Communications
  bInterfaceSubClass  2 Abstract (modem)
  bInterfaceProtocol  1 AT-commands (v.25ter)
  iInterface  0
  CDC Header:
bcdCDC   1.10
  CDC ACM:
bmCapabilities   0x02
  line coding and serial state
  CDC Call Management:
bmCapabilities   0x03
  call management
  use DataInterface
bDataInterface  3
  CDC Union:
bMasterInterface2
bSlaveInterface 3
  Endpoint Descriptor:
bLength 7
bDescriptorType 5
bEndpointAddress 0x83  EP 3 IN
bmAttributes3
  Transfer TypeInterrupt
  Synch Type   None
  Usage Type   Data
wMaxPacketSize 0x0040  1x 64 bytes
bInterval   5
Interface Descriptor:
  bLength 9
  bDescriptorType 4
  bInterfaceNumber3
  bAlternateSett

[PATCH net] tcp: Fix potential use-after-free due to double kfree().

2021-01-17 Thread Kuniyuki Iwashima

Receiving ACK with a valid SYN cookie, cookie_v4_check() allocates struct
request_sock and then can allocate inet_rsk(req)->ireq_opt. After that,
tcp_v4_syn_recv_sock() allocates struct sock and copies ireq_opt to
inet_sk(sk)->inet_opt. Normally, tcp_v4_syn_recv_sock() inserts the full
socket into ehash and sets NULL to ireq_opt. Otherwise,
tcp_v4_syn_recv_sock() has to reset inet_opt by NULL and free the full
socket.

The commit 01770a1661657 ("tcp: fix race condition when creating child
sockets from syncookies") added a new path, in which more than one cores
create full sockets for the same SYN cookie. Currently, the core which
loses the race frees the full socket without resetting inet_opt, resulting
in that both sock_put() and reqsk_put() call kfree() for the same memory:

  sock_put
sk_free
  __sk_free
sk_destruct
  __sk_destruct
sk->sk_destruct/inet_sock_destruct
  kfree(rcu_dereference_protected(inet->inet_opt, 1));

  reqsk_put
reqsk_free
  __reqsk_free
req->rsk_ops->destructor/tcp_v4_reqsk_destructor
  kfree(rcu_dereference_protected(inet_rsk(req)->ireq_opt, 1));

Calling kmalloc() between the double kfree() can lead to use-after-free, so
this patch fixes it by setting NULL to inet_opt before sock_put().

As a side note, this kind of issue does not happen for IPv6. This is
because tcp_v6_syn_recv_sock() clones both ipv6_opt and pktopts which
correspond to ireq_opt in IPv4.

Fixes: 01770a166165 ("tcp: fix race condition when creating child sockets from 
syncookies")
CC: Ricardo Dias 
Signed-off-by: Kuniyuki Iwashima 
Reviewed-by: Benjamin Herrenschmidt 
---
 net/ipv4/tcp_ipv4.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 58207c7769d0..87eb614dab27 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1595,6 +1595,8 @@ struct sock *tcp_v4_syn_recv_sock(const struct sock *sk, 
struct sk_buff *skb,
tcp_move_syn(newtp, req);
ireq->ireq_opt = NULL;
} else {
+   newinet->inet_opt = NULL;
+
if (!req_unhash && found_dup_sk) {
/* This code path should only be executed in the
 * syncookie case only
@@ -1602,8 +1604,6 @@ struct sock *tcp_v4_syn_recv_sock(const struct sock *sk, 
struct sk_buff *skb,
bh_unlock_sock(newsk);
sock_put(newsk);
newsk = NULL;
-   } else {
-   newinet->inet_opt = NULL;
}
}
return newsk;
-- 
2.17.2 (Apple Git-113)

Re: [PATCH net-next v2 0/7] virtio-net support xdp socket zero copy xmit

2021-01-17 Thread Jason Wang




On 2021/1/16 上午10:59, Xuan Zhuo wrote:

XDP socket is an excellent by pass kernel network transmission framework. The
zero copy feature of xsk (XDP socket) needs to be supported by the driver. The
performance of zero copy is very good. mlx5 and intel ixgbe already support this
feature, This patch set allows virtio-net to support xsk's zerocopy xmit
feature.

And xsk's zerocopy rx has made major changes to virtio-net, and I hope to submit
it after this patch set are received.

Compared with other drivers, virtio-net does not directly obtain the dma
address, so I first obtain the xsk page, and then pass the page to virtio.

When recycling the sent packets, we have to distinguish between skb and xdp.
Now we have to distinguish between skb, xdp, xsk. So the second patch solves
this problem first.

The last four patches are used to support xsk zerocopy in virtio-net:

  1. support xsk enable/disable
  2. realize the function of xsk packet sending
  3. implement xsk wakeup callback
  4. set xsk completed when packet sent done


 Performance Testing 

The udp package tool implemented by the interface of xsk vs sockperf(kernel udp)
for performance testing:

xsk zero copy in virtio-net:
CPUPPS MSGSIZE
28.7%  3833857 64
38.5%  3689491 512
38.9%  2787096 1456



Some questions on the results:

1) What's the setup on the vhost?
2) What's the setup of the mitigation in both host and guest?
3) Any analyze on the possible bottleneck via perf or other tools?

Thanks




xsk without zero copy in virtio-net:
CPUPPS MSGSIZE
100%   1916747 64
100%   1775988 512
100%   1440054 1456

sockperf:
CPUPPS MSGSIZE
100%   713274  64
100%   701024  512
100%   695832  1456

Xuan Zhuo (7):
   xsk: support get page for drv
   virtio-net, xsk: distinguish XDP_TX and XSK XMIT ctx
   xsk, virtio-net: prepare for support xsk zerocopy xmit
   virtio-net, xsk: support xsk enable/disable
   virtio-net, xsk: realize the function of xsk packet sending
   virtio-net, xsk: implement xsk wakeup callback
   virtio-net, xsk: set xsk completed when packet sent done

  drivers/net/virtio_net.c| 559 +++-
  include/linux/netdevice.h   |   1 +
  include/net/xdp_sock_drv.h  |  10 +
  include/net/xsk_buff_pool.h |   1 +
  net/xdp/xsk_buff_pool.c |  10 +-
  5 files changed, 523 insertions(+), 58 deletions(-)

--
1.8.3.1

Re: [PATCH net v2] udp: ipv4: manipulate network header of NATed UDP GRO fraglist

2021-01-17 Thread Steffen Klassert

On Fri, Jan 15, 2021 at 05:12:33PM +, Alexander Lobakin wrote:
> From: Dongseok Yi 
> Date: Fri, 15 Jan 2021 22:20:35 +0900
> 
> > UDP/IP header of UDP GROed frag_skbs are not updated even after NAT
> > forwarding. Only the header of head_skb from ip_finish_output_gso ->
> > skb_gso_segment is updated but following frag_skbs are not updated.
> > 
> > A call path skb_mac_gso_segment -> inet_gso_segment ->
> > udp4_ufo_fragment -> __udp_gso_segment -> __udp_gso_segment_list
> > does not try to update UDP/IP header of the segment list but copy
> > only the MAC header.
> > 
> > Update dport, daddr and checksums of each skb of the segment list
> > in __udp_gso_segment_list. It covers both SNAT and DNAT.
> > 
> > Fixes: 9fd1ff5d2ac7 (udp: Support UDP fraglist GRO/GSO.)
> > Signed-off-by: Dongseok Yi 
> > ---
> > v1:
> > Steffen Klassert said, there could be 2 options.
> > https://lore.kernel.org/patchwork/patch/1362257/
> > I was trying to write a quick fix, but it was not easy to forward
> > segmented list. Currently, assuming DNAT only.
> > 
> > v2:
> > Per Steffen Klassert request, move the procedure from
> > udp4_ufo_fragment to __udp_gso_segment_list and support SNAT.
> > 
> > To Alexander Lobakin, I've checked your email late. Just use this
> > patch as a reference. It support SNAT too, but does not support IPv6
> > yet. I cannot make IPv6 header changes in __udp_gso_segment_list due
> > to the file is in IPv4 directory.
> 
> I used another approach, tried to make fraglist GRO closer to plain
> in terms of checksummming, as it is confusing to me why GSO packet
> should have CHECKSUM_UNNECESSARY.

This is intentional. With fraglist GRO, we don't mangle packets
in the standard (non NAT) case. So the checksum is still correct
after segmentation. That is one reason why it has good forwarding
performance when software segmentation is needed. Checksuming
touches the whole packet and has a lot of overhead, so it is
heplfull to avoid it whenever possible.

We should find a way to do the checksum only when we really
need it. I.e. only if the headers of the head skb changed.

Re: [PATCH net-next v2 2/7] virtio-net, xsk: distinguish XDP_TX and XSK XMIT ctx

2021-01-17 Thread Jason Wang




On 2021/1/16 上午10:59, Xuan Zhuo wrote:

If support xsk, a new ptr will be recovered during the
process of freeing the old ptr. In order to distinguish between ctx sent
by XDP_TX and ctx sent by xsk, a struct is added here to distinguish
between these two situations. virtnet_xdp_type.type It is used to
distinguish different ctx, and virtnet_xdp_type.offset is used to record
the offset between "true ctx" and virtnet_xdp_type.

The newly added virtnet_xsk_hdr will be used for xsk.

Signed-off-by: Xuan Zhuo 



Any reason that you can't simply encode the type in the pointer itself 
as we used to do?


#define VIRTIO_XSK_FLAG    BIT(1)

?



---
  drivers/net/virtio_net.c | 75 ++--
  1 file changed, 60 insertions(+), 15 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index ba8e637..e707c31 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -94,6 +94,22 @@ struct virtnet_rq_stats {
u64 kicks;
  };
  
+enum {

+   XDP_TYPE_XSK,
+   XDP_TYPE_TX,
+};
+
+struct virtnet_xdp_type {
+   int offset:24;
+   unsigned type:8;
+};
+
+struct virtnet_xsk_hdr {
+   struct virtnet_xdp_type type;
+   struct virtio_net_hdr_mrg_rxbuf hdr;
+   u32 len;
+};
+
  #define VIRTNET_SQ_STAT(m)offsetof(struct virtnet_sq_stats, m)
  #define VIRTNET_RQ_STAT(m)offsetof(struct virtnet_rq_stats, m)
  
@@ -251,14 +267,19 @@ static bool is_xdp_frame(void *ptr)

return (unsigned long)ptr & VIRTIO_XDP_FLAG;
  }
  
-static void *xdp_to_ptr(struct xdp_frame *ptr)

+static void *xdp_to_ptr(struct virtnet_xdp_type *ptr)
  {
return (void *)((unsigned long)ptr | VIRTIO_XDP_FLAG);
  }
  
-static struct xdp_frame *ptr_to_xdp(void *ptr)

+static struct virtnet_xdp_type *ptr_to_xtype(void *ptr)
+{
+   return (struct virtnet_xdp_type *)((unsigned long)ptr & 
~VIRTIO_XDP_FLAG);
+}
+
+static void *xtype_get_ptr(struct virtnet_xdp_type *xdptype)
  {
-   return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
+   return (char *)xdptype + xdptype->offset;
  }
  
  /* Converting between virtqueue no. and kernel tx/rx queue no.

@@ -459,11 +480,16 @@ static int __virtnet_xdp_xmit_one(struct virtnet_info *vi,
   struct xdp_frame *xdpf)
  {
struct virtio_net_hdr_mrg_rxbuf *hdr;
+   struct virtnet_xdp_type *xdptype;
int err;
  
-	if (unlikely(xdpf->headroom < vi->hdr_len))

+   if (unlikely(xdpf->headroom < vi->hdr_len + sizeof(*xdptype)))
return -EOVERFLOW;
  
+	xdptype = (struct virtnet_xdp_type *)(xdpf + 1);

+   xdptype->offset = (char *)xdpf - (char *)xdptype;
+   xdptype->type = XDP_TYPE_TX;
+
/* Make room for virtqueue hdr (also change xdpf->headroom?) */
xdpf->data -= vi->hdr_len;
/* Zero header and leave csum up to XDP layers */
@@ -473,7 +499,7 @@ static int __virtnet_xdp_xmit_one(struct virtnet_info *vi,
  
  	sg_init_one(sq->sg, xdpf->data, xdpf->len);
  
-	err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdp_to_ptr(xdpf),

+   err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdp_to_ptr(xdptype),
   GFP_ATOMIC);
if (unlikely(err))
return -ENOSPC; /* Caller handle free/refcnt */
@@ -523,8 +549,11 @@ static int virtnet_xdp_xmit(struct net_device *dev,
/* Free up any pending old buffers before queueing new ones. */
while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
if (likely(is_xdp_frame(ptr))) {
-   struct xdp_frame *frame = ptr_to_xdp(ptr);
+   struct virtnet_xdp_type *xtype;
+   struct xdp_frame *frame;
  
+			xtype = ptr_to_xtype(ptr);

+   frame = xtype_get_ptr(xtype);
bytes += frame->len;
xdp_return_frame(frame);
} else {
@@ -1373,24 +1402,34 @@ static int virtnet_receive(struct receive_queue *rq, 
int budget,
  
  static void free_old_xmit_skbs(struct send_queue *sq, bool in_napi)

  {
-   unsigned int len;
unsigned int packets = 0;
unsigned int bytes = 0;
-   void *ptr;
+   unsigned int len;
+   struct virtnet_xdp_type *xtype;
+   struct xdp_frame*frame;
+   struct virtnet_xsk_hdr  *xskhdr;
+   struct sk_buff  *skb;
+   void*ptr;
  
  	while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {

if (likely(!is_xdp_frame(ptr))) {
-   struct sk_buff *skb = ptr;
+   skb = ptr;
  
  			pr_debug("Sent skb %p\n", skb);
  
  			bytes += skb->len;

napi_consume_skb(skb, in_napi);
} else {
-   struct xdp_frame *frame = ptr_to_xdp(ptr);
+   xtype = ptr_to_xtype(ptr);
  
-			bytes += frame->len;

-   xdp_return_fra

Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors

2021-01-17 Thread Leon Romanovsky

On Sun, Jan 17, 2021 at 07:16:30PM -0800, Alexander Duyck wrote:
> On Sat, Jan 16, 2021 at 12:20 AM Leon Romanovsky  wrote:
> >
> > On Fri, Jan 15, 2021 at 05:48:59PM -0800, Alexander Duyck wrote:
> > > On Fri, Jan 15, 2021 at 7:53 AM Leon Romanovsky  wrote:
> > > >
> > > > On Fri, Jan 15, 2021 at 10:06:19AM -0400, Jason Gunthorpe wrote:
> > > > > On Thu, Jan 14, 2021 at 05:56:20PM -0800, Alexander Duyck wrote:
> > > > >
> > > > > > That said, it only works at the driver level. So if the firmware is
> > > > > > the one that is having to do this it also occured to me that if this
> > > > > > update happened on FLR that would probably be preferred.
> > > > >
> > > > > FLR is not free, I'd prefer not to require it just for some
> > > > > philosophical reason.
> > > > >
> > > > > > Since the mlx5 already supports devlink I don't see any reason why 
> > > > > > the
> > > > > > driver couldn't be extended to also support the devlink resource
> > > > > > interface and apply it to interrupts.
> > > > >
> > > > > So you are OK with the PF changing the VF as long as it is devlink not
> > > > > sysfs? Seems rather arbitary?
> > > > >
> > > > > Leon knows best, but if I recall devlink becomes wonky when the VF
> > > > > driver doesn't provide a devlink instance. How does it do reload of a
> > > > > VF then?
> > > > >
> > > > > I think you end up with essentially the same logic as presented here
> > > > > with sysfs.
> > > >
> > > > The reasons why I decided to go with sysfs are:
> > > > 1. This MSI-X table size change is applicable to ALL devices in the 
> > > > world,
> > > > and not only netdev.
> > >
> > > In the PCI world MSI-X table size is a read only value. That is why I
> > > am pushing back on this as a PCI interface.
> >
> > And it stays read-only.
>
> Only if you come at it directly. What this is adding is a back door
> that is visible as a part of the VF sysfs.
>
> > >
> > > > 2. This is purely PCI field and apply equally with same logic to all
> > > > subsystems and not to netdev only.
> > >
> > > Again, calling this "purely PCI" is the sort of wording that has me
> > > concerned. I would prefer it if we avoid that wording. There is much
> > > more to this than just modifying the table size field. The firmware is
> > > having to shift resources between devices and this potentially has an
> > > effect on the entire part, not just one VF.
> >
> > It is internal to HW implementation, dumb device can solve it differently.
>
> That is my point. I am worried about "dumb devices" that may follow. I
> would like to see the steps that should be taken to prevent these sort
> of things called out specifically. Basically this isn't just modifying
> the PCIe config space, it is actually resizing the PBA and MSI-X
> table.

Exactly the last line the dumb device can implement differently. The
request is simple - configure MSI-X table size to be the new size.

>
> > >
> > > > 3. The sysfs interface is the standard way of configuring PCI/core, not
> > > > devlink.
> > >
> > > This isn't PCI core that is being configured. It is the firmware for
> > > the device. You are working with resources that are shared between
> > > multiple functions.
> >
> > I'm ensuring that "lspci -vv .." will work correctly after such change.
> > It is PCI core responsibility.
>
> The current code doesn't work on anything with a driver loaded on it.

The problem that no one care about this case, because in opposite to
other devices that usually operates in the hypervisor and probed during
the boot, the VFs are used differently. They run in VMs, probed there
and (usually) not needed in hypervisor.

The driver reload would make sense if PF MSI-X table was changed.

> In addition the messaging provided is fairly minimal which results in
> an interface that will be difficult to understand when it doesn't
> work.

I'm fond of simple interfaces: 0, EBUSY and EINVAL are common way
to inform user. We must remember that this interface is for low-level
PCI property and is needed for expert users who needs to squeeze maximum
for their VMs out of expensive high speed network card that supports SR-IOV.

According to the ebay, the CX6 card costs between 1000 and 1700 USDs,
not really home equipment.

> In addition there is currently only one piece of hardware that
> works with this interface which is the mlx5.

It is not different from any other feature, someone should be first.
This has very clear purpose, scoped well and understandable when and
why it is needed.

Kernel is full of devices and features that exist in one device only.

> My concern is this is
> adding overhead to all VFs that will not be used by most SR-IOV
> capable devices. In my view it would make much more sense to have a
> top-down approach instead of bottom-up where the PF is registering
> interfaces for the VFs.
>
> If you want yet another compromise I would be much happier with the PF
> registering the sysfs interfaces on the VFs rather than the VFs
> registering the interface and hopi

RE: [PATCH net v2] udp: ipv4: manipulate network header of NATed UDP GRO fraglist

2021-01-17 Thread Dongseok Yi

On 2021-01-18 15:37, Steffen Klassert wrote:
> On Fri, Jan 15, 2021 at 05:12:33PM +, Alexander Lobakin wrote:
> > From: Dongseok Yi 
> > Date: Fri, 15 Jan 2021 22:20:35 +0900
> >
> > > UDP/IP header of UDP GROed frag_skbs are not updated even after NAT
> > > forwarding. Only the header of head_skb from ip_finish_output_gso ->
> > > skb_gso_segment is updated but following frag_skbs are not updated.
> > >
> > > A call path skb_mac_gso_segment -> inet_gso_segment ->
> > > udp4_ufo_fragment -> __udp_gso_segment -> __udp_gso_segment_list
> > > does not try to update UDP/IP header of the segment list but copy
> > > only the MAC header.
> > >
> > > Update dport, daddr and checksums of each skb of the segment list
> > > in __udp_gso_segment_list. It covers both SNAT and DNAT.
> > >
> > > Fixes: 9fd1ff5d2ac7 (udp: Support UDP fraglist GRO/GSO.)
> > > Signed-off-by: Dongseok Yi 
> > > ---
> > > v1:
> > > Steffen Klassert said, there could be 2 options.
> > > https://lore.kernel.org/patchwork/patch/1362257/
> > > I was trying to write a quick fix, but it was not easy to forward
> > > segmented list. Currently, assuming DNAT only.
> > >
> > > v2:
> > > Per Steffen Klassert request, move the procedure from
> > > udp4_ufo_fragment to __udp_gso_segment_list and support SNAT.
> > >
> > > To Alexander Lobakin, I've checked your email late. Just use this
> > > patch as a reference. It support SNAT too, but does not support IPv6
> > > yet. I cannot make IPv6 header changes in __udp_gso_segment_list due
> > > to the file is in IPv4 directory.
> >
> > I used another approach, tried to make fraglist GRO closer to plain
> > in terms of checksummming, as it is confusing to me why GSO packet
> > should have CHECKSUM_UNNECESSARY.
> 
> This is intentional. With fraglist GRO, we don't mangle packets
> in the standard (non NAT) case. So the checksum is still correct
> after segmentation. That is one reason why it has good forwarding
> performance when software segmentation is needed. Checksuming
> touches the whole packet and has a lot of overhead, so it is
> heplfull to avoid it whenever possible.
> 
> We should find a way to do the checksum only when we really
> need it. I.e. only if the headers of the head skb changed.

It would be not easy to detect if the skb is mangled by netfilter. I
think v2 patch has little impact on the performance. Can you suggest
an another version? If not, I can make v3 including 80 columns
warning fix.

Re: [PATCH net-next 0/4] i40e: small improvements on XDP path

2021-01-17 Thread Magnus Karlsson

On Thu, Jan 14, 2021 at 3:34 PM Cristian Dumitrescu
 wrote:
>
> This patchset introduces some small and straightforward improvements
> to the Intel i40e driver XDP path. Each improvement is fully described
> in its associated patch.
>
> Cristian Dumitrescu (4):
>   i40e: remove unnecessary memory writes of the next to clean pointer
>   i40e: remove unnecessary cleaned_count updates
>   i40e: remove the redundant buffer info updates
>   i40: consolidate handling of XDP program actions
>
>  drivers/net/ethernet/intel/i40e/i40e_xsk.c | 149 +++--
>  1 file changed, 79 insertions(+), 70 deletions(-)
>
> --
> 2.25.1
>

Thank you for these clean ups Cristian!

For the series:

Acked-by: Magnus Karlsson

90 matches

Mail list logo