[PATCH net-next 2/7] selftests: mlxsw: Test RIF's reference count when joining a LAG

2020-12-06 Thread Ido Schimmel
From: Ido Schimmel 

Test that the reference count of a router interface (RIF) configured for
a LAG is incremented / decremented when ports join / leave the LAG. Use
the offload indication on routes configured on the RIF to understand if
it was created / destroyed.

The test fails without the previous patch.

Signed-off-by: Ido Schimmel 
---
 .../selftests/drivers/net/mlxsw/rtnetlink.sh  | 43 +++
 1 file changed, 43 insertions(+)

diff --git a/tools/testing/selftests/drivers/net/mlxsw/rtnetlink.sh 
b/tools/testing/selftests/drivers/net/mlxsw/rtnetlink.sh
index a2eff5f58209..ed346da5d3cb 100755
--- a/tools/testing/selftests/drivers/net/mlxsw/rtnetlink.sh
+++ b/tools/testing/selftests/drivers/net/mlxsw/rtnetlink.sh
@@ -22,6 +22,7 @@ ALL_TESTS="
duplicate_vlans_test
vlan_rif_refcount_test
subport_rif_refcount_test
+   subport_rif_lag_join_test
vlan_dev_deletion_test
lag_unlink_slaves_test
lag_dev_deletion_test
@@ -440,6 +441,48 @@ subport_rif_refcount_test()
ip link del dev bond1
 }
 
+subport_rif_lag_join_test()
+{
+   # Test that the reference count of a RIF configured for a LAG is
+   # incremented / decremented when ports join / leave the LAG. We use the
+   # offload indication on routes configured on the RIF to understand if
+   # it was created / destroyed
+   RET=0
+
+   ip link add name bond1 type bond mode 802.3ad
+   ip link set dev $swp1 down
+   ip link set dev $swp2 down
+   ip link set dev $swp1 master bond1
+   ip link set dev $swp2 master bond1
+
+   ip link set dev bond1 up
+   ip -6 address add 2001:db8:1::1/64 dev bond1
+
+   busywait "$TIMEOUT" wait_for_offload \
+   ip -6 route get fibmatch 2001:db8:1::2 dev bond1
+   check_err $? "subport rif was not created on lag device"
+
+   ip link set dev $swp1 nomaster
+   busywait "$TIMEOUT" wait_for_offload \
+   ip -6 route get fibmatch 2001:db8:1::2 dev bond1
+   check_err $? "subport rif of lag device was destroyed after removing 
one port"
+
+   ip link set dev $swp1 master bond1
+   ip link set dev $swp2 nomaster
+   busywait "$TIMEOUT" wait_for_offload \
+   ip -6 route get fibmatch 2001:db8:1::2 dev bond1
+   check_err $? "subport rif of lag device was destroyed after re-adding a 
port and removing another"
+
+   ip link set dev $swp1 nomaster
+   busywait "$TIMEOUT" not wait_for_offload \
+   ip -6 route get fibmatch 2001:db8:1::2 dev bond1
+   check_err $? "subport rif of lag device was not destroyed when should"
+
+   log_test "subport rif lag join"
+
+   ip link del dev bond1
+}
+
 vlan_dev_deletion_test()
 {
# Test that VLAN devices are correctly deleted / unlinked when enslaved
-- 
2.28.0



[PATCH net-next 0/7] mlxsw: Misc updates

2020-12-06 Thread Ido Schimmel
From: Ido Schimmel 

This patch set contains various updates for mlxsw in various areas.

Patch #1 fixes a corner case in router interface (RIF) configuration.
Targeted at net-next since this is not a regression. Patch #2 adds a
test case.

Patch #3 enables tracing of EMAD events via 'devlink:devlink_hwmsg'
tracepoint, in addition to the existing request / response EMAD
messages. It enables a more complete logging of all the exchanged
hardware messages.

Patches #4-#5 suppress "WARNING use flexible-array member instead"
coccinelle warnings.

Patch #6 bumps the minimum firmware version enforced by the driver.

Patch #7 is a small refactoring in IPinIP code.

Ido Schimmel (5):
  mlxsw: spectrum: Apply RIF configuration when joining a LAG
  selftests: mlxsw: Test RIF's reference count when joining a LAG
  mlxsw: core: Trace EMAD events
  mlxsw: spectrum_mr: Use flexible-array member instead of zero-length
array
  mlxsw: core_acl: Use an array instead of a struct with a zero-length
array

Jiri Pirko (1):
  mlxsw: spectrum_router: Reduce mlxsw_sp_ipip_fib_entry_op_gre4()

Petr Machata (1):
  mlxsw: spectrum: Bump minimum FW version to xx.2008.2018

 drivers/net/ethernet/mellanox/mlxsw/core.c|  7 +++
 .../mellanox/mlxsw/core_acl_flex_keys.c   | 26 +-
 .../net/ethernet/mellanox/mlxsw/spectrum.c| 23 +++--
 .../net/ethernet/mellanox/mlxsw/spectrum.h|  4 ++
 .../ethernet/mellanox/mlxsw/spectrum_ipip.c   | 45 ++---
 .../ethernet/mellanox/mlxsw/spectrum_ipip.h   |  8 +--
 .../net/ethernet/mellanox/mlxsw/spectrum_mr.c |  2 +-
 .../ethernet/mellanox/mlxsw/spectrum_router.c | 49 +++
 .../ethernet/mellanox/mlxsw/spectrum_router.h |  4 --
 .../selftests/drivers/net/mlxsw/rtnetlink.sh  | 43 
 10 files changed, 129 insertions(+), 82 deletions(-)

-- 
2.28.0



[PATCH net-next 3/7] mlxsw: core: Trace EMAD events

2020-12-06 Thread Ido Schimmel
From: Ido Schimmel 

Currently, mlxsw triggers the 'devlink:devlink_hwmsg' tracepoint
whenever a request is sent to the device and whenever a response is
received from it. However, the tracepoint is not triggered when an event
(e.g., port up / down) is received from the device.

Also trace EMAD events in order to log a more complete picture of all
the exchanged hardware messages.

Signed-off-by: Ido Schimmel 
---
 drivers/net/ethernet/mellanox/mlxsw/core.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c 
b/drivers/net/ethernet/mellanox/mlxsw/core.c
index 630109f139a0..c67825a68a26 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.c
@@ -160,6 +160,7 @@ struct mlxsw_rx_listener_item {
 
 struct mlxsw_event_listener_item {
struct list_head list;
+   struct mlxsw_core *mlxsw_core;
struct mlxsw_event_listener el;
void *priv;
 };
@@ -2171,11 +2172,16 @@ static void mlxsw_core_event_listener_func(struct 
sk_buff *skb, u8 local_port,
   void *priv)
 {
struct mlxsw_event_listener_item *event_listener_item = priv;
+   struct mlxsw_core *mlxsw_core;
struct mlxsw_reg_info reg;
char *payload;
char *reg_tlv;
char *op_tlv;
 
+   mlxsw_core = event_listener_item->mlxsw_core;
+   trace_devlink_hwmsg(priv_to_devlink(mlxsw_core), true, 0,
+   skb->data, skb->len);
+
mlxsw_emad_tlv_parse(skb);
op_tlv = mlxsw_emad_op_tlv(skb);
reg_tlv = mlxsw_emad_reg_tlv(skb);
@@ -2225,6 +2231,7 @@ int mlxsw_core_event_listener_register(struct mlxsw_core 
*mlxsw_core,
el_item = kmalloc(sizeof(*el_item), GFP_KERNEL);
if (!el_item)
return -ENOMEM;
+   el_item->mlxsw_core = mlxsw_core;
el_item->el = *el;
el_item->priv = priv;
 
-- 
2.28.0



[PATCH net-next 1/7] mlxsw: spectrum: Apply RIF configuration when joining a LAG

2020-12-06 Thread Ido Schimmel
From: Ido Schimmel 

In case a router interface (RIF) is configured for a LAG, make sure its
configuration is applied on the new LAG member.

Signed-off-by: Ido Schimmel 
Reviewed-by: Jiri Pirko 
---
 .../net/ethernet/mellanox/mlxsw/spectrum.c| 17 --
 .../net/ethernet/mellanox/mlxsw/spectrum.h|  4 +++
 .../ethernet/mellanox/mlxsw/spectrum_router.c | 31 ---
 3 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 385eb3c3b362..65e8407f4646 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -3595,7 +3595,8 @@ static int mlxsw_sp_port_lag_index_get(struct mlxsw_sp 
*mlxsw_sp,
 }
 
 static int mlxsw_sp_port_lag_join(struct mlxsw_sp_port *mlxsw_sp_port,
- struct net_device *lag_dev)
+ struct net_device *lag_dev,
+ struct netlink_ext_ack *extack)
 {
struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp;
struct mlxsw_sp_upper *lag;
@@ -3631,8 +3632,20 @@ static int mlxsw_sp_port_lag_join(struct mlxsw_sp_port 
*mlxsw_sp_port,
if (mlxsw_sp_port->default_vlan->fid)
mlxsw_sp_port_vlan_router_leave(mlxsw_sp_port->default_vlan);
 
+   /* Join a router interface configured on the LAG, if exists */
+   err = mlxsw_sp_port_vlan_router_join(mlxsw_sp_port->default_vlan,
+lag_dev, extack);
+   if (err)
+   goto err_router_join;
+
return 0;
 
+err_router_join:
+   lag->ref_count--;
+   mlxsw_sp_port->lagged = 0;
+   mlxsw_core_lag_mapping_clear(mlxsw_sp->core, lag_id,
+mlxsw_sp_port->local_port);
+   mlxsw_sp_lag_col_port_remove(mlxsw_sp_port, lag_id);
 err_col_port_add:
if (!lag->ref_count)
mlxsw_sp_lag_destroy(mlxsw_sp, lag_id);
@@ -3997,7 +4010,7 @@ static int mlxsw_sp_netdevice_port_upper_event(struct 
net_device *lower_dev,
} else if (netif_is_lag_master(upper_dev)) {
if (info->linking) {
err = mlxsw_sp_port_lag_join(mlxsw_sp_port,
-upper_dev);
+upper_dev, extack);
} else {

mlxsw_sp_port_lag_col_dist_disable(mlxsw_sp_port);
mlxsw_sp_port_lag_leave(mlxsw_sp_port,
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
index ce26cc41831f..6092243a69cb 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
@@ -656,6 +656,10 @@ mlxsw_sp_netdevice_ipip_ul_event(struct mlxsw_sp *mlxsw_sp,
 struct net_device *l3_dev,
 unsigned long event,
 struct netdev_notifier_info *info);
+int
+mlxsw_sp_port_vlan_router_join(struct mlxsw_sp_port_vlan *mlxsw_sp_port_vlan,
+  struct net_device *l3_dev,
+  struct netlink_ext_ack *extack);
 void
 mlxsw_sp_port_vlan_router_leave(struct mlxsw_sp_port_vlan *mlxsw_sp_port_vlan);
 void mlxsw_sp_rif_destroy_by_dev(struct mlxsw_sp *mlxsw_sp,
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 85223647fdb6..20b141f02145 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -7697,9 +7697,9 @@ static void mlxsw_sp_rif_subport_put(struct mlxsw_sp_rif 
*rif)
 }
 
 static int
-mlxsw_sp_port_vlan_router_join(struct mlxsw_sp_port_vlan *mlxsw_sp_port_vlan,
-  struct net_device *l3_dev,
-  struct netlink_ext_ack *extack)
+__mlxsw_sp_port_vlan_router_join(struct mlxsw_sp_port_vlan *mlxsw_sp_port_vlan,
+struct net_device *l3_dev,
+struct netlink_ext_ack *extack)
 {
struct mlxsw_sp_port *mlxsw_sp_port = mlxsw_sp_port_vlan->mlxsw_sp_port;
struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port->mlxsw_sp;
@@ -7764,6 +7764,27 @@ __mlxsw_sp_port_vlan_router_leave(struct 
mlxsw_sp_port_vlan *mlxsw_sp_port_vlan)
mlxsw_sp_rif_subport_put(rif);
 }
 
+int
+mlxsw_sp_port_vlan_router_join(struct mlxsw_sp_port_vlan *mlxsw_sp_port_vlan,
+  struct net_device *l3_dev,
+  struct netlink_ext_ack *extack)
+{
+   struct mlxsw_sp *mlxsw_sp = mlxsw_sp_port_vlan->mlxsw_sp_port->mlxsw_sp;
+   struct mlxsw_sp_rif *rif;
+   int err = 0;
+
+   mutex_lock(&mlxsw_sp->router-

[PATCH net-next 5/7] mlxsw: core_acl: Use an array instead of a struct with a zero-length array

2020-12-06 Thread Ido Schimmel
From: Ido Schimmel 

Suppresses the following coccinelle warning:

drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c:139:3-7:
WARNING use flexible-array member instead

Signed-off-by: Ido Schimmel 
Reviewed-by: Jiri Pirko 
---
 .../mellanox/mlxsw/core_acl_flex_keys.c   | 26 ---
 1 file changed, 11 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c 
b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c
index 9f6905fa6b47..f1b09c2f9eda 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c
@@ -133,10 +133,8 @@ mlxsw_afk_key_info_find(struct mlxsw_afk *mlxsw_afk,
 }
 
 struct mlxsw_afk_picker {
-   struct {
-   DECLARE_BITMAP(element, MLXSW_AFK_ELEMENT_MAX);
-   unsigned int total;
-   } hits[0];
+   DECLARE_BITMAP(element, MLXSW_AFK_ELEMENT_MAX);
+   unsigned int total;
 };
 
 static void mlxsw_afk_picker_count_hits(struct mlxsw_afk *mlxsw_afk,
@@ -154,8 +152,8 @@ static void mlxsw_afk_picker_count_hits(struct mlxsw_afk 
*mlxsw_afk,
 
elinst = &block->instances[j];
if (elinst->element == element) {
-   __set_bit(element, picker->hits[i].element);
-   picker->hits[i].total++;
+   __set_bit(element, picker[i].element);
+   picker[i].total++;
}
}
}
@@ -169,13 +167,13 @@ static void mlxsw_afk_picker_subtract_hits(struct 
mlxsw_afk *mlxsw_afk,
int i;
int j;
 
-   memcpy(&hits_element, &picker->hits[block_index].element,
+   memcpy(&hits_element, &picker[block_index].element,
   sizeof(hits_element));
 
for (i = 0; i < mlxsw_afk->blocks_count; i++) {
for_each_set_bit(j, hits_element, MLXSW_AFK_ELEMENT_MAX) {
-   if (__test_and_clear_bit(j, picker->hits[i].element))
-   picker->hits[i].total--;
+   if (__test_and_clear_bit(j, picker[i].element))
+   picker[i].total--;
}
}
 }
@@ -188,8 +186,8 @@ static int mlxsw_afk_picker_most_hits_get(struct mlxsw_afk 
*mlxsw_afk,
int i;
 
for (i = 0; i < mlxsw_afk->blocks_count; i++) {
-   if (picker->hits[i].total > most_hits) {
-   most_hits = picker->hits[i].total;
+   if (picker[i].total > most_hits) {
+   most_hits = picker[i].total;
most_index = i;
}
}
@@ -206,7 +204,7 @@ static int mlxsw_afk_picker_key_info_add(struct mlxsw_afk 
*mlxsw_afk,
if (key_info->blocks_count == mlxsw_afk->max_blocks)
return -EINVAL;
 
-   for_each_set_bit(element, picker->hits[block_index].element,
+   for_each_set_bit(element, picker[block_index].element,
 MLXSW_AFK_ELEMENT_MAX) {
key_info->element_to_block[element] = key_info->blocks_count;
mlxsw_afk_element_usage_add(&key_info->elusage, element);
@@ -224,11 +222,9 @@ static int mlxsw_afk_picker(struct mlxsw_afk *mlxsw_afk,
 {
struct mlxsw_afk_picker *picker;
enum mlxsw_afk_element element;
-   size_t alloc_size;
int err;
 
-   alloc_size = sizeof(picker->hits[0]) * mlxsw_afk->blocks_count;
-   picker = kzalloc(alloc_size, GFP_KERNEL);
+   picker = kcalloc(mlxsw_afk->blocks_count, sizeof(*picker), GFP_KERNEL);
if (!picker)
return -ENOMEM;
 
-- 
2.28.0



[PATCH net-next 7/7] mlxsw: spectrum_router: Reduce mlxsw_sp_ipip_fib_entry_op_gre4()

2020-12-06 Thread Ido Schimmel
From: Jiri Pirko 

Turned out that mlxsw_sp_ipip_fib_entry_op_gre4() does not need to
figure out the IP address and virtual router id. Those are exactly
the same as in the fib_entry it is called for. So just use that and
reduce mlxsw_sp_ipip_fib_entry_op_gre4() function to only call
mlxsw_sp_ipip_fib_entry_op_gre4_rtdp() make the ipip decap op
code similar to nve.

Signed-off-by: Jiri Pirko 
Reviewed-by: Petr Machata 
Signed-off-by: Ido Schimmel 
---
 .../ethernet/mellanox/mlxsw/spectrum_ipip.c   | 45 ++-
 .../ethernet/mellanox/mlxsw/spectrum_ipip.h   |  8 +---
 .../ethernet/mellanox/mlxsw/spectrum_router.c | 18 +---
 .../ethernet/mellanox/mlxsw/spectrum_router.h |  4 --
 4 files changed, 19 insertions(+), 56 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.c
index 089d99535f9e..6ccca39bae84 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.c
@@ -142,9 +142,9 @@ mlxsw_sp_ipip_nexthop_update_gre4(struct mlxsw_sp 
*mlxsw_sp, u32 adj_index,
 }
 
 static int
-mlxsw_sp_ipip_fib_entry_op_gre4_rtdp(struct mlxsw_sp *mlxsw_sp,
-u32 tunnel_index,
-struct mlxsw_sp_ipip_entry *ipip_entry)
+mlxsw_sp_ipip_decap_config_gre4(struct mlxsw_sp *mlxsw_sp,
+   struct mlxsw_sp_ipip_entry *ipip_entry,
+   u32 tunnel_index)
 {
u16 rif_index = mlxsw_sp_ipip_lb_rif_index(ipip_entry->ol_lb);
u16 ul_rif_id = mlxsw_sp_ipip_lb_ul_rif_id(ipip_entry->ol_lb);
@@ -180,43 +180,6 @@ mlxsw_sp_ipip_fib_entry_op_gre4_rtdp(struct mlxsw_sp 
*mlxsw_sp,
return mlxsw_reg_write(mlxsw_sp->core, MLXSW_REG(rtdp), rtdp_pl);
 }
 
-static int
-mlxsw_sp_ipip_fib_entry_op_gre4_do(struct mlxsw_sp *mlxsw_sp,
-  const struct mlxsw_sp_router_ll_ops *ll_ops,
-  struct mlxsw_sp_fib_entry_op_ctx *op_ctx,
-  u32 dip, u8 prefix_len, u16 ul_vr_id,
-  enum mlxsw_sp_fib_entry_op op,
-  u32 tunnel_index,
-  struct mlxsw_sp_fib_entry_priv *priv)
-{
-   ll_ops->fib_entry_pack(op_ctx, MLXSW_SP_L3_PROTO_IPV4, op, ul_vr_id,
-  prefix_len, (unsigned char *) &dip, priv);
-   ll_ops->fib_entry_act_ip2me_tun_pack(op_ctx, tunnel_index);
-   return mlxsw_sp_fib_entry_commit(mlxsw_sp, op_ctx, ll_ops);
-}
-
-static int mlxsw_sp_ipip_fib_entry_op_gre4(struct mlxsw_sp *mlxsw_sp,
-  const struct mlxsw_sp_router_ll_ops 
*ll_ops,
-  struct mlxsw_sp_fib_entry_op_ctx 
*op_ctx,
-  struct mlxsw_sp_ipip_entry 
*ipip_entry,
-  enum mlxsw_sp_fib_entry_op op, u32 
tunnel_index,
-  struct mlxsw_sp_fib_entry_priv *priv)
-{
-   u16 ul_vr_id = mlxsw_sp_ipip_lb_ul_vr_id(ipip_entry->ol_lb);
-   __be32 dip;
-   int err;
-
-   err = mlxsw_sp_ipip_fib_entry_op_gre4_rtdp(mlxsw_sp, tunnel_index,
-  ipip_entry);
-   if (err)
-   return err;
-
-   dip = mlxsw_sp_ipip_netdev_saddr(MLXSW_SP_L3_PROTO_IPV4,
-ipip_entry->ol_dev).addr4;
-   return mlxsw_sp_ipip_fib_entry_op_gre4_do(mlxsw_sp, ll_ops, op_ctx, 
be32_to_cpu(dip),
- 32, ul_vr_id, op, 
tunnel_index, priv);
-}
-
 static bool mlxsw_sp_ipip_tunnel_complete(enum mlxsw_sp_l3proto proto,
  const struct net_device *ol_dev)
 {
@@ -332,7 +295,7 @@ static const struct mlxsw_sp_ipip_ops 
mlxsw_sp_ipip_gre4_ops = {
.dev_type = ARPHRD_IPGRE,
.ul_proto = MLXSW_SP_L3_PROTO_IPV4,
.nexthop_update = mlxsw_sp_ipip_nexthop_update_gre4,
-   .fib_entry_op = mlxsw_sp_ipip_fib_entry_op_gre4,
+   .decap_config = mlxsw_sp_ipip_decap_config_gre4,
.can_offload = mlxsw_sp_ipip_can_offload_gre4,
.ol_loopback_config = mlxsw_sp_ipip_ol_loopback_config_gre4,
.ol_netdev_change = mlxsw_sp_ipip_ol_netdev_change_gre4,
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.h 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.h
index d32702cb6ab4..87bef9880e5e 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_ipip.h
@@ -50,13 +50,9 @@ struct mlxsw_sp_ipip_ops {
(*ol_loopback_config)(struct mlxsw_sp *mlxsw_sp,
  const struct net_device *ol_dev);
 
-   int (*fib_entry_op)(struct mlxsw_sp *mlxsw_sp,
-   const struct mlxsw_sp_router_ll_ops *ll_ops,

[PATCH net-next 4/7] mlxsw: spectrum_mr: Use flexible-array member instead of zero-length array

2020-12-06 Thread Ido Schimmel
From: Ido Schimmel 

Suppresses the following coccinelle warning:

drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c:18:15-19: WARNING use 
flexible-array member instead

Signed-off-by: Ido Schimmel 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c
index 47eb751a2570..7846a21555ef 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c
@@ -15,7 +15,7 @@ struct mlxsw_sp_mr {
struct list_head table_list;
struct mutex table_list_lock; /* Protects table_list */
 #define MLXSW_SP_MR_ROUTES_COUNTER_UPDATE_INTERVAL 5000 /* ms */
-   unsigned long priv[0];
+   unsigned long priv[];
/* priv has to be always the last item */
 };
 
-- 
2.28.0



[PATCH net-next 6/7] mlxsw: spectrum: Bump minimum FW version to xx.2008.2018

2020-12-06 Thread Ido Schimmel
From: Petr Machata 

The indicated version fixes an issue whereby the MOMTE register would by
default enable mirroring of ECN-marked traffic from all traffic classes,
once the ECN mirroring was configured. This fix is necessary for offload
of RED "ecn_mark" qevent.

Signed-off-by: Petr Machata 
Signed-off-by: Ido Schimmel 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 65e8407f4646..963eb0b1d9dd 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -45,7 +45,7 @@
 
 #define MLXSW_SP1_FWREV_MAJOR 13
 #define MLXSW_SP1_FWREV_MINOR 2008
-#define MLXSW_SP1_FWREV_SUBMINOR 1310
+#define MLXSW_SP1_FWREV_SUBMINOR 2018
 #define MLXSW_SP1_FWREV_CAN_RESET_MINOR 1702
 
 static const struct mlxsw_fw_rev mlxsw_sp1_fw_rev = {
@@ -62,7 +62,7 @@ static const struct mlxsw_fw_rev mlxsw_sp1_fw_rev = {
 
 #define MLXSW_SP2_FWREV_MAJOR 29
 #define MLXSW_SP2_FWREV_MINOR 2008
-#define MLXSW_SP2_FWREV_SUBMINOR 1310
+#define MLXSW_SP2_FWREV_SUBMINOR 2018
 
 static const struct mlxsw_fw_rev mlxsw_sp2_fw_rev = {
.major = MLXSW_SP2_FWREV_MAJOR,
@@ -77,7 +77,7 @@ static const struct mlxsw_fw_rev mlxsw_sp2_fw_rev = {
 
 #define MLXSW_SP3_FWREV_MAJOR 30
 #define MLXSW_SP3_FWREV_MINOR 2008
-#define MLXSW_SP3_FWREV_SUBMINOR 1310
+#define MLXSW_SP3_FWREV_SUBMINOR 2018
 
 static const struct mlxsw_fw_rev mlxsw_sp3_fw_rev = {
.major = MLXSW_SP3_FWREV_MAJOR,
-- 
2.28.0



Re: linux-next: ERROR: build error for arm64

2020-12-06 Thread Björn Töpel

On 2020-12-06 08:32, Hui Su wrote:

hi, all:

The error came out like this when i build the linux-next kernel
with ARCH=arm64, with the arm64_defconfig:

   CC  drivers/net/ethernet/freescale/dpaa/dpaa_eth.o
../drivers/net/ethernet/freescale/dpaa/dpaa_eth.c: In function ‘dpaa_fq_init’:
../drivers/net/ethernet/freescale/dpaa/dpaa_eth.c:1135:9: error: too few 
arguments to function ‘xdp_rxq_info_reg’
  1135 |   err = xdp_rxq_info_reg(&dpaa_fq->xdp_rxq, dpaa_fq->net_dev,
   | ^~~~
In file included from ../include/linux/netdevice.h:42,
  from ../include/uapi/linux/if_arp.h:27,
  from ../include/linux/if_arp.h:23,
  from ../drivers/net/ethernet/freescale/dpaa/dpaa_eth.c:40:
../include/net/xdp.h:229:5: note: declared here
   229 | int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
   | ^~~~
make[6]: *** [../scripts/Makefile.build:283: 
drivers/net/ethernet/freescale/dpaa/dpaa_eth.o] Error 1


Branch: git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
HEAD: 2996bd3f6ca9ea529b40c369a94b247657abdb4d
ARCH: arm64
CONFIG: arch/arm64/configs/defconfig
CMD: make ARCH=arm64 CROSS_COMPILE=/usr/bin/aarch64-linux-gnu- Image

It maybe introduced by the commit b02e5a0ebb172(xsk: Propagate napi_id to XDP 
socket Rx path),
and if anyone solved this, please add:

Reported-by: Hui Su 



Hui, was already resolved in commit fdd8b8249ef8 ("dpaa_eth: fix build
errorr in dpaa_fq_init").

Thanks,
Björn



Re: [PATCH v13 0/4] userspace MHI client interface driver

2020-12-06 Thread Leon Romanovsky
On Tue, Dec 01, 2020 at 09:59:53PM -0700, Jeffrey Hugo wrote:
> On 12/1/2020 7:55 PM, Jakub Kicinski wrote:
> > On Tue, 1 Dec 2020 13:48:36 -0700 Jeffrey Hugo wrote:
> > > On 12/1/2020 1:03 PM, Jakub Kicinski wrote:
> > > > On Tue, 1 Dec 2020 12:40:50 -0700 Jeffrey Hugo wrote:
> > > > > On 12/1/2020 12:29 PM, Jakub Kicinski wrote:
> > > > > > On Fri, 27 Nov 2020 19:26:02 -0800 Hemant Kumar wrote:
> > > > > > > This patch series adds support for UCI driver. UCI driver enables 
> > > > > > > userspace
> > > > > > > clients to communicate to external MHI devices like modem and 
> > > > > > > WLAN. UCI driver
> > > > > > > probe creates standard character device file nodes for userspace 
> > > > > > > clients to
> > > > > > > perform open, read, write, poll and release file operations. 
> > > > > > > These file
> > > > > > > operations call MHI core layer APIs to perform data transfer 
> > > > > > > using MHI bus
> > > > > > > to communicate with MHI device. Patch is tested using arm64 based 
> > > > > > > platform.
> > > > > >
> > > > > > Wait, I thought this was for modems.
> > > > > >
> > > > > > Why do WLAN devices need to communicate with user space?
> > > > >
> > > > > Why does it matter what type of device it is?  Are modems somehow 
> > > > > unique
> > > > > in that they are the only type of device that userspace is allowed to
> > > > > interact with?
> > > >
> > > > Yes modems are traditionally highly weird and require some serial
> > > > device dance I don't even know about.
> > > >
> > > > We have proper interfaces in Linux for configuring WiFi which work
> > > > across vendors. Having char device access to WiFi would be a step
> > > > back.
> > >
> > > So a WLAN device is only ever allowed to do Wi-Fi?  It can't also have
> > > GPS functionality for example?
> >
> > No, but it's also not true that the only way to implement GPS is by
> > opening a full on command/packet interface between fat proprietary
> > firmware and custom user space (which may or may not be proprietary
> > as well).
>
> Funny, that exactly what the GPS "API" in the kernel is, although a bit
> limited to the specifics on the standardized GPS "sentences" and not
> covering implementation specific configuration.
>
> >
> > > > > However, I'll bite.  Once such usecase would be QMI.  QMI is a generic
> > > > > messaging protocol, and is not strictly limited to the unique 
> > > > > operations
> > > > > of a modem.
> > > > >
> > > > > Another usecase would be Sahara - a custom file transfer protocol used
> > > > > for uploading firmware images, and downloading crashdumps.
> > > >
> > > > Thanks, I was asking for use cases, not which proprietary vendor
> > > > protocol you can implement over it.
> > > >
> > > > None of the use cases you mention here should require a direct FW -
> > > > user space backdoor for WLAN.
> > >
> > > Uploading runtime firmware, with variations based on the runtime mode.
> > > Flashing the onboard flash based on cryptographic keys.  Accessing
> > > configuration data.  Accessing device logs.  Configuring device logs.
> > > Synchronizing the device time reference to Linux local or remote time
> > > sources.  Enabling debugging/performance hardware.  Getting software
> > > diagnostic events.  Configuring redundancy hardware per workload.
> > > Uploading new cryptographic keys.  Invalidating cryptographic keys.
> > > Uploading factory test data and running factory tests.
> > >
> > > Need more?
> >
> > This conversation is going nowhere. Are you trying to say that creating
> > a common Linux API for those features is impossible and each vendor
> > should be allowed to add their own proprietary way?
> >
> > This has been proven incorrect again and again, and Wi-Fi is a good
> > example.
> >
> > You can do whatever you want for GPS etc. but don't come nowhere near
> > networking with this attitude please.
> >
>
> No I'm saying (and Bjorn/Mani by the looks of things), that there is
> commonality in the core features - IP traffic, Wi-Fi, etc but then there are
> vendor specific things which are either things you don't actually want in
> the kernel, don't want the kernel doing, or have little commonality between
> vendors such that attempting to unify them gains you little to nothing.
>
> Over in the networking space, I can see where standardization is plenty
> useful.
>
> I can't speak for other vendors, but a "modem" or a "wlan" device from
> Qualcomm is not something that just provides one service.  They tend to
> provide dozens of different functionalities, some of those are
> "standardized" like wi-fi where common wi-fi interfaces are used. Others are
> unique to Qualcomm.
>
> The point is "wlan device" is a superset of "wi-fi".  You seem to be
> equating them to be the same in a "shoot first, ask questions later" manner.
>
> This series provides a way for userspace to talk to remote MHI "widgets" for
> usecases not covered elsewhere.  Those "widgets" just happen to commonly
> provide modem/wlan services, but one

[PATCH net-next] net/mlx4: Remove unused #define MAX_MSIX_P_PORT

2020-12-06 Thread Tariq Toukan
From: Tariq Toukan 

All usages of the definition MAX_MSIX_P_PORT were removed.
It's not in use anymore. Remove it.

Signed-off-by: Tariq Toukan 
Reviewed-by: Moshe Shemesh 
---
 include/linux/mlx4/device.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index 06e066e04a4b..236a7d04f891 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -46,7 +46,6 @@
 
 #define DEFAULT_UAR_PAGE_SHIFT  12
 
-#define MAX_MSIX_P_PORT17
 #define MAX_MSIX   128
 #define MIN_MSIX_P_PORT5
 #define MLX4_IS_LEGACY_EQ_MODE(dev_cap) ((dev_cap).num_comp_vectors < \
-- 
2.21.0



macb: should we revert 0a4e9ce17ba7 ("macb: support the two tx descriptors on at91rm9200") ?

2020-12-06 Thread Willy Tarreau
Hi Jakub,

Two months ago I implemented a small change in the macb driver to
support the two Tx descriptors that AT91RM9200 supports. I implemented
this using the only compatible device I had which is the MSC313E-based
Breadbee board. Since then I've met situations where the chip would stop
sending, mostly when doing bidirectional traffic. I've spent a few week-
ends on this already trying various things including blocking interrupts
on a larger place, switching to the 32-bit register access on the MSC313E
(previous code was using two 16-bit accesses and I thought it was the
cause), and tracing status registers along the various operations. Each
time the pattern looks similar, a write when the chips declares having
room results in an overrun, but the preceeding condition doesn't leave
any info suggesting this may happen.

Sadly we don't have the datasheet for this SoC, what is visible is that it
supports AT91RM9200's tx mode and that it works well when there's a single
descriptor in flight. In this case it complies with AT91RM9200's datasheet.
The chip reports other undocumented bits in its status registers, that
cannot even be correlated to the issue by the way. I couldn't spot anything
wrong there in my changes, even after doing heavy changes to progress on
debugging, and the SoC's behavior reporting an overrun after a single write
when there's room contradicts the RM9200's datasheet. In addition we know
the chip also supports other descriptor-based modes, so it's very possible
it doesn't perfectly implement the RM9200's 2-desc mode and that my change
is still fine.

Yesterday I hope to get my old AT91SAM9G20 board to execute this code and
test it, but it clearly does not work when I remap the init and tx functions,
which indicates that it really does not implement a hidden compatibility
mode with the old chip.

Thus at this point I have no way to make sure that the original SoC for
which I changed the code still works fine or if I broke it. As such, I'd
feel more comfortable if we'd revert my patch until someone has the
opportunity to test it on the original hardware (Alexandre suggested he
might, but later).

The commit in question is the following one:

  0a4e9ce17ba7 ("macb: support the two tx descriptors on at91rm9200")

If the maintainers prefer that we wait for an issue to be reported before
reverting it, that's fine for me as well. What's important to me is that
this potential issue is identified so as not to waste someone else's time
on it.

Thanks!
Willy


[PATCH 1/1] mwifiex: Fix possible buffer overflows in mwifiex_cmd_802_11_ad_hoc_start

2020-12-06 Thread Xiaohui Zhang
From: Zhang Xiaohui 

mwifiex_cmd_802_11_ad_hoc_start() calls memcpy() without checking
the destination size may trigger a buffer overflower,
which a local user could use to cause denial of service
or the execution of arbitrary code.
Fix it by putting the length check before calling memcpy().

Signed-off-by: Zhang Xiaohui 
---
 drivers/net/wireless/marvell/mwifiex/join.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/wireless/marvell/mwifiex/join.c 
b/drivers/net/wireless/marvell/mwifiex/join.c
index 5934f7147..173ccf79c 100644
--- a/drivers/net/wireless/marvell/mwifiex/join.c
+++ b/drivers/net/wireless/marvell/mwifiex/join.c
@@ -877,6 +877,8 @@ mwifiex_cmd_802_11_ad_hoc_start(struct mwifiex_private 
*priv,
 
memset(adhoc_start->ssid, 0, IEEE80211_MAX_SSID_LEN);
 
+   if (req_ssid->ssid_len > IEEE80211_MAX_SSID_LEN)
+   req_ssid->ssid_len = IEEE80211_MAX_SSID_LEN;
memcpy(adhoc_start->ssid, req_ssid->ssid, req_ssid->ssid_len);
 
mwifiex_dbg(adapter, INFO, "info: ADHOC_S_CMD: SSID = %s\n",
-- 
2.17.1



[PATCH v2 0/2] net: dsa: lantiq: add support for xRX300 and xRX330

2020-12-06 Thread Aleksander Jan Bajkowski
From: Aleksander Jan Bajkowski 

Changed since v1:
* gswip_mii_mask_cfg() can now change port 3 on xRX330
* changed alowed modes on port 0 and 5 for xRX300 and xRX330
* moved common part of phylink validation into gswip_phylink_set_capab()
* verify the compatible string against the hardware

Aleksander Jan Bajkowski (2):
  net: dsa: lantiq: allow to use all GPHYs on xRX300 and xRX330
  dt-bindings: net: dsa: lantiq, lantiq-gswip: add example for xRX330

 .../bindings/net/dsa/lantiq-gswip.txt | 110 +++-
 drivers/net/dsa/lantiq_gswip.c| 170 +++---
 2 files changed, 250 insertions(+), 30 deletions(-)

-- 
2.20.1



[PATCH v2 1/2] net: dsa: lantiq: allow to use all GPHYs on xRX300 and xRX330

2020-12-06 Thread Aleksander Jan Bajkowski
This patch allows to use all PHYs on GRX300 and GRX330. The ARX300 has 3
and the GRX330 has 4 integrated PHYs connected to different ports compared
to VRX200.

Port configurations:

xRX200:
GMAC0: RGMII/MII/REVMII/RMII port
GMAC1: RGMII/MII/REVMII/RMII port
GMAC2: GPHY0 (GMII)
GMAC3: GPHY0 (MII)
GMAC4: GPHY1 (GMII)
GMAC5: GPHY1 (MII) or RGMII port

xRX300:
GMAC0: RGMII port
GMAC1: GPHY2 (GMII)
GMAC2: GPHY0 (GMII)
GMAC3: GPHY0 (MII)
GMAC4: GPHY1 (GMII)
GMAC5: GPHY1 (MII) or RGMII port

xRX330:
GMAC0: RGMII/GMII/RMII port
GMAC1: GPHY2 (GMII)
GMAC2: GPHY0 (GMII)
GMAC3: GPHY0 (MII) or GPHY3 (GMII)
GMAC4: GPHY1 (GMII)
GMAC5: GPHY1 (MII) or RGMII/RMII port

Tested on D-Link DWR966 with OpenWRT.

Signed-off-by: Aleksander Jan Bajkowski 
---
 drivers/net/dsa/lantiq_gswip.c | 170 +++--
 1 file changed, 141 insertions(+), 29 deletions(-)

diff --git a/drivers/net/dsa/lantiq_gswip.c b/drivers/net/dsa/lantiq_gswip.c
index 09701c17f3f6..4c8f611ed397 100644
--- a/drivers/net/dsa/lantiq_gswip.c
+++ b/drivers/net/dsa/lantiq_gswip.c
@@ -94,6 +94,7 @@
 /* GSWIP MII Registers */
 #define GSWIP_MII_CFG0 0x00
 #define GSWIP_MII_CFG1 0x02
+#define GSWIP_MII_CFG3 0xc3
 #define GSWIP_MII_CFG5 0x04
 #define  GSWIP_MII_CFG_EN  BIT(14)
 #define  GSWIP_MII_CFG_LDCLKDISBIT(12)
@@ -102,6 +103,7 @@
 #define  GSWIP_MII_CFG_MODE_RMIIP  0x2
 #define  GSWIP_MII_CFG_MODE_RMIIM  0x3
 #define  GSWIP_MII_CFG_MODE_RGMII  0x4
+#define  GSWIP_MII_CFG_MODE_GMII   0x9
 #define  GSWIP_MII_CFG_MODE_MASK   0xf
 #define  GSWIP_MII_CFG_RATE_M2P5   0x00
 #define  GSWIP_MII_CFG_RATE_M250x10
@@ -222,6 +224,7 @@
 struct gswip_hw_info {
int max_ports;
int cpu_port;
+   struct dsa_switch_ops *ops;
 };
 
 struct xway_gphy_match_data {
@@ -392,12 +395,19 @@ static void gswip_mii_mask(struct gswip_priv *priv, u32 
clear, u32 set,
 static void gswip_mii_mask_cfg(struct gswip_priv *priv, u32 clear, u32 set,
   int port)
 {
+   struct device_node *np = priv->ds->dev->of_node;
+
switch (port) {
case 0:
gswip_mii_mask(priv, clear, set, GSWIP_MII_CFG0);
break;
case 1:
-   gswip_mii_mask(priv, clear, set, GSWIP_MII_CFG1);
+   if (of_device_is_compatible(np, "lantiq,xrx200-gswip"))
+   gswip_mii_mask(priv, clear, set, GSWIP_MII_CFG1);
+   break;
+   case 3:
+   if (of_device_is_compatible(np, "lantiq,xrx330-gswip"))
+   gswip_mii_mask(priv, clear, set, GSWIP_MII_CFG3);
break;
case 5:
gswip_mii_mask(priv, clear, set, GSWIP_MII_CFG5);
@@ -1409,12 +1419,40 @@ static int gswip_port_fdb_dump(struct dsa_switch *ds, 
int port,
return 0;
 }
 
-static void gswip_phylink_validate(struct dsa_switch *ds, int port,
-  unsigned long *supported,
-  struct phylink_link_state *state)
+static void gswip_phylink_set_capab(unsigned long *supported, struct 
phylink_link_state *state)
 {
__ETHTOOL_DECLARE_LINK_MODE_MASK(mask) = { 0, };
 
+   /* Allow all the expected bits */
+   phylink_set(mask, Autoneg);
+   phylink_set_port_modes(mask);
+   phylink_set(mask, Pause);
+   phylink_set(mask, Asym_Pause);
+
+   /* With the exclusion of MII and Reverse MII, we support Gigabit,
+* including Half duplex
+*/
+   if (state->interface != PHY_INTERFACE_MODE_MII &&
+   state->interface != PHY_INTERFACE_MODE_REVMII) {
+   phylink_set(mask, 1000baseT_Full);
+   phylink_set(mask, 1000baseT_Half);
+   }
+
+   phylink_set(mask, 10baseT_Half);
+   phylink_set(mask, 10baseT_Full);
+   phylink_set(mask, 100baseT_Half);
+   phylink_set(mask, 100baseT_Full);
+
+   bitmap_and(supported, supported, mask,
+  __ETHTOOL_LINK_MODE_MASK_NBITS);
+   bitmap_and(state->advertising, state->advertising, mask,
+  __ETHTOOL_LINK_MODE_MASK_NBITS);
+}
+
+static void gswip_xrx200_phylink_validate(struct dsa_switch *ds, int port,
+ unsigned long *supported,
+ struct phylink_link_state *state)
+{
switch (port) {
case 0:
case 1:
@@ -1441,37 +1479,56 @@ static void gswip_phylink_validate(struct dsa_switch 
*ds, int port,
return;
}
 
-   /* Allow all the expected bits */
-   phylink_set(mask, Autoneg);
-   phylink_set_port_modes(mask);
-   phylink_set(mask, Pause);
-   phylink_set(mask, Asym_Pause);
+   gswip_phylink_set_capab(supported, state);
 
-   /* With the exclusion of MII and Reverse MII, we support Gigabit,
-* including Half duplex
-*/
-   if (state->inter

[PATCH v2 2/2] dt-bindings: net: dsa: lantiq, lantiq-gswip: add example for xRX330

2020-12-06 Thread Aleksander Jan Bajkowski
Add compatible string and example for xRX300 and xRX330.

Signed-off-by: Aleksander Jan Bajkowski 
---
 .../bindings/net/dsa/lantiq-gswip.txt | 110 +-
 1 file changed, 109 insertions(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/net/dsa/lantiq-gswip.txt 
b/Documentation/devicetree/bindings/net/dsa/lantiq-gswip.txt
index 886cbe8ffb38..7a90a6a1b065 100644
--- a/Documentation/devicetree/bindings/net/dsa/lantiq-gswip.txt
+++ b/Documentation/devicetree/bindings/net/dsa/lantiq-gswip.txt
@@ -3,7 +3,8 @@ Lantiq GSWIP Ethernet switches
 
 Required properties for GSWIP core:
 
-- compatible   : "lantiq,xrx200-gswip" for the embedded GSWIP in the
+- compatible   : "lantiq,xrx200-gswip", "lantiq,xrx300-gswip" or
+ "lantiq,xrx330-gswip" for the embedded GSWIP in the
  xRX200 SoC
 - reg  : memory range of the GSWIP core registers
: memory range of the GSWIP MDIO registers
@@ -141,3 +142,110 @@ switch@e108000 {
};
};
 };
+
+Ethernet switch on the GRX330 SoC:
+
+switch@e108000 {
+   #address-cells = <1>;
+   #size-cells = <0>;
+   compatible = "lantiq,xrx300-gswip";
+   reg = < 0xe108000 0x3100/* switch */
+   0xe10b100 0xd8  /* mdio */
+   0xe10b1d8 0x130 /* mii */
+   >;
+   dsa,member = <0 0>;
+
+   ports {
+   #address-cells = <1>;
+   #size-cells = <0>;
+
+   port@1 {
+   reg = <1>;
+   label = "lan1";
+   phy-mode = "internal";
+   phy-handle = <&phy1>;
+   };
+
+   port@2 {
+   reg = <2>;
+   label = "lan2";
+   phy-mode = "internal";
+   phy-handle = <&phy2>;
+   };
+
+   port@3 {
+   reg = <3>;
+   label = "lan3";
+   phy-mode = "internal";
+   phy-handle = <&phy3>;
+   };
+
+   port@4 {
+   reg = <4>;
+   label = "lan4";
+   phy-mode = "internal";
+   phy-handle = <&phy4>;
+   };
+
+   port@6 {
+   reg = <0x6>;
+   label = "cpu";
+   ethernet = <ð0>;
+   };
+   };
+
+   mdio {
+   #address-cells = <1>;
+   #size-cells = <0>;
+   compatible = "lantiq,xrx200-mdio";
+   reg = <0>;
+
+   phy1: ethernet-phy@1 {
+   reg = <0x1>;
+   };
+   phy2: ethernet-phy@2 {
+   reg = <0x2>;
+   };
+   phy3: ethernet-phy@3 {
+   reg = <0x3>;
+   };
+   phy4: ethernet-phy@4 {
+   reg = <0x4>;
+   };
+   };
+
+   gphy-fw {
+   compatible = "lantiq,xrx330-gphy-fw", "lantiq,gphy-fw";
+   lantiq,rcu = <&rcu0>;
+   #address-cells = <1>;
+   #size-cells = <0>;
+
+   gphy@20 {
+   reg = <0x20>;
+
+   resets = <&reset0 31 30>;
+   reset-names = "gphy";
+   };
+
+   gphy@68 {
+   reg = <0x68>;
+
+   resets = <&reset0 29 28>;
+   reset-names = "gphy";
+   };
+
+   gphy@ac {
+   reg = <0xac>;
+
+   resets = <&reset0 28 13>;
+   reset-names = "gphy";
+   };
+
+   gphy@264 {
+   reg = <0x264>;
+
+   resets = <&reset0 10 10>;
+   reset-names = "gphy";
+   };
+   };
+};
-- 
2.20.1



Re: [net-next V2 08/15] net/mlx5e: Add TX PTP port object support

2020-12-06 Thread Eran Ben Elisha




On 12/4/2020 11:57 PM, Saeed Mahameed wrote:

We only forward ptp traffic to the new special queue but we create more
than one to avoid internal locking as we will utilize the tx softirq
percpu.

After double checking the code it seems Eran and Tariq have decided to
forward all UDP traffic, let me double check with them what happened
here.


We though about extending the support of these queues to UDP in general, 
and not just PTP. But we can role this back to PTP time critical events 
on dport 319 only.


Re: [net-next V2 08/15] net/mlx5e: Add TX PTP port object support

2020-12-06 Thread Eran Ben Elisha




On 12/5/2020 1:17 AM, Jakub Kicinski wrote:

We only forward ptp traffic to the new special queue but we create more
than one to avoid internal locking as we will utilize the tx softirq
percpu.

In other words to make the driver implementation simpler we'll have
a pretty basic feature hidden behind a ethtool priv knob and a number
of queues which doesn't match reality reported to user space. Hm.


We are not hiding these queues from the netdev stack. We report them in 
real num of TX queues and manage them as any other queue. The only 
change is that select_queue() will select a stream to them if and only 
if they match specific criteria.


Re: [net-next V2 08/15] net/mlx5e: Add TX PTP port object support

2020-12-06 Thread Eran Ben Elisha




On 12/5/2020 2:24 AM, Jakub Kicinski wrote:

On Fri, 04 Dec 2020 15:57:36 -0800 Saeed Mahameed wrote:

On Fri, 2020-12-04 at 15:17 -0800, Jakub Kicinski wrote:

On Fri, 04 Dec 2020 13:57:49 -0800 Saeed Mahameed wrote:

option 2) route PTP traffic to a special SQs per ring, this SQ
will
be
PTP port accurate, Normal traffic will continue through regular
SQs

Pros: Regular non PTP traffic not affected.
Cons: High memory footprint for creating special SQs

So we prefer (2) + private flag to avoid the performance hit
and
the
redundant memory usage out of the box.


Option 3 - have only one special PTP queue in the system. PTP
traffic
is rather low rate, queue per core doesn't seem necessary.


We only forward ptp traffic to the new special queue but we create
more
than one to avoid internal locking as we will utilize the tx
softirq
percpu.


In other words to make the driver implementation simpler we'll have
a pretty basic feature hidden behind a ethtool priv knob and a number
of queues which doesn't match reality reported to user space. Hm.


I look at these queues as a special HW objects to allow the accurate
PTP stamping, they piggyback on the reported txqs, so they are
transparent,


But they are visible to the stack, via sysfs, netlink. Any check
in the kernel that tries to help the driver by validating user input
against real_num_tx_queues will be moot for mlx5e.


Re-writing it here,  we report them in real num of TX queues.



mlx5e hides the AF_XDP queues behind normal RSS queues, but it would
have extra visible queues for TX PTP.


they just increase the memory footprint of each ring.


For every ring or for every TC? (which is hopefully 1 in any non-DCB
deployment?)


For every TC, not for every ring.




for the priv flags, one of the floating ideas was to
use hwtstamp_rx_filters flags:
  
https://elixir.bootlin.com/linux/latest/source/include/uapi/linux/net_tstamp.h#L107


Our hardware timestamps all packets for free whether you request it or
not, Currently there is no option to setup "ALL_PTP" traffic in ethtool
-T, but we can add this flag as it make sense to be in ethtool -T, thus
we could use it in mlx5 to determine if user selected ALL_PTP, then ptp
packets will go through this accurate special path.

This is not a W/A or an abuse to the new flag, it just means if you
select ALL_PTP then a side effect will be our HW will be more accurate
for PTP traffic.

What do you think ?


That sounds much better than the priv flag, yes.


Our Hardware can provide a better accurate time stamp under few 
limitations. It requires higher memory consumption ({SQ, 2 x CQ, 
internal HW LB RQ} per TC), and also has performance impact (more CQEs 
to consume for example).
Some customers are happy with the accuracy they get today and don't want 
the extra penalty, so they don't want to be automatically shifted to the 
new TS logic.


Adding new enum to the ioctl means we have add 
(HWTSTAMP_TX_ON_TIME_CRITICAL_ONLY for example) all the way - drivers, 
kernel ptp, user space ptp, ethtool.


My concerns are:
1. Timestamp applications (like ptp4l or similar) will have to add 
support for configuring the driver to use 
HWTSTAMP_TX_ON_TIME_CRITICAL_ONLY if supported via ioctl prior to 
packets transmit. From application point of view, the dual-modes 
(HWTSTAMP_TX_ON_TIME_CRITICAL_ONLY , HWTSTAMP_TX_ON) support is 
redundant, as it offers nothing new.
2. Other vendors will have to support it as well, when not sure what is 
the expectation from them if they cannot improve accuracy between them.


This feature is just an internal enhancement, and as such it should be 
added only as a vendor private configuration flag. We are not offering 
here about any standard for others to follow.


If we did not have the limitation above, it could have been added as the 
default silently.


I suggest we reconsider the ethtool private-flag, the ioctl change might 
be a long journey in a wrong direction.





Regarding reducing to a single special queue, i will discuss with Eran
and the Team on Sunday.


Okay, thanks.



[PATCH 1/1] ionic: fix array overflow on receiving too many fragments for a packet

2020-12-06 Thread Xiaohui Zhang
From: Zhang Xiaohui 

If the hardware receives an oversized packet with too many rx fragments,
skb_shinfo(skb)->frags can overflow and corrupt memory of adjacent pages.
This becomes especially visible if it corrupts the freelist pointer of
a slab page.

Signed-off-by: Zhang Xiaohui 
---
 drivers/net/ethernet/pensando/ionic/ionic_txrx.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/pensando/ionic/ionic_txrx.c 
b/drivers/net/ethernet/pensando/ionic/ionic_txrx.c
index 169ac4f54..a3e274c65 100644
--- a/drivers/net/ethernet/pensando/ionic/ionic_txrx.c
+++ b/drivers/net/ethernet/pensando/ionic/ionic_txrx.c
@@ -102,8 +102,12 @@ static struct sk_buff *ionic_rx_frags(struct ionic_queue 
*q,
 
dma_unmap_page(dev, dma_unmap_addr(page_info, dma_addr),
   PAGE_SIZE, DMA_FROM_DEVICE);
-   skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
+   struct skb_shared_info *shinfo = skb_shinfo(skb);
+
+   if (shinfo->nr_frags < ARRAY_SIZE(shinfo->frags)) {
+   skb_add_rx_frag(skb, shinfo->nr_frags,
page_info->page, 0, frag_len, PAGE_SIZE);
+   }
page_info->page = NULL;
page_info++;
i--;
-- 
2.17.1



[PATCH v2] can-isotp: add SF_BROADCAST support for functional addressing

2020-12-06 Thread Oliver Hartkopp
When CAN_ISOTP_SF_BROADCAST is set in the CAN_ISOTP_OPTS flags the
CAN_ISOTP socket is switched into functional addressing mode, where
only single frame (SF) protocol data units can be send on the specified
CAN interface and the given tp.tx_id after bind().

In opposite to normal and extended addressing this socket does not
register a CAN-ID for reception which would be needed for a 1-to-1
ISOTP connection with a segmented bi-directional data transfer.

Sending SFs on this socket is therefore a TX-only 'broadcast' operation.

Signed-off-by: Oliver Hartkopp 
Signed-off-by: Thomas Wagner 
Link: https://lore.kernel.org/r/20201203140604.25488-3-socket...@hartkopp.net
Link: https://lore.kernel.org/r/20201204135557.55599-1-th...@web.de
---
 include/uapi/linux/can/isotp.h |  2 +-
 net/can/isotp.c| 42 +++---
 2 files changed, 30 insertions(+), 14 deletions(-)

diff --git a/include/uapi/linux/can/isotp.h b/include/uapi/linux/can/isotp.h
index 7793b26aa154..c55935b64ccc 100644
--- a/include/uapi/linux/can/isotp.h
+++ b/include/uapi/linux/can/isotp.h
@@ -133,11 +133,11 @@ struct can_isotp_ll_options {
 #define CAN_ISOTP_HALF_DUPLEX  0x040   /* half duplex error state handling */
 #define CAN_ISOTP_FORCE_TXSTMIN0x080   /* ignore stmin from received 
FC */
 #define CAN_ISOTP_FORCE_RXSTMIN0x100   /* ignore CFs depending on rx 
stmin */
 #define CAN_ISOTP_RX_EXT_ADDR  0x200   /* different rx extended addressing */
 #define CAN_ISOTP_WAIT_TX_DONE 0x400   /* wait for tx completion */
-
+#define CAN_ISOTP_SF_BROADCAST 0x800   /* 1-to-N functional addressing */
 
 /* default values */
 
 #define CAN_ISOTP_DEFAULT_FLAGS0
 #define CAN_ISOTP_DEFAULT_EXT_ADDRESS  0x00
diff --git a/net/can/isotp.c b/net/can/isotp.c
index 26bdc3c20b7e..7839c3b9e5be 100644
--- a/net/can/isotp.c
+++ b/net/can/isotp.c
@@ -863,10 +863,18 @@ static int isotp_sendmsg(struct socket *sock, struct 
msghdr *msg, size_t size)
}
 
if (!size || size > MAX_MSG_LENGTH)
return -EINVAL;
 
+   /* take care of a potential SF_DL ESC offset for TX_DL > 8 */
+   off = (so->tx.ll_dl > CAN_MAX_DLEN) ? 1 : 0;
+
+   /* does the given data fit into a single frame for SF_BROADCAST? */
+   if ((so->opt.flags & CAN_ISOTP_SF_BROADCAST) &&
+   (size > so->tx.ll_dl - SF_PCI_SZ4 - ae - off))
+   return -EINVAL;
+
err = memcpy_from_msg(so->tx.buf, msg, size);
if (err < 0)
return err;
 
dev = dev_get_by_index(sock_net(sk), so->ifindex);
@@ -889,13 +897,10 @@ static int isotp_sendmsg(struct socket *sock, struct 
msghdr *msg, size_t size)
so->tx.idx = 0;
 
cf = (struct canfd_frame *)skb->data;
skb_put(skb, so->ll.mtu);
 
-   /* take care of a potential SF_DL ESC offset for TX_DL > 8 */
-   off = (so->tx.ll_dl > CAN_MAX_DLEN) ? 1 : 0;
-
/* check for single frame transmission depending on TX_DL */
if (size <= so->tx.ll_dl - SF_PCI_SZ4 - ae - off) {
/* The message size generally fits into a SingleFrame - good.
 *
 * SF_DL ESC offset optimization:
@@ -1014,11 +1019,11 @@ static int isotp_release(struct socket *sock)
 
hrtimer_cancel(&so->txtimer);
hrtimer_cancel(&so->rxtimer);
 
/* remove current filters & unregister */
-   if (so->bound) {
+   if (so->bound && (!(so->opt.flags & CAN_ISOTP_SF_BROADCAST))) {
if (so->ifindex) {
struct net_device *dev;
 
dev = dev_get_by_index(net, so->ifindex);
if (dev) {
@@ -1050,19 +1055,29 @@ static int isotp_bind(struct socket *sock, struct 
sockaddr *uaddr, int len)
struct net *net = sock_net(sk);
int ifindex;
struct net_device *dev;
int err = 0;
int notify_enetdown = 0;
+   int do_rx_reg = 1;
 
if (len < CAN_REQUIRED_SIZE(struct sockaddr_can, can_addr.tp))
return -EINVAL;
 
-   if (addr->can_addr.tp.rx_id == addr->can_addr.tp.tx_id)
-   return -EADDRNOTAVAIL;
+   /* do not register frame reception for functional addressing */
+   if (so->opt.flags & CAN_ISOTP_SF_BROADCAST)
+   do_rx_reg = 0;
+
+   /* do not validate rx address for functional addressing */
+   if (do_rx_reg) {
+   if (addr->can_addr.tp.rx_id == addr->can_addr.tp.tx_id)
+   return -EADDRNOTAVAIL;
+
+   if (addr->can_addr.tp.rx_id & (CAN_ERR_FLAG | CAN_RTR_FLAG))
+   return -EADDRNOTAVAIL;
+   }
 
-   if ((addr->can_addr.tp.rx_id | addr->can_addr.tp.tx_id) &
-   (CAN_ERR_FLAG | CAN_RTR_FLAG))
+   if (addr->can_addr.tp.tx_id & (CAN_ERR_FLAG | CAN_RTR_FLAG))
return -EADDRNOTAVAIL;
 
if (!addr->can_ifindex)
return -ENODEV;
 
@@ -1091,17 +1106,18 @@ static int isotp_b

[PATCH] dpaa2-mac: Add a missing of_node_put after of_device_is_available

2020-12-06 Thread Christophe JAILLET
Add an 'of_node_put()' call when a tested device node is not available.

Fixes:94ae899b2096 ("dpaa2-mac: add PCS support through the Lynx module")
Signed-off-by: Christophe JAILLET 
---
 drivers/net/ethernet/freescale/dpaa2/dpaa2-mac.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-mac.c 
b/drivers/net/ethernet/freescale/dpaa2/dpaa2-mac.c
index 90cd243070d7..828c177df03d 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-mac.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-mac.c
@@ -269,6 +269,7 @@ static int dpaa2_pcs_create(struct dpaa2_mac *mac,
 
if (!of_device_is_available(node)) {
netdev_err(mac->net_dev, "pcs-handle node not available\n");
+   of_node_put(node);
return -ENODEV;
}
 
-- 
2.27.0



Re: WARNING in __cfg80211_ibss_joined (2)

2020-12-06 Thread syzbot
syzbot has found a reproducer for the following issue on:

HEAD commit:7059c2c0 Merge branch 'for-linus' of git://git.kernel.org/..
git tree:   upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=11a1199b50
kernel config:  https://syzkaller.appspot.com/x/.config?x=e49433cfed49b7d9
dashboard link: https://syzkaller.appspot.com/bug?extid=7f064ba1704c2466e36d
compiler:   gcc (GCC) 10.1.0-syz 20200507
syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=146ff2ef50
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=105d68df50

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+7f064ba1704c2466e...@syzkaller.appspotmail.com

[ cut here ]
WARNING: CPU: 1 PID: 9804 at net/wireless/ibss.c:36 
__cfg80211_ibss_joined+0x487/0x520 net/wireless/ibss.c:36
Modules linked in:
CPU: 1 PID: 9804 Comm: kworker/u4:6 Not tainted 5.10.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Workqueue: cfg80211 cfg80211_event_work
RIP: 0010:__cfg80211_ibss_joined+0x487/0x520 net/wireless/ibss.c:36
Code: 0f 0b e9 0c fe ff ff e8 b7 55 7a f9 e9 41 fc ff ff e8 8d 55 7a f9 e9 7d 
fc ff ff e8 a3 55 7a f9 e9 0d ff ff ff e8 29 d7 38 f9 <0f> 0b e9 7e fc ff ff e8 
1d d7 38 f9 0f 0b e8 96 55 7a f9 e9 e4 fb
RSP: 0018:c9000a85fbd8 EFLAGS: 00010293
RAX:  RBX: 888014edcc10 RCX: 8155a937
RDX: 88802295b480 RSI: 88372d57 RDI: 
RBP: 888014edc000 R08: 0001 R09: 8ebaf6bf
R10: fbfff1d75ed7 R11:  R12: 19200150bf7d
R13: 88803471e818 R14:  R15: 0006
FS:  () GS:8880b9f0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 20e0a000 CR3: 25d73000 CR4: 00350ee0
Call Trace:
 cfg80211_process_wdev_events+0x3de/0x5b0 net/wireless/util.c:942
 cfg80211_process_rdev_events+0x6e/0x100 net/wireless/util.c:968
 cfg80211_event_work+0x1a/0x20 net/wireless/core.c:322
 process_one_work+0x933/0x15a0 kernel/workqueue.c:2272
 worker_thread+0x64c/0x1120 kernel/workqueue.c:2418
 kthread+0x3b1/0x4a0 kernel/kthread.c:292
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:296



Re: LRO: creating vlan subports affects parent port's LRO settings

2020-12-06 Thread Michal Kubecek
On Sat, Dec 05, 2020 at 07:04:06PM -0500, Jarod Wilson wrote:
> On Mon, Nov 23, 2020 at 7:27 PM Jakub Kicinski  wrote:
> >
> > On Thu, 19 Nov 2020 20:37:27 -0500 Limin Wang wrote:
> > > Under relatively recent kernels (v4.4+), creating a vlan subport on a
> > > LRO supported parent NIC may turn LRO off on the parent port and
> > > further render its LRO feature practically unchangeable.
> >
> > That does sound like an oversight in commit fd867d51f889 ("net/core:
> > generic support for disabling netdev features down stack").
> >
> > Are you able to create a patch to fix this?
> 
> Something like this, perhaps? Completely untested copy-pasta'd
> theoretical patch:
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 8588ade790cb..a5ce372e02ba 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -9605,8 +9605,10 @@ int __netdev_update_features(struct net_device *dev)
> features = netdev_fix_features(dev, features);
> 
> /* some features can't be enabled if they're off on an upper device */
> -   netdev_for_each_upper_dev_rcu(dev, upper, iter)
> -   features = netdev_sync_upper_features(dev, upper, features);
> +   netdev_for_each_upper_dev_rcu(dev, upper, iter) {
> +   if (netif_is_lag_master(upper) || 
> netif_is_bridge_master(upper))
> +   features = netdev_sync_upper_features(dev,
> upper, features);
> +   }
> 
> if (dev->features == features)
> goto sync_lower;
> @@ -9633,8 +9635,10 @@ int __netdev_update_features(struct net_device *dev)
> /* some features must be disabled on lower devices when disabled
>  * on an upper device (think: bonding master or bridge)
>  */
> -   netdev_for_each_lower_dev(dev, lower, iter)
> -   netdev_sync_lower_features(dev, lower, features);
> +   if (netif_is_lag_master(dev) || netif_is_bridge_master(dev)) {
> +   netdev_for_each_lower_dev(dev, lower, iter)
> +   netdev_sync_lower_features(dev, lower, features);
> +   }
> 
> if (!err) {
> netdev_features_t diff = features ^ dev->features;
> 
> I'm not sure what all other upper devices this excludes besides just
> vlan ports though, so perhaps safer add upper device types to not do
> feature sync on than to choose which ones to do them on?

I'm not sure excluding devices from feature sync is the right way,
whether it's an explicit list types or default. The logic still makes
sense to me. Couldn't we address the issue by either setting features in
NETIF_F_UPPER_DISABLES) by default for a new vlan (and probably macvlan)
device? Or perhaps inheriting their values from the lower device.

Michal


Re: [net-next V2 08/15] net/mlx5e: Add TX PTP port object support

2020-12-06 Thread Richard Cochran
On Sun, Dec 06, 2020 at 03:37:47PM +0200, Eran Ben Elisha wrote:
> Adding new enum to the ioctl means we have add
> (HWTSTAMP_TX_ON_TIME_CRITICAL_ONLY for example) all the way - drivers,
> kernel ptp, user space ptp, ethtool.
> 
> My concerns are:
> 1. Timestamp applications (like ptp4l or similar) will have to add support
> for configuring the driver to use HWTSTAMP_TX_ON_TIME_CRITICAL_ONLY if
> supported via ioctl prior to packets transmit. From application point of
> view, the dual-modes (HWTSTAMP_TX_ON_TIME_CRITICAL_ONLY , HWTSTAMP_TX_ON)
> support is redundant, as it offers nothing new.

Well said.

> 2. Other vendors will have to support it as well, when not sure what is the
> expectation from them if they cannot improve accuracy between them.

If there were multiple different devices out there with this kind of
implementation (different levels of accuracy with increasing run time
performance cost), then we could consider such a flag.  However, to my
knowledge, this feature is unique to your device.

> This feature is just an internal enhancement, and as such it should be added
> only as a vendor private configuration flag. We are not offering here about
> any standard for others to follow.

+1

Thanks,
Richard


Re: [PATCH v3 0/7] Improve s0ix flows for systems i219LM

2020-12-06 Thread Alexander Duyck
On Sat, Dec 5, 2020 at 3:49 PM Jakub Kicinski  wrote:
>
> On Fri, 4 Dec 2020 14:38:03 -0800 Alexander Duyck wrote:
> > > > The patches look good to me. Just need to address the minor issue that
> > > > seems to have been present prior to the introduction of this patch
> > > > set.
> > > >
> > > > Reviewed-by: Alexander Duyck 
> > >
> > > Thanks for your review.  Just some operational questions - since this 
> > > previously
> > > existed do you want me to re-spin the series to a v4 for this, or should 
> > > it be
> > > a follow up after the series?
> > >
> > > If I respin it, would you prefer that change to occur at the start or end
> > > of the series?
> >
> > I don't need a respin, but if you are going to fix it you should
> > probably put out the patch as something like a 8/7. If you respin it
> > should happen near the start of the series as it is a bug you are
> > addressing.
>
> Don't we need that patch to be before this series so it can be
> back ported easily? Or is it not really a bug?

You're right. For backports it would make it easier to have the patch
be before the changes. As far as being a bug, it is one, but it isn't
an urgent bug as it is basically some bad exception handling so the
likelihood of seeing it should be quite low.


Re: [PATCH ethtool v2] Improve error message when SFP module is missing

2020-12-06 Thread Michal Kubecek
On Wed, Dec 02, 2020 at 07:22:14PM +0200, Baruch Siach wrote:
> ETHTOOL_GMODULEINFO request success indicates that SFP cage is present.
> Failure of ETHTOOL_GMODULEEEPROM is most likely because SFP module is
> not plugged in. Add an indication to the user as to what might be the
> reason for the failure.
> 
> Signed-off-by: Baruch Siach 

Applied (with minor whitespace formatting cleanup), thank you.

Michal

> ---
> v2: Limit message to likely errno values (Andrew Lunn)
> ---
>  ethtool.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/ethtool.c b/ethtool.c
> index 1d9067e774af..63b3a095eded 100644
> --- a/ethtool.c
> +++ b/ethtool.c
> @@ -4855,7 +4855,11 @@ static int do_getmodule(struct cmd_context *ctx)
>   eeprom->offset = geeprom_offset;
>   err = send_ioctl(ctx, eeprom);
>   if (err < 0) {
> + int saved_errno = errno;
>   perror("Cannot get Module EEPROM data");
> + if (saved_errno == ENODEV || saved_errno == EIO ||
> + saved_errno == ENXIO)
> + fprintf(stderr, "SFP module not in cage?\n");
>   free(eeprom);
>   return 1;
>   }
> -- 
> 2.29.2
> 


signature.asc
Description: PGP signature


Re: [PATCH v2 1/2] net: dsa: lantiq: allow to use all GPHYs on xRX300 and xRX330

2020-12-06 Thread Vladimir Oltean
On Sun, Dec 06, 2020 at 02:27:12PM +0100, Aleksander Jan Bajkowski wrote:
> This patch allows to use all PHYs on GRX300 and GRX330. The ARX300 has 3
> and the GRX330 has 4 integrated PHYs connected to different ports compared
> to VRX200.
> 
> Port configurations:
> 
> xRX200:
> GMAC0: RGMII/MII/REVMII/RMII port
> GMAC1: RGMII/MII/REVMII/RMII port
> GMAC2: GPHY0 (GMII)
> GMAC3: GPHY0 (MII)
> GMAC4: GPHY1 (GMII)
> GMAC5: GPHY1 (MII) or RGMII port
> 
> xRX300:
> GMAC0: RGMII port
> GMAC1: GPHY2 (GMII)
> GMAC2: GPHY0 (GMII)
> GMAC3: GPHY0 (MII)
> GMAC4: GPHY1 (GMII)
> GMAC5: GPHY1 (MII) or RGMII port
> 
> xRX330:
> GMAC0: RGMII/GMII/RMII port
> GMAC1: GPHY2 (GMII)
> GMAC2: GPHY0 (GMII)
> GMAC3: GPHY0 (MII) or GPHY3 (GMII)
> GMAC4: GPHY1 (GMII)
> GMAC5: GPHY1 (MII) or RGMII/RMII port

When you say GMII/MII when you are talking to the GPHY ports, what you
are really talking about is 1000Base-T vs 100Base-TX, right?

How about xRX330, does that really expose the full parallel GMAC pinout
on GMAC0?

> 
> Tested on D-Link DWR966 with OpenWRT.
> 
> Signed-off-by: Aleksander Jan Bajkowski 
> ---
>  drivers/net/dsa/lantiq_gswip.c | 170 +++--
>  1 file changed, 141 insertions(+), 29 deletions(-)
> 
> diff --git a/drivers/net/dsa/lantiq_gswip.c b/drivers/net/dsa/lantiq_gswip.c
> index 09701c17f3f6..4c8f611ed397 100644
> --- a/drivers/net/dsa/lantiq_gswip.c
> +++ b/drivers/net/dsa/lantiq_gswip.c
> @@ -94,6 +94,7 @@
>  /* GSWIP MII Registers */
>  #define GSWIP_MII_CFG0   0x00
>  #define GSWIP_MII_CFG1   0x02
> +#define GSWIP_MII_CFG3   0xc3
>  #define GSWIP_MII_CFG5   0x04
>  #define  GSWIP_MII_CFG_ENBIT(14)
>  #define  GSWIP_MII_CFG_LDCLKDIS  BIT(12)
> @@ -102,6 +103,7 @@
>  #define  GSWIP_MII_CFG_MODE_RMIIP0x2
>  #define  GSWIP_MII_CFG_MODE_RMIIM0x3
>  #define  GSWIP_MII_CFG_MODE_RGMII0x4
> +#define  GSWIP_MII_CFG_MODE_GMII 0x9
>  #define  GSWIP_MII_CFG_MODE_MASK 0xf
>  #define  GSWIP_MII_CFG_RATE_M2P5 0x00
>  #define  GSWIP_MII_CFG_RATE_M25  0x10
> @@ -222,6 +224,7 @@
>  struct gswip_hw_info {
>   int max_ports;
>   int cpu_port;
> + struct dsa_switch_ops *ops;
>  };
>  
>  struct xway_gphy_match_data {
> @@ -392,12 +395,19 @@ static void gswip_mii_mask(struct gswip_priv *priv, u32 
> clear, u32 set,
>  static void gswip_mii_mask_cfg(struct gswip_priv *priv, u32 clear, u32 set,
>  int port)
>  {
> + struct device_node *np = priv->ds->dev->of_node;
> +
>   switch (port) {
>   case 0:
>   gswip_mii_mask(priv, clear, set, GSWIP_MII_CFG0);
>   break;
>   case 1:
> - gswip_mii_mask(priv, clear, set, GSWIP_MII_CFG1);
> + if (of_device_is_compatible(np, "lantiq,xrx200-gswip"))
> + gswip_mii_mask(priv, clear, set, GSWIP_MII_CFG1);
> + break;
> + case 3:
> + if (of_device_is_compatible(np, "lantiq,xrx330-gswip"))
> + gswip_mii_mask(priv, clear, set, GSWIP_MII_CFG3);
>   break;
>   case 5:
>   gswip_mii_mask(priv, clear, set, GSWIP_MII_CFG5);
> @@ -1409,12 +1419,40 @@ static int gswip_port_fdb_dump(struct dsa_switch *ds, 
> int port,
>   return 0;
>  }
>  
> -static void gswip_phylink_validate(struct dsa_switch *ds, int port,
> -unsigned long *supported,
> -struct phylink_link_state *state)
> +static void gswip_phylink_set_capab(unsigned long *supported, struct 
> phylink_link_state *state)
>  {
>   __ETHTOOL_DECLARE_LINK_MODE_MASK(mask) = { 0, };
>  
> + /* Allow all the expected bits */
> + phylink_set(mask, Autoneg);
> + phylink_set_port_modes(mask);
> + phylink_set(mask, Pause);
> + phylink_set(mask, Asym_Pause);
> +
> + /* With the exclusion of MII and Reverse MII, we support Gigabit,
> +  * including Half duplex
> +  */
> + if (state->interface != PHY_INTERFACE_MODE_MII &&
> + state->interface != PHY_INTERFACE_MODE_REVMII) {
> + phylink_set(mask, 1000baseT_Full);
> + phylink_set(mask, 1000baseT_Half);
> + }
> +
> + phylink_set(mask, 10baseT_Half);
> + phylink_set(mask, 10baseT_Full);
> + phylink_set(mask, 100baseT_Half);
> + phylink_set(mask, 100baseT_Full);
> +
> + bitmap_and(supported, supported, mask,
> +__ETHTOOL_LINK_MODE_MASK_NBITS);
> + bitmap_and(state->advertising, state->advertising, mask,
> +__ETHTOOL_LINK_MODE_MASK_NBITS);
> +}
> +
> +static void gswip_xrx200_phylink_validate(struct dsa_switch *ds, int port,
> +   unsigned long *supported,
> +   struct phylink_link_state *state)
> +{
>   switch (port) {
>   case 0:
>   case 1:
> @@ -1441,37 +1479,56 @@ static void gswip_phylink

Re: vlan_filtering=1 breaks all traffic

2020-12-06 Thread Vladimir Oltean
On Sat, Dec 05, 2020 at 09:13:37PM +0100, Rasmus Villemoes wrote:
> Yup, that corresponds pretty much to what I do. Just for good measure, I
> tried doing exactly the above (with only a change in IP address), and...
> it worked. So, first thought was "perhaps it's because you bring up br0
> before adding the ports". But no, bringing it up after still works.
> Second thought: "portS - hm, only one port is added here", and indeed,
> once I add two or more ports to the bridge, it stops working. Removing
> all but the single port that has a cable plugged in makes it work again.
> It doesn't seem to matter whether the other ports are up or down.
>
> I should probably mention that wireshark says that ARP (ipv4) and
> neighbor solicitation (ipv6ll) packets do reach my laptop when I attempt
> the ping. If I start by doing a succesful ping (i.e., no other ports
> added), then add another port, then do a ping, the ping packets do reach
> the laptop (and of course get answered). So the problem does not appear
> to be on egress.

It would be interesting to see what is the ingress drop reason, if that
could be deduced from the drop counters that are incrementing in ethtool -S.

I am not confident that it can be a VTU issue, given the fact that you
have not said that you see VTU violation warnings, which are fairly loud
on mv88e6xxx.

The procedure of joining a new port to the bridge does alter the VTU,
since that second port needs to be a part of the default_pvid of the
bridge as soon as it goes up (VID 1). However that is not the main thing
that bridging a new port does - instead, the in-chip Port VLAN map is
altered.

In theory it _would_ be possible (even if unlikely) for the VTU to get
overwritten by the second port join, in a way that removes the first
port from the bridge's VID 1. Remember that this issue only seems to be
observable on 8250, so it seems logical to search in 8250 specific code
first (therefore making the VTU a more likely suspect than the in-chip
Port VLAN map).

Since you've already made the effort to boot kernel 5.9, you could make
the extra leap to try out the 5.10 rc's and look at the VTU using
Andrew's devlink based tool:
https://github.com/lunn/mv88e6xxx_dump

# devlink dev
mdio_bus/d0032004.mdio-mii:11
mdio_bus/d0032004.mdio-mii:12
mdio_bus/d0032004.mdio-mii:10
# ./mv88e6xxx_dump --device mdio_bus/d0032004.mdio-mii:10 --vtu
VTU:
V - a member, egress not modified
U - a member, egress untagged
T - a member, egress tagged
X - not a member, Ingress frames with VID discarded
P  VID 0123456789a  FID  SID QPrio FPrio VidPolicy
# ip link add br0 type bridge vlan_filtering 1
# ip link set lan4 master br0
[   74.443547] br0: port 1(lan4) entered blocking state
[   74.446037] br0: port 1(lan4) entered disabled state
[   74.461416] device lan4 entered promiscuous mode
[   74.463564] device eth1 entered promiscuous mode

# ./mv88e6xxx_dump --device mdio_bus/d0032004.mdio-mii:10 --vtu
VTU:
V - a member, egress not modified
U - a member, egress untagged
T - a member, egress tagged
X - not a member, Ingress frames with VID discarded
P  VID 0123456789a  FID  SID QPrio FPrio VidPolicy
01 UVV10 - - 0

# ip link set lan5 master br0
[   84.533120] br0: port 2(lan5) entered blocking state
[   84.535563] br0: port 2(lan5) entered disabled state
[   84.552022] device lan5 entered promiscuous mode
#
# ./mv88e6xxx_dump --device mdio_bus/d0032004.mdio-mii:10 --vtu
VTU:
V - a member, egress not modified
U - a member, egress untagged
T - a member, egress tagged
X - not a member, Ingress frames with VID discarded
P  VID 0123456789a  FID  SID QPrio FPrio VidPolicy
01 UUXXXVV10 - - 0

You would expect to see two U's for your ports, and not something else
like X.

[PATCH] drivers: broadcom: save return value of pci_find_capability() in u8

2020-12-06 Thread Puranjay Mohan
Callers of pci_find_capability() should save the return value in u8.
change the type of pcix_cap from int to u8, to match the specification.

Signed-off-by: Puranjay Mohan 
---
 drivers/net/ethernet/broadcom/tg3.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/tg3.h 
b/drivers/net/ethernet/broadcom/tg3.h
index 1000c894064f..f1781d2dce0b 100644
--- a/drivers/net/ethernet/broadcom/tg3.h
+++ b/drivers/net/ethernet/broadcom/tg3.h
@@ -3268,7 +3268,7 @@ struct tg3 {
 
int pci_fn;
int msi_cap;
-   int pcix_cap;
+   u8  pcix_cap;
int pcie_readrq;
 
struct mii_bus  *mdio_bus;
-- 
2.27.0



Re: [PATCH V4 net-next 6/9] net: ena: use xdp_frame in XDP TX flow

2020-12-06 Thread Maciej Fijalkowski
On Fri, Dec 04, 2020 at 02:11:12PM +0200, akiy...@amazon.com wrote:
> From: Arthur Kiyanovski 
> 
> Rename the ena_xdp_xmit_buff() function to ena_xdp_xmit_frame() and pass
> it an xdp_frame struct instead of xdp_buff.
> This change lays the ground for XDP redirect implementation which uses
> xdp_frames when 'xmit'ing packets.
> 
> Signed-off-by: Shay Agroskin 
> Signed-off-by: Arthur Kiyanovski 
> ---
>  drivers/net/ethernet/amazon/ena/ena_netdev.c | 46 ++--
>  1 file changed, 23 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
> b/drivers/net/ethernet/amazon/ena/ena_netdev.c
> index 222bb576e30e..cbb07548409a 100644
> --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
> +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
> @@ -233,18 +233,18 @@ static int ena_xdp_io_poll(struct napi_struct *napi, 
> int budget)
>   return ret;
>  }
>  
> -static int ena_xdp_tx_map_buff(struct ena_ring *xdp_ring,
> -struct ena_tx_buffer *tx_info,
> -struct xdp_buff *xdp,
> -void **push_hdr,
> -u32 *push_len)
> +static int ena_xdp_tx_map_frame(struct ena_ring *xdp_ring,
> + struct ena_tx_buffer *tx_info,
> + struct xdp_frame *xdpf,
> + void **push_hdr,
> + u32 *push_len)
>  {
>   struct ena_adapter *adapter = xdp_ring->adapter;
>   struct ena_com_buf *ena_buf;
>   dma_addr_t dma = 0;
>   u32 size;
>  
> - tx_info->xdpf = xdp_convert_buff_to_frame(xdp);
> + tx_info->xdpf = xdpf;
>   size = tx_info->xdpf->len;
>   ena_buf = tx_info->bufs;
>  
> @@ -281,29 +281,31 @@ static int ena_xdp_tx_map_buff(struct ena_ring 
> *xdp_ring,
>   return -EINVAL;
>  }
>  
> -static int ena_xdp_xmit_buff(struct net_device *dev,
> -  struct xdp_buff *xdp,
> -  int qid,
> -  struct ena_rx_buffer *rx_info)
> +static int ena_xdp_xmit_frame(struct net_device *dev,
> +   struct xdp_frame *xdpf,
> +   int qid)
>  {
>   struct ena_adapter *adapter = netdev_priv(dev);
>   struct ena_com_tx_ctx ena_tx_ctx = {};
>   struct ena_tx_buffer *tx_info;
>   struct ena_ring *xdp_ring;
> + struct page *rx_buff_page;
>   u16 next_to_use, req_id;
>   int rc;
>   void *push_hdr;
>   u32 push_len;
>  
> + rx_buff_page = virt_to_page(xdpf->data);
> +
>   xdp_ring = &adapter->tx_ring[qid];
>   next_to_use = xdp_ring->next_to_use;
>   req_id = xdp_ring->free_ids[next_to_use];
>   tx_info = &xdp_ring->tx_buffer_info[req_id];
>   tx_info->num_of_bufs = 0;
> - page_ref_inc(rx_info->page);
> - tx_info->xdp_rx_page = rx_info->page;
> + page_ref_inc(rx_buff_page);
> + tx_info->xdp_rx_page = rx_buff_page;
>  
> - rc = ena_xdp_tx_map_buff(xdp_ring, tx_info, xdp, &push_hdr, &push_len);
> + rc = ena_xdp_tx_map_frame(xdp_ring, tx_info, xdpf, &push_hdr, 
> &push_len);
>   if (unlikely(rc))
>   goto error_drop_packet;
>  
> @@ -318,7 +320,7 @@ static int ena_xdp_xmit_buff(struct net_device *dev,
>tx_info,
>&ena_tx_ctx,
>next_to_use,
> -  xdp->data_end - xdp->data);
> +  xdpf->len);
>   if (rc)
>   goto error_unmap_dma;
>   /* trigger the dma engine. ena_com_write_sq_doorbell()
> @@ -337,12 +339,11 @@ static int ena_xdp_xmit_buff(struct net_device *dev,
>   return NETDEV_TX_OK;
>  }
>  
> -static int ena_xdp_execute(struct ena_ring *rx_ring,
> -struct xdp_buff *xdp,
> -struct ena_rx_buffer *rx_info)
> +static int ena_xdp_execute(struct ena_ring *rx_ring, struct xdp_buff *xdp)
>  {
>   struct bpf_prog *xdp_prog;
>   u32 verdict = XDP_PASS;
> + struct xdp_frame *xdpf;
>   u64 *xdp_stat;
>  
>   rcu_read_lock();
> @@ -354,10 +355,9 @@ static int ena_xdp_execute(struct ena_ring *rx_ring,
>   verdict = bpf_prog_run_xdp(xdp_prog, xdp);
>  
>   if (verdict == XDP_TX) {
> - ena_xdp_xmit_buff(rx_ring->netdev,
> -   xdp,
> -   rx_ring->qid + 
> rx_ring->adapter->num_io_queues,
> -   rx_info);
> + xdpf = xdp_convert_buff_to_frame(xdp);

Similar to Jakub's comment on another patch, xdp_convert_buff_to_frame can
return NULL and from what I can tell you never check that in
ena_xdp_xmit_frame.

> + ena_xdp_xmit_frame(rx_ring->netdev, xdpf,
> +rx_ring->qid + 
> rx_ring->adapter->num_io_queues);
>  
>   xdp_stat = &rx_ring->rx_stats.xdp_tx;
>   } else if (unlikely(verdict =

Re: [PATCH V4 net-next 9/9] net: ena: introduce ndo_xdp_xmit() function for XDP_REDIRECT

2020-12-06 Thread Maciej Fijalkowski
On Fri, Dec 04, 2020 at 02:11:15PM +0200, akiy...@amazon.com wrote:
> From: Arthur Kiyanovski 
> 
> This patch implements the ndo_xdp_xmit() net_device function which is
> called when a packet is redirected to this driver using an
> XDP_REDIRECT directive.
> 
> The function receives an array of xdp frames that it needs to xmit.
> The TX queues that are used to xmit these frames are the XDP
> queues used by the XDP_TX flow. Therefore a lock is added to synchronize
> both flows (XDP_TX and XDP_REDIRECT).
> 
> Signed-off-by: Shay Agroskin 
> Signed-off-by: Arthur Kiyanovski 
> ---
>  drivers/net/ethernet/amazon/ena/ena_netdev.c | 83 +---
>  drivers/net/ethernet/amazon/ena/ena_netdev.h |  1 +
>  2 files changed, 72 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
> b/drivers/net/ethernet/amazon/ena/ena_netdev.c
> index 0d077a626604..6c5d8b8c4d13 100644
> --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
> +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
> @@ -281,20 +281,18 @@ static int ena_xdp_tx_map_frame(struct ena_ring 
> *xdp_ring,
>   return -EINVAL;
>  }
>  
> -static int ena_xdp_xmit_frame(struct net_device *dev,
> +static int ena_xdp_xmit_frame(struct ena_ring *xdp_ring,
> +   struct net_device *dev,
> struct xdp_frame *xdpf,
> -   int qid)
> +   int flags)
>  {
> - struct ena_adapter *adapter = netdev_priv(dev);
>   struct ena_com_tx_ctx ena_tx_ctx = {};
>   struct ena_tx_buffer *tx_info;
> - struct ena_ring *xdp_ring;
>   u16 next_to_use, req_id;
> - int rc;
>   void *push_hdr;
>   u32 push_len;
> + int rc;
>  
> - xdp_ring = &adapter->tx_ring[qid];
>   next_to_use = xdp_ring->next_to_use;
>   req_id = xdp_ring->free_ids[next_to_use];
>   tx_info = &xdp_ring->tx_buffer_info[req_id];
> @@ -321,25 +319,76 @@ static int ena_xdp_xmit_frame(struct net_device *dev,
>   /* trigger the dma engine. ena_com_write_sq_doorbell()
>* has a mb
>*/
> - ena_com_write_sq_doorbell(xdp_ring->ena_com_io_sq);
> - ena_increase_stat(&xdp_ring->tx_stats.doorbells, 1, &xdp_ring->syncp);
> + if (flags & XDP_XMIT_FLUSH) {
> + ena_com_write_sq_doorbell(xdp_ring->ena_com_io_sq);
> + ena_increase_stat(&xdp_ring->tx_stats.doorbells, 1,
> +   &xdp_ring->syncp);
> + }
>  
> - return NETDEV_TX_OK;
> + return rc;
>  
>  error_unmap_dma:
>   ena_unmap_tx_buff(xdp_ring, tx_info);
>   tx_info->xdpf = NULL;
>  error_drop_packet:
>   xdp_return_frame(xdpf);
> - return NETDEV_TX_OK;
> + return rc;
> +}
> +
> +static int ena_xdp_xmit(struct net_device *dev, int n,
> + struct xdp_frame **frames, u32 flags)
> +{
> + struct ena_adapter *adapter = netdev_priv(dev);
> + int qid, i, err, drops = 0;
> + struct ena_ring *xdp_ring;
> +
> + if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
> + return -EINVAL;
> +
> + if (!test_bit(ENA_FLAG_DEV_UP, &adapter->flags))
> + return -ENETDOWN;
> +
> + /* We assume that all rings have the same XDP program */
> + if (!READ_ONCE(adapter->rx_ring->xdp_bpf_prog))
> + return -ENXIO;
> +
> + qid = smp_processor_id() % adapter->xdp_num_queues;
> + qid += adapter->xdp_first_ring;
> + xdp_ring = &adapter->tx_ring[qid];
> +
> + /* Other CPU ids might try to send thorugh this queue */
> + spin_lock(&xdp_ring->xdp_tx_lock);

I have a feeling that we are not consistent with this locking approach as
some drivers do that and some don't.

> +
> + for (i = 0; i < n; i++) {
> + err = ena_xdp_xmit_frame(xdp_ring, dev, frames[i], 0);
> + /* The descriptor is freed by ena_xdp_xmit_frame in case
> +  * of an error.
> +  */
> + if (err)
> + drops++;
> + }
> +
> + /* Ring doorbell to make device aware of the packets */
> + if (flags & XDP_XMIT_FLUSH) {
> + ena_com_write_sq_doorbell(xdp_ring->ena_com_io_sq);
> + ena_increase_stat(&xdp_ring->tx_stats.doorbells, 1,
> +   &xdp_ring->syncp);

Have you thought of ringing the doorbell once per a batch of xmitted
frames?

> + }
> +
> + spin_unlock(&xdp_ring->xdp_tx_lock);
> +
> + /* Return number of packets sent */
> + return n - drops;
>  }
>  
>  static int ena_xdp_execute(struct ena_ring *rx_ring, struct xdp_buff *xdp)
>  {
>   struct bpf_prog *xdp_prog;
> + struct ena_ring *xdp_ring;
>   u32 verdict = XDP_PASS;
>   struct xdp_frame *xdpf;
>   u64 *xdp_stat;
> + int qid;
>  
>   rcu_read_lock();
>   xdp_prog = READ_ONCE(rx_ring->xdp_bpf_prog);
> @@ -352,8 +401,16 @@ static int ena_xdp_execute(struct ena_ring *rx_ring, 
> struct xdp_buff *xdp)
>   switch (verdict) 

Re:Re: linux-next: ERROR: build error for arm64

2020-12-06 Thread 苏辉
















At 2020-12-06 16:25:19, "Björn Töpel"  wrote:

Ok, thanks.

RE: [PATCH v2] Drivers: hv: vmbus: Copy packets sent by Hyper-V out of the ring buffer

2020-12-06 Thread Michael Kelley
From: Andrea Parri (Microsoft)  Sent: Monday, November 
9, 2020 2:07 AM
> 
> From: Andres Beltran 
> 
> Pointers to ring-buffer packets sent by Hyper-V are used within the
> guest VM. Hyper-V can send packets with erroneous values or modify
> packet fields after they are processed by the guest. To defend
> against these scenarios, return a copy of the incoming VMBus packet
> after validating its length and offset fields in hv_pkt_iter_first().
> In this way, the packet can no longer be modified by the host.
> 

[snip]

> @@ -419,17 +446,52 @@ static u32 hv_pkt_iter_avail(const struct 
> hv_ring_buffer_info *rbi)
>  struct vmpacket_descriptor *hv_pkt_iter_first(struct vmbus_channel *channel)
>  {
>   struct hv_ring_buffer_info *rbi = &channel->inbound;
> - struct vmpacket_descriptor *desc;
> + struct vmpacket_descriptor *desc, *desc_copy;
> + u32 bytes_avail, pkt_len, pkt_offset;
> 
> - hv_debug_delay_test(channel, MESSAGE_DELAY);
> - if (hv_pkt_iter_avail(rbi) < sizeof(struct vmpacket_descriptor))
> + desc = hv_pkt_iter_first_raw(channel);
> + if (!desc)
>   return NULL;
> 
> - desc = hv_get_ring_buffer(rbi) + rbi->priv_read_index;
> - if (desc)
> - prefetch((char *)desc + (desc->len8 << 3));
> + bytes_avail = hv_pkt_iter_avail(rbi);
> +
> + /*
> +  * Ensure the compiler does not use references to incoming Hyper-V 
> values (which
> +  * could change at any moment) when reading local variables later in 
> the code
> +  */
> + pkt_len = READ_ONCE(desc->len8) << 3;
> + pkt_offset = READ_ONCE(desc->offset8) << 3;
> +
> + /*
> +  * If pkt_len is invalid, set it to the smaller of hv_pkt_iter_avail() 
> and
> +  * rbi->pkt_buffer_size
> +  */
> + if (rbi->pkt_buffer_size < bytes_avail)
> + bytes_avail = rbi->pkt_buffer_size;

I think the above could be combined with the earlier call to 
hv_pkt_iter_avail(),
and more logically expressed as:

bytes_avail = min(rbi->pkt_buffer_size, hv_pkt_iter_avail(rbi));


This is a minor nit.  Everything else in this patch looks good to me.

Michael

> +
> + if (pkt_len < sizeof(struct vmpacket_descriptor) || pkt_len > 
> bytes_avail)
> + pkt_len = bytes_avail;
> +
> + /*
> +  * If pkt_offset is invalid, arbitrarily set it to
> +  * the size of vmpacket_descriptor
> +  */
> + if (pkt_offset < sizeof(struct vmpacket_descriptor) || pkt_offset > 
> pkt_len)
> + pkt_offset = sizeof(struct vmpacket_descriptor);
> +
> + /* Copy the Hyper-V packet out of the ring buffer */
> + desc_copy = (struct vmpacket_descriptor *)rbi->pkt_buffer;
> + memcpy(desc_copy, desc, pkt_len);
> +
> + /*
> +  * Hyper-V could still change len8 and offset8 after the earlier read.
> +  * Ensure that desc_copy has legal values for len8 and offset8 that
> +  * are consistent with the copy we just made
> +  */
> + desc_copy->len8 = pkt_len >> 3;
> + desc_copy->offset8 = pkt_offset >> 3;
> 
> - return desc;
> + return desc_copy;
>  }
>  EXPORT_SYMBOL_GPL(hv_pkt_iter_first);
> 


Re: LRO: creating vlan subports affects parent port's LRO settings

2020-12-06 Thread Jarod Wilson
On Sun, Dec 6, 2020 at 11:49 AM Michal Kubecek  wrote:
>
> On Sat, Dec 05, 2020 at 07:04:06PM -0500, Jarod Wilson wrote:
> > On Mon, Nov 23, 2020 at 7:27 PM Jakub Kicinski  wrote:
> > >
> > > On Thu, 19 Nov 2020 20:37:27 -0500 Limin Wang wrote:
> > > > Under relatively recent kernels (v4.4+), creating a vlan subport on a
> > > > LRO supported parent NIC may turn LRO off on the parent port and
> > > > further render its LRO feature practically unchangeable.
> > >
> > > That does sound like an oversight in commit fd867d51f889 ("net/core:
> > > generic support for disabling netdev features down stack").
> > >
> > > Are you able to create a patch to fix this?
> >
> > Something like this, perhaps? Completely untested copy-pasta'd
> > theoretical patch:
> >
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 8588ade790cb..a5ce372e02ba 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -9605,8 +9605,10 @@ int __netdev_update_features(struct net_device *dev)
> > features = netdev_fix_features(dev, features);
> >
> > /* some features can't be enabled if they're off on an upper device 
> > */
> > -   netdev_for_each_upper_dev_rcu(dev, upper, iter)
> > -   features = netdev_sync_upper_features(dev, upper, features);
> > +   netdev_for_each_upper_dev_rcu(dev, upper, iter) {
> > +   if (netif_is_lag_master(upper) || 
> > netif_is_bridge_master(upper))
> > +   features = netdev_sync_upper_features(dev,
> > upper, features);
> > +   }
> >
> > if (dev->features == features)
> > goto sync_lower;
> > @@ -9633,8 +9635,10 @@ int __netdev_update_features(struct net_device *dev)
> > /* some features must be disabled on lower devices when disabled
> >  * on an upper device (think: bonding master or bridge)
> >  */
> > -   netdev_for_each_lower_dev(dev, lower, iter)
> > -   netdev_sync_lower_features(dev, lower, features);
> > +   if (netif_is_lag_master(dev) || netif_is_bridge_master(dev)) {
> > +   netdev_for_each_lower_dev(dev, lower, iter)
> > +   netdev_sync_lower_features(dev, lower, features);
> > +   }
> >
> > if (!err) {
> > netdev_features_t diff = features ^ dev->features;
> >
> > I'm not sure what all other upper devices this excludes besides just
> > vlan ports though, so perhaps safer add upper device types to not do
> > feature sync on than to choose which ones to do them on?
>
> I'm not sure excluding devices from feature sync is the right way,
> whether it's an explicit list types or default. The logic still makes
> sense to me. Couldn't we address the issue by either setting features in
> NETIF_F_UPPER_DISABLES) by default for a new vlan (and probably macvlan)
> device? Or perhaps inheriting their values from the lower device.

Yeah, I think you're right, excluding devices entirely from sync is a
bad idea, it should be only certain features that don't get sync'd for
devices that say they don't want them (i.e., vlan devs and macvlan
devs). I'll do a bit more reading of the code and ponder. I'm not
familiar with the intricacies of NETIF_F_UPPER_DISABLES just yet.

-- 
Jarod Wilson
ja...@redhat.com



[RFC PATCH net-next 00/13] Make .ndo_get_stats64 sleepable

2020-12-06 Thread Vladimir Oltean
This series converts all callers of dev_get_stats() to be in sleepable
context, so that we can do more work in the .ndo_get_stats64 method.

The situation today is that if we have hardware that needs to be
accessed through a slow bus like SPI, or through a firmware, we cannot
do that directly in .ndo_get_stats64, so we have to poll counters
periodically and return a cached (not up to date) copy in the atomic NDO
callback. This is undesirable on both ends: more work than strictly
needed is being done, and the end result is also worse (not guaranteed
to be up to date). So converting the code paths to be compatible with
sleeping seems to make more sense.

Cc: Leon Romanovsky 
Cc: Heiko Carstens 
Cc: Vasily Gorbik 
Cc: linux-s...@vger.kernel.org
Cc: Jay Vosburgh 
Cc: Veaceslav Falico 
Cc: Andy Gospodarek 
Cc: Sridhar Samudrala 
Cc: "James E.J. Bottomley" 
Cc: Helge Deller 
Cc: linux-par...@vger.kernel.org
Cc: Christian Brauner 

Vladimir Oltean (13):
  RDMA/mlx4: remove bogus dev_base_lock usage
  net: mark dev_base_lock for deprecation
  net: introduce a mutex for the netns interface lists
  s390/appldata_net_sum: hold the netdev lists lock when retrieving
device statistics
  net: bonding: hold the netdev lists lock when retrieving device
statistics
  net_failover: hold the netdev lists lock when retrieving device
statistics
  parisc/led: remove trailing whitespaces
  parisc/led: reindent the code that gathers device statistics
  parisc/led: hold the netdev lists lock when retrieving device
statistics
  net: procfs: hold the netdev lists lock when retrieving device
statistics
  net: sysfs: don't hold dev_base_lock while retrieving device
statistics
  net: mark ndo_get_stats64 as being able to sleep
  net: remove obsolete comments about ndo_get_stats64 context from eth
drivers

 Documentation/networking/netdevices.rst |   4 +-
 Documentation/networking/statistics.rst |   9 +-
 arch/s390/appldata/appldata_net_sum.c   |   8 +-
 drivers/infiniband/hw/mlx4/main.c   |   3 -
 drivers/net/bonding/bond_main.c |  16 +-
 drivers/net/ethernet/cisco/enic/enic_main.c |   1 -
 drivers/net/ethernet/nvidia/forcedeth.c |   2 -
 drivers/net/ethernet/sfc/efx_common.c   |   1 -
 drivers/net/ethernet/sfc/falcon/efx.c   |   1 -
 drivers/net/net_failover.c  |  15 +-
 drivers/parisc/led.c| 164 ++--
 include/net/bonding.h   |   1 -
 include/net/net_failover.h  |   3 -
 include/net/net_namespace.h |   5 +
 net/core/dev.c  |  63 +---
 net/core/net-procfs.c   |  13 +-
 net/core/net-sysfs.c|   3 +-
 17 files changed, 162 insertions(+), 150 deletions(-)

-- 
2.25.1



[RFC PATCH net-next 13/13] net: remove obsolete comments about ndo_get_stats64 context from eth drivers

2020-12-06 Thread Vladimir Oltean
Now that we have a good summary in Documentation/networking/netdevices.rst,
these comments serve no purpose and are actually distracting/confusing.

Signed-off-by: Vladimir Oltean 
---
 drivers/net/ethernet/cisco/enic/enic_main.c | 1 -
 drivers/net/ethernet/nvidia/forcedeth.c | 2 --
 drivers/net/ethernet/sfc/efx_common.c   | 1 -
 drivers/net/ethernet/sfc/falcon/efx.c   | 1 -
 4 files changed, 5 deletions(-)

diff --git a/drivers/net/ethernet/cisco/enic/enic_main.c 
b/drivers/net/ethernet/cisco/enic/enic_main.c
index fb269d587b74..7425f94f9091 100644
--- a/drivers/net/ethernet/cisco/enic/enic_main.c
+++ b/drivers/net/ethernet/cisco/enic/enic_main.c
@@ -870,7 +870,6 @@ static netdev_tx_t enic_hard_start_xmit(struct sk_buff *skb,
return NETDEV_TX_OK;
 }
 
-/* dev_base_lock rwlock held, nominally process context */
 static void enic_get_stats(struct net_device *netdev,
   struct rtnl_link_stats64 *net_stats)
 {
diff --git a/drivers/net/ethernet/nvidia/forcedeth.c 
b/drivers/net/ethernet/nvidia/forcedeth.c
index 8724d6a9ed02..8fa254dc64e9 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -1761,8 +1761,6 @@ static void nv_get_stats(int cpu, struct fe_priv *np,
 /*
  * nv_get_stats64: dev->ndo_get_stats64 function
  * Get latest stats value from the nic.
- * Called with read_lock(&dev_base_lock) held for read -
- * only synchronized against unregister_netdevice.
  */
 static void
 nv_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *storage)
diff --git a/drivers/net/ethernet/sfc/efx_common.c 
b/drivers/net/ethernet/sfc/efx_common.c
index de797e1ac5a9..4d8047b35fb2 100644
--- a/drivers/net/ethernet/sfc/efx_common.c
+++ b/drivers/net/ethernet/sfc/efx_common.c
@@ -596,7 +596,6 @@ void efx_stop_all(struct efx_nic *efx)
efx_stop_datapath(efx);
 }
 
-/* Context: process, dev_base_lock or RTNL held, non-blocking. */
 void efx_net_stats(struct net_device *net_dev, struct rtnl_link_stats64 *stats)
 {
struct efx_nic *efx = netdev_priv(net_dev);
diff --git a/drivers/net/ethernet/sfc/falcon/efx.c 
b/drivers/net/ethernet/sfc/falcon/efx.c
index f8979991970e..6db2b6583dec 100644
--- a/drivers/net/ethernet/sfc/falcon/efx.c
+++ b/drivers/net/ethernet/sfc/falcon/efx.c
@@ -2096,7 +2096,6 @@ int ef4_net_stop(struct net_device *net_dev)
return 0;
 }
 
-/* Context: process, dev_base_lock or RTNL held, non-blocking. */
 static void ef4_net_stats(struct net_device *net_dev,
  struct rtnl_link_stats64 *stats)
 {
-- 
2.25.1



[RFC PATCH net-next 05/13] net: bonding: hold the netdev lists lock when retrieving device statistics

2020-12-06 Thread Vladimir Oltean
In the effort of making .ndo_get_stats64 be able to sleep, we need to
ensure the callers of dev_get_stats do not use atomic context.

The bonding driver uses an RCU read-side critical section to ensure the
integrity of the list of network interfaces, because the driver iterates
through all net devices in the netns to find the ones which are its
configured slaves. We still need some protection against an interface
registering or deregistering, and the writer-side lock, the netns mutex,
is fine for that, because it offers sleepable context.

This mutex now serves double duty. It offers code serialization,
something which the stats_lock already did. So now that serves no
purpose, let's remove it.

Cc: Jay Vosburgh 
Cc: Veaceslav Falico 
Cc: Andy Gospodarek 
Signed-off-by: Vladimir Oltean 
---
 drivers/net/bonding/bond_main.c | 16 +---
 include/net/bonding.h   |  1 -
 2 files changed, 5 insertions(+), 12 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index e0880a3840d7..f788f9fa1858 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3738,21 +3738,16 @@ static void bond_get_stats(struct net_device *bond_dev,
   struct rtnl_link_stats64 *stats)
 {
struct bonding *bond = netdev_priv(bond_dev);
+   struct net *net = dev_net(bond_dev);
struct rtnl_link_stats64 temp;
struct list_head *iter;
struct slave *slave;
-   int nest_level = 0;
 
+   mutex_lock(&net->netdev_lists_lock);
 
-   rcu_read_lock();
-#ifdef CONFIG_LOCKDEP
-   nest_level = bond_get_lowest_level_rcu(bond_dev);
-#endif
-
-   spin_lock_nested(&bond->stats_lock, nest_level);
memcpy(stats, &bond->bond_stats, sizeof(*stats));
 
-   bond_for_each_slave_rcu(bond, slave, iter) {
+   bond_for_each_slave(bond, slave, iter) {
const struct rtnl_link_stats64 *new =
dev_get_stats(slave->dev, &temp);
 
@@ -3763,8 +3758,8 @@ static void bond_get_stats(struct net_device *bond_dev,
}
 
memcpy(&bond->bond_stats, stats, sizeof(*stats));
-   spin_unlock(&bond->stats_lock);
-   rcu_read_unlock();
+
+   mutex_unlock(&net->netdev_lists_lock);
 }
 
 static int bond_do_ioctl(struct net_device *bond_dev, struct ifreq *ifr, int 
cmd)
@@ -5192,7 +5187,6 @@ static int bond_init(struct net_device *bond_dev)
if (!bond->wq)
return -ENOMEM;
 
-   spin_lock_init(&bond->stats_lock);
netdev_lockdep_set_classes(bond_dev);
 
list_add_tail(&bond->bond_list, &bn->dev_list);
diff --git a/include/net/bonding.h b/include/net/bonding.h
index d9d0ff3b0ad3..6fbde9713424 100644
--- a/include/net/bonding.h
+++ b/include/net/bonding.h
@@ -224,7 +224,6 @@ struct bonding {
 * ALB mode (6) - to sync the use and modifications of its hash table
 */
spinlock_t mode_lock;
-   spinlock_t stats_lock;
u8   send_peer_notif;
u8   igmp_retrans;
 #ifdef CONFIG_PROC_FS
-- 
2.25.1



[RFC PATCH net-next 04/13] s390/appldata_net_sum: hold the netdev lists lock when retrieving device statistics

2020-12-06 Thread Vladimir Oltean
In the effort of making .ndo_get_stats64 be able to sleep, we need to
ensure the callers of dev_get_stats do not use atomic context.

In the case of the appldata driver, an RCU read-side critical section is
used to ensure the integrity of the list of network interfaces, because
the driver iterates through all net devices in the netns to aggregate
statistics. We still need some protection against an interface
registering or deregistering, and the writer-side lock, the netns's
mutex, is fine for that, because it offers sleepable context.

The ops->callback function is called from under appldata_ops_mutex
protection, so this is proof that the context is sleepable and holding
a mutex is therefore fine.

Cc: Heiko Carstens 
Cc: Vasily Gorbik 
Cc: linux-s...@vger.kernel.org
Signed-off-by: Vladimir Oltean 
---
 arch/s390/appldata/appldata_net_sum.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/s390/appldata/appldata_net_sum.c 
b/arch/s390/appldata/appldata_net_sum.c
index 59c282ca002f..27e9cc998d4d 100644
--- a/arch/s390/appldata/appldata_net_sum.c
+++ b/arch/s390/appldata/appldata_net_sum.c
@@ -78,8 +78,9 @@ static void appldata_get_net_sum_data(void *data)
tx_dropped = 0;
collisions = 0;
 
-   rcu_read_lock();
-   for_each_netdev_rcu(&init_net, dev) {
+   mutex_lock(&init_net.netdev_lists_lock);
+
+   for_each_netdev(&init_net, dev) {
const struct rtnl_link_stats64 *stats;
struct rtnl_link_stats64 temp;
 
@@ -95,7 +96,8 @@ static void appldata_get_net_sum_data(void *data)
collisions += stats->collisions;
i++;
}
-   rcu_read_unlock();
+
+   mutex_unlock(&init_net.netdev_lists_unlock);
 
net_data->nr_interfaces = i;
net_data->rx_packets = rx_packets;
-- 
2.25.1



[RFC PATCH net-next 07/13] parisc/led: remove trailing whitespaces

2020-12-06 Thread Vladimir Oltean
This file looks bad in text editors.

Cc: "James E.J. Bottomley" 
Cc: Helge Deller 
Cc: linux-par...@vger.kernel.org
Signed-off-by: Vladimir Oltean 
---
 drivers/parisc/led.c | 128 +--
 1 file changed, 64 insertions(+), 64 deletions(-)

diff --git a/drivers/parisc/led.c b/drivers/parisc/led.c
index 36c6613f7a36..676a12bb94c9 100644
--- a/drivers/parisc/led.c
+++ b/drivers/parisc/led.c
@@ -47,10 +47,10 @@
 #include 
 #include 
 
-/* The control of the LEDs and LCDs on PARISC-machines have to be done 
+/* The control of the LEDs and LCDs on PARISC-machines have to be done
completely in software. The necessary calculations are done in a work queue
-   task which is scheduled regularly, and since the calculations may consume a 
-   relatively large amount of CPU time, some of the calculations can be 
+   task which is scheduled regularly, and since the calculations may consume a
+   relatively large amount of CPU time, some of the calculations can be
turned off with the following variables (controlled via procfs) */
 
 static int led_type __read_mostly = -1;
@@ -80,7 +80,7 @@ struct lcd_block {
 };
 
 /* Structure returned by PDC_RETURN_CHASSIS_INFO */
-/* NOTE: we use unsigned long:16 two times, since the following member 
+/* NOTE: we use unsigned long:16 two times, since the following member
lcd_cmd_reg_addr needs to be 64bit aligned on 64bit PA2.0-machines */
 struct pdc_chassis_lcd_info_ret_block {
unsigned long model:16; /* DISPLAY_MODEL_ */
@@ -103,7 +103,7 @@ struct pdc_chassis_lcd_info_ret_block {
 #define KITTYHAWK_LCD_CMD  F_EXTEND(0xf019UL) /* 64bit-ready */
 #define KITTYHAWK_LCD_DATA (KITTYHAWK_LCD_CMD+1)
 
-/* lcd_info is pre-initialized to the values needed to program KittyHawk LCD's 
+/* lcd_info is pre-initialized to the values needed to program KittyHawk LCD's
  * HP seems to have used Sharp/Hitachi HD44780 LCDs most of the time. */
 static struct pdc_chassis_lcd_info_ret_block
 lcd_info __attribute__((aligned(8))) __read_mostly =
@@ -119,16 +119,16 @@ lcd_info __attribute__((aligned(8))) __read_mostly =
 
 
 /* direct access to some of the lcd_info variables */
-#define LCD_CMD_REGlcd_info.lcd_cmd_reg_addr
-#define LCD_DATA_REG   lcd_info.lcd_data_reg_addr   
+#define LCD_CMD_REGlcd_info.lcd_cmd_reg_addr
+#define LCD_DATA_REG   lcd_info.lcd_data_reg_addr
 #define LED_DATA_REG   lcd_info.lcd_cmd_reg_addr   /* LASI & ASP only */
 
 #define LED_HASLCD 1
 #define LED_NOLCD  0
 
 /* The workqueue must be created at init-time */
-static int start_task(void) 
-{  
+static int start_task(void)
+{
/* Display the default text now */
if (led_type == LED_HASLCD) lcd_print( lcd_text_default );
 
@@ -136,7 +136,7 @@ static int start_task(void)
if (lcd_no_led_support) return 0;
 
/* Create the work queue and queue the LED task */
-   led_wq = create_singlethread_workqueue("led_wq");   
+   led_wq = create_singlethread_workqueue("led_wq");
queue_delayed_work(led_wq, &led_task, 0);
 
return 0;
@@ -214,14 +214,14 @@ static ssize_t led_proc_write(struct file *file, const 
char __user *buf,
case LED_HASLCD:
if (*cur && cur[strlen(cur)-1] == '\n')
cur[strlen(cur)-1] = 0;
-   if (*cur == 0) 
+   if (*cur == 0)
cur = lcd_text_default;
lcd_print(cur);
break;
default:
return 0;
}
-   
+
return count;
 
 parse_error:
@@ -267,9 +267,9 @@ static int __init led_create_procfs(void)
 #endif
 
 /*
-   ** 
+   **
** led_ASP_driver()
-   ** 
+   **
  */
 #defineLED_DATA0x01/* data to shift (0:on 1:off) */
 #defineLED_STROBE  0x02/* strobe to clock data */
@@ -289,9 +289,9 @@ static void led_ASP_driver(unsigned char leds)
 
 
 /*
-   ** 
+   **
** led_LASI_driver()
-   ** 
+   **
  */
 static void led_LASI_driver(unsigned char leds)
 {
@@ -301,16 +301,16 @@ static void led_LASI_driver(unsigned char leds)
 
 
 /*
-   ** 
+   **
** led_LCD_driver()
-   **   
+   **
  */
 static void led_LCD_driver(unsigned char leds)
 {
static int i;
static unsigned char mask[4] = { LED_HEARTBEAT, LED_DISK_IO,
LED_LAN_RCV, LED_LAN_TX };
-   
+
static struct lcd_block * blockp[4] = {
&lcd_info.heartbeat,
&lcd_info.disk_io,
@@ -320,15 +320,15 @@ static void led_LCD_driver(unsigned char leds)
 
/* Convert min_cmd_delay to milliseconds */
unsigned int msec_cmd_delay = 1 + (lcd_info.min_cmd_delay / 1000);
-   
-   for (i=0; i<4; ++i) 
+
+   for (i=0; i<4; ++i)
{
-   if ((leds & mask[i]) != (lastleds & mask[i])) 
+   if ((leds & mask[i]) != (lastleds & mask[i]))
{
gsc_writeb( blockp[i]->command, LCD_

[RFC PATCH net-next 03/13] net: introduce a mutex for the netns interface lists

2020-12-06 Thread Vladimir Oltean
Currently, any writer that wants to alter the lists of network
interfaces (either the plain list net->dev_base_head, or the hash tables
net->dev_index_head and net->dev_name_head) can keep other writers at
bay using the RTNL mutex.

However, the RTNL mutex has become a very contended resource over the
years, so there is a movement to do finer grained locking.

This patch adds one more way for writers to the network interface lists
to serialize themselves. We assume that all writers to the network
interface lists are easily identifiable because the write side of
dev_base_lock also needs to be held (note that some instances of that
were deliberately skipped, since they only dealt with protecting the
operational state of the netdev).

Holding the RTNL mutex is now optional for new code that alters the
lists, since all relevant writers were made to also hold the new lock.

Signed-off-by: Vladimir Oltean 
---
 include/net/net_namespace.h |  5 +
 net/core/dev.c  | 44 +
 2 files changed, 35 insertions(+), 14 deletions(-)

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 29567875f428..90826982843d 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -105,6 +105,11 @@ struct net {
struct hlist_head   *dev_index_head;
struct raw_notifier_headnetdev_chain;
 
+   /* Serializes writers to @dev_base_head, @dev_name_head
+* and @dev_index_head
+*/
+   struct mutexnetdev_lists_lock;
+
/* Note that @hash_mix can be read millions times per second,
 * it is critical that it is on a read_mostly cache line.
 */
diff --git a/net/core/dev.c b/net/core/dev.c
index 701a326682e8..18904acb7797 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -175,13 +175,16 @@ static struct napi_struct *napi_by_id(unsigned int 
napi_id);
  *
  * Pure readers should hold rcu_read_lock() which should protect them against
  * concurrent changes to the interface lists made by the writers. Pure writers
- * must serialize by holding the RTNL mutex while they loop through the list
- * and make changes to it.
+ * must serialize by holding the @net->netdev_lists_lock mutex while they loop
+ * through the list and make changes to it.
+ *
+ * It is possible to hold the RTNL mutex for serializing the writers too, but
+ * this should be avoided in new code due to lock contention.
  *
  * It is also possible to hold the global rwlock_t @dev_base_lock for
  * protection (holding its read side as an alternative to rcu_read_lock, and
- * its write side as an alternative to the RTNL mutex), however this should not
- * be done in new code, since it is deprecated and pending removal.
+ * its write side as an alternative to @net->netdev_lists_lock), however this
+ * should not be done in new code, since it is deprecated and pending removal.
  *
  * One other role of @dev_base_lock is to protect against changes in the
  * operational state of a network interface.
@@ -360,12 +363,14 @@ static void list_netdevice(struct net_device *dev)
 
ASSERT_RTNL();
 
+   mutex_lock(&net->netdev_lists_lock);
write_lock_bh(&dev_base_lock);
list_add_tail_rcu(&dev->dev_list, &net->dev_base_head);
netdev_name_node_add(net, dev->name_node);
hlist_add_head_rcu(&dev->index_hlist,
   dev_index_hash(net, dev->ifindex));
write_unlock_bh(&dev_base_lock);
+   mutex_unlock(&net->netdev_lists_lock);
 
dev_base_seq_inc(net);
 }
@@ -375,16 +380,20 @@ static void list_netdevice(struct net_device *dev)
  */
 static void unlist_netdevice(struct net_device *dev)
 {
+   struct net *net = dev_net(dev);
+
ASSERT_RTNL();
 
/* Unlink dev from the device chain */
+   mutex_lock(&net->netdev_lists_lock);
write_lock_bh(&dev_base_lock);
list_del_rcu(&dev->dev_list);
netdev_name_node_del(dev->name_node);
hlist_del_rcu(&dev->index_hlist);
write_unlock_bh(&dev_base_lock);
+   mutex_unlock(&net->netdev_lists_lock);
 
-   dev_base_seq_inc(dev_net(dev));
+   dev_base_seq_inc(net);
 }
 
 /*
@@ -850,11 +859,11 @@ EXPORT_SYMBOL_GPL(dev_fill_metadata_dst);
  * @net: the applicable net namespace
  * @name: name to find
  *
- * Find an interface by name. Must be called under RTNL semaphore
- * or @dev_base_lock. If the name is found a pointer to the device
- * is returned. If the name is not found then %NULL is returned. The
- * reference counters are not incremented so the caller must be
- * careful with locks.
+ * Find an interface by name. Must be called under RTNL semaphore,
+ * @net->netdev_lists_lock or @dev_base_lock. If the name is found,
+ * a pointer to the device is returned. If the name is not found then
+ * %NULL is returned. The reference counters are not incremented so the
+ * caller must be careful with locks.

[RFC PATCH net-next 06/13] net_failover: hold the netdev lists lock when retrieving device statistics

2020-12-06 Thread Vladimir Oltean
In the effort of making .ndo_get_stats64 be able to sleep, we need to
ensure the callers of dev_get_stats do not use atomic context.

The net_failover driver makes copious abuse of RCU protection for the
slave interfaces, which is probably unnecessary given the fact that it
already calls dev_hold on slave interfaces. Nonetheless, to avoid
regressions, we still need to offer the same level of protection against
unregistering the standby and primary devices. We can achieve this by
holding the netns mutex, which gives us the sleepable context that
dev_get_stats() wants to see. Holding this mutex also removes the need
for a separate lock for statistics.

Cc: Sridhar Samudrala 
Signed-off-by: Vladimir Oltean 
---
 drivers/net/net_failover.c | 15 +++
 include/net/net_failover.h |  3 ---
 2 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/drivers/net/net_failover.c b/drivers/net/net_failover.c
index 2a4892402ed8..90db0358bc1d 100644
--- a/drivers/net/net_failover.c
+++ b/drivers/net/net_failover.c
@@ -184,32 +184,31 @@ static void net_failover_get_stats(struct net_device *dev,
 {
struct net_failover_info *nfo_info = netdev_priv(dev);
const struct rtnl_link_stats64 *new;
+   struct net *net = dev_net(dev);
struct rtnl_link_stats64 temp;
struct net_device *slave_dev;
 
-   spin_lock(&nfo_info->stats_lock);
-   memcpy(stats, &nfo_info->failover_stats, sizeof(*stats));
+   mutex_lock(&net->netdev_lists_lock);
 
-   rcu_read_lock();
+   memcpy(stats, &nfo_info->failover_stats, sizeof(*stats));
 
-   slave_dev = rcu_dereference(nfo_info->primary_dev);
+   slave_dev = nfo_info->primary_dev;
if (slave_dev) {
new = dev_get_stats(slave_dev, &temp);
net_failover_fold_stats(stats, new, &nfo_info->primary_stats);
memcpy(&nfo_info->primary_stats, new, sizeof(*new));
}
 
-   slave_dev = rcu_dereference(nfo_info->standby_dev);
+   slave_dev = nfo_info->standby_dev;
if (slave_dev) {
new = dev_get_stats(slave_dev, &temp);
net_failover_fold_stats(stats, new, &nfo_info->standby_stats);
memcpy(&nfo_info->standby_stats, new, sizeof(*new));
}
 
-   rcu_read_unlock();
-
memcpy(&nfo_info->failover_stats, stats, sizeof(*stats));
-   spin_unlock(&nfo_info->stats_lock);
+
+   mutex_unlock(&net->netdev_lists_lock);
 }
 
 static int net_failover_change_mtu(struct net_device *dev, int new_mtu)
diff --git a/include/net/net_failover.h b/include/net/net_failover.h
index b12a1c469d1c..1e0089800a28 100644
--- a/include/net/net_failover.h
+++ b/include/net/net_failover.h
@@ -22,9 +22,6 @@ struct net_failover_info {
 
/* aggregated stats */
struct rtnl_link_stats64 failover_stats;
-
-   /* spinlock while updating stats */
-   spinlock_t stats_lock;
 };
 
 struct failover *net_failover_create(struct net_device *standby_dev);
-- 
2.25.1



[RFC PATCH net-next 08/13] parisc/led: reindent the code that gathers device statistics

2020-12-06 Thread Vladimir Oltean
The standard in the Linux kernel is to use one tab character per
indentation level.

Cc: "James E.J. Bottomley" 
Cc: Helge Deller 
Cc: linux-par...@vger.kernel.org
Signed-off-by: Vladimir Oltean 
---
 drivers/parisc/led.c | 23 +--
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/drivers/parisc/led.c b/drivers/parisc/led.c
index 676a12bb94c9..b7005aaa782b 100644
--- a/drivers/parisc/led.c
+++ b/drivers/parisc/led.c
@@ -359,16 +359,19 @@ static __inline__ int led_get_net_activity(void)
/* we are running as a workqueue task, so we can use an RCU lookup */
rcu_read_lock();
for_each_netdev_rcu(&init_net, dev) {
-   const struct rtnl_link_stats64 *stats;
-   struct rtnl_link_stats64 temp;
-   struct in_device *in_dev = __in_dev_get_rcu(dev);
-   if (!in_dev || !in_dev->ifa_list)
-   continue;
-   if (ipv4_is_loopback(in_dev->ifa_list->ifa_local))
-   continue;
-   stats = dev_get_stats(dev, &temp);
-   rx_total += stats->rx_packets;
-   tx_total += stats->tx_packets;
+   const struct rtnl_link_stats64 *stats;
+   struct rtnl_link_stats64 temp;
+   struct in_device *in_dev = __in_dev_get_rcu(dev);
+
+   if (!in_dev || !in_dev->ifa_list)
+   continue;
+
+   if (ipv4_is_loopback(in_dev->ifa_list->ifa_local))
+   continue;
+
+   stats = dev_get_stats(dev, &temp);
+   rx_total += stats->rx_packets;
+   tx_total += stats->tx_packets;
}
rcu_read_unlock();
 
-- 
2.25.1



[RFC PATCH net-next 02/13] net: mark dev_base_lock for deprecation

2020-12-06 Thread Vladimir Oltean
There is a movement to eliminate the usage of dev_base_lock, which
exists since as far as I could track the kernel history down (the
"7a2deb329241 Import changeset" commit from the bitkeeper branch).

The dev_base_lock approach has multiple issues:
- It is global and not per netns.
- Its meaning has mutated over the years and the vast majority of
  current users is using it to protect against changes of network device
  state, as opposed to changes of the interface list.
- It is overlapping with other protection mechanisms introduced in the
  meantime, which have higher performance for readers, like RCU
  introduced in 2009 by Eric Dumazet for kernel 2.6.

The old comment that I just deleted made this distinction: writers
protect only against readers by holding dev_base_lock, whereas they need
to protect against other writers by holding the RTNL mutex. This is
slightly confusing/incorrect, since a rwlock_t cannot have more than one
concurrently running writer, so that explanation does not justify why
the RTNL mutex would be necessary.

Instead, Stephen Hemminger makes this clarification here:
https://lore.kernel.org/netdev/20201129211230.4d704931@hermes.local/T/#u

| There are really two different domains being covered by locks here.
|
| The first area is change of state of network devices. This has traditionally
| been covered by RTNL because there are places that depend on coordinating
| state between multiple devices. RTNL is too big and held too long but getting
| rid of it is hard because there are corner cases (like state changes from 
userspace
| for VPN devices).
|
| The other area is code that wants to do read access to look at list of 
devices.
| These pure readers can/should be converted to RCU by now. Writers should hold 
RTNL.

This patch edits the comment for dev_base_lock, minimizing its role in
the protection of network interface lists, and clarifies that it is has
other purposes as well.

Signed-off-by: Vladimir Oltean 
---
 net/core/dev.c | 33 -
 1 file changed, 16 insertions(+), 17 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index d33099fd7a20..701a326682e8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -169,23 +169,22 @@ static int call_netdevice_notifiers_extack(unsigned long 
val,
 static struct napi_struct *napi_by_id(unsigned int napi_id);
 
 /*
- * The @dev_base_head list is protected by @dev_base_lock and the rtnl
- * semaphore.
- *
- * Pure readers hold dev_base_lock for reading, or rcu_read_lock()
- *
- * Writers must hold the rtnl semaphore while they loop through the
- * dev_base_head list, and hold dev_base_lock for writing when they do the
- * actual updates.  This allows pure readers to access the list even
- * while a writer is preparing to update it.
- *
- * To put it another way, dev_base_lock is held for writing only to
- * protect against pure readers; the rtnl semaphore provides the
- * protection against other writers.
- *
- * See, for example usages, register_netdevice() and
- * unregister_netdevice(), which must be called with the rtnl
- * semaphore held.
+ * The network interface list of a netns (@net->dev_base_head) and the hash
+ * tables per ifindex (@net->dev_index_head) and per name (@net->dev_name_head)
+ * are protected using the following rules:
+ *
+ * Pure readers should hold rcu_read_lock() which should protect them against
+ * concurrent changes to the interface lists made by the writers. Pure writers
+ * must serialize by holding the RTNL mutex while they loop through the list
+ * and make changes to it.
+ *
+ * It is also possible to hold the global rwlock_t @dev_base_lock for
+ * protection (holding its read side as an alternative to rcu_read_lock, and
+ * its write side as an alternative to the RTNL mutex), however this should not
+ * be done in new code, since it is deprecated and pending removal.
+ *
+ * One other role of @dev_base_lock is to protect against changes in the
+ * operational state of a network interface.
  */
 DEFINE_RWLOCK(dev_base_lock);
 EXPORT_SYMBOL(dev_base_lock);
-- 
2.25.1



[RFC PATCH net-next 09/13] parisc/led: hold the netdev lists lock when retrieving device statistics

2020-12-06 Thread Vladimir Oltean
In the effort of making .ndo_get_stats64 be able to sleep, we need to
ensure the callers of dev_get_stats do not use atomic context.

The LED driver for HP-PARISC workstations uses a workqueue to
periodically check for updates in network interface statistics, and
flicker when those have changed (i.e. there has been activity on the
line). Ignoring the fact that this driver is completely duplicating
drivers/leds/trigger/ledtrig-netdev.c, there is an even bigger problem.
Now, the dev_get_stats call can sleep, and iterating through the list of
network interfaces still needs to ensure the integrity of list of
network interfaces. So that leaves us only one locking option given the
current design of the network stack, and that is the netns mutex.

Cc: "James E.J. Bottomley" 
Cc: Helge Deller 
Cc: linux-par...@vger.kernel.org
Signed-off-by: Vladimir Oltean 
---
 drivers/parisc/led.c | 21 -
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/drivers/parisc/led.c b/drivers/parisc/led.c
index b7005aaa782b..1289dd3ea6e4 100644
--- a/drivers/parisc/led.c
+++ b/drivers/parisc/led.c
@@ -38,7 +38,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -356,24 +355,28 @@ static __inline__ int led_get_net_activity(void)
 
rx_total = tx_total = 0;
 
-   /* we are running as a workqueue task, so we can use an RCU lookup */
-   rcu_read_lock();
-   for_each_netdev_rcu(&init_net, dev) {
+   /* we are running as a workqueue task, so we can sleep */
+   mutex_lock(&init_net->netdev_lists_lock);
+
+   for_each_netdev(&init_net, dev) {
+   struct in_device *in_dev = in_dev_get(dev);
const struct rtnl_link_stats64 *stats;
struct rtnl_link_stats64 temp;
-   struct in_device *in_dev = __in_dev_get_rcu(dev);
 
-   if (!in_dev || !in_dev->ifa_list)
+   if (!in_dev || !in_dev->ifa_list ||
+   ipv4_is_loopback(in_dev->ifa_list->ifa_local)) {
+   in_dev_put(in_dev);
continue;
+   }
 
-   if (ipv4_is_loopback(in_dev->ifa_list->ifa_local))
-   continue;
+   in_dev_put(in_dev);
 
stats = dev_get_stats(dev, &temp);
rx_total += stats->rx_packets;
tx_total += stats->tx_packets;
}
-   rcu_read_unlock();
+
+   mutex_unlock(&init_net->netdev_lists_lock);
 
retval = 0;
 
-- 
2.25.1



[RFC PATCH net-next 01/13] RDMA/mlx4: remove bogus dev_base_lock usage

2020-12-06 Thread Vladimir Oltean
The dev_base_lock does not protect dev->dev_addr, so it serves no
purpose here.

Cc: Leon Romanovsky 
Signed-off-by: Vladimir Oltean 
---
 drivers/infiniband/hw/mlx4/main.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c 
b/drivers/infiniband/hw/mlx4/main.c
index f0864f40ea1a..e3cd402c079a 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -2265,10 +2265,7 @@ static void mlx4_ib_update_qps(struct mlx4_ib_dev *ibdev,
u64 release_mac = MLX4_IB_INVALID_MAC;
struct mlx4_ib_qp *qp;
 
-   read_lock(&dev_base_lock);
new_smac = mlx4_mac_to_u64(dev->dev_addr);
-   read_unlock(&dev_base_lock);
-
atomic64_set(&ibdev->iboe.mac[port - 1], new_smac);
 
/* no need for update QP1 and mac registration in non-SRIOV */
-- 
2.25.1



[RFC PATCH net-next 10/13] net: procfs: hold the netdev lists lock when retrieving device statistics

2020-12-06 Thread Vladimir Oltean
In the effort of making .ndo_get_stats64 be able to sleep, we need to
ensure the callers of dev_get_stats do not use atomic context.

The /proc/net/dev file uses an RCU read-side critical section to ensure
the integrity of the list of network interfaces, because it iterates
through all net devices in the netns to show their statistics.

To offer the equivalent protection against an interface registering or
deregistering, while also remaining in sleepable context, we can use the
netns mutex for the interface lists.

Cc: Cong Wang 
Cc: Eric Dumazet 
Signed-off-by: Vladimir Oltean 
---
 net/core/net-procfs.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/core/net-procfs.c b/net/core/net-procfs.c
index c714e6a9dad4..83f8a89dfc5a 100644
--- a/net/core/net-procfs.c
+++ b/net/core/net-procfs.c
@@ -21,7 +21,7 @@ static inline struct net_device *dev_from_same_bucket(struct 
seq_file *seq, loff
unsigned int count = 0, offset = get_offset(*pos);
 
h = &net->dev_index_head[get_bucket(*pos)];
-   hlist_for_each_entry_rcu(dev, h, index_hlist) {
+   hlist_for_each_entry(dev, h, index_hlist) {
if (++count == offset)
return dev;
}
@@ -51,9 +51,11 @@ static inline struct net_device *dev_from_bucket(struct 
seq_file *seq, loff_t *p
  * in detail.
  */
 static void *dev_seq_start(struct seq_file *seq, loff_t *pos)
-   __acquires(RCU)
 {
-   rcu_read_lock();
+   struct net *net = seq_file_net(seq);
+
+   mutex_lock(&net->netdev_lists_lock);
+
if (!*pos)
return SEQ_START_TOKEN;
 
@@ -70,9 +72,10 @@ static void *dev_seq_next(struct seq_file *seq, void *v, 
loff_t *pos)
 }
 
 static void dev_seq_stop(struct seq_file *seq, void *v)
-   __releases(RCU)
 {
-   rcu_read_unlock();
+   struct net *net = seq_file_net(seq);
+
+   mutex_unlock(&net->netdev_lists_lock);
 }
 
 static void dev_seq_printf_stats(struct seq_file *seq, struct net_device *dev)
-- 
2.25.1



[RFC PATCH net-next 11/13] net: sysfs: don't hold dev_base_lock while retrieving device statistics

2020-12-06 Thread Vladimir Oltean
In the effort of making .ndo_get_stats64 be able to sleep, we need to
ensure the callers of dev_get_stats do not use atomic context.

I need to preface this by saying that I have no idea why netstat_show
takes the dev_base_lock rwlock. Two things can be observed:
(a) it does not appear to be due to dev_isalive requiring it for some
reason, because broadcast_show() also calls dev_isalive() and has
had no problem existing since the beginning of git.
(b) the dev_get_stats function definitely does not need dev_base_lock
protection either. In fact, holding the dev_base_lock is the entire
problem here, because we want to make dev_get_stats sleepable, and
holding a rwlock gives us atomic context.

So since no protection seems to be necessary, just run unlocked while
retrieving the /sys/class/net/eth0/statistics/* values.

Cc: Christian Brauner 
Cc: Eric Dumazet 
Signed-off-by: Vladimir Oltean 
---
 net/core/net-sysfs.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 999b70c59761..0782a476b424 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -585,14 +585,13 @@ static ssize_t netstat_show(const struct device *d,
WARN_ON(offset > sizeof(struct rtnl_link_stats64) ||
offset % sizeof(u64) != 0);
 
-   read_lock(&dev_base_lock);
if (dev_isalive(dev)) {
struct rtnl_link_stats64 temp;
const struct rtnl_link_stats64 *stats = dev_get_stats(dev, 
&temp);
 
ret = sprintf(buf, fmt_u64, *(u64 *)(((u8 *)stats) + offset));
}
-   read_unlock(&dev_base_lock);
+
return ret;
 }
 
-- 
2.25.1



[RFC PATCH net-next 12/13] net: mark ndo_get_stats64 as being able to sleep

2020-12-06 Thread Vladimir Oltean
Now that all callers have been converted to not use atomic context when
calling dev_get_stats, it is time to update the documentation and put a
notice in the function that it expects process context.

Signed-off-by: Vladimir Oltean 
---
 Documentation/networking/netdevices.rst | 4 ++--
 Documentation/networking/statistics.rst | 9 -
 net/core/dev.c  | 2 ++
 3 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/Documentation/networking/netdevices.rst 
b/Documentation/networking/netdevices.rst
index 5a85fcc80c76..9d005cbf84f7 100644
--- a/Documentation/networking/netdevices.rst
+++ b/Documentation/networking/netdevices.rst
@@ -64,8 +64,8 @@ ndo_do_ioctl:
Context: process
 
 ndo_get_stats:
-   Synchronization: dev_base_lock rwlock.
-   Context: nominally process, but don't sleep inside an rwlock
+   Synchronization: none. rtnl_lock() might be held, but not guaranteed.
+   Context: process
 
 ndo_start_xmit:
Synchronization: __netif_tx_lock spinlock.
diff --git a/Documentation/networking/statistics.rst 
b/Documentation/networking/statistics.rst
index 234abedc29b2..ad3e353df0dd 100644
--- a/Documentation/networking/statistics.rst
+++ b/Documentation/networking/statistics.rst
@@ -155,11 +155,10 @@ Drivers must ensure best possible compliance with
 Please note for example that detailed error statistics must be
 added into the general `rx_error` / `tx_error` counters.
 
-The `.ndo_get_stats64` callback can not sleep because of accesses
-via `/proc/net/dev`. If driver may sleep when retrieving the statistics
-from the device it should do so periodically asynchronously and only return
-a recent copy from `.ndo_get_stats64`. Ethtool interrupt coalescing interface
-allows setting the frequency of refreshing statistics, if needed.
+Drivers may sleep when retrieving the statistics from the device, or they might
+read the counters periodically and only return in `.ndo_get_stats64` a recent
+copy collected asynchronously. In the latter case, the ethtool interrupt
+coalescing interface allows setting the frequency of refreshing statistics.
 
 Retrieving ethtool statistics is a multi-syscall process, drivers are advised
 to keep the number of statistics constant to avoid race conditions with
diff --git a/net/core/dev.c b/net/core/dev.c
index 18904acb7797..45a845526b64 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10367,6 +10367,8 @@ struct rtnl_link_stats64 *dev_get_stats(struct 
net_device *dev,
 {
const struct net_device_ops *ops = dev->netdev_ops;
 
+   might_sleep();
+
if (ops->ndo_get_stats64) {
memset(storage, 0, sizeof(*storage));
ops->ndo_get_stats64(dev, storage);
-- 
2.25.1



Re: [RFC PATCH net-next 05/13] net: bonding: hold the netdev lists lock when retrieving device statistics

2020-12-06 Thread Vladimir Oltean
On Mon, Dec 07, 2020 at 01:59:11AM +0200, Vladimir Oltean wrote:
> In the effort of making .ndo_get_stats64 be able to sleep, we need to
> ensure the callers of dev_get_stats do not use atomic context.
>
> The bonding driver uses an RCU read-side critical section to ensure the
> integrity of the list of network interfaces, because the driver iterates
> through all net devices in the netns to find the ones which are its
> configured slaves. We still need some protection against an interface
> registering or deregistering, and the writer-side lock, the netns mutex,
> is fine for that, because it offers sleepable context.
>
> This mutex now serves double duty. It offers code serialization,
> something which the stats_lock already did. So now that serves no
> purpose, let's remove it.
>
> Cc: Jay Vosburgh 
> Cc: Veaceslav Falico 
> Cc: Andy Gospodarek 
> Signed-off-by: Vladimir Oltean 
> ---

There is a very obvious deadlock here which happens when we have
bond-over-bond and the upper calls dev_get_stats from the lower.

Conceptually, the same can happen even in any number of stacking
combinations between bonding, net_failover, [ insert any other driver
that takes net->netdev_lists_lock here ].

There would be two approaches trying to solve this issue:
- using mutex_lock_nested where we aren't sure that we are top level
- ensuring through convention that user space always takes
  net->netdev_lists_lock when calling dev_get_stats, and documenting
  that, and therefore making it unnecessary to lock in bonding.

I took neither of the two approaches (I don't really like either one too
much), hence [ one of ] the reasons for the RFC. Comments?

Re: [PATCH 1/1] ionic: fix array overflow on receiving too many fragments for a packet

2020-12-06 Thread Jesse Brandeburg
Xiaohui Zhang wrote:

> From: Zhang Xiaohui 
> 
> If the hardware receives an oversized packet with too many rx fragments,
> skb_shinfo(skb)->frags can overflow and corrupt memory of adjacent pages.
> This becomes especially visible if it corrupts the freelist pointer of
> a slab page.
> 
> Signed-off-by: Zhang Xiaohui 

Hi, thanks for your patch.

It appears this is a part of a series of patches (at least this one and
one to the ice driver) - please send as one series, with a cover letter
explanation.

Please justify how this is a bug and how this is found / reproduced.

I'll respond separately to the ice driver patch as I don't know this
hardware and it's limits, but I suspect that you've tried to fix a bug
where there was none. (It seems like something a code scanner might find
and be confused about)

> ---
>  drivers/net/ethernet/pensando/ionic/ionic_txrx.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/pensando/ionic/ionic_txrx.c 
> b/drivers/net/ethernet/pensando/ionic/ionic_txrx.c
> index 169ac4f54..a3e274c65 100644
> --- a/drivers/net/ethernet/pensando/ionic/ionic_txrx.c
> +++ b/drivers/net/ethernet/pensando/ionic/ionic_txrx.c
> @@ -102,8 +102,12 @@ static struct sk_buff *ionic_rx_frags(struct ionic_queue 
> *q,
>  
>   dma_unmap_page(dev, dma_unmap_addr(page_info, dma_addr),
>  PAGE_SIZE, DMA_FROM_DEVICE);
> - skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
> + struct skb_shared_info *shinfo = skb_shinfo(skb);

you can't declare variables in the middle of a code flow in C, did you
compile this?

> +
> + if (shinfo->nr_frags < ARRAY_SIZE(shinfo->frags)) {
> + skb_add_rx_frag(skb, shinfo->nr_frags,
>   page_info->page, 0, frag_len, PAGE_SIZE);
> + }
>   page_info->page = NULL;
>   page_info++;
>   i--;




Re: [PATCH 1/1] ice: fix array overflow on receiving too many fragments for a packet

2020-12-06 Thread Jesse Brandeburg
Xiaohui Zhang wrote:

> From: Zhang Xiaohui 
> 
> If the hardware receives an oversized packet with too many rx fragments,
> skb_shinfo(skb)->frags can overflow and corrupt memory of adjacent pages.
> This becomes especially visible if it corrupts the freelist pointer of
> a slab page.

As I replied to the ionic patch, please justify this with how you found
it and how you reproduced a problem. Resend the patches as a series so
we can discuss them as one change.

> 
> Signed-off-by: Zhang Xiaohui 
> ---
>  drivers/net/ethernet/intel/ice/ice_txrx.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c 
> b/drivers/net/ethernet/intel/ice/ice_txrx.c
> index eae75260f..f0f034fa5 100644
> --- a/drivers/net/ethernet/intel/ice/ice_txrx.c
> +++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
> @@ -823,8 +823,12 @@ ice_add_rx_frag(struct ice_ring *rx_ring, struct 
> ice_rx_buf *rx_buf,
>  
>   if (!size)
>   return;
> - skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, rx_buf->page,
> + struct skb_shared_info *shinfo = skb_shinfo(skb);
> +
> + if (shinfo->nr_frags < ARRAY_SIZE(shinfo->frags)) {
> + skb_add_rx_frag(skb, shinfo, rx_buf->page,
>   rx_buf->page_offset, size, truesize);
> + }

The driver is using 2kB receive buffers, and can chain them together up
to a max receive size of 9126 bytes (or so), so how can we receive more
than 18 fragments? Please explain your logic

>  
>   /* page is being used so we must update the page offset */
>   ice_rx_buf_adjust_pg_offset(rx_buf, truesize);

Your patch doesn't compile. You must compile test and explain your
patches better.

  CC [M]  drivers/net/ethernet/intel/ice//ice_main.o
  CC [M]  drivers/net/ethernet/intel/ice//ice_controlq.o
  CC [M]  drivers/net/ethernet/intel/ice//ice_common.o
  CC [M]  drivers/net/ethernet/intel/ice//ice_nvm.o
  CC [M]  drivers/net/ethernet/intel/ice//ice_switch.o
  CC [M]  drivers/net/ethernet/intel/ice//ice_sched.o
  CC [M]  drivers/net/ethernet/intel/ice//ice_base.o
  CC [M]  drivers/net/ethernet/intel/ice//ice_lib.o
  CC [M]  drivers/net/ethernet/intel/ice//ice_txrx_lib.o
  CC [M]  drivers/net/ethernet/intel/ice//ice_txrx.o
drivers/net/ethernet/intel/ice//ice_txrx.c: In function ‘ice_add_rx_frag’:
drivers/net/ethernet/intel/ice//ice_txrx.c:829:2: warning: ISO C90 forbids 
mixed declarations and code [-Wdeclaration-after-statement]
  829 |  struct skb_shared_info *shinfo = skb_shinfo(skb);
  |  ^~
drivers/net/ethernet/intel/ice//ice_txrx.c:832:24: warning: passing argument 2 
of ‘skb_add_rx_frag’ makes integer from pointer without a cast 
[-Wint-conversion]
  832 |   skb_add_rx_frag(skb, shinfo, rx_buf->page,
  |^~
  ||
  |struct skb_shared_info *
In file included from ./include/linux/if_ether.h:19,
 from ./include/uapi/linux/ethtool.h:19,
 from ./include/linux/ethtool.h:18,
 from ./include/linux/netdevice.h:37,
 from ./include/trace/events/xdp.h:8,
 from ./include/linux/bpf_trace.h:5,
 from drivers/net/ethernet/intel/ice//ice_txrx.c:8:
./include/linux/skbuff.h:2182:47: note: expected ‘int’ but argument is of type 
‘struct skb_shared_info *’
 2182 | void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int 
off,
  |   ^


[PATCH net] net: flow_offload: Fix memory leak for indirect flow block

2020-12-06 Thread Chris Mi
The offending commit introduces a cleanup callback that is invoked
when the driver module is removed to clean up the tunnel device
flow block. But it returns on the first iteration of the for loop.
The remaining indirect flow blocks will never be freed.

Fixes: 1fac52da5942 ("net: flow_offload: consolidate indirect flow_block 
infrastructure")
Signed-off-by: Chris Mi 
Reviewed-by: Roi Dayan 
---
 net/core/flow_offload.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/core/flow_offload.c b/net/core/flow_offload.c
index d4474c812b64..715b67f6c62f 100644
--- a/net/core/flow_offload.c
+++ b/net/core/flow_offload.c
@@ -381,10 +381,8 @@ static void __flow_block_indr_cleanup(void (*release)(void 
*cb_priv),
 
list_for_each_entry_safe(this, next, &flow_block_indr_list, indr.list) {
if (this->release == release &&
-   this->indr.cb_priv == cb_priv) {
+   this->indr.cb_priv == cb_priv)
list_move(&this->indr.list, cleanup_list);
-   return;
-   }
}
 }
 
-- 
2.26.2



Re: [PATCH net-next 07/13] net/mlx5: SF, Add auxiliary device support

2020-12-06 Thread David Ahern
On 11/12/20 12:24 PM, Parav Pandit wrote:
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig 
> b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
> index 485478979b1a..10dfaf671c90 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
> @@ -202,3 +202,12 @@ config MLX5_SW_STEERING
>   default y
>   help
>   Build support for software-managed steering in the NIC.
> +
> +config MLX5_SF
> + bool "Mellanox Technologies subfunction device support using auxiliary 
> device"
> + depends on MLX5_CORE && MLX5_CORE_EN
> + default n
> + help
> + Build support for subfuction device in the NIC. A Mellanox subfunction
> + device can support RDMA, netdevice and vdpa device.
> + It is similar to a SRIOV VF but it doesn't require SRIOV support.

per Dan's comment about AUXILIARY_BUS being select only, should this
config select AUXILIARY_BUS?


Re: LRO: creating vlan subports affects parent port's LRO settings

2020-12-06 Thread Limin Wang
Thanks to Jakub, Michal and Jarod for the time and consideration. I
have been caught up with other stuff, and not spent much time on this
since.

I am not that familiar with the upper/lower sync logic and
implications. But I agree with Michal,  it may not be necessary to
have a blanket exclusion of certain type of devices in those sync
functions.

I was thinking something maybe in vlan_dev.c:

static int vlan_dev_init(struct net_device *dev)
{
struct net_device *real_dev = vlan_dev_priv(dev)->real_dev;

netif_carrier_off(dev);

/* IFF_BROADCAST|IFF_MULTICAST; ??? */
dev->flags  = real_dev->flags & ~(IFF_UP | IFF_PROMISC | IFF_ALLMULTI |
 IFF_MASTER | IFF_SLAVE);
dev->state  = (real_dev->state & ((1<<__LINK_STATE_NOCARRIER) |
 (1<<__LINK_STATE_DORMANT))) |
 (1<<__LINK_STATE_PRESENT);

dev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG |
  NETIF_F_FRAGLIST | NETIF_F_GSO_SOFTWARE |
  NETIF_F_HIGHDMA | NETIF_F_SCTP_CRC |
  NETIF_F_ALL_FCOE;

here, maybe check if real_dev's NETIF_F_UPPER_DISABLES features bits
are set, and add them to vlan_dev?

On Sun, Dec 6, 2020 at 11:49 AM Michal Kubecek  wrote:
>
> On Sat, Dec 05, 2020 at 07:04:06PM -0500, Jarod Wilson wrote:
> > On Mon, Nov 23, 2020 at 7:27 PM Jakub Kicinski  wrote:
> > >
> > > On Thu, 19 Nov 2020 20:37:27 -0500 Limin Wang wrote:
> > > > Under relatively recent kernels (v4.4+), creating a vlan subport on a
> > > > LRO supported parent NIC may turn LRO off on the parent port and
> > > > further render its LRO feature practically unchangeable.
> > >
> > > That does sound like an oversight in commit fd867d51f889 ("net/core:
> > > generic support for disabling netdev features down stack").
> > >
> > > Are you able to create a patch to fix this?
> >
> > Something like this, perhaps? Completely untested copy-pasta'd
> > theoretical patch:
> >
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 8588ade790cb..a5ce372e02ba 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -9605,8 +9605,10 @@ int __netdev_update_features(struct net_device *dev)
> > features = netdev_fix_features(dev, features);
> >
> > /* some features can't be enabled if they're off on an upper device 
> > */
> > -   netdev_for_each_upper_dev_rcu(dev, upper, iter)
> > -   features = netdev_sync_upper_features(dev, upper, features);
> > +   netdev_for_each_upper_dev_rcu(dev, upper, iter) {
> > +   if (netif_is_lag_master(upper) || 
> > netif_is_bridge_master(upper))
> > +   features = netdev_sync_upper_features(dev,
> > upper, features);
> > +   }
> >
> > if (dev->features == features)
> > goto sync_lower;
> > @@ -9633,8 +9635,10 @@ int __netdev_update_features(struct net_device *dev)
> > /* some features must be disabled on lower devices when disabled
> >  * on an upper device (think: bonding master or bridge)
> >  */
> > -   netdev_for_each_lower_dev(dev, lower, iter)
> > -   netdev_sync_lower_features(dev, lower, features);
> > +   if (netif_is_lag_master(dev) || netif_is_bridge_master(dev)) {
> > +   netdev_for_each_lower_dev(dev, lower, iter)
> > +   netdev_sync_lower_features(dev, lower, features);
> > +   }
> >
> > if (!err) {
> > netdev_features_t diff = features ^ dev->features;
> >
> > I'm not sure what all other upper devices this excludes besides just
> > vlan ports though, so perhaps safer add upper device types to not do
> > feature sync on than to choose which ones to do them on?
>
> I'm not sure excluding devices from feature sync is the right way,
> whether it's an explicit list types or default. The logic still makes
> sense to me. Couldn't we address the issue by either setting features in
> NETIF_F_UPPER_DISABLES) by default for a new vlan (and probably macvlan)
> device? Or perhaps inheriting their values from the lower device.
>
> Michal


linux-next: build failure after merge of the block tree

2020-12-06 Thread Stephen Rothwell
Hi all,

After merging the block tree, today's linux-next build (powerpc
ppc64_defconfig) failed like this:

fs/io_uring.c: In function 'io_shutdown':
fs/io_uring.c:3782:9: error: too many arguments to function 'sock_from_file'
 3782 |  sock = sock_from_file(req->file, &ret);
  | ^~
In file included from fs/io_uring.c:63:
include/linux/net.h:243:16: note: declared here
  243 | struct socket *sock_from_file(struct file *file);
  |^~

Caused by commit

  36f4fa6886a8 ("io_uring: add support for shutdown(2)")

interacting with commit

  dba4a9256bb4 ("net: Remove the err argument from sock_from_file")

from the bpf-next tree.

I have applied the following merge fix patch.

From: Stephen Rothwell 
Date: Mon, 7 Dec 2020 14:04:10 +1100
Subject: [PATCH] fixup for "net: Remove the err argument from sock_from_file"

Signed-off-by: Stephen Rothwell 
---
 fs/io_uring.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index cd997264dbab..91d08408f1fe 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -3779,9 +3779,9 @@ static int io_shutdown(struct io_kiocb *req, bool 
force_nonblock)
if (force_nonblock)
return -EAGAIN;
 
-   sock = sock_from_file(req->file, &ret);
+   sock = sock_from_file(req->file);
if (unlikely(!sock))
-   return ret;
+   return -ENOTSOCK;
 
ret = __sys_shutdown_sock(sock, req->shutdown.how);
io_req_complete(req, ret);
-- 
2.29.2

-- 
Cheers,
Stephen Rothwell


pgph6hJLjafyJ.pgp
Description: OpenPGP digital signature


Re: LRO: creating vlan subports affects parent port's LRO settings

2020-12-06 Thread Limin Wang
I might be wrong. One potential issue I found in
netdev_sync_upper_features() is that it depends on the wanted_feature
of upper_dev

if (!(upper->wanted_features & feature)
   && (features & feature)) {
netdev_dbg(lower, "Dropping feature %pNF, upper dev %s has it off.\n",
  &feature, upper->name);
features &= ~feature;
}
}
Suppose a new vlan device will have the LRO bit in its features
because lower_dev (real_dev) supports LRO ( assuming with proposed
changes above), if the vlan_dev's wanted_feature doesn't include LRO,
the NETIF_F_LRO may still be dropped due to this.
One could manually use "ethtool -K vlan_dev lro on" to enable LRO in
the subport's wanted_features, but that has to be done on all
vlan_dev's of the same real_dev. (it is not uncommon that a parent
port may have hundreds of vlan subports)
Does that mean the vlan_dev->wanted_feature has to include LRO bit at
creation time to avoid explicitly setting later on for each and every
vlan subports?

On Sun, Dec 6, 2020 at 5:58 PM Jarod Wilson  wrote:
>
> On Sun, Dec 6, 2020 at 11:49 AM Michal Kubecek  wrote:
> >
> > On Sat, Dec 05, 2020 at 07:04:06PM -0500, Jarod Wilson wrote:
> > > On Mon, Nov 23, 2020 at 7:27 PM Jakub Kicinski  wrote:
> > > >
> > > > On Thu, 19 Nov 2020 20:37:27 -0500 Limin Wang wrote:
> > > > > Under relatively recent kernels (v4.4+), creating a vlan subport on a
> > > > > LRO supported parent NIC may turn LRO off on the parent port and
> > > > > further render its LRO feature practically unchangeable.
> > > >
> > > > That does sound like an oversight in commit fd867d51f889 ("net/core:
> > > > generic support for disabling netdev features down stack").
> > > >
> > > > Are you able to create a patch to fix this?
> > >
> > > Something like this, perhaps? Completely untested copy-pasta'd
> > > theoretical patch:
> > >
> > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > index 8588ade790cb..a5ce372e02ba 100644
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -9605,8 +9605,10 @@ int __netdev_update_features(struct net_device 
> > > *dev)
> > > features = netdev_fix_features(dev, features);
> > >
> > > /* some features can't be enabled if they're off on an upper 
> > > device */
> > > -   netdev_for_each_upper_dev_rcu(dev, upper, iter)
> > > -   features = netdev_sync_upper_features(dev, upper, 
> > > features);
> > > +   netdev_for_each_upper_dev_rcu(dev, upper, iter) {
> > > +   if (netif_is_lag_master(upper) || 
> > > netif_is_bridge_master(upper))
> > > +   features = netdev_sync_upper_features(dev,
> > > upper, features);
> > > +   }
> > >
> > > if (dev->features == features)
> > > goto sync_lower;
> > > @@ -9633,8 +9635,10 @@ int __netdev_update_features(struct net_device 
> > > *dev)
> > > /* some features must be disabled on lower devices when disabled
> > >  * on an upper device (think: bonding master or bridge)
> > >  */
> > > -   netdev_for_each_lower_dev(dev, lower, iter)
> > > -   netdev_sync_lower_features(dev, lower, features);
> > > +   if (netif_is_lag_master(dev) || netif_is_bridge_master(dev)) {
> > > +   netdev_for_each_lower_dev(dev, lower, iter)
> > > +   netdev_sync_lower_features(dev, lower, features);
> > > +   }
> > >
> > > if (!err) {
> > > netdev_features_t diff = features ^ dev->features;
> > >
> > > I'm not sure what all other upper devices this excludes besides just
> > > vlan ports though, so perhaps safer add upper device types to not do
> > > feature sync on than to choose which ones to do them on?
> >
> > I'm not sure excluding devices from feature sync is the right way,
> > whether it's an explicit list types or default. The logic still makes
> > sense to me. Couldn't we address the issue by either setting features in
> > NETIF_F_UPPER_DISABLES) by default for a new vlan (and probably macvlan)
> > device? Or perhaps inheriting their values from the lower device.
>
> Yeah, I think you're right, excluding devices entirely from sync is a
> bad idea, it should be only certain features that don't get sync'd for
> devices that say they don't want them (i.e., vlan devs and macvlan
> devs). I'll do a bit more reading of the code and ponder. I'm not
> familiar with the intricacies of NETIF_F_UPPER_DISABLES just yet.
>
> --
> Jarod Wilson
> ja...@redhat.com
>


Re: [PATCH v2 bpf-next 0/3] bpf: support module BTF in BTF display helpers

2020-12-06 Thread Yonghong Song




On 12/5/20 4:43 PM, Alan Maguire wrote:


On Sat, 5 Dec 2020, Yonghong Song wrote:




__builtin_btf_type_id() is really only supported in llvm12
and 64bit return value support is pushed to llvm12 trunk
a while back. The builtin is introduced in llvm11 but has a
corner bug, so llvm12 is recommended. So if people use the builtin,
you can assume 64bit return value. libbpf support is required
here. So in my opinion, there is no need to do feature detection.

Andrii has a patch to support 64bit return value for
__builtin_btf_type_id() and I assume that one should
be landed before or together with your patch.

Just for your info. The following is an example you could
use to determine whether __builtin_btf_type_id()
supports btf object id at llvm level.

-bash-4.4$ cat t.c
int test(int arg) {
   return __builtin_btf_type_id(arg, 1);
}

Compile to generate assembly code with latest llvm12 trunk:
   clang -target bpf -O2 -S -g -mcpu=v3 t.c
In the asm code, you should see one line with
   r0 = 1 ll

Or you can generate obj code:
   clang -target bpf -O2 -c -g -mcpu=v3 t.c
and then you disassemble the obj file
   llvm-objdump -d --no-show-raw-insn --no-leading-addr t.o
You should see below in the output
   r0 = 1 ll

Use earlier version of llvm12 trunk, the builtin has
32bit return value, you will see
   r0 = 1
which is a 32bit imm to r0, while "r0 = 1 ll" is
64bit imm to r0.



Thanks for this Yonghong!  I'm thinking the way I'll tackle it
is to simply verify that the upper 32 bits specifying the
veth module object id are non-zero; if they are zero, we'll skip
the test (I think a skip probably makes sense as not everyone will
have llvm12). Does that seem reasonable?


This should work too and we do not need to add a note in
README.rst for this test then.



With the additional few minor changes on top of Andrii's patch,
the use of __builtin_btf_type_id() worked perfectly. Thanks!

Alan



Re: [PATCH 1/1] ionic: fix array overflow on receiving too many fragments for a packet

2020-12-06 Thread Shannon Nelson

On 12/6/20 5:51 PM, Jesse Brandeburg wrote:

Xiaohui Zhang wrote:


From: Zhang Xiaohui 

If the hardware receives an oversized packet with too many rx fragments,
skb_shinfo(skb)->frags can overflow and corrupt memory of adjacent pages.
This becomes especially visible if it corrupts the freelist pointer of
a slab page.

Signed-off-by: Zhang Xiaohui 

Hi, thanks for your patch.

It appears this is a part of a series of patches (at least this one and
one to the ice driver) - please send as one series, with a cover letter
explanation.

Please justify how this is a bug and how this is found / reproduced.

I'll respond separately to the ice driver patch as I don't know this
hardware and it's limits, but I suspect that you've tried to fix a bug
where there was none. (It seems like something a code scanner might find
and be confused about)


---
  drivers/net/ethernet/pensando/ionic/ionic_txrx.c | 6 +-
  1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/pensando/ionic/ionic_txrx.c 
b/drivers/net/ethernet/pensando/ionic/ionic_txrx.c
index 169ac4f54..a3e274c65 100644
--- a/drivers/net/ethernet/pensando/ionic/ionic_txrx.c
+++ b/drivers/net/ethernet/pensando/ionic/ionic_txrx.c
@@ -102,8 +102,12 @@ static struct sk_buff *ionic_rx_frags(struct ionic_queue 
*q,
  
  		dma_unmap_page(dev, dma_unmap_addr(page_info, dma_addr),

   PAGE_SIZE, DMA_FROM_DEVICE);
-   skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
+   struct skb_shared_info *shinfo = skb_shinfo(skb);

you can't declare variables in the middle of a code flow in C, did you
compile this?


+
+   if (shinfo->nr_frags < ARRAY_SIZE(shinfo->frags)) {
+   skb_add_rx_frag(skb, shinfo->nr_frags,
page_info->page, 0, frag_len, PAGE_SIZE);
+   }


Is this just dropping the remaining frags without dropping the rest of 
the skb?  Is this going to leave an incorrect length in the skb?


A single statement after the 'if' doesn't need {}'s

This might be better handled by making sure ahead of time in 
configuration that the HW doesn't do this, rather than add a test into 
the fast path.  As it is, between the definitions of shinfo->frags[] and 
the ionic's rx sg list, I don't think this is a possible error.


As Jesse suggests, I'd like to see the test case so i can add it to our 
internal testing.


Thanks,
sln


page_info->page = NULL;
page_info++;
i--;






Re: [PATCH net-next] tun: fix ubuf refcount incorrectly on error path

2020-12-06 Thread Jason Wang



On 2020/12/4 下午6:22, wangyunjian wrote:

-Original Message-
From: Jason Wang [mailto:jasow...@redhat.com]
Sent: Friday, December 4, 2020 2:11 PM
To: wangyunjian ; m...@redhat.com
Cc: virtualizat...@lists.linux-foundation.org; netdev@vger.kernel.org; Lilijun
(Jerry) ; xudingke 
Subject: Re: [PATCH net-next] tun: fix ubuf refcount incorrectly on error path


On 2020/12/3 下午4:00, wangyunjian wrote:

From: Yunjian Wang 

After setting callback for ubuf_info of skb, the callback
(vhost_net_zerocopy_callback) will be called to decrease the refcount
when freeing skb. But when an exception occurs afterwards, the error
handling in vhost handle_tx() will try to decrease the same refcount
again. This is wrong and fix this by clearing ubuf_info when meeting
errors.

Fixes: 4477138fa0ae ("tun: properly test for IFF_UP")
Fixes: 90e33d459407 ("tun: enable napi_gro_frags() for TUN/TAP
driver")

Signed-off-by: Yunjian Wang 
---
   drivers/net/tun.c | 11 +++
   1 file changed, 11 insertions(+)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c index
2dc1988a8973..3614bb1b6d35 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1861,6 +1861,12 @@ static ssize_t tun_get_user(struct tun_struct

*tun, struct tun_file *tfile,

if (unlikely(!(tun->dev->flags & IFF_UP))) {
err = -EIO;
rcu_read_unlock();
+   if (zerocopy) {
+   skb_shinfo(skb)->destructor_arg = NULL;
+   skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
+   skb_shinfo(skb)->tx_flags &= ~SKBTX_SHARED_FRAG;
+   }
+
goto drop;
}

@@ -1874,6 +1880,11 @@ static ssize_t tun_get_user(struct tun_struct
*tun, struct tun_file *tfile,

if (unlikely(headlen > skb_headlen(skb))) {
atomic_long_inc(&tun->dev->rx_dropped);
+   if (zerocopy) {
+   skb_shinfo(skb)->destructor_arg = NULL;
+   skb_shinfo(skb)->tx_flags &= 
~SKBTX_DEV_ZEROCOPY;
+   skb_shinfo(skb)->tx_flags &= ~SKBTX_SHARED_FRAG;
+   }
napi_free_frags(&tfile->napi);
rcu_read_unlock();
mutex_unlock(&tfile->napi_mutex);


It looks to me then we miss the failure feedback.

The issues comes from the inconsistent error handling in tun.

I wonder whether we can simply do uarg->callback(uarg, false) if necessary on
every failture path on tun_get_user().

How about this?

---
  drivers/net/tun.c | 29 ++---
  1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 2dc1988a8973..36a8d8eacd7b 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1637,6 +1637,19 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
return NULL;
  }
  
+/* copy ubuf_info for callback when skb has no error */

+inline static tun_copy_ubuf_info(struct sk_buff *skb, bool zerocopy, void 
*msg_control)
+{
+   if (zerocopy) {
+   skb_shinfo(skb)->destructor_arg = msg_control;
+   skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
+   skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
+   } else if (msg_control) {
+   struct ubuf_info *uarg = msg_control;
+   uarg->callback(uarg, false);
+   }
+}
+
  /* Get packet from user space buffer */
  static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
void *msg_control, struct iov_iter *from,
@@ -1812,16 +1825,6 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
break;
}
  
-	/* copy skb_ubuf_info for callback when skb has no error */

-   if (zerocopy) {
-   skb_shinfo(skb)->destructor_arg = msg_control;
-   skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
-   skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
-   } else if (msg_control) {
-   struct ubuf_info *uarg = msg_control;
-   uarg->callback(uarg, false);
-   }
-
skb_reset_network_header(skb);
skb_probe_transport_header(skb);
skb_record_rx_queue(skb, tfile->queue_index);
@@ -1830,6 +1833,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
struct bpf_prog *xdp_prog;
int ret;
  
+		tun_copy_ubuf_info(skb, zerocopy, msg_control);



If you think disabling zerocopy for XDP (which I think it makes sense). 
It's better to do this in another patch.




local_bh_disable();
rcu_read_lock();
xdp_prog = rcu_dereference(tun->xdp_prog);
@@ -1880,7 +1884,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
WARN_ON(1);
return

RE: [PATCH net-next v4] devlink: Add devlink port documentation

2020-12-06 Thread Parav Pandit



> From: Jakub Kicinski 
> Sent: Sunday, December 6, 2020 1:57 AM
> 
> On Thu, 3 Dec 2020 20:02:55 +0200 Parav Pandit wrote:
> > Added documentation for devlink port and port function related commands.
> >
> > Signed-off-by: Parav Pandit 
> > Reviewed-by: Jiri Pirko 
> > Reviewed-by: Jacob Keller 
> 
> > +
> > +Devlink Port
> > +
> > +
> > +``devlink-port`` is a port that exists on the device.
> 
> Can we add something like:
> 
> Each port is a logically separate ingress/egress point of the device.
> 
> ?
This may not be true when both physical ports are under bond.

> 
> > A devlink port can
> > +be of one among many flavours. A devlink port flavour along with port
> > +attributes describe what a port represents.
> > +
> > +A device driver that intends to publish a devlink port sets the
> > +devlink port attributes and registers the devlink port.
> > +
> > +Devlink port types are described below.
> 
> How about:
> 
> Physical ports can have different types based on the link layer:
>
Ok. will add.
 
> > +.. list-table:: List of devlink port types
> > +   :widths: 23 90
> > +
> > +   * - Type
> > + - Description
> > +   * - ``DEVLINK_PORT_TYPE_ETH``
> > + - Driver should set this port type when a link layer of the port is 
> > Ethernet.
> > +   * - ``DEVLINK_PORT_TYPE_IB``
> > + - Driver should set this port type when a link layer of the port is 
> > InfiniBand.
> 
> Please wrap at 80 chars.
> 
Ack.
> > +   * - ``DEVLINK_PORT_TYPE_AUTO``
> > + - This type is indicated by the user when user prefers to set the 
> > port type
> > +   to be automatically detected by the device driver.
> 
> How about:
> 
> This type is indicated by the user when driver should detect the port type
> automatically.
> 
Will change.

> > +A controller consists of one or more PCI functions.
> 
> This need some intro. Like:
> 
> PCI controllers
> ---
> 
> In most cases PCI devices will have only one controller, with potentially 
> multiple
> physical and virtual functions. Devices connected to multiple CPUs and
> SmartNICs, however, may have multiple controllers.
> 
> > Such PCI function consists
> > +of one or more networking ports.
> 
> PCI function consists of networking ports? What do you mean by a networking
> port? All devlink ports are networking ports.
>
I am not sure this document should be a starting point to define such 
restriction.
 
> > A networking port of such PCI function is
> > +represented by the eswitch devlink port.
> 
> What's eswitch devlink port? It was never defined.
Eswitch devlink port is the port which sets eswitch attributes (id and length).

> 
> > A devlink instance holds ports of two
> > +types of controllers.
> 
> For devices with multiple controllers we can distinguish...
> 
Yes, will change.

> > +(1) controller discovered on same system where eswitch resides:
> > +This is the case where PCI PF/VF of a controller and devlink eswitch
> > +instance both are located on a single system.
> 
> How is eswitch located on a system? Eswitch is in the NIC
>
Yes, I meant eswitch devlink instance and controller devlink instance are same.
 Will rephase.

> I think you should say refer to eswitch being controlled by a system.
> 
> > +(2) controller located on external host system.
> > +This is the case where a controller is in one system and its devlink
> > +eswitch ports are in a different system. Such controller is called
> > +external controller.
> 
> > +An example view of two controller systems::
> > +
> > +Port function configuration
> > +===
> > +
> > +When a port flavor is ``DEVLINK_PORT_FLAVOUR_PCI_PF`` or
> > +``DEVLINK_PORT_FLAVOUR_PCI_VF``, it represents the networking port of
> > +a PCI function.
> 
> Networking port of a PCI function?
> 
> > A user can configure the port function attributes
> 
> port function attributes?
> 
> > before
> > +enumerating the function.
> 
> What does this mean? What does enumerate mean in this context?
> 
Enumerate means before creating the device of the function.
However today due to SR-IOV limitation, it is before probing the function 
device.

> > For example user may set the hardware address of
> > +the function represented by the devlink port function.
> 
> What's a hardware address? You mean MAC address?
Yes, MAC address.
Port function attribute is named as hardware address to be generic enough 
similar to other iproute2 tools.



RE: [PATCH net-next 07/13] net/mlx5: SF, Add auxiliary device support

2020-12-06 Thread Parav Pandit


> From: David Ahern 
> Sent: Monday, December 7, 2020 8:19 AM
> 
> On 11/12/20 12:24 PM, Parav Pandit wrote:
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
> > b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
> > index 485478979b1a..10dfaf671c90 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/Kconfig
> > @@ -202,3 +202,12 @@ config MLX5_SW_STEERING
> > default y
> > help
> > Build support for software-managed steering in the NIC.
> > +
> > +config MLX5_SF
> > +   bool "Mellanox Technologies subfunction device support using auxiliary
> device"
> > +   depends on MLX5_CORE && MLX5_CORE_EN
> > +   default n
> > +   help
> > +   Build support for subfuction device in the NIC. A Mellanox subfunction
> > +   device can support RDMA, netdevice and vdpa device.
> > +   It is similar to a SRIOV VF but it doesn't require SRIOV support.
> 
> per Dan's comment about AUXILIARY_BUS being select only, should this config
> select AUXILIARY_BUS?
Yes.
However, my patchset depends on patchset [1].
With introduction of patchset [2], MLX5_CORE depends on AUXILIARY_BUS. 
MLX5_SF depends on MLX5_CORE.
So I omitted explicitly selecting AUXBUS by MLX5_SF.

[1] https://lore.kernel.org/linux-rdma/20201204182952.72263-1-sae...@nvidia.com/
[2] https://lore.kernel.org/alsa-devel/20201026111849.1035786-6-l...@kernel.org/




Re: [PATCH] vhost scsi: fix error return code in vhost_scsi_set_endpoint()

2020-12-06 Thread Jason Wang



On 2020/12/4 下午4:43, Zhang Changzhong wrote:

Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.

Fixes: 25b98b64e284 ("vhost scsi: alloc cmds per vq instead of session")
Reported-by: Hulk Robot 
Signed-off-by: Zhang Changzhong 
---
  drivers/vhost/scsi.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 6ff8a5096..4ce9f00 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1643,7 +1643,8 @@ vhost_scsi_set_endpoint(struct vhost_scsi *vs,
if (!vhost_vq_is_setup(vq))
continue;
  
-			if (vhost_scsi_setup_vq_cmds(vq, vq->num))

+   ret = vhost_scsi_setup_vq_cmds(vq, vq->num);
+   if (ret)
goto destroy_vq_cmds;
}
  



Acked-by: Jason Wang 




[PATCH bpf] tools/bpftool: Add/Fix support for modules btf dump

2020-12-06 Thread saeed
From: Saeed Mahameed 

While playing with BTF for modules, i noticed that executing the command:
$ bpftool btf dump id 

Fails due to lack of information in the BTF data.

Maybe I am missing a step but actually adding the support for this is
very simple.

To completely parse modules BTF data, we need the vmlinux BTF as their
"base btf", which can be easily found by iterating through the btf ids and
looking for btf.name == "vmlinux".

I am not sure why this hasn't been added by the original patchset
"Integrate kernel module BTF support", as adding the support for
this is very trivial. Unless i am missing something, CCing Andrii..

Signed-off-by: Saeed Mahameed 
CC: Andrii Nakryiko 
---
 tools/lib/bpf/btf.c  | 57 
 tools/lib/bpf/btf.h  |  2 ++
 tools/lib/bpf/libbpf.map |  1 +
 3 files changed, 60 insertions(+)

diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c
index 3c3f2bc6c652..5900cccf82e2 100644
--- a/tools/lib/bpf/btf.c
+++ b/tools/lib/bpf/btf.c
@@ -1370,6 +1370,14 @@ struct btf *btf_get_from_fd(int btf_fd, struct btf 
*base_btf)
goto exit_free;
}
 
+   /* force base_btf for kernel modules */
+   if (btf_info.kernel_btf && !base_btf) {
+   int id = btf_get_kernel_id();
+
+   /* Double check our btf is not the kernel BTF itself */
+   if (id != btf_info.id)
+   btf__get_from_id(id, &base_btf);
+   }
btf = btf_new(ptr, btf_info.btf_size, base_btf);
 
 exit_free:
@@ -4623,3 +4631,52 @@ struct btf *libbpf_find_kernel_btf(void)
pr_warn("failed to find valid kernel BTF\n");
return ERR_PTR(-ESRCH);
 }
+
+#define foreach_btf_id(id, err) \
+   for (err = bpf_btf_get_next_id((id), (&id)); !err; )
+
+/*
+ * Scan all ids for a kernel btf with name == "vmlinux"
+ */
+int btf_get_kernel_id(void)
+{
+   struct bpf_btf_info info;
+   __u32 len = sizeof(info);
+   char name[64];
+   __u32 id = 0;
+   int err, fd;
+
+   foreach_btf_id(id, err) {
+   fd = bpf_btf_get_fd_by_id(id);
+   if (fd < 0) {
+   if (errno == ENOENT)
+   continue; /* expected race: BTF was unloaded */
+   err = -errno;
+   pr_warn("failed to get BTF object #%d FD: %d\n", id, 
err);
+   return err;
+   }
+
+   memset(&info, 0, sizeof(info));
+   info.name = ptr_to_u64(name);
+   info.name_len = sizeof(name);
+
+   err = bpf_obj_get_info_by_fd(fd, &info, &len);
+   if (err) {
+   err = -errno;
+   pr_warn("failed to get BTF object #%d info: %d\n", id, 
err);
+   return err;
+   }
+
+   if (info.kernel_btf && strcmp(name, "vmlinux") == 0)
+   return id;
+
+   }
+
+   if (err && errno != ENOENT) {
+   err = -errno;
+   pr_warn("failed to iterate BTF objects: %d\n", err);
+   return err;
+   }
+
+   return -ENOENT;
+}
\ No newline at end of file
diff --git a/tools/lib/bpf/btf.h b/tools/lib/bpf/btf.h
index 1237bcd1dd17..44075b086d1c 100644
--- a/tools/lib/bpf/btf.h
+++ b/tools/lib/bpf/btf.h
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "libbpf_common.h"
 
@@ -90,6 +91,7 @@ LIBBPF_API __u32 btf_ext__func_info_rec_size(const struct 
btf_ext *btf_ext);
 LIBBPF_API __u32 btf_ext__line_info_rec_size(const struct btf_ext *btf_ext);
 
 LIBBPF_API struct btf *libbpf_find_kernel_btf(void);
+LIBBPF_API int btf_get_kernel_id(void);
 
 LIBBPF_API int btf__find_str(struct btf *btf, const char *s);
 LIBBPF_API int btf__add_str(struct btf *btf, const char *s);
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 7c4126542e2b..727daeb57f35 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -348,4 +348,5 @@ LIBBPF_0.3.0 {
btf__new_split;
xsk_setup_xdp_prog;
xsk_socket__update_xskmap;
+   btf_get_kernel_id
 } LIBBPF_0.2.0;
-- 
2.26.2



[pull request][for-next V2] mlx5-next auxbus support

2020-12-06 Thread saeed
Hi Jakub, Jason

v1->v2: Fix compilation warning when compiling with W=1 in the mlx5
patches.

This pull request is targeting net-next and rdma-next branches.

This series provides mlx5 support for auxiliary bus devices.

It starts with a merge commit of tag 'auxbus-5.11-rc1' from
gregkh/driver-core into mlx5-next, then the mlx5 patches that will convert
mlx5 ulp devices (netdev, rdma, vdpa) to use the proper auxbus
infrastructure instead of the internal mlx5 device and interface management
implementation, which Leon is deleting at the end of this patchset.

Link: 
https://lore.kernel.org/alsa-devel/20201026111849.1035786-1-l...@kernel.org/

Thanks to everyone for the joint effort !

Please pull and let me know if there's any problem.

Thanks,
Saeed.

---

The following changes since commit b65054597872ce3aefbc6a666385eabdf9e288da:

  Linux 5.10-rc6 (2020-11-29 15:50:50 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux.git mlx5-next

for you to fetch changes up to 04b222f9577396a8d19bf2937d2a218dc2a3c7ac:

  RDMA/mlx5: Remove IB representors dead code (2020-12-06 07:43:54 +0200)


Dave Ertman (1):
  Add auxiliary bus support

Greg Kroah-Hartman (3):
  driver core: auxiliary bus: move slab.h from include file
  driver core: auxiliary bus: make remove function return void
  driver core: auxiliary bus: minor coding style tweaks

Leon Romanovsky (11):
  Merge tag 'auxbus-5.11-rc1' of 
https://git.kernel.org/.../gregkh/driver-core into mlx5-next
  net/mlx5: Properly convey driver version to firmware
  net/mlx5_core: Clean driver version and name
  vdpa/mlx5: Make hardware definitions visible to all mlx5 devices
  net/mlx5: Register mlx5 devices to auxiliary virtual bus
  vdpa/mlx5: Connect mlx5_vdpa to auxiliary bus
  net/mlx5e: Connect ethernet part to auxiliary bus
  RDMA/mlx5: Convert mlx5_ib to use auxiliary bus
  net/mlx5: Delete custom device management logic
  net/mlx5: Simplify eswitch mode check
  RDMA/mlx5: Remove IB representors dead code

 Documentation/driver-api/auxiliary_bus.rst | 234 +
 Documentation/driver-api/index.rst |   1 +
 drivers/base/Kconfig   |   3 +
 drivers/base/Makefile  |   1 +
 drivers/base/auxiliary.c   | 274 ++
 drivers/infiniband/hw/mlx5/counters.c  |   7 -
 drivers/infiniband/hw/mlx5/ib_rep.c| 112 ++--
 drivers/infiniband/hw/mlx5/ib_rep.h|  45 +-
 drivers/infiniband/hw/mlx5/main.c  | 153 --
 drivers/infiniband/hw/mlx5/mlx5_ib.h   |   4 +-
 drivers/net/ethernet/mellanox/mlx5/core/Kconfig|   1 +
 drivers/net/ethernet/mellanox/mlx5/core/dev.c  | 567 ++---
 drivers/net/ethernet/mellanox/mlx5/core/devlink.c  |   4 +-
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |   4 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 134 ++---
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.c   |  41 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_rep.h   |   6 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c|   8 +-
 drivers/net/ethernet/mellanox/mlx5/core/eswitch.c  |  21 +-
 .../ethernet/mellanox/mlx5/core/ipoib/ethtool.c|   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/lag.c  |  58 +--
 drivers/net/ethernet/mellanox/mlx5/core/main.c |  49 +-
 .../net/ethernet/mellanox/mlx5/core/mlx5_core.h|  33 +-
 drivers/vdpa/mlx5/Makefile |   2 +-
 drivers/vdpa/mlx5/net/main.c   |  76 ---
 drivers/vdpa/mlx5/net/mlx5_vnet.c  |  53 +-
 drivers/vdpa/mlx5/net/mlx5_vnet.h  |  24 -
 include/linux/auxiliary_bus.h  |  77 +++
 include/linux/mlx5/driver.h|  34 +-
 include/linux/mlx5/eswitch.h   |   8 +-
 .../linux/mlx5/mlx5_ifc_vdpa.h |   8 +-
 include/linux/mod_devicetable.h|   8 +
 scripts/mod/devicetable-offsets.c  |   3 +
 scripts/mod/file2alias.c   |   8 +
 34 files changed, 1413 insertions(+), 650 deletions(-)
 create mode 100644 Documentation/driver-api/auxiliary_bus.rst
 create mode 100644 drivers/base/auxiliary.c
 delete mode 100644 drivers/vdpa/mlx5/net/main.c
 delete mode 100644 drivers/vdpa/mlx5/net/mlx5_vnet.h
 create mode 100644 include/linux/auxiliary_bus.h
 rename drivers/vdpa/mlx5/core/mlx5_vdpa_ifc.h => 
include/linux/mlx5/mlx5_ifc_vdpa.h (96%)


[PATCH iproute2-net v2 0/3] devlink: Add devlink reload action limit and stats

2020-12-06 Thread Moshe Shemesh
Introduce new options on devlink reload API to enable the user to select
the reload action required and constrains limits on these actions that he
may want to ensure.

Add reload stats to show the history per reload action per limit.

Patch 1 adds the new API reload action and reload limit options to
devlink reload command.
Patch 2 adds pr_out_dev() helper function and modify monitor function to
use it.
Patch 3 adds reload stats and remote reload stats to devlink dev show.

Moshe Shemesh (3):
  devlink: Add devlink reload action and limit options
  devlink: Add pr_out_dev() helper function
  devlink: Add reload stats to dev show

 devlink/devlink.c  | 268 +++--
 man/man8/devlink-dev.8 |  34 ++
 2 files changed, 291 insertions(+), 11 deletions(-)

-- 
2.26.2



[PATCH iproute2-net v2 3/3] devlink: Add reload stats to dev show

2020-12-06 Thread Moshe Shemesh
Show reload statistics through devlink dev show using devlink stats
flag. The reload statistics show the history per reload action type and
limit. Add remote reload statistics to show the history of actions
performed due devlink reload commands initiated by remote host.

Output examples:
$ devlink dev show -s
pci/:82:00.0:
  stats:
  reload:
  driver_reinit:
unspecified 2
  fw_activate:
unspecified 1 no_reset 0
  remote_reload:
  driver_reinit:
unspecified 0
  fw_activate:
unspecified 0 no_reset 0
pci/:82:00.1:
  stats:
  reload:
  driver_reinit:
unspecified 0
  fw_activate:
unspecified 0 no_reset 0
  remote_reload:
  driver_reinit:
unspecified 1
  fw_activate:
unspecified 1 no_reset 0

$ devlink dev show -s -jp
{
"dev": {
"pci/:82:00.0": {
"stats": {
"reload": {
"driver_reinit": {
"unspecified": 2
},
"fw_activate": {
"unspecified": 1,
"no_reset": 0
}
},
"remote_reload": {
"driver_reinit": {
"unspecified": 0
},
"fw_activate": {
"unspecified": 0,
"no_reset": 0
}
}
}
},
"pci/:82:00.1": {
"stats": {
"reload": {
"driver_reinit": {
"unspecified": 0
},
"fw_activate": {
"unspecified": 0,
"no_reset": 0
}
},
"remote_reload": {
"driver_reinit": {
"unspecified": 1
},
"fw_activate": {
"unspecified": 1,
"no_reset": 0
}
}
}
}
}
}

Signed-off-by: Moshe Shemesh 
---
v1 -> v2:
- Use temp variables for the attributes to make the code more readable
- Wrap lines at 80 columns unless it is a print statement
---
 devlink/devlink.c | 110 +-
 1 file changed, 108 insertions(+), 2 deletions(-)

diff --git a/devlink/devlink.c b/devlink/devlink.c
index b42eb1b9..296a1654 100644
--- a/devlink/devlink.c
+++ b/devlink/devlink.c
@@ -682,6 +682,15 @@ static const enum mnl_attr_data_type 
devlink_policy[DEVLINK_ATTR_MAX + 1] = {
[DEVLINK_ATTR_TRAP_METADATA] = MNL_TYPE_NESTED,
[DEVLINK_ATTR_TRAP_GROUP_NAME] = MNL_TYPE_STRING,
[DEVLINK_ATTR_RELOAD_FAILED] = MNL_TYPE_U8,
+   [DEVLINK_ATTR_DEV_STATS] = MNL_TYPE_NESTED,
+   [DEVLINK_ATTR_RELOAD_STATS] = MNL_TYPE_NESTED,
+   [DEVLINK_ATTR_RELOAD_STATS_ENTRY] = MNL_TYPE_NESTED,
+   [DEVLINK_ATTR_RELOAD_ACTION] = MNL_TYPE_U8,
+   [DEVLINK_ATTR_RELOAD_STATS_LIMIT] = MNL_TYPE_U8,
+   [DEVLINK_ATTR_RELOAD_STATS_VALUE] = MNL_TYPE_U32,
+   [DEVLINK_ATTR_REMOTE_RELOAD_STATS] = MNL_TYPE_NESTED,
+   [DEVLINK_ATTR_RELOAD_ACTION_INFO] = MNL_TYPE_NESTED,
+   [DEVLINK_ATTR_RELOAD_ACTION_STATS] = MNL_TYPE_NESTED,
[DEVLINK_ATTR_TRAP_POLICER_ID] = MNL_TYPE_U32,
[DEVLINK_ATTR_TRAP_POLICER_RATE] = MNL_TYPE_U64,
[DEVLINK_ATTR_TRAP_POLICER_BURST] = MNL_TYPE_U64,
@@ -2371,6 +2380,18 @@ static const char *reload_action_name(uint8_t 
reload_action)
}
 }
 
+static const char *reload_limit_name(uint8_t reload_limit)
+{
+   switch (reload_limit) {
+   case DEVLINK_RELOAD_LIMIT_UNSPEC:
+   return "unspecified";
+   case DEVLINK_RELOAD_LIMIT_NO_RESET:
+   return "no_reset";
+   default:
+   return "";
+   }
+}
+
 static const char *eswitch_mode_name(uint32_t mode)
 {
switch (mode) {
@@ -2976,17 +2997,101 @@ static int cmd_dev_param(struct dl *dl)
return -ENOENT;
 }
 
-static void pr_out_dev(struct dl *dl, struct nlattr **tb)
+static void pr_out_action_stats(struct dl *dl, struct nlattr *action_stats)
+{
+   struct nlattr *tb_stats_entry[DEVLINK_ATTR_MAX + 1] = {};
+   struct nlattr *nla_reload_stats_entry, *nla_limit, *nla_value;
+   enum devlink_reload_limit limit;
+   uint32_t value;
+   int err;
+
+   mnl_attr_for_each_nested(nla_reload_stats_entry, action_stats) {
+   err = mnl_attr_parse_nested(nla_reload_stats_entry, attr_cb,
+   tb_stats_entry);
+   if (err != MNL_CB_OK)
+   return;
+
+   nla_limit = tb_stats_entry[DEVLINK_ATTR_RELOAD_STATS_LIMIT];
+   nla_value = tb_stats_entry[DEVLINK_

[PATCH iproute2-net v2 2/3] devlink: Add pr_out_dev() helper function

2020-12-06 Thread Moshe Shemesh
Add pr_out_dev() helper function and use it both by cmd_dev_show_cb()
and by cmd_mon_show_cb().

Dev stats will be added on the next patch to dev context, so
cmd_mon_show_cb() should print the whole dev context and not just dev
handle.

Signed-off-by: Moshe Shemesh 
Reviewed-by: Jiri Pirko 
---
 devlink/devlink.c | 24 +++-
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/devlink/devlink.c b/devlink/devlink.c
index 128c81b8..b42eb1b9 100644
--- a/devlink/devlink.c
+++ b/devlink/devlink.c
@@ -2975,17 +2975,11 @@ static int cmd_dev_param(struct dl *dl)
pr_err("Command \"%s\" not found\n", dl_argv(dl));
return -ENOENT;
 }
-static int cmd_dev_show_cb(const struct nlmsghdr *nlh, void *data)
+
+static void pr_out_dev(struct dl *dl, struct nlattr **tb)
 {
-   struct dl *dl = data;
-   struct nlattr *tb[DEVLINK_ATTR_MAX + 1] = {};
-   struct genlmsghdr *genl = mnl_nlmsg_get_payload(nlh);
uint8_t reload_failed = 0;
 
-   mnl_attr_parse(nlh, sizeof(*genl), attr_cb, tb);
-   if (!tb[DEVLINK_ATTR_BUS_NAME] || !tb[DEVLINK_ATTR_DEV_NAME])
-   return MNL_CB_ERROR;
-
if (tb[DEVLINK_ATTR_RELOAD_FAILED])
reload_failed = mnl_attr_get_u8(tb[DEVLINK_ATTR_RELOAD_FAILED]);
 
@@ -2997,7 +2991,19 @@ static int cmd_dev_show_cb(const struct nlmsghdr *nlh, 
void *data)
} else {
pr_out_handle(dl, tb);
}
+}
 
+static int cmd_dev_show_cb(const struct nlmsghdr *nlh, void *data)
+{
+   struct dl *dl = data;
+   struct nlattr *tb[DEVLINK_ATTR_MAX + 1] = {};
+   struct genlmsghdr *genl = mnl_nlmsg_get_payload(nlh);
+
+   mnl_attr_parse(nlh, sizeof(*genl), attr_cb, tb);
+   if (!tb[DEVLINK_ATTR_BUS_NAME] || !tb[DEVLINK_ATTR_DEV_NAME])
+   return MNL_CB_ERROR;
+
+   pr_out_dev(dl, tb);
return MNL_CB_OK;
 }
 
@@ -4842,7 +4848,7 @@ static int cmd_mon_show_cb(const struct nlmsghdr *nlh, 
void *data)
if (!tb[DEVLINK_ATTR_BUS_NAME] || !tb[DEVLINK_ATTR_DEV_NAME])
return MNL_CB_ERROR;
pr_out_mon_header(genl->cmd);
-   pr_out_handle(dl, tb);
+   pr_out_dev(dl, tb);
pr_out_mon_footer();
break;
case DEVLINK_CMD_PORT_GET: /* fall through */
-- 
2.26.2



[PATCH iproute2-net v2 1/3] devlink: Add devlink reload action and limit options

2020-12-06 Thread Moshe Shemesh
Add reload action and reload limit to devlink reload command to enable
the user to select the reload action required and constrains limits on
these actions that he may want to ensure.

The following reload actions are supported:
  driver_reinit: driver entities re-initialization, applying
 devlink-param and devlink-resource values.
  fw_activate: firmware activate.

The uAPI is backward compatible, if the reload action option is omitted
from the reload command, the driver reinit action will be used.
Note that when required to do firmware activation some drivers may need
to reload the driver. On the other hand some drivers may need to reset
the firmware to reinitialize the driver entities. Therefore, the devlink
reload command returns the actions which were actually performed.

By default reload actions are not limited and driver implementation may
include reset or downtime as needed to perform the actions. However, if
reload limit is selected, the driver should perform only if it can do it
while keeping the limit constraints.

Reload limit added:
  no_reset: No reset allowed, no down time allowed, no link flap and no
configuration is lost.

Command examples:
$devlink dev reload pci/:82:00.0 action driver_reinit
reload_actions_performed:
  driver_reinit

$devlink dev reload pci/:82:00.0 action fw_activate
reload_actions_performed:
  driver_reinit fw_activate

devlink dev reload pci/:82:00.1 action driver_reinit -jp
{
"reload": {
"reload_actions_performed": [ "driver_reinit" ]
}
}

devlink dev reload pci/:82:00.0 action fw_activate -jp
{
"reload": {
"reload_actions_performed": [ "driver_reinit","fw_activate" ]
}
}

Signed-off-by: Moshe Shemesh 
---
v1 -> v2:
- Wrap unreasonable print line length
- Use temp variables for the attributes to make the code more readable
- Wrap lines at 80 columns unless it is a print statement
- Add man page update
---
 devlink/devlink.c  | 138 -
 man/man8/devlink-dev.8 |  34 ++
 2 files changed, 170 insertions(+), 2 deletions(-)

diff --git a/devlink/devlink.c b/devlink/devlink.c
index ca99732e..128c81b8 100644
--- a/devlink/devlink.c
+++ b/devlink/devlink.c
@@ -304,6 +304,8 @@ static void ifname_map_free(struct ifname_map *ifname_map)
 #define DL_OPT_HEALTH_REPORTER_AUTO_DUMP BIT(37)
 #define DL_OPT_PORT_FUNCTION_HW_ADDR BIT(38)
 #define DL_OPT_FLASH_OVERWRITE BIT(39)
+#define DL_OPT_RELOAD_ACTION   BIT(40)
+#define DL_OPT_RELOAD_LIMITBIT(41)
 
 struct dl_opts {
uint64_t present; /* flags of present items */
@@ -352,6 +354,8 @@ struct dl_opts {
char port_function_hw_addr[MAX_ADDR_LEN];
uint32_t port_function_hw_addr_len;
uint32_t overwrite_mask;
+   enum devlink_reload_action reload_action;
+   enum devlink_reload_limit reload_limit;
 };
 
 struct dl {
@@ -1344,6 +1348,32 @@ static int hw_addr_parse(const char *addrstr, char 
*hw_addr, uint32_t *len)
return 0;
 }
 
+static int reload_action_get(struct dl *dl, const char *actionstr,
+enum devlink_reload_action *action)
+{
+   if (strcmp(actionstr, "driver_reinit") == 0) {
+   *action = DEVLINK_RELOAD_ACTION_DRIVER_REINIT;
+   } else if (strcmp(actionstr, "fw_activate") == 0) {
+   *action = DEVLINK_RELOAD_ACTION_FW_ACTIVATE;
+   } else {
+   pr_err("Unknown reload action \"%s\"\n", actionstr);
+   return -EINVAL;
+   }
+   return 0;
+}
+
+static int reload_limit_get(struct dl *dl, const char *limitstr,
+enum devlink_reload_limit *limit)
+{
+   if (strcmp(limitstr, "no_reset") == 0) {
+   *limit = DEVLINK_RELOAD_LIMIT_NO_RESET;
+   } else {
+   pr_err("Unknown reload limit \"%s\"\n", limitstr);
+   return -EINVAL;
+   }
+   return 0;
+}
+
 struct dl_args_metadata {
uint64_t o_flag;
char err_msg[DL_ARGS_REQUIRED_MAX_ERR_LEN];
@@ -1730,6 +1760,30 @@ static int dl_argv_parse(struct dl *dl, uint64_t 
o_required,
opts->netns_is_pid = true;
}
o_found |= DL_OPT_NETNS;
+   } else if (dl_argv_match(dl, "action") &&
+  (o_all & DL_OPT_RELOAD_ACTION)) {
+   const char *actionstr;
+
+   dl_arg_inc(dl);
+   err = dl_argv_str(dl, &actionstr);
+   if (err)
+   return err;
+   err = reload_action_get(dl, actionstr, 
&opts->reload_action);
+   if (err)
+   return err;
+   o_found |= DL_OPT_RELOAD_ACTION;
+   } else if (dl_argv_match(dl, "limit") &&
+  (o_all & DL_OPT_RELOAD_LIMIT)) {
+   const char *l

Re: [net-next V2 08/15] net/mlx5e: Add TX PTP port object support

2020-12-06 Thread Saeed Mahameed
On Sat, 2020-12-05 at 03:49 +0200, Vladimir Oltean wrote:
> On Fri, Dec 04, 2020 at 12:26:13PM -0800, Jakub Kicinski wrote:
> > On Fri, 04 Dec 2020 11:33:26 -0800 Saeed Mahameed wrote:
> > > On Thu, 2020-12-03 at 18:29 -0800, Jakub Kicinski wrote:
> > > > On Wed, 2 Dec 2020 20:21:01 -0800 Saeed Mahameed wrote:
> > > > > Add TX PTP port object support for better TX timestamping
> > > > > accuracy.
> > > > > Currently, driver supports CQE based TX port timestamp.
> > > > > Device
> > > > > also offers TX port timestamp, which has less jitter and
> > > > > better
> > > > > reflects the actual time of a packet's transmit.
> > > > 
> > > > How much better is it?
> > > > 
> > > > Is the new implementation is standard compliant or just a
> > > > "better
> > > > guess"?
> > > 
> > > It is not a guess for sure, the closer to the output port you
> > > take the
> > > stamp the more accurate you get, this is why we need the HW
> > > timestamp
> > > in first place, i don't have the exact number though, but we
> > > target to
> > > be compliant with G.8273.2 class C, (30 nsec), and this code
> > > allow
> > > Linux systems to be deployed in the 5G telco edge. Where this
> > > standard
> > > is needed.
> > 
> > I see. IIRC there was also an IEEE standard which specified the
> > exact
> > time stamping point (i.e. SFD crosses layer X). If it's class C
> > that
> > answers the question, I think.
> 
> The ITU-T G.8273.2 specification just requires a Class C clock to
> have a
> maximum absolute time error under steady state of 30 ns. And taking
> timestamps closer to the wire eliminates some clock domain crossings
> from what is measured in the path delay, this is probably the reason
> why
> timestamping is more accurate, and it helps to achieve the required
> jitter figure.
> 
> The IEEE standard that you're thinking of is clause "7.3.4 Generation
> of
> event message timestamps" of IEEE 1588.
> 
> -[cut here]-
> 7.3.4.1 Event message timestamp point
> 
> Unless otherwise specified in a transport-specific annex to this
> standard, the message timestamp point for an event message shall be
> the
> beginning of the first symbol after the Start of Frame (SOF)
> delimiter.
> 
> 7.3.4.2 Event timestamp generation
> 
> All PTP event messages are timestamped on egress and ingress. The
> timestamp shall be the time at which the event message timestamp
> point
> passes the reference plane marking the boundary between the PTP node
> and
> the network.
> 
> NOTE 1— If an implementation generates event message timestamps using
> a
> point other than the message timestamp point, then the generated
> timestamps should be appropriately corrected by the time interval
> between the actual time of detection and the time the message
> timestamp
> point passed the reference plane. Failure to make these corrections
> results in a time offset between the slave and master clocks.
> -[cut here]-
> 
> So there you go, it just says "the reference plane marking the
> boundary
> between the PTP node and the network". So it depends on how you draw
> the
> borders. I cannot seem to find any more precise definition.
> 
> Regardless of the layer at which the timestamp is taken, it is the
> jitter that matters more than the reduced path delay. The latter is
> just
> a side effect.
> 

SO the closer to the wire you take the stamp the less potential for
jitter, since this is after ALL HW pipeline variable delays.

> "How much better" is an interesting question though.




Re: GRO: can't force packet up stack immediately?

2020-12-06 Thread John Ousterhout
On Fri, Dec 4, 2020 at 3:20 AM Edward Cree  wrote:
>
> On 03/12/2020 19:52, John Ousterhout wrote:
> > Homa uses GRO to collect batches of packets for protocol processing,
> > but there are times when it wants to push a batch of packet up through
> > the stack immediately (it doesn't want any more packets to be
> > processed at NAPI level before pushing the batch up). However, I can't
> > see a way to achieve this goal.
> It's kinda hacky, but you might be able to call netif_receive_skb_internal()
>  yourself, and then return ERR_PTR(-EINPROGRESS), so that dev_gro_receive()
>  returns GRO_CONSUMED.
> Of course, you'd need to be careful about out-of-order issues in case
>  any earlier homa packets were still sitting in the rx_list.

It doesn't appear to me that this approach will work, because the
packet I want to force up through the stack is not the new one being
passed into homa_gro_receive, but one of the packets on the list
passed in as the first argument (gro_head in dev_gro_receive).
Removing a packet from this list looks tricky, because it also
requires updating a count in the napi structure, which
homa_gro_receive doesn't have immediate access to. I might be able to
figure out a way to get the napi pointer, but this is starting to feel
pretty hacky (and brittle w.r.t. kernel changes).

BTW, out-of-order issues don't matter for Homa; this is one of the
areas where the "protocol independent" parts of the networking code
build in TCP-specific assumptions, which are either unnecessary or, in
some cases, problematic for Homa.

Thanks anyway for the suggestion.

-John-


[PATCH 1/1] ice: fix array overflow on receiving too many fragments for a packet

2020-12-06 Thread Xiaohui Zhang
From: Zhang Xiaohui 

If the hardware receives an oversized packet with too many rx fragments,
skb_shinfo(skb)->frags can overflow and corrupt memory of adjacent pages.
This becomes especially visible if it corrupts the freelist pointer of
a slab page.

Signed-off-by: Zhang Xiaohui 
---
 drivers/net/ethernet/intel/ice/ice_txrx.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c 
b/drivers/net/ethernet/intel/ice/ice_txrx.c
index eae75260f..f0f034fa5 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -823,8 +823,12 @@ ice_add_rx_frag(struct ice_ring *rx_ring, struct 
ice_rx_buf *rx_buf,
 
if (!size)
return;
-   skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, rx_buf->page,
+   struct skb_shared_info *shinfo = skb_shinfo(skb);
+
+   if (shinfo->nr_frags < ARRAY_SIZE(shinfo->frags)) {
+   skb_add_rx_frag(skb, shinfo, rx_buf->page,
rx_buf->page_offset, size, truesize);
+   }
 
/* page is being used so we must update the page offset */
ice_rx_buf_adjust_pg_offset(rx_buf, truesize);
-- 
2.17.1



Re: [net-next V2 08/15] net/mlx5e: Add TX PTP port object support

2020-12-06 Thread Saeed Mahameed
On Sat, 2020-12-05 at 00:55 +, Vladimir Oltean wrote:
> Hi Jakub,
> 
> On Fri, Dec 04, 2020 at 02:52:40PM -0800, Jakub Kicinski wrote:
> > On Fri, 04 Dec 2020 13:57:49 -0800 Saeed Mahameed wrote:
> > > > Why not use the PTP classification helpers we already have?
> > > 
> > > do you mean ptp_parse_header() or the ebpf prog ?
> > > We use skb_flow_dissect() which should be simple enough.
> > 
> > Not sure which exact one TBH, I just know we have helpers for this,
> > so if we don't use them it'd be good to at least justify why.
> > 
> > Maybe someone with more practical knowledge here can chime in with
> > a recommendation for a helper to find PTP frames on TX?
> 
> ptp_classify_raw is optimized to identify PTP event messages (the
> only
> ones that need to be timestamped as far as the protocol is
> concerned).
> PTP general messages (Follow-Up, Delay_Resp, Announce etc) will
> return
> PTP_CLASS_NONE from ptp_classify_raw.
> 

I looked at the implementation, while it is nice to see that it is
running an ebpf program, but it seems these functions are meant for
those who care about the content of those PTP messages.

Select queue has to be consistent for a specific stream so
I'd rather lookup the well known ptp port via the standard flow
dissector and select the queue accordingly, using any other mechanism
might cause inconsistencies and ooo.

also the flow dissector handles non linear skbs very nicely, whereas,
the two ptp classifier methods don't. They actually have different
purposes than what we are looking for.

so I think we should stick with our simple flow dissector
implementation.

> But maybe there is an even better way, since this is on the TX path,
> maybe the .ndo_select_queue operation can simply look at
>   skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP
> when deciding whether to send it to the "good" queue or not. This has
> the advantage of being less expensive than any sort of frame
> classification.
> 

We also considered this, this is bad in our case because this will
easily break performance for users who do setsockopt(SO_TIMESTAMPING)
on TCP/UDP sockets that favor performance over precision but still want
HW timestamping.

> Nonetheless, some tests would need to be run. In theory, practice and
> theory are the same, whereas in practice they aren't.

In Theory, I don't agree ;-).





[PATCH net] net: hns3: remove a misused pragma packed

2020-12-06 Thread Huazhong Tan
hclge_dbg_reg_info[] is defined as an array of packed structure
accidentally. However, this array contains pointers, which are
no longer aligned naturally, and cannot be relocated on PPC64.
Hence, when compile-testing this driver on PPC64 with
CONFIG_RELOCATABLE=y (e.g. PowerPC allyesconfig), there will be
some warnings.

Since each field in structure hclge_qos_pri_map_cmd and
hclge_dbg_bitmap_cmd is type u8, the pragma packed is unnecessary
for these two structures as well, so remove the pragma packed in
hclge_debugfs.h to fix this issue, and this increases
hclge_dbg_reg_info[] by 4 bytes per entry.

Fixes: a582b78dfc33 ("net: hns3: code optimization for debugfs related to "dump 
reg"")
Reported-by: Stephen Rothwell 
Signed-off-by: Huazhong Tan 
---
 drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.h | 4 
 1 file changed, 4 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.h
index a9066e6..ca2ab6c 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.h
@@ -35,8 +35,6 @@
 
 #define HCLGE_DBG_DFX_SSU_2_OFFSET 12
 
-#pragma pack(1)
-
 struct hclge_qos_pri_map_cmd {
u8 pri0_tc  : 4,
   pri1_tc  : 4;
@@ -85,8 +83,6 @@ struct hclge_dbg_reg_type_info {
struct hclge_dbg_reg_common_msg reg_msg;
 };
 
-#pragma pack()
-
 static const struct hclge_dbg_dfx_message hclge_dbg_bios_common_reg[] = {
{false, "Reserved"},
{true,  "BP_CPU_STATE"},
-- 
2.7.4



Re: [PATCH net] net: hns3: remove a misused pragma packed

2020-12-06 Thread David Miller
From: Huazhong Tan 
Date: Mon, 7 Dec 2020 15:20:25 +0800

> hclge_dbg_reg_info[] is defined as an array of packed structure
> accidentally. However, this array contains pointers, which are
> no longer aligned naturally, and cannot be relocated on PPC64.
> Hence, when compile-testing this driver on PPC64 with
> CONFIG_RELOCATABLE=y (e.g. PowerPC allyesconfig), there will be
> some warnings.
> 
> Since each field in structure hclge_qos_pri_map_cmd and
> hclge_dbg_bitmap_cmd is type u8, the pragma packed is unnecessary
> for these two structures as well, so remove the pragma packed in
> hclge_debugfs.h to fix this issue, and this increases
> hclge_dbg_reg_info[] by 4 bytes per entry.
> 
> Fixes: a582b78dfc33 ("net: hns3: code optimization for debugfs related to 
> "dump reg"")
> Reported-by: Stephen Rothwell 
> Signed-off-by: Huazhong Tan 

Applied, thank you.


[PATCH net] udp: fix the proto value passed to ip_protocol_deliver_rcu for the segments

2020-12-06 Thread Xin Long
Guillaume noticed that: for segments udp_queue_rcv_one_skb() returns the
proto, and it should pass "ret" unmodified to ip_protocol_deliver_rcu().
Otherwize, with a negtive value passed, it will underflow inet_protos.

This can be reproduced with IPIP FOU:

  # ip fou add port  ipproto 4
  # ethtool -K eth1 rx-gro-list on

Fixes: cf329aa42b66 ("udp: cope with UDP GRO packet misdirection")
Reported-by: Guillaume Nault 
Signed-off-by: Xin Long 
---
 net/ipv4/udp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 09f0a23..9eeebd4 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2173,7 +2173,7 @@ static int udp_queue_rcv_skb(struct sock *sk, struct 
sk_buff *skb)
__skb_pull(skb, skb_transport_offset(skb));
ret = udp_queue_rcv_one_skb(sk, skb);
if (ret > 0)
-   ip_protocol_deliver_rcu(dev_net(skb->dev), skb, -ret);
+   ip_protocol_deliver_rcu(dev_net(skb->dev), skb, ret);
}
return 0;
 }
-- 
2.1.0