[RFC PATCH 6/7] tun: populate hash in virtio-net header when needed

2021-01-12 Thread Yuri Benditovich
If the BPF program populated the hash in the skb the tun
propagates the hash value and hash report type to the
respective fields of virtio-net header.

Signed-off-by: Yuri Benditovich 
---
 drivers/net/tun.c | 25 ++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 45f4f04a4a3e..214feb0b16fb 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -556,15 +556,20 @@ static u16 tun_ebpf_select_queue(struct tun_struct *tun, 
struct sk_buff *skb)
 {
struct tun_prog *prog;
u32 numqueues;
-   u16 ret = 0;
+   u32 ret = 0;
 
numqueues = READ_ONCE(tun->numqueues);
if (!numqueues)
return 0;
 
prog = rcu_dereference(tun->steering_prog);
-   if (prog)
+   if (prog) {
ret = bpf_prog_run_clear_cb(prog->prog, skb);
+   if (tun->bpf_populates_hash) {
+   *skb_hash_report_type(skb) = (__u8)(ret >> 16);
+   ret &= 0x;
+   }
+   }
 
return ret % numqueues;
 }
@@ -2062,6 +2067,7 @@ static ssize_t tun_put_user(struct tun_struct *tun,
 
if (vnet_hdr_sz) {
struct virtio_net_hdr gso;
+   __u16 extra_copy = 0;
 
if (iov_iter_count(iter) < vnet_hdr_sz)
return -EINVAL;
@@ -2085,7 +2091,20 @@ static ssize_t tun_put_user(struct tun_struct *tun,
if (copy_to_iter(&gso, sizeof(gso), iter) != sizeof(gso))
return -EFAULT;
 
-   iov_iter_advance(iter, vnet_hdr_sz - sizeof(gso));
+   if (tun->bpf_populates_hash &&
+   vnet_hdr_sz >= sizeof(struct virtio_net_hdr_v1_hash)) {
+   struct virtio_net_hdr_v1_hash hdr;
+
+   hdr.hdr.num_buffers = 0;
+   hdr.hash_value = cpu_to_le32(skb_get_hash(skb));
+   hdr.hash_report = 
cpu_to_le16(*skb_hash_report_type(skb));
+   hdr.padding = 0;
+   extra_copy = sizeof(hdr) - sizeof(gso);
+   if (copy_to_iter(&hdr.hdr.num_buffers, extra_copy, 
iter) != extra_copy)
+   return -EFAULT;
+   }
+
+   iov_iter_advance(iter, vnet_hdr_sz - sizeof(gso) - extra_copy);
}
 
if (vlan_hlen) {
-- 
2.17.1



[RFC PATCH 7/7] tun: report new tun feature IFF_HASH

2021-01-12 Thread Yuri Benditovich
IFF_HASH feature indicates that the tun supports
TUNSETHASHPOPULATION ioctl and can propagate the hash
data to the virtio-net packet.

Signed-off-by: Yuri Benditovich 
---
 drivers/net/tun.c   | 2 +-
 include/uapi/linux/if_tun.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 214feb0b16fb..b46aa8941a9d 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -88,7 +88,7 @@ static void tun_default_link_ksettings(struct net_device *dev,
 #define TUN_VNET_LE 0x8000
 #define TUN_VNET_BE 0x4000
 
-#define TUN_FEATURES (IFF_NO_PI | IFF_ONE_QUEUE | IFF_VNET_HDR | \
+#define TUN_FEATURES (IFF_NO_PI | IFF_ONE_QUEUE | IFF_VNET_HDR | IFF_HASH |\
  IFF_MULTI_QUEUE | IFF_NAPI | IFF_NAPI_FRAGS)
 
 #define GOODCOPY_LEN 128
diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
index 0fd43533da26..116b84ede3a0 100644
--- a/include/uapi/linux/if_tun.h
+++ b/include/uapi/linux/if_tun.h
@@ -73,6 +73,7 @@
 #define IFF_ONE_QUEUE  0x2000
 #define IFF_VNET_HDR   0x4000
 #define IFF_TUN_EXCL   0x8000
+#define IFF_HASH   0x0080
 #define IFF_MULTI_QUEUE 0x0100
 #define IFF_ATTACH_QUEUE 0x0200
 #define IFF_DETACH_QUEUE 0x0400
-- 
2.17.1



Re: [PATCH bpf 1/2] bpf: support PTR_TO_MEM{,_OR_NULL} register spilling

2021-01-12 Thread Andrii Nakryiko
On Tue, Jan 12, 2021 at 1:14 AM Gilad Reti  wrote:
>
> Add support for pointer to mem register spilling, to allow the verifier
> to track pointer to valid memory addresses. Such pointers are returned
> for example by a successful call of the bpf_ringbuf_reserve helper.
>
> This patch was suggested as a solution by Yonghong Song.
>
> The patch was partially contibuted by CyberArk Software, Inc.
>
> Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier
> support for it")

Surprised no one mentioned this yet. Fixes tag should always be on a
single unwrapped line, however long it is, please fix.


> Signed-off-by: Gilad Reti 
> ---
>  kernel/bpf/verifier.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 17270b8404f1..36af69fac591 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -2217,6 +2217,8 @@ static bool is_spillable_regtype(enum bpf_reg_type type)
> case PTR_TO_RDWR_BUF:
> case PTR_TO_RDWR_BUF_OR_NULL:
> case PTR_TO_PERCPU_BTF_ID:
> +   case PTR_TO_MEM:
> +   case PTR_TO_MEM_OR_NULL:
> return true;
> default:
> return false;
> --
> 2.27.0
>


Re: [RFC PATCH 3/7] tun: allow use of BPF_PROG_TYPE_SCHED_CLS program type

2021-01-12 Thread Alexei Starovoitov
On Tue, Jan 12, 2021 at 11:42 AM Yuri Benditovich
 wrote:
>
> This program type can set skb hash value. It will be useful
> when the tun will support hash reporting feature if virtio-net.
>
> Signed-off-by: Yuri Benditovich 
> ---
>  drivers/net/tun.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 7959b5c2d11f..455f7afc1f36 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -2981,6 +2981,8 @@ static int tun_set_ebpf(struct tun_struct *tun, struct 
> tun_prog __rcu **prog_p,
> prog = NULL;
> } else {
> prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCKET_FILTER);
> +   if (IS_ERR(prog))
> +   prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SCHED_CLS);

You've ignored the feedback and just resend? what for?


Re: [PATCH net-next v2 3/7] ibmvnic: avoid allocating rwi entries

2021-01-12 Thread Saeed Mahameed
On Tue, 2021-01-12 at 10:14 -0800, Sukadev Bhattiprolu wrote:
> Whenever we need to schedule a reset, we allocate an rwi (reset work
> item?) entry and add to the list of pending resets.
> 
> Since we only add one rwi for a given reason type to the list (no
> duplicates).
> we will only have a handful of reset types in the list - even in the
> worst case. In the common case we should just have a couple of
> entries
> at most.
> 
> Rather than allocating/freeing every time (and dealing with the
> corner
> case of the allocation failing), use a fixed number of rwi entries.
> The extra memory needed is tiny and most of it will be used over the
> active life of the adapter.
> 
> This also fixes a couple of tiny memory leaks. One is in
> ibmvnic_reset()
> where we don't free the rwi entries after deleting them from the list
> due
> to a transport event.  The second is in __ibmvnic_reset() where if we
> find that the adapter is being removed, we simply break out of the
> loop
> (with rc = EBUSY) but ignore any rwi entries that remain on the list.
> 
> Fixes: 2770a7984db58 ("Introduce hard reset recovery")
> Fixes: 36f1031c51a2 ("ibmvnic: Do not process reset during or after
> device removal")
> Signed-off-by: Sukadev Bhattiprolu 
> ---
>  drivers/net/ethernet/ibm/ibmvnic.c | 123 +
> 
>  drivers/net/ethernet/ibm/ibmvnic.h |  14 ++--
>  2 files changed, 78 insertions(+), 59 deletions(-)
> 

[...]

> -enum ibmvnic_reset_reason {VNIC_RESET_FAILOVER = 1,
> +enum ibmvnic_reset_reason {VNIC_RESET_UNUSED = 0,
> +VNIC_RESET_FAILOVER = 1,
>  VNIC_RESET_MOBILITY,
>  VNIC_RESET_FATAL,
>  VNIC_RESET_NON_FATAL,
>  VNIC_RESET_TIMEOUT,
> -VNIC_RESET_CHANGE_PARAM};
> -
> -struct ibmvnic_rwi {
> - enum ibmvnic_reset_reason reset_reason;
> - struct list_head list;
> -};
> +VNIC_RESET_CHANGE_PARAM,
> +VNIC_RESET_MAX}; // must be last
   this is not the preferred comment style: ^^

I would just drop the comment here, it is clear from the name of the
enum.





Re: [RFC PATCH 0/7] Support for virtio-net hash reporting

2021-01-12 Thread Yuri Benditovich
On Tue, Jan 12, 2021 at 9:41 PM Yuri Benditovich
 wrote:
>
> Existing TUN module is able to use provided "steering eBPF" to
> calculate per-packet hash and derive the destination queue to
> place the packet to. The eBPF uses mapped configuration data
> containing a key for hash calculation and indirection table
> with array of queues' indices.
>
> This series of patches adds support for virtio-net hash reporting
> feature as defined in virtio specification. It extends the TUN module
> and the "steering eBPF" as follows:
>
> Extended steering eBPF calculates the hash value and hash type, keeps
> hash value in the skb->hash and returns index of destination virtqueue
> and the type of the hash. TUN module keeps returned hash type in
> (currently unused) field of the skb.
> skb->__unused renamed to 'hash_report_type'.
>
> When TUN module is called later to allocate and fill the virtio-net
> header and push it to destination virtqueue it populates the hash
> and the hash type into virtio-net header.
>
> VHOST driver is made aware of respective virtio-net feature that
> extends the virtio-net header to report the hash value and hash report
> type.

Comment from Willem de Bruijn:

Skbuff fields are in short supply. I don't think we need to add one
just for this narrow path entirely internal to the tun device.

Instead, you could just run the flow_dissector in tun_put_user if the
feature is negotiated. Indeed, the flow dissector seems more apt to me
than BPF here. Note that the flow dissector internally can be
overridden by a BPF program if the admin so chooses.

This also hits on a deeper point with the choice of hash values, that
I also noticed in my RFC patchset to implement the inverse [1][2]. It
is much more detailed than skb->hash + skb->l4_hash currently offers,
and that can be gotten for free from most hardware. In most practical
cases, that information suffices. I added less specific fields
VIRTIO_NET_HASH_REPORT_L4, VIRTIO_NET_HASH_REPORT_OTHER that work
without explicit flow dissection. I understand that the existing
fields are part of the standard. Just curious, what is their purpose
beyond 4-tuple based flow hashing?

[1] https://patchwork.kernel.org/project/netdevbpf/list/?series=406859&state=*
[2] 
https://github.com/wdebruij/linux/commit/0f77febf22cd6ffc242a575807fa8382a26e511e
>
> Yuri Benditovich (7):
>   skbuff: define field for hash report type
>   vhost: support for hash report virtio-net feature
>   tun: allow use of BPF_PROG_TYPE_SCHED_CLS program type
>   tun: free bpf_program by bpf_prog_put instead of bpf_prog_destroy
>   tun: add ioctl code TUNSETHASHPOPULATION
>   tun: populate hash in virtio-net header when needed
>   tun: report new tun feature IFF_HASH
>
>  drivers/net/tun.c   | 43 +++--
>  drivers/vhost/net.c | 37 ---
>  include/linux/skbuff.h  |  7 +-
>  include/uapi/linux/if_tun.h |  2 ++
>  4 files changed, 74 insertions(+), 15 deletions(-)
>
> --
> 2.17.1
>


[PATCH net-next v15 1/6] dt-bindings: net: Add 5GBASER phy interface

2021-01-12 Thread Marek Behún
From: Pavana Sharma 

Add 5gbase-r PHY interface mode.

Signed-off-by: Pavana Sharma 
Reviewed-by: Andrew Lunn 
Acked-by: Rob Herring 
Signed-off-by: Marek Behún 
---
 Documentation/devicetree/bindings/net/ethernet-controller.yaml | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Documentation/devicetree/bindings/net/ethernet-controller.yaml 
b/Documentation/devicetree/bindings/net/ethernet-controller.yaml
index 0965f6515f9e..5507ae3c478d 100644
--- a/Documentation/devicetree/bindings/net/ethernet-controller.yaml
+++ b/Documentation/devicetree/bindings/net/ethernet-controller.yaml
@@ -89,6 +89,7 @@ properties:
   - trgmii
   - 1000base-x
   - 2500base-x
+  - 5gbase-r
   - rxaui
   - xaui
 
-- 
2.26.2



[PATCH net-next v15 0/6] Add support for mv88e6393x family of Marvell

2021-01-12 Thread Marek Behún
Hello,

this is version 15 of patches adding support for Amethyst family to
mv88e6xxx. It should apply cleanly on net-next.

This series is tested on Marvell CN9130-CRB.

Changes from v14:
- added my Signed-off-by tags to Pavana's patches, since I am sending
  them (as suggested by Andrew)
- added documentation to second patch adding 5gbase-r mode (as requested
  by Russell)
- added Reviewed-by tags
- applied Vladimir's suggestions:
  - reduced indentation level in mv88e6xxx_set_egress_port and
mv88e6393x_serdes_port_config
  - removed 1baseKR_Full from mv88e6393x_phylink_validate
  - removed PHY_INTERFACE_MODE_10GKR from mv88e6xxx_port_set_cmode

Changes from v13:
- added patch that wraps .set_egress_port into mv88e6xxx_set_egress_port,
  so that we do not have to set chip->*gress_dest_port members in every
  implementation of this method
- for the patch that adds Amethyst support:
  - added more information into commit message
  - added these methods for mv88e6393x_ops:
  .port_sync_link
  .port_setup_message_port
  .port_max_speed_mode (new implementation needed)
  .atu_get_hash
  .atu_set_hash
  .serdes_pcs_config
  .serdes_pcs_an_restart
  .serdes_pcs_link_up
  - this device can set upstream port per port, so implement
  .port_set_upstream_port
instead of
  .set_cpu_port
  - removed USXGMII cmode (not yet supported, working on it)
  - added debug messages into mv88e6393x_port_set_speed_duplex
  - added Amethyst errata 4.5 (EEE should be disabled on SERDES ports)
  - fixed 5gbase-r serdes configuration and interrupt handling
  - refactored mv88e6393x_serdes_setup_errata
  - refactored mv88e6393x_port_policy_write
- added patch implementing .port_set_policy for Amethyst

Marek

Marek Behún (2):
  net: dsa: mv88e6xxx: wrap .set_egress_port method
  net: dsa: mv88e6xxx: implement .port_set_policy for Amethyst

Pavana Sharma (4):
  dt-bindings: net: Add 5GBASER phy interface
  net: phy: Add 5GBASER interface mode
  net: dsa: mv88e6xxx: Change serdes lane parameter type from u8 type to
int
  net: dsa: mv88e6xxx: Add support for mv88e6393x family of Marvell

 .../bindings/net/ethernet-controller.yaml |   1 +
 Documentation/networking/phy.rst  |   6 +
 drivers/net/dsa/mv88e6xxx/chip.c  | 227 --
 drivers/net/dsa/mv88e6xxx/chip.h  |  20 +-
 drivers/net/dsa/mv88e6xxx/global1.c   |  19 +-
 drivers/net/dsa/mv88e6xxx/global1.h   |   2 +
 drivers/net/dsa/mv88e6xxx/global2.h   |   8 +
 drivers/net/dsa/mv88e6xxx/port.c  | 397 --
 drivers/net/dsa/mv88e6xxx/port.h  |  50 ++-
 drivers/net/dsa/mv88e6xxx/serdes.c| 394 +++--
 drivers/net/dsa/mv88e6xxx/serdes.h| 108 +++--
 include/linux/phy.h   |   4 +
 12 files changed, 1073 insertions(+), 163 deletions(-)


base-commit: c73a45965dd54a10c368191804b9de661eee1007
-- 
2.26.2



[PATCH net-next v15 3/6] net: dsa: mv88e6xxx: Change serdes lane parameter type from u8 type to int

2021-01-12 Thread Marek Behún
From: Pavana Sharma 

Returning 0 is no more an error case with MV88E6393 family
which has serdes lane numbers 0, 9 or 10.
So with this change .serdes_get_lane will return lane number
or -errno (-ENODEV or -EOPNOTSUPP).

Signed-off-by: Pavana Sharma 
Reviewed-by: Andrew Lunn 
Reviewed-by: Vladimir Oltean 
Signed-off-by: Marek Behún 
---
 drivers/net/dsa/mv88e6xxx/chip.c   | 28 +-
 drivers/net/dsa/mv88e6xxx/chip.h   | 16 +++---
 drivers/net/dsa/mv88e6xxx/port.c   |  8 +--
 drivers/net/dsa/mv88e6xxx/serdes.c | 82 +++---
 drivers/net/dsa/mv88e6xxx/serdes.h | 64 +++
 5 files changed, 99 insertions(+), 99 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 4aa7d0a8f197..44591ef487af 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -485,12 +485,12 @@ static int mv88e6xxx_serdes_pcs_get_state(struct 
dsa_switch *ds, int port,
  struct phylink_link_state *state)
 {
struct mv88e6xxx_chip *chip = ds->priv;
-   u8 lane;
+   int lane;
int err;
 
mv88e6xxx_reg_lock(chip);
lane = mv88e6xxx_serdes_get_lane(chip, port);
-   if (lane && chip->info->ops->serdes_pcs_get_state)
+   if (lane >= 0 && chip->info->ops->serdes_pcs_get_state)
err = chip->info->ops->serdes_pcs_get_state(chip, port, lane,
state);
else
@@ -506,11 +506,11 @@ static int mv88e6xxx_serdes_pcs_config(struct 
mv88e6xxx_chip *chip, int port,
   const unsigned long *advertise)
 {
const struct mv88e6xxx_ops *ops = chip->info->ops;
-   u8 lane;
+   int lane;
 
if (ops->serdes_pcs_config) {
lane = mv88e6xxx_serdes_get_lane(chip, port);
-   if (lane)
+   if (lane >= 0)
return ops->serdes_pcs_config(chip, port, lane, mode,
  interface, advertise);
}
@@ -523,14 +523,14 @@ static void mv88e6xxx_serdes_pcs_an_restart(struct 
dsa_switch *ds, int port)
struct mv88e6xxx_chip *chip = ds->priv;
const struct mv88e6xxx_ops *ops;
int err = 0;
-   u8 lane;
+   int lane;
 
ops = chip->info->ops;
 
if (ops->serdes_pcs_an_restart) {
mv88e6xxx_reg_lock(chip);
lane = mv88e6xxx_serdes_get_lane(chip, port);
-   if (lane)
+   if (lane >= 0)
err = ops->serdes_pcs_an_restart(chip, port, lane);
mv88e6xxx_reg_unlock(chip);
 
@@ -544,11 +544,11 @@ static int mv88e6xxx_serdes_pcs_link_up(struct 
mv88e6xxx_chip *chip, int port,
int speed, int duplex)
 {
const struct mv88e6xxx_ops *ops = chip->info->ops;
-   u8 lane;
+   int lane;
 
if (!phylink_autoneg_inband(mode) && ops->serdes_pcs_link_up) {
lane = mv88e6xxx_serdes_get_lane(chip, port);
-   if (lane)
+   if (lane >= 0)
return ops->serdes_pcs_link_up(chip, port, lane,
   speed, duplex);
}
@@ -2429,11 +2429,11 @@ static irqreturn_t mv88e6xxx_serdes_irq_thread_fn(int 
irq, void *dev_id)
struct mv88e6xxx_chip *chip = mvp->chip;
irqreturn_t ret = IRQ_NONE;
int port = mvp->port;
-   u8 lane;
+   int lane;
 
mv88e6xxx_reg_lock(chip);
lane = mv88e6xxx_serdes_get_lane(chip, port);
-   if (lane)
+   if (lane >= 0)
ret = mv88e6xxx_serdes_irq_status(chip, port, lane);
mv88e6xxx_reg_unlock(chip);
 
@@ -2441,7 +2441,7 @@ static irqreturn_t mv88e6xxx_serdes_irq_thread_fn(int 
irq, void *dev_id)
 }
 
 static int mv88e6xxx_serdes_irq_request(struct mv88e6xxx_chip *chip, int port,
-   u8 lane)
+   int lane)
 {
struct mv88e6xxx_port *dev_id = &chip->ports[port];
unsigned int irq;
@@ -2470,7 +2470,7 @@ static int mv88e6xxx_serdes_irq_request(struct 
mv88e6xxx_chip *chip, int port,
 }
 
 static int mv88e6xxx_serdes_irq_free(struct mv88e6xxx_chip *chip, int port,
-u8 lane)
+int lane)
 {
struct mv88e6xxx_port *dev_id = &chip->ports[port];
unsigned int irq = dev_id->serdes_irq;
@@ -2495,11 +2495,11 @@ static int mv88e6xxx_serdes_irq_free(struct 
mv88e6xxx_chip *chip, int port,
 static int mv88e6xxx_serdes_power(struct mv88e6xxx_chip *chip, int port,
  bool on)
 {
-   u8 lane;
+   int lane;
int err;
 
lane = mv88e6xxx_serdes_get_lane(chip, port);
-   if (!lane)
+   if (lane < 0)
return 0;
 
if (on) {
diff --git a

[PATCH net-next v15 2/6] net: phy: Add 5GBASER interface mode

2021-01-12 Thread Marek Behún
From: Pavana Sharma 

Add 5GBASE-R phy interface mode

Signed-off-by: Pavana Sharma 
Reviewed-by: Andrew Lunn 
Signed-off-by: Marek Behún 
---
 Documentation/networking/phy.rst | 6 ++
 include/linux/phy.h  | 4 
 2 files changed, 10 insertions(+)

diff --git a/Documentation/networking/phy.rst b/Documentation/networking/phy.rst
index b2f7ec794bc8..d7be261b5a8d 100644
--- a/Documentation/networking/phy.rst
+++ b/Documentation/networking/phy.rst
@@ -267,6 +267,12 @@ Some of the interface modes are described below:
 duplex, pause or other settings.  This is dependent on the MAC and/or
 PHY behaviour.
 
+``PHY_INTERFACE_MODE_5GBASER``
+This is the IEEE 802.3 Clause 129 defined 5GBASE-R protocol. It is
+identical to the 10GBASE-R protocol defined in Clause 49, with the
+exception that it operates at half the frequency. Please refer to the
+IEEE standard for the definition.
+
 ``PHY_INTERFACE_MODE_10GBASER``
 This is the IEEE 802.3 Clause 49 defined 10GBASE-R protocol used with
 various different mediums. Please refer to the IEEE standard for a
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 9effb511acde..548372eb253a 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -106,6 +106,7 @@ extern const int phy_10gbit_features_array[1];
  * @PHY_INTERFACE_MODE_TRGMII: Turbo RGMII
  * @PHY_INTERFACE_MODE_1000BASEX: 1000 BaseX
  * @PHY_INTERFACE_MODE_2500BASEX: 2500 BaseX
+ * @PHY_INTERFACE_MODE_5GBASER: 5G BaseR
  * @PHY_INTERFACE_MODE_RXAUI: Reduced XAUI
  * @PHY_INTERFACE_MODE_XAUI: 10 Gigabit Attachment Unit Interface
  * @PHY_INTERFACE_MODE_10GBASER: 10G BaseR
@@ -137,6 +138,7 @@ typedef enum {
PHY_INTERFACE_MODE_TRGMII,
PHY_INTERFACE_MODE_1000BASEX,
PHY_INTERFACE_MODE_2500BASEX,
+   PHY_INTERFACE_MODE_5GBASER,
PHY_INTERFACE_MODE_RXAUI,
PHY_INTERFACE_MODE_XAUI,
/* 10GBASE-R, XFI, SFI - single lane 10G Serdes */
@@ -207,6 +209,8 @@ static inline const char *phy_modes(phy_interface_t 
interface)
return "1000base-x";
case PHY_INTERFACE_MODE_2500BASEX:
return "2500base-x";
+   case PHY_INTERFACE_MODE_5GBASER:
+   return "5gbase-r";
case PHY_INTERFACE_MODE_RXAUI:
return "rxaui";
case PHY_INTERFACE_MODE_XAUI:
-- 
2.26.2



[PATCH net-next v15 4/6] net: dsa: mv88e6xxx: wrap .set_egress_port method

2021-01-12 Thread Marek Behún
There are two implementations of the .set_egress_port method, and both
of them, if successful, set chip->*gress_dest_port variable.

To avoid code repetition, wrap this method into
mv88e6xxx_set_egress_port.

Signed-off-by: Marek Behún 
Reviewed-by: Pavana Sharma 
---
 drivers/net/dsa/mv88e6xxx/chip.c| 49 ++---
 drivers/net/dsa/mv88e6xxx/global1.c | 19 ++-
 2 files changed, 33 insertions(+), 35 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 44591ef487af..ed07cb29b285 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -2519,6 +2519,27 @@ static int mv88e6xxx_serdes_power(struct mv88e6xxx_chip 
*chip, int port,
return err;
 }
 
+static int mv88e6xxx_set_egress_port(struct mv88e6xxx_chip *chip,
+enum mv88e6xxx_egress_direction direction,
+int port)
+{
+   int err;
+
+   if (!chip->info->ops->set_egress_port)
+   return -EOPNOTSUPP;
+
+   err = chip->info->ops->set_egress_port(chip, direction, port);
+   if (err)
+   return err;
+
+   if (direction == MV88E6XXX_EGRESS_DIR_INGRESS)
+   chip->ingress_dest_port = port;
+   else
+   chip->egress_dest_port = port;
+
+   return 0;
+}
+
 static int mv88e6xxx_setup_upstream_port(struct mv88e6xxx_chip *chip, int port)
 {
struct dsa_switch *ds = chip->ds;
@@ -2541,19 +2562,17 @@ static int mv88e6xxx_setup_upstream_port(struct 
mv88e6xxx_chip *chip, int port)
return err;
}
 
-   if (chip->info->ops->set_egress_port) {
-   err = chip->info->ops->set_egress_port(chip,
+   err = mv88e6xxx_set_egress_port(chip,
MV88E6XXX_EGRESS_DIR_INGRESS,
upstream_port);
-   if (err)
-   return err;
+   if (err && err != -EOPNOTSUPP)
+   return err;
 
-   err = chip->info->ops->set_egress_port(chip,
+   err = mv88e6xxx_set_egress_port(chip,
MV88E6XXX_EGRESS_DIR_EGRESS,
upstream_port);
-   if (err)
-   return err;
-   }
+   if (err && err != -EOPNOTSUPP)
+   return err;
}
 
return 0;
@@ -5286,9 +5305,6 @@ static int mv88e6xxx_port_mirror_add(struct dsa_switch 
*ds, int port,
int i;
int err;
 
-   if (!chip->info->ops->set_egress_port)
-   return -EOPNOTSUPP;
-
mutex_lock(&chip->reg_lock);
if ((ingress ? chip->ingress_dest_port : chip->egress_dest_port) !=
mirror->to_local_port) {
@@ -5303,9 +5319,8 @@ static int mv88e6xxx_port_mirror_add(struct dsa_switch 
*ds, int port,
goto out;
}
 
-   err = chip->info->ops->set_egress_port(chip,
-  direction,
-  mirror->to_local_port);
+   err = mv88e6xxx_set_egress_port(chip, direction,
+   mirror->to_local_port);
if (err)
goto out;
}
@@ -5338,10 +5353,8 @@ static void mv88e6xxx_port_mirror_del(struct dsa_switch 
*ds, int port,
 
/* Reset egress port when no other mirror is active */
if (!other_mirrors) {
-   if (chip->info->ops->set_egress_port(chip,
-direction,
-dsa_upstream_port(ds,
-  port)))
+   if (mv88e6xxx_set_egress_port(chip, direction,
+ dsa_upstream_port(ds, port)))
dev_err(ds->dev, "failed to set egress port\n");
}
 
diff --git a/drivers/net/dsa/mv88e6xxx/global1.c 
b/drivers/net/dsa/mv88e6xxx/global1.c
index 33d443a37efc..815b0f681d69 100644
--- a/drivers/net/dsa/mv88e6xxx/global1.c
+++ b/drivers/net/dsa/mv88e6xxx/global1.c
@@ -315,7 +315,6 @@ int mv88e6095_g1_set_egress_port(struct mv88e6xxx_chip 
*chip,
 enum mv88e6xxx_egress_direction direction,
 int port)
 {
-   int *dest_port_chip;
u16 reg;
int err;
 
@@ -325,13 +324,11 @@ int mv88e6095_g1_set_egress_port(struct mv88e6xxx_chip 
*chip,
 
switch (direction) {
case MV88E6XXX_EGRESS_DIR_INGRESS:
-   dest_port_chip = &chip->ingress_dest_port;
reg &= ~MV88E6185_G1_MONITOR_CTL_INGRESS_DEST

[PATCH net-next v15 6/6] net: dsa: mv88e6xxx: implement .port_set_policy for Amethyst

2021-01-12 Thread Marek Behún
The 16-bit Port Policy CTL register from older chips is on 6393x changed
to Port Policy MGMT CTL, which can access more data, but indirectly and
via 8-bit registers.

The original 16-bit value is divided into first two 8-bit register in
the Port Policy MGMT CTL.

We can therefore use the previous code to compute the mask and shift,
and then
- if 0 <= shift < 8, we access register 0 in Port Policy MGMT CTL
- if 8 <= shift < 16, we access register 1 in Port Policy MGMT CTL

There are in fact other possible policy settings for Amethyst which
could be added here, but this can be done in the future.

Signed-off-by: Marek Behún 
Reviewed-by: Pavana Sharma 
---
 drivers/net/dsa/mv88e6xxx/chip.c |   1 +
 drivers/net/dsa/mv88e6xxx/port.c | 122 ---
 drivers/net/dsa/mv88e6xxx/port.h |   3 +
 3 files changed, 99 insertions(+), 27 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index c2dc6858481a..77f7a263d1a7 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -4570,6 +4570,7 @@ static const struct mv88e6xxx_ops mv88e6393x_ops = {
.port_set_speed_duplex = mv88e6393x_port_set_speed_duplex,
.port_max_speed_mode = mv88e6393x_port_max_speed_mode,
.port_tag_remap = mv88e6390_port_tag_remap,
+   .port_set_policy = mv88e6393x_port_set_policy,
.port_set_frame_mode = mv88e6351_port_set_frame_mode,
.port_set_egress_floods = mv88e6352_port_set_egress_floods,
.port_set_ether_type = mv88e6393x_port_set_ether_type,
diff --git a/drivers/net/dsa/mv88e6xxx/port.c b/drivers/net/dsa/mv88e6xxx/port.c
index 2ff38357c481..732e43569f80 100644
--- a/drivers/net/dsa/mv88e6xxx/port.c
+++ b/drivers/net/dsa/mv88e6xxx/port.c
@@ -1304,6 +1304,27 @@ int mv88e6xxx_port_disable_pri_override(struct 
mv88e6xxx_chip *chip, int port)
 
 /* Offset 0x0E: Policy & MGMT Control Register for FAMILY 6191X 6193X 6393X */
 
+static int mv88e6393x_port_policy_read(struct mv88e6xxx_chip *chip, int port,
+  u16 pointer, u8 *data)
+{
+   u16 reg;
+   int err;
+
+   err = mv88e6xxx_port_write(chip, port, MV88E6393X_PORT_POLICY_MGMT_CTL,
+  pointer);
+   if (err)
+   return err;
+
+   err = mv88e6xxx_port_read(chip, port, MV88E6393X_PORT_POLICY_MGMT_CTL,
+ ®);
+   if (err)
+   return err;
+
+   *data = reg;
+
+   return 0;
+}
+
 static int mv88e6393x_port_policy_write(struct mv88e6xxx_chip *chip, int port,
u16 pointer, u8 data)
 {
@@ -1505,46 +1526,43 @@ int mv88e6390_port_tag_remap(struct mv88e6xxx_chip 
*chip, int port)
 
 /* Offset 0x0E: Policy Control Register */
 
-int mv88e6352_port_set_policy(struct mv88e6xxx_chip *chip, int port,
- enum mv88e6xxx_policy_mapping mapping,
- enum mv88e6xxx_policy_action action)
+static int
+mv88e6xxx_port_policy_mapping_get_pos(enum mv88e6xxx_policy_mapping mapping,
+ enum mv88e6xxx_policy_action action,
+ u16 *mask, u16 *val, int *shift)
 {
-   u16 reg, mask, val;
-   int shift;
-   int err;
-
switch (mapping) {
case MV88E6XXX_POLICY_MAPPING_DA:
-   shift = __bf_shf(MV88E6XXX_PORT_POLICY_CTL_DA_MASK);
-   mask = MV88E6XXX_PORT_POLICY_CTL_DA_MASK;
+   *shift = __bf_shf(MV88E6XXX_PORT_POLICY_CTL_DA_MASK);
+   *mask = MV88E6XXX_PORT_POLICY_CTL_DA_MASK;
break;
case MV88E6XXX_POLICY_MAPPING_SA:
-   shift = __bf_shf(MV88E6XXX_PORT_POLICY_CTL_SA_MASK);
-   mask = MV88E6XXX_PORT_POLICY_CTL_SA_MASK;
+   *shift = __bf_shf(MV88E6XXX_PORT_POLICY_CTL_SA_MASK);
+   *mask = MV88E6XXX_PORT_POLICY_CTL_SA_MASK;
break;
case MV88E6XXX_POLICY_MAPPING_VTU:
-   shift = __bf_shf(MV88E6XXX_PORT_POLICY_CTL_VTU_MASK);
-   mask = MV88E6XXX_PORT_POLICY_CTL_VTU_MASK;
+   *shift = __bf_shf(MV88E6XXX_PORT_POLICY_CTL_VTU_MASK);
+   *mask = MV88E6XXX_PORT_POLICY_CTL_VTU_MASK;
break;
case MV88E6XXX_POLICY_MAPPING_ETYPE:
-   shift = __bf_shf(MV88E6XXX_PORT_POLICY_CTL_ETYPE_MASK);
-   mask = MV88E6XXX_PORT_POLICY_CTL_ETYPE_MASK;
+   *shift = __bf_shf(MV88E6XXX_PORT_POLICY_CTL_ETYPE_MASK);
+   *mask = MV88E6XXX_PORT_POLICY_CTL_ETYPE_MASK;
break;
case MV88E6XXX_POLICY_MAPPING_PPPOE:
-   shift = __bf_shf(MV88E6XXX_PORT_POLICY_CTL_PPPOE_MASK);
-   mask = MV88E6XXX_PORT_POLICY_CTL_PPPOE_MASK;
+   *shift = __bf_shf(MV88E6XXX_PORT_POLICY_CTL_PPPOE_MASK);
+   *mask = MV88E6XXX_PORT_POLICY_CTL_PPPOE_MASK;
break;
cas

RE: [PATCH v0 net-next 1/1] Allow user to set metric on default route learned via Router Advertisement.

2021-01-12 Thread Praveen Chaudhary
Hi Jakub

Thanks for the review,

Sure, I will reraise the patch (again v0i, sonce no code changes) after adding 
space before '<'.

This patch adds lines in 'include/uapi/', that requires ABI version changes for 
debian build. I am not sure, if we need any such changes to avoid breaking 
allmodconfig. It will be really helpful, if you can look at the patch once 
'https://lkml.org/lkml/2021/1/11/1668' and suggest on this. Thanks a lot again.


Re: [PATCH net-next v15 1/6] dt-bindings: net: Add 5GBASER phy interface

2021-01-12 Thread Florian Fainelli
On 1/12/21 11:54 AM, Marek Behún wrote:
> From: Pavana Sharma 
> 
> Add 5gbase-r PHY interface mode.
> 
> Signed-off-by: Pavana Sharma 
> Reviewed-by: Andrew Lunn 
> Acked-by: Rob Herring 
> Signed-off-by: Marek Behún 

Reviewed-by: Florian Fainelli 
-- 
Florian


[PATCH net-next v15 5/6] net: dsa: mv88e6xxx: Add support for mv88e6393x family of Marvell

2021-01-12 Thread Marek Behún
From: Pavana Sharma 

The Marvell 88E6393X device is a single-chip integration of a 11-port
Ethernet switch with eight integrated Gigabit Ethernet (GbE)
transceivers and three 10-Gigabit interfaces.

This patch adds functionalities specific to mv88e6393x family (88E6393X,
88E6193X and 88E6191X).

The main differences between previous devices and this one are:
- port 0 can be a SERDES port
- all SERDESes are one-lane, eg. no XAUI nor RXAUI
- on the other hand the SERDESes can do USXGMII, 10GBASER and 5GBASER
  (on 6191X only one SERDES is capable of more than 1g; USXGMII is not
  yet supported with this change)
- Port Policy CTL register is changed to Port Policy MGMT CTL register,
  via which several more registers can be accessed indirectly
- egress monitor port is configured differently
- ingress monitor/CPU/mirror ports are configured differently and can be
  configured per port (ie. each port can have different ingress monitor
  port, for example)
- port speed AltBit works differently than previously
- PHY registers can be also accessed via MDIO address 0x18 and 0x19
  (on previous devices they could be accessed only via Global 2 offsets
   0x18 and 0x19, which means two indirections; this feature is not yet
   leveraged with this patch)

Co-developed-by: Ashkan Boldaji 
Signed-off-by: Ashkan Boldaji 
Signed-off-by: Pavana Sharma 
Co-developed-by: Marek Behún 
Signed-off-by: Marek Behún 
---
 drivers/net/dsa/mv88e6xxx/chip.c| 149 +
 drivers/net/dsa/mv88e6xxx/chip.h|   4 +
 drivers/net/dsa/mv88e6xxx/global1.h |   2 +
 drivers/net/dsa/mv88e6xxx/global2.h |   8 +
 drivers/net/dsa/mv88e6xxx/port.c| 267 
 drivers/net/dsa/mv88e6xxx/port.h|  47 -
 drivers/net/dsa/mv88e6xxx/serdes.c  | 312 
 drivers/net/dsa/mv88e6xxx/serdes.h  |  44 
 8 files changed, 831 insertions(+), 2 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index ed07cb29b285..c2dc6858481a 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -635,6 +635,28 @@ static void mv88e6390x_phylink_validate(struct 
mv88e6xxx_chip *chip, int port,
mv88e6390_phylink_validate(chip, port, mask, state);
 }
 
+static void mv88e6393x_phylink_validate(struct mv88e6xxx_chip *chip, int port,
+   unsigned long *mask,
+   struct phylink_link_state *state)
+{
+   if (port == 0 || port == 9 || port == 10) {
+   phylink_set(mask, 1baseT_Full);
+   phylink_set(mask, 1baseCR_Full);
+   phylink_set(mask, 1baseSR_Full);
+   phylink_set(mask, 1baseLR_Full);
+   phylink_set(mask, 1baseLRM_Full);
+   phylink_set(mask, 1baseER_Full);
+   phylink_set(mask, 5000baseT_Full);
+   phylink_set(mask, 2500baseX_Full);
+   phylink_set(mask, 2500baseT_Full);
+   }
+
+   phylink_set(mask, 1000baseT_Full);
+   phylink_set(mask, 1000baseX_Full);
+
+   mv88e6065_phylink_validate(chip, port, mask, state);
+}
+
 static void mv88e6xxx_validate(struct dsa_switch *ds, int port,
   unsigned long *supported,
   struct phylink_link_state *state)
@@ -4533,6 +4555,64 @@ static const struct mv88e6xxx_ops mv88e6390x_ops = {
.phylink_validate = mv88e6390x_phylink_validate,
 };
 
+static const struct mv88e6xxx_ops mv88e6393x_ops = {
+   /* MV88E6XXX_FAMILY_6393 */
+   .setup_errata = mv88e6393x_serdes_setup_errata,
+   .irl_init_all = mv88e6390_g2_irl_init_all,
+   .get_eeprom = mv88e6xxx_g2_get_eeprom8,
+   .set_eeprom = mv88e6xxx_g2_set_eeprom8,
+   .set_switch_mac = mv88e6xxx_g2_set_switch_mac,
+   .phy_read = mv88e6xxx_g2_smi_phy_read,
+   .phy_write = mv88e6xxx_g2_smi_phy_write,
+   .port_set_link = mv88e6xxx_port_set_link,
+   .port_sync_link = mv88e6xxx_port_sync_link,
+   .port_set_rgmii_delay = mv88e6390_port_set_rgmii_delay,
+   .port_set_speed_duplex = mv88e6393x_port_set_speed_duplex,
+   .port_max_speed_mode = mv88e6393x_port_max_speed_mode,
+   .port_tag_remap = mv88e6390_port_tag_remap,
+   .port_set_frame_mode = mv88e6351_port_set_frame_mode,
+   .port_set_egress_floods = mv88e6352_port_set_egress_floods,
+   .port_set_ether_type = mv88e6393x_port_set_ether_type,
+   .port_set_jumbo_size = mv88e6165_port_set_jumbo_size,
+   .port_egress_rate_limiting = mv88e6097_port_egress_rate_limiting,
+   .port_pause_limit = mv88e6390_port_pause_limit,
+   .port_disable_learn_limit = mv88e6xxx_port_disable_learn_limit,
+   .port_disable_pri_override = mv88e6xxx_port_disable_pri_override,
+   .port_get_cmode = mv88e6352_port_get_cmode,
+   .port_set_cmode = mv88e6393x_port_set_cmode,
+   .port_setup_message_port = mv88e6xxx_setup_message_port,
+ 

Re: [PATCH net-next v15 2/6] net: phy: Add 5GBASER interface mode

2021-01-12 Thread Florian Fainelli
On 1/12/21 11:54 AM, Marek Behún wrote:
> From: Pavana Sharma 
> 
> Add 5GBASE-R phy interface mode
> 
> Signed-off-by: Pavana Sharma 
> Reviewed-by: Andrew Lunn 
> Signed-off-by: Marek Behún 

Reviewed-by: Florian Fainelli 
-- 
Florian


Re: [PATCH net-next v2 5/7] ibmvnic: serialize access to work queue

2021-01-12 Thread Saeed Mahameed
On Tue, 2021-01-12 at 10:14 -0800, Sukadev Bhattiprolu wrote:
> The work queue is used to queue reset requests like CHANGE-PARAM or
> FAILOVER resets for the worker thread. When the adapter is being
> removed
> the adapter state is set to VNIC_REMOVING and the work queue is
> flushed
> so no new work is added. However the check for adapter being removed
> is
> racy in that the adapter can go into REMOVING state just after we
> check
> and we might end up adding work just as it is being flushed (or
> after).
> 
> Use a new spin lock, ->remove_lock to ensure that after (while) the
> work
> queue is (being) flushed, no new work is queued.
> 
> A follow-on patch will convert the existing ->state_lock from a spin
> lock
> to to a mutex to allow us to hold it for longer periods of time.
> Since
> work can be scheduled from a tasklet, we cannot block on the mutex,
> so
> use a new spin lock.
> 
> Fixes: 6954a9e4192b ("ibmvnic: Flush existing work items before
> device removal")
> Signed-off-by: Sukadev Bhattiprolu 
> ---
>  drivers/net/ethernet/ibm/ibmvnic.c | 15 ++-
>  drivers/net/ethernet/ibm/ibmvnic.h |  2 ++
>  2 files changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/ibm/ibmvnic.c
> b/drivers/net/ethernet/ibm/ibmvnic.c
> index ad551418ac63..7645df37e886 100644
> --- a/drivers/net/ethernet/ibm/ibmvnic.c
> +++ b/drivers/net/ethernet/ibm/ibmvnic.c
> @@ -2435,6 +2435,7 @@ static int ibmvnic_reset(struct ibmvnic_adapter
> *adapter,
>enum ibmvnic_reset_reason reason)
>  {
>   struct net_device *netdev = adapter->netdev;
> + unsigned long flags;
>   int ret;
>  
>   if (adapter->state == VNIC_PROBING) {
> @@ -2443,6 +2444,8 @@ static int ibmvnic_reset(struct ibmvnic_adapter
> *adapter,
>   goto err;
>   }
>  
> + spin_lock_irqsave(&adapter->remove_lock, flags);
> +
>   /*
>* If failover is pending don't schedule any other reset.
>* Instead let the failover complete. If there is already a
> @@ -2465,8 +2468,9 @@ static int ibmvnic_reset(struct ibmvnic_adapter
> *adapter,
>   netdev_dbg(adapter->netdev, "Scheduling reset (reason %d)\n",
> reason);
>   schedule_work(&adapter->ibmvnic_reset);
>  
> - return 0;
> + ret = 0;
>  err:
> + spin_unlock_irqrestore(&adapter->remove_lock, flags);
>   return -ret;
>  }
>  
> @@ -5387,6 +5391,7 @@ static int ibmvnic_probe(struct vio_dev *dev,
> const struct vio_device_id *id)
>   memset(&adapter->pending_resets, 0, sizeof(adapter-
> >pending_resets));
>   spin_lock_init(&adapter->rwi_lock);
>   spin_lock_init(&adapter->state_lock);
> + spin_lock_init(&adapter->remove_lock);
>   mutex_init(&adapter->fw_lock);
>   init_completion(&adapter->init_done);
>   init_completion(&adapter->fw_done);
> @@ -5467,7 +5472,15 @@ static int ibmvnic_remove(struct vio_dev *dev)
>   return -EBUSY;
>   }
>  
> + /* If ibmvnic_reset() is scheduling a reset, wait for it to
> +  * finish. Then prevent it from scheduling any more resets
> +  * and have the reset functions ignore any resets that have
> +  * already been scheduled.
> +  */
> + spin_lock_irqsave(&adapter->remove_lock, flags);
>   adapter->state = VNIC_REMOVING;
> + spin_unlock_irqrestore(&adapter->remove_lock, flags);
> +

Why irqsave/restore variants ? are you expecting this spinlock to be
held in interruptcontext ?

>   spin_unlock_irqrestore(&adapter->state_lock, flags);
>  
>   flush_work(&adapter->ibmvnic_reset);
> diff --git a/drivers/net/ethernet/ibm/ibmvnic.h
> b/drivers/net/ethernet/ibm/ibmvnic.h
> index 1179a95a3f92..2779696ade09 100644
> --- a/drivers/net/ethernet/ibm/ibmvnic.h
> +++ b/drivers/net/ethernet/ibm/ibmvnic.h
> @@ -1081,6 +1081,8 @@ struct ibmvnic_adapter {
>   spinlock_t rwi_lock;
>   enum ibmvnic_reset_reason pending_resets[VNIC_RESET_MAX-1];
>   short next_reset;
> + /* serialize ibmvnic_reset() and ibmvnic_remove() */
> + spinlock_t remove_lock;
>   struct work_struct ibmvnic_reset;
>   struct delayed_work ibmvnic_delayed_reset;
>   unsigned long resetting;



Re: [PATCH net-next v4 1/4] net: phy: mdio-i2c: support I2C MDIO protocol for RollBall SFP modules

2021-01-12 Thread Andrew Lunn
> +/* RollBall SFPs do not access internal PHY via I2C address 0x56, but
> + * instead via address 0x51, when SFP page is set to 0x03 and password to
> + * 0x.
> + * Since current SFP code does not modify SFP_PAGE, we set it to 0x03 only at
> + * bus creation time, and expect it to remain set to 0x03 throughout the
> + * lifetime of the module plugged into the system. If the SFP code starts
> + * modifying SFP_PAGE in the future, this code will need to change.

...

> +/* In order to not interfere with other SFP code (which possibly may 
> manipulate
> + * SFP_PAGE), for every transfer we do this:
> + *   1. lock the bus
> + *   2. save content of SFP_PAGE
> + *   3. set SFP_PAGE to 3
> + *   4. do the transfer
> + *   5. restore original SFP_PAGE
> + *   6. unlock the bus

These two comments seem to contradict each other?

> +static int i2c_rollball_mii_poll(struct mii_bus *bus, int bus_addr, u8 *buf,
> +  size_t len)
> +{
> + struct i2c_adapter *i2c = bus->priv;
> + struct i2c_msg msgs[2];
> + u8 cmd_addr, tmp, *res;
> + int i, ret;
> +
> + cmd_addr = ROLLBALL_CMD_ADDR;
> +
> + res = buf ? buf : &tmp;
> + len = buf ? len : 1;
> +
> + msgs[0].addr = bus_addr;
> + msgs[0].flags = 0;
> + msgs[0].len = 1;
> + msgs[0].buf = &cmd_addr;
> +
> + msgs[1].addr = bus_addr;
> + msgs[1].flags = I2C_M_RD;
> + msgs[1].len = len;
> + msgs[1].buf = res;
> +
> + /* By experiment it takes up to 70 ms to access a register for these
> +  * SFPs. Sleep 20ms between iteratios and try 10 times.
> +  */

iterations

> +static int i2c_mii_read_rollball(struct mii_bus *bus, int phy_id, int reg)
> +{
> + u8 buf[4], res[6];
> + int bus_addr, ret;
> + u16 val;
> +
> + if (!(reg & MII_ADDR_C45))
> + return -EOPNOTSUPP;
> +
> + bus_addr = i2c_mii_phy_addr(phy_id);
> + if (bus_addr != ROLLBALL_PHY_I2C_ADDR)
> + return 0x;
> +
> + buf[0] = ROLLBALL_DATA_ADDR;
> + buf[1] = (reg >> 16) & 0x1f;
> + buf[2] = (reg >> 8) & 0xff;
> + buf[3] = reg & 0xff;

This looks odd. There are only 32 registers for C22 transactions, so
it fits in one byte. You can set buf[1] and buf[2] to zero.

C45 transactions allow for 65536 registers, and has the devtype field,
so would need 3 bytes. Which suggests that C45 might actually be
supported? At least by the protocol, if not the device.

> + dev_dbg(&bus->dev, "read reg %02x:%04x = %04x\n", (reg >> 16) & 0x1f,
> + reg & 0x, val);

There is a tracepoint that allows access to this information, so you
can probably remove this.

Andrew


Re: [PATCH net-next v4 1/4] net: phy: mdio-i2c: support I2C MDIO protocol for RollBall SFP modules

2021-01-12 Thread Andrew Lunn
> > > +static int i2c_mii_read_rollball(struct mii_bus *bus, int phy_id, int 
> > > reg)
> > > +{
> > > + u8 buf[4], res[6];
> > > + int bus_addr, ret;
> > > + u16 val;
> > > +
> > > + if (!(reg & MII_ADDR_C45))
> > > + return -EOPNOTSUPP;
> > > +
> > > + bus_addr = i2c_mii_phy_addr(phy_id);
> > > + if (bus_addr != ROLLBALL_PHY_I2C_ADDR)
> > > + return 0x;
> > > +
> > > + buf[0] = ROLLBALL_DATA_ADDR;
> > > + buf[1] = (reg >> 16) & 0x1f;
> > > + buf[2] = (reg >> 8) & 0xff;
> > > + buf[3] = reg & 0xff;  
> > 
> > This looks odd. There are only 32 registers for C22 transactions, so
> > it fits in one byte. You can set buf[1] and buf[2] to zero.
> 
> C22 is not supported by this protocol.

Duh!

Sorry for the noise.

  Andrew


Re: [RFC PATCH net-next] bonding: add a vlan+srcmac tx hashing option

2021-01-12 Thread Jarod Wilson
On Thu, Jan 07, 2021 at 07:03:40PM -0500, Jarod Wilson wrote:
> On Fri, Dec 18, 2020 at 04:18:59PM -0800, Jay Vosburgh wrote:
> > Jarod Wilson  wrote:
> > 
> > >This comes from an end-user request, where they're running multiple VMs on
> > >hosts with bonded interfaces connected to some interest switch topologies,
> > >where 802.3ad isn't an option. They're currently running a proprietary
> > >solution that effectively achieves load-balancing of VMs and bandwidth
> > >utilization improvements with a similar form of transmission algorithm.
> > >
> > >Basically, each VM has it's own vlan, so it always sends its traffic out
> > >the same interface, unless that interface fails. Traffic gets split
> > >between the interfaces, maintaining a consistent path, with failover still
> > >available if an interface goes down.
> > >
> > >This has been rudimetarily tested to provide similar results, suitable for
> > >them to use to move off their current proprietary solution.
> > >
> > >Still on the TODO list, if these even looks sane to begin with, is
> > >fleshing out Documentation/networking/bonding.rst.
> > 
> > I'm sure you're aware, but any final submission will also need
> > to include netlink and iproute2 support.
> 
> I believe everything for netlink support is already included, but I'll
> double-check that before submitting something for inclusion consideration.

I'm not certain if what you actually meant was that I'd have to patch
iproute2 as well, which I've definitely stumbled onto today, but it's a
2-line patch, and everything seems to be working fine with it:

$ sudo ip link set bond0 type bond xmit_hash_policy 5
$ ip -d link show bond0
11: bond0:  mtu 1500 qdisc noop state DOWN mode 
DEFAULT group default qlen 1000
link/ether ce:85:5e:24:ce:90 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 
maxmtu 65535
bond mode balance-xor miimon 0 updelay 0 downdelay 0 peer_notify_delay 0 
use_carrier 1 arp_interval 0 arp_validate none arp_all_targets any 
primary_reselect always fail_over_mac none xmit_hash_policy vlansrc resend_igmp 
1 num_grat_arp 1 all_slaves_active 0 min_links 0 lp_interval 1 
packets_per_slave 1 lacp_rate slow ad_select stable tlb_dynamic_lb 1 
addrgenmode eui64 numtxqueues 16 numrxqueues 16 gso_max_size 65536 gso_max_segs 
65535
$ grep Hash /proc/net/bonding/bond0
Transmit Hash Policy: vlansrc (5)

Nothing bad seems to happen on an older kernel if one tries to set the new
hash, you just get told that it's an invalid argument.

I *think* this is all ready for submission then, so I'll get both the kernel
and iproute2 patches out soon.

-- 
Jarod Wilson
ja...@redhat.com



Re: [PATCH net-next] net: ipa: add config dependency on QCOM_SMEM

2021-01-12 Thread Randy Dunlap
On 1/12/21 11:21 AM, Alex Elder wrote:
> The IPA driver depends on some SMEM functionality (qcom_smem_init(),
> qcom_smem_alloc(), and qcom_smem_virt_to_phys()), but this is not
> reflected in the configuration dependencies.  Add a dependency on
> QCOM_SMEM to avoid attempts to build the IPA driver without SMEM.
> This avoids a link error for certain configurations.
> 
> Reported-by: Randy Dunlap 
> Fixes: 38a4066f593c5 ("net: ipa: support COMPILE_TEST")
> Signed-off-by: Alex Elder 

Acked-by: Randy Dunlap  # build-tested

Thanks.

> ---
>  drivers/net/ipa/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ipa/Kconfig b/drivers/net/ipa/Kconfig
> index 10a0e041ee775..b68f1289b89ef 100644
> --- a/drivers/net/ipa/Kconfig
> +++ b/drivers/net/ipa/Kconfig
> @@ -1,6 +1,6 @@
>  config QCOM_IPA
>   tristate "Qualcomm IPA support"
> - depends on 64BIT && NET
> + depends on 64BIT && NET && QCOM_SMEM
>   depends on ARCH_QCOM || COMPILE_TEST
>   depends on QCOM_RPROC_COMMON || (QCOM_RPROC_COMMON=n && COMPILE_TEST)
>   select QCOM_MDT_LOADER if ARCH_QCOM
> 


-- 
~Randy



Re: [PATCH net-next v15 5/6] net: dsa: mv88e6xxx: Add support for mv88e6393x family of Marvell

2021-01-12 Thread Vladimir Oltean
On Tue, Jan 12, 2021 at 10:16:32PM +0100, Marek Behún wrote:
> On Tue, 12 Jan 2021 22:38:08 +0200
> Vladimir Oltean  wrote:
> 
> > > + phylink_set(mask, 1baseT_Full);
> > > + phylink_set(mask, 1baseCR_Full);
> > > + phylink_set(mask, 1baseSR_Full);
> > > + phylink_set(mask, 1baseLR_Full);
> > > + phylink_set(mask, 1baseLRM_Full);
> > > + phylink_set(mask, 1baseER_Full);
> > 
> > Why did you remove 1000baseKR_Full from here?
> 
> I am confused now. Should 1000baseKR_Full be here? 10g-kr is 10g-r with
> clause 73 autonegotiation, so they are not compatible, or are they?

ETHTOOL_LINK_MODE_1baseKR_Full_BIT and most of everything else from
enum ethtool_link_mode_bit_indices are network-side media interfaces
(aka between the PHY and the link partner).

Whereas PHY_INTERFACE_MODE_10GKR and most of everything else from
phy_interface_t is a system-side media independent interface (aka
between the MAC and the PHY).

What the 6393X does not support is PHY_INTERFACE_MODE_10GKR, and my
feedback from your previous series did not ask you to remove
1000baseKR_Full from phylink_validate. There's nothing that would
prevent somebody from attaching a PHY with a different (and compatible)
system-side SERDES protocol, and 10GBase-KR on the media side.

What Russell said is that he's seriously thinking about reworking the
phylink_validate API so that the MAC driver would not need to sign off
on things it does not care about (such as what media-side interface will
the PHY use, of which there is an ever increasing plethora). But at the
moment, the API is what it is, and you should put as many media-side
link modes as possible, at the given speed and duplex that the MAC
supports.

_I_ am confused. What did you say you were happy to help with?


RE: [PATCH v2 0/3] hv_netvsc: Prevent packet loss during VF add/remove

2021-01-12 Thread Long Li
> Subject: Re: [PATCH v2 0/3] hv_netvsc: Prevent packet loss during VF
> add/remove
> 
> On Fri,  8 Jan 2021 16:53:40 -0800 Long Li wrote:
> > From: Long Li 
> >
> > This patch set fixes issues with packet loss on VF add/remove.
> 
> These patches are for net-next? They just optimize the amount of packet
> loss on switch, not fix bugs, right?

Yes, those patches are for net-next.

They eliminate the packet loss introduced from Linux side during VF changes. 
They can be seen as optimizations.


Re: [PATCH net-next v4 1/4] net: phy: mdio-i2c: support I2C MDIO protocol for RollBall SFP modules

2021-01-12 Thread Russell King - ARM Linux admin
On Tue, Jan 12, 2021 at 09:54:24PM +0100, Andrew Lunn wrote:
> > +static int i2c_transfer_rollball(struct i2c_adapter *i2c,
> > +struct i2c_msg *msgs, int num)
> > +{
> > +   u8 saved_page;
> > +   int ret;
> > +
> > +   i2c_lock_bus(i2c, I2C_LOCK_SEGMENT);
> > +
> > +   /* save original page */
> > +   ret = __i2c_rollball_get_page(i2c, msgs->addr, &saved_page);
> > +   if (ret)
> > +   goto unlock;
> > +
> > +   /* change to RollBall MDIO page */
> > +   ret = __i2c_rollball_set_page(i2c, msgs->addr, SFP_PAGE_ROLLBALL_MDIO);
> > +   if (ret)
> > +   goto unlock;
> > +
> > +   /* do the transfer */
> > +   ret = __i2c_transfer_err(i2c, msgs, num);
> > +   if (ret)
> > +   goto unlock;
> 
> If get page and set page worked, and you get an error in during the
> actual data transfer, i wonder if you should try restoring the page
> before returning with the error?

That's what we do with paged PHYs - if the access encounters an error,
we still attempt to restore the page and propagate the _original_ error.
We would only propagate the error from the page restore if it was the
only error.  See the logic and comments for phy_restore_page().

Note that phy_(save|select)_page()..phy_restore_page() deal with the
locking, which is why they always have to be paired (also means that
the user doesn't have to think too hard about error handling.)

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!


Re: [net] net: feature check mandating HW_CSUM is wrong

2021-01-12 Thread rohit maheshwari



On 07/01/21 12:47 AM, Jakub Kicinski wrote:

On Wed,  6 Jan 2021 23:23:27 +0530 Rohit Maheshwari wrote:

Mandating NETIF_F_HW_CSUM to enable TLS offload feature is wrong.
And it broke tls offload feature for the drivers, which are still
using NETIF_F_IP_CSUM or NETIF_F_IPV6_CSUM. We should use
NETIF_F_CSUM_MASK instead.

Fixes: ae0b04b238e2 ("net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled")
Signed-off-by: Rohit Maheshwari 

Please use Tariq's suggestion.

HW_TLS_TX feature is for both IPv4/v6. And If one device is limited to
support only IPv4 checksum offload, TLS offload should be allowed for
that too. So I think putting a check of CSUM_MASK is enough.


Re: [PATCH net-next v4 1/4] net: phy: mdio-i2c: support I2C MDIO protocol for RollBall SFP modules

2021-01-12 Thread Marek Behún
On Tue, 12 Jan 2021 21:54:24 +0100
Andrew Lunn  wrote:

> > +static int i2c_transfer_rollball(struct i2c_adapter *i2c,
> > +struct i2c_msg *msgs, int num)
> > +{
> > +   u8 saved_page;
> > +   int ret;
> > +
> > +   i2c_lock_bus(i2c, I2C_LOCK_SEGMENT);
> > +
> > +   /* save original page */
> > +   ret = __i2c_rollball_get_page(i2c, msgs->addr, &saved_page);
> > +   if (ret)
> > +   goto unlock;
> > +
> > +   /* change to RollBall MDIO page */
> > +   ret = __i2c_rollball_set_page(i2c, msgs->addr, SFP_PAGE_ROLLBALL_MDIO);
> > +   if (ret)
> > +   goto unlock;
> > +
> > +   /* do the transfer */
> > +   ret = __i2c_transfer_err(i2c, msgs, num);
> > +   if (ret)
> > +   goto unlock;  
> 
> If get page and set page worked, and you get an error in during the
> actual data transfer, i wonder if you should try restoring the page
> before returning with the error?

I don't know. Can i2c trasfer fail and the next one succeed?


[PATCH net-next] udp: allow forwarding of plain (non-fraglisted) UDP GRO packets

2021-01-12 Thread Alexander Lobakin
Commit 9fd1ff5d2ac7 ("udp: Support UDP fraglist GRO/GSO.") actually
not only added a support for fraglisted UDP GRO, but also tweaked
some logics the way that non-fraglisted UDP GRO started to work for
forwarding too.
Tests showed that currently forwarding and NATing of plain UDP GRO
packets are performed fully correctly, regardless if the target
netdevice has a support for hardware/driver GSO UDP L4 or not.
Add the last element and allow to form plain UDP GRO packets if
there is no socket -> we are on forwarding path.

Plain UDP GRO forwarding even shows better performance than fraglisted
UDP GRO in some cases due to not wasting one skbuff_head per every
segment.

Signed-off-by: Alexander Lobakin 
---
 net/ipv4/udp_offload.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index ff39e94781bf..9d71df3d52ce 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -460,12 +460,13 @@ struct sk_buff *udp_gro_receive(struct list_head *head, 
struct sk_buff *skb,
if (skb->dev->features & NETIF_F_GRO_FRAGLIST)
NAPI_GRO_CB(skb)->is_flist = sk ? !udp_sk(sk)->gro_enabled: 1;
 
-   if ((sk && udp_sk(sk)->gro_enabled) || NAPI_GRO_CB(skb)->is_flist) {
+   if (!sk || (sk && udp_sk(sk)->gro_enabled) ||
+   NAPI_GRO_CB(skb)->is_flist) {
pp = call_gro_receive(udp_gro_receive_segment, head, skb);
return pp;
}
 
-   if (!sk || NAPI_GRO_CB(skb)->encap_mark ||
+   if (NAPI_GRO_CB(skb)->encap_mark ||
(skb->ip_summed != CHECKSUM_PARTIAL &&
 NAPI_GRO_CB(skb)->csum_cnt == 0 &&
 !NAPI_GRO_CB(skb)->csum_valid) ||
-- 
2.30.0




Re: [PATCH net-next v15 5/6] net: dsa: mv88e6xxx: Add support for mv88e6393x family of Marvell

2021-01-12 Thread Marek Behún
On Tue, 12 Jan 2021 22:38:08 +0200
Vladimir Oltean  wrote:

> > +   phylink_set(mask, 1baseT_Full);
> > +   phylink_set(mask, 1baseCR_Full);
> > +   phylink_set(mask, 1baseSR_Full);
> > +   phylink_set(mask, 1baseLR_Full);
> > +   phylink_set(mask, 1baseLRM_Full);
> > +   phylink_set(mask, 1baseER_Full);  
> 
> Why did you remove 1000baseKR_Full from here?

I am confused now. Should 1000baseKR_Full be here? 10g-kr is 10g-r with
clause 73 autonegotiation, so they are not compatible, or are they?

Marek


Re: [PATCH] igb: avoid premature Rx buffer reuse

2021-01-12 Thread Alexander Duyck
On Mon, Jan 11, 2021 at 6:54 PM Li,Rongqing  wrote:
>
>
>
> > -Original Message-
> > From: Alexander Duyck [mailto:alexander.du...@gmail.com]
> > Sent: Tuesday, January 12, 2021 4:54 AM
> > To: Li,Rongqing 
> > Cc: Netdev ; intel-wired-lan
> > ; Björn Töpel 
> > Subject: Re: [PATCH] igb: avoid premature Rx buffer reuse
> >
> > On Wed, Jan 6, 2021 at 7:53 PM Li RongQing  wrote:
> > >
> > > The page recycle code, incorrectly, relied on that a page fragment
> > > could not be freed inside xdp_do_redirect(). This assumption leads to
> > > that page fragments that are used by the stack/XDP redirect can be
> > > reused and overwritten.
> > >
> > > To avoid this, store the page count prior invoking xdp_do_redirect().
> > >
> > > Fixes: 9cbc948b5a20 ("igb: add XDP support")
> > > Signed-off-by: Li RongQing 
> > > Cc: Björn Töpel 
> >
> > I'm not sure what you are talking about here. We allow for a 0 to 1 count
> > difference in the pagecount bias. The idea is the driver should be holding 
> > onto
> > at least one reference from the driver at all times.
> > Are you saying that is not the case?
> >
> > As far as the code itself we hold onto the page as long as our difference 
> > does
> > not exceed 1. So specifically if the XDP call is freeing the page the page 
> > itself
> > should still be valid as the reference count shouldn't drop below 1, and in 
> > that
> > case the driver should be holding that one reference to the page.
> >
> > When we perform our check we are performing it such at output of either 0 if
> > the page is freed, or 1 if the page is not freed are acceptable for us to 
> > allow
> > reuse. The key bit is in igb_clean_rx_irq where we will flip the buffer for 
> > the
> > IGB_XDP_TX | IGB_XDP_REDIR case and just increment the pagecnt_bias
> > indicating that the page was dropped in the non-flipped case.
> >
> > Are you perhaps seeing a function that is returning an error and still 
> > consuming
> > the page? If so that might explain what you are seeing.
> > However the bug would be in the other driver not this one. The
> > xdp_do_redirect function is not supposed to free the page if it returns an 
> > error.
> > It is supposed to leave that up to the function that called xdp_do_redirect.
> >
> > > ---
> > >  drivers/net/ethernet/intel/igb/igb_main.c | 22 +++---
> > >  1 file changed, 15 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/drivers/net/ethernet/intel/igb/igb_main.c
> > > b/drivers/net/ethernet/intel/igb/igb_main.c
> > > index 03f78fdb0dcd..3e0d903cf919 100644
> > > --- a/drivers/net/ethernet/intel/igb/igb_main.c
> > > +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> > > @@ -8232,7 +8232,8 @@ static inline bool igb_page_is_reserved(struct
> > page *page)
> > > return (page_to_nid(page) != numa_mem_id()) ||
> > > page_is_pfmemalloc(page);  }
> > >
> > > -static bool igb_can_reuse_rx_page(struct igb_rx_buffer *rx_buffer)
> > > +static bool igb_can_reuse_rx_page(struct igb_rx_buffer *rx_buffer,
> > > +
> > int
> > > +rx_buf_pgcnt)
> > >  {
> > > unsigned int pagecnt_bias = rx_buffer->pagecnt_bias;
> > > struct page *page = rx_buffer->page; @@ -8243,7 +8244,7 @@
> > > static bool igb_can_reuse_rx_page(struct igb_rx_buffer *rx_buffer)
> > >
> > >  #if (PAGE_SIZE < 8192)
> > > /* if we are only owner of page we can reuse it */
> > > -   if (unlikely((page_ref_count(page) - pagecnt_bias) > 1))
> > > +   if (unlikely((rx_buf_pgcnt - pagecnt_bias) > 1))
> > > return false;
> > >  #else
> > I would need more info on the actual issue. If nothing else it might be 
> > useful to
> > have an example where you print out the page_ref_count versus the
> > pagecnt_bias at a few points to verify exactly what is going on. As I said 
> > before
> > if the issue is the xdp_do_redirect returning an error and still consuming 
> > the
> > page then the bug is elsewhere and not here.
>
>
> This patch is same as 75aab4e10ae6a4593a60f66d13de755d4e91f400
>
>
> commit 75aab4e10ae6a4593a60f66d13de755d4e91f400
> Author: Björn Töpel 
> Date:   Tue Aug 25 19:27:34 2020 +0200
>
> i40e: avoid premature Rx buffer reuse
>
> The page recycle code, incorrectly, relied on that a page fragment
> could not be freed inside xdp_do_redirect(). This assumption leads to
> that page fragments that are used by the stack/XDP redirect can be
> reused and overwritten.
>
> To avoid this, store the page count prior invoking xdp_do_redirect().
>
> Longer explanation:
>
> Intel NICs have a recycle mechanism. The main idea is that a page is
> split into two parts. One part is owned by the driver, one part might
> be owned by someone else, such as the stack.
>
> t0: Page is allocated, and put on the Rx ring
>   +---
> used by NIC ->| upper buffer
> (rx_buffer)   +---
>   | lower buffer
>   +---
>   page count  == USHRT

Re: [MPTCP] [PATCH net 2/2] mptcp: better msk-level shutdown.

2021-01-12 Thread Mat Martineau

On Tue, 12 Jan 2021, Paolo Abeni wrote:


Instead of re-implementing most of inet_shutdown, re-use
such helper, and implement the MPTCP-specific bits at the
'proto' level.

The msk-level disconnect() can now be invoked, lets provide a
suitable implementation.

As a side effect, this fixes bad state management for listener
sockets. The latter could lead to division by 0 oops since
commit ea4ca586b16f ("mptcp: refine MPTCP-level ack scheduling").

Fixes: 43b54c6ee382 ("mptcp: Use full MPTCP-level disconnect state machine")
Fixes: ea4ca586b16f ("mptcp: refine MPTCP-level ack scheduling")
Signed-off-by: Paolo Abeni 
---
net/mptcp/protocol.c | 62 
1 file changed, 17 insertions(+), 45 deletions(-)



Reviewed-by: Mat Martineau 

--
Mat Martineau
Intel


Re: [MPTCP] [PATCH net 1/2] mptcp: more strict state checking for acks

2021-01-12 Thread Mat Martineau

On Tue, 12 Jan 2021, Paolo Abeni wrote:


Syzkaller found a way to trigger division by zero
in mptcp_subflow_cleanup_rbuf().

The current checks implemented into tcp_can_send_ack()
are too week, let's be more accurate.

Reported-by: Christoph Paasch 
Fixes: ea4ca586b16f ("mptcp: refine MPTCP-level ack scheduling")
Fixes: fd8976790a6c ("mptcp: be careful on MPTCP-level ack.")
Signed-off-by: Paolo Abeni 
---
net/mptcp/protocol.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)



Reviewed-by: Mat Martineau 

--
Mat Martineau
Intel


Re: [RFC PATCH net-next] bonding: add a vlan+srcmac tx hashing option

2021-01-12 Thread Jay Vosburgh
Jarod Wilson  wrote:

>On Thu, Jan 07, 2021 at 07:03:40PM -0500, Jarod Wilson wrote:
>> On Fri, Dec 18, 2020 at 04:18:59PM -0800, Jay Vosburgh wrote:
>> > Jarod Wilson  wrote:
>> > 
>> > >This comes from an end-user request, where they're running multiple VMs on
>> > >hosts with bonded interfaces connected to some interest switch topologies,
>> > >where 802.3ad isn't an option. They're currently running a proprietary
>> > >solution that effectively achieves load-balancing of VMs and bandwidth
>> > >utilization improvements with a similar form of transmission algorithm.
>> > >
>> > >Basically, each VM has it's own vlan, so it always sends its traffic out
>> > >the same interface, unless that interface fails. Traffic gets split
>> > >between the interfaces, maintaining a consistent path, with failover still
>> > >available if an interface goes down.
>> > >
>> > >This has been rudimetarily tested to provide similar results, suitable for
>> > >them to use to move off their current proprietary solution.
>> > >
>> > >Still on the TODO list, if these even looks sane to begin with, is
>> > >fleshing out Documentation/networking/bonding.rst.
>> > 
>> >I'm sure you're aware, but any final submission will also need
>> > to include netlink and iproute2 support.
>> 
>> I believe everything for netlink support is already included, but I'll
>> double-check that before submitting something for inclusion consideration.
>
>I'm not certain if what you actually meant was that I'd have to patch
>iproute2 as well, which I've definitely stumbled onto today, but it's a
>2-line patch, and everything seems to be working fine with it:

Yes, that's what I meant.

>$ sudo ip link set bond0 type bond xmit_hash_policy 5

Does the above work with the text label (presumably "vlansrc")
as well as the number, and does "ip link add test type bond help" print
the correct text for XMIT_HASH_POLICY?

-J

>$ ip -d link show bond0
>11: bond0:  mtu 1500 qdisc noop state DOWN mode 
>DEFAULT group default qlen 1000
>link/ether ce:85:5e:24:ce:90 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 
> maxmtu 65535
>bond mode balance-xor miimon 0 updelay 0 downdelay 0 peer_notify_delay 0 
> use_carrier 1 arp_interval 0 arp_validate none arp_all_targets any 
> primary_reselect always fail_over_mac none xmit_hash_policy vlansrc 
> resend_igmp 1 num_grat_arp 1 all_slaves_active 0 min_links 0 lp_interval 1 
> packets_per_slave 1 lacp_rate slow ad_select stable tlb_dynamic_lb 1 
> addrgenmode eui64 numtxqueues 16 numrxqueues 16 gso_max_size 65536 
> gso_max_segs 65535
>$ grep Hash /proc/net/bonding/bond0
>Transmit Hash Policy: vlansrc (5)
>
>Nothing bad seems to happen on an older kernel if one tries to set the new
>hash, you just get told that it's an invalid argument.
>
>I *think* this is all ready for submission then, so I'll get both the kernel
>and iproute2 patches out soon.
>
>-- 
>Jarod Wilson
>ja...@redhat.com

---
-Jay Vosburgh, jay.vosbu...@canonical.com


Re: [RFC PATCH 3/7] tun: allow use of BPF_PROG_TYPE_SCHED_CLS program type

2021-01-12 Thread Yuri Benditovich
On Tue, Jan 12, 2021 at 10:40 PM Yuri Benditovich
 wrote:
>
> On Tue, Jan 12, 2021 at 9:42 PM Yuri Benditovich
>  wrote:
> >
> > This program type can set skb hash value. It will be useful
> > when the tun will support hash reporting feature if virtio-net.
> >
> > Signed-off-by: Yuri Benditovich 
> > ---
> >  drivers/net/tun.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> > index 7959b5c2d11f..455f7afc1f36 100644
> > --- a/drivers/net/tun.c
> > +++ b/drivers/net/tun.c
> > @@ -2981,6 +2981,8 @@ static int tun_set_ebpf(struct tun_struct *tun, 
> > struct tun_prog __rcu **prog_p,
> > prog = NULL;
> > } else {
> > prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCKET_FILTER);
> > +   if (IS_ERR(prog))
> > +   prog = bpf_prog_get_type(fd, 
> > BPF_PROG_TYPE_SCHED_CLS);
> > if (IS_ERR(prog))
> > return PTR_ERR(prog);
> > }
>
> Comment from Alexei Starovoitov:
> Patches 1 and 2 are missing for me, so I couldn't review properly,
> but this diff looks odd.
> It allows sched_cls prog type to attach to tun.
> That means everything that sched_cls progs can do will be done from tun hook?

We do not have an intention to modify the packet in this steering eBPF.
There is just one function that unavailable for BPF_PROG_TYPE_SOCKET_FILTER
that the eBPF needs to make possible to deliver the hash to the guest
VM - it is 'bpf_set_hash'

Does it mean that we need to define a new eBPF type for socket filter
operations + set_hash?

Our problem is that the eBPF calculates 32-bit hash, 16-bit queue
index and 8-bit of hash type.
But it is able to return only 32-bit integer, so in this set of
patches the eBPF returns
queue index and hash type and saves the hash in skb->hash using bpf_set_hash().

If this is unacceptable, can you please recommend a better solution?

> sched_cls assumes l2 and can modify the packet.

The steering eBPF in TUN module also assumes l2.

> I think crashes are inevitable.
>
> > --
> > 2.17.1
> >


Re: [RFC PATCH 3/7] tun: allow use of BPF_PROG_TYPE_SCHED_CLS program type

2021-01-12 Thread Yuri Benditovich
On Tue, Jan 12, 2021 at 9:42 PM Yuri Benditovich
 wrote:
>
> This program type can set skb hash value. It will be useful
> when the tun will support hash reporting feature if virtio-net.
>
> Signed-off-by: Yuri Benditovich 
> ---
>  drivers/net/tun.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 7959b5c2d11f..455f7afc1f36 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -2981,6 +2981,8 @@ static int tun_set_ebpf(struct tun_struct *tun, struct 
> tun_prog __rcu **prog_p,
> prog = NULL;
> } else {
> prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCKET_FILTER);
> +   if (IS_ERR(prog))
> +   prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SCHED_CLS);
> if (IS_ERR(prog))
> return PTR_ERR(prog);
> }

Comment from Alexei Starovoitov:
Patches 1 and 2 are missing for me, so I couldn't review properly,
but this diff looks odd.
It allows sched_cls prog type to attach to tun.
That means everything that sched_cls progs can do will be done from tun hook?
sched_cls assumes l2 and can modify the packet.
I think crashes are inevitable.

> --
> 2.17.1
>


Re: mlx5 error when the skb linear space is empty

2021-01-12 Thread Saeed Mahameed
On Mon, 2021-01-11 at 09:02 +0100, Magnus Karlsson wrote:
> On Tue, Jan 5, 2021 at 9:51 PM Saeed Mahameed 
> wrote:
> > On Mon, 2021-01-04 at 18:59 +0800, Xuan Zhuo wrote:
> > > hi
> > > 
> > > In the process of developing xdp socket, we tried to directly use
> > > page to
> > > construct skb directly, to avoid data copy. And the MAC
> > > information
> > > is also in
> > > the page, which caused the linear space of skb to be empty. In
> > > this
> > > case, I
> > > encountered a problem :
> > > 
> > > mlx5_core :3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8, qn
> > > 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68
> > > : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > 0030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2
> > > WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64
> > > : 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00
> > > 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > 0020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00
> > > 0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > mlx5_core :3b:00.1 eth1: ERR CQE on SQ: 0x1dbb
> > > 
> > > 
> > > And when I try to copy only the mac address into the linear space
> > > of
> > > skb, the
> > > other parts are still placed in the page. When constructing skb
> > > in
> > > this way, I
> > > found that although the data can be sent successfully, the
> > > sending
> > > performance
> > > is relatively poor!!
> > > 
> > 
> > Hi,
> > 
> > This is an expected behavior of ConnectX4-LX, ConnectX4-LX requires
> > the
> > driver to copy at least the L2 headers into the linear part, in
> > some
> > DCB/DSCP configuration it will require L3 headers.
> 
> Do I understand this correctly if I say whatever is calling
> ndo_start_xmit has to make sure at least the L2 headers is in the
> linear part of the skb? If Xuan does not do this, the ConnectX4
> driver
> crashes, but if he does, it works. So from an ndo_start_xmit
> interface
> perspective, what is the requirement of an skb that is passed to it?
> Do all users of ndo_start_xmit make sure the L2 header is in the
> linear part, or are there users that do not make sure this is the
> case? Judging from the ConnectX5 code it seems that the latter is
> possible (since it has code to deal with this), but from the
> ConnectX4, it seems like the former is true (since it does not copy
> the L2 headers into the linear part as far as I can see). Sorry for
> my
> confusion, but I think it is important to get some clarity here as it
> will decide if Xuan's patch is a good idea or not in its current
> form.
> 

To clarify: 
Connectx4Lx, doesn't really require data to be in the linear part, I
was refereing to a HW limitation that requires the driver to copy the
L2/L3 headers (depending on current HW config) to a special area in the
tx descriptor, currently the driver copy the L2/L3 headers only from
the linear part of the SKB, but this can be changed via calling
pskb_may_pull in mlx5 ConnectX4LX tx path to make sure the linear part
has the needed data .. 

Something like:

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index 61ed671fe741..5939fd8eed2c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -159,6 +159,7 @@ static inline int mlx5e_skb_l2_header_offset(struct
sk_buff *skb)
 {
 #define MLX5E_MIN_INLINE (ETH_HLEN + VLAN_HLEN)

+   /* need to check ret val */ 
+   pskb_may_pull(skb, MLX5E_MIN_INLINE);
return max(skb_network_offset(skb), MLX5E_MIN_INLINE);
 }




Re: [PATCH net-next v4 1/4] net: phy: mdio-i2c: support I2C MDIO protocol for RollBall SFP modules

2021-01-12 Thread Marek Behún
On Tue, 12 Jan 2021 21:43:29 +0100
Andrew Lunn  wrote:

> > +/* RollBall SFPs do not access internal PHY via I2C address 0x56, but
> > + * instead via address 0x51, when SFP page is set to 0x03 and password to
> > + * 0x.
> > + * Since current SFP code does not modify SFP_PAGE, we set it to 0x03 only 
> > at
> > + * bus creation time, and expect it to remain set to 0x03 throughout the
> > + * lifetime of the module plugged into the system. If the SFP code starts
> > + * modifying SFP_PAGE in the future, this code will need to change.  
> 
> ...
> 
> > +/* In order to not interfere with other SFP code (which possibly may 
> > manipulate
> > + * SFP_PAGE), for every transfer we do this:
> > + *   1. lock the bus
> > + *   2. save content of SFP_PAGE
> > + *   3. set SFP_PAGE to 3
> > + *   4. do the transfer
> > + *   5. restore original SFP_PAGE
> > + *   6. unlock the bus  
> 
> These two comments seem to contradict each other?

I forgot about the first comment. I will rewrite it and send again.

> 
> > +static int i2c_rollball_mii_poll(struct mii_bus *bus, int bus_addr, u8 
> > *buf,
> > +size_t len)
> > +{
> > +   struct i2c_adapter *i2c = bus->priv;
> > +   struct i2c_msg msgs[2];
> > +   u8 cmd_addr, tmp, *res;
> > +   int i, ret;
> > +
> > +   cmd_addr = ROLLBALL_CMD_ADDR;
> > +
> > +   res = buf ? buf : &tmp;
> > +   len = buf ? len : 1;
> > +
> > +   msgs[0].addr = bus_addr;
> > +   msgs[0].flags = 0;
> > +   msgs[0].len = 1;
> > +   msgs[0].buf = &cmd_addr;
> > +
> > +   msgs[1].addr = bus_addr;
> > +   msgs[1].flags = I2C_M_RD;
> > +   msgs[1].len = len;
> > +   msgs[1].buf = res;
> > +
> > +   /* By experiment it takes up to 70 ms to access a register for these
> > +* SFPs. Sleep 20ms between iteratios and try 10 times.
> > +*/  
> 
> iterations

THX.

> 
> > +static int i2c_mii_read_rollball(struct mii_bus *bus, int phy_id, int reg)
> > +{
> > +   u8 buf[4], res[6];
> > +   int bus_addr, ret;
> > +   u16 val;
> > +
> > +   if (!(reg & MII_ADDR_C45))
> > +   return -EOPNOTSUPP;
> > +
> > +   bus_addr = i2c_mii_phy_addr(phy_id);
> > +   if (bus_addr != ROLLBALL_PHY_I2C_ADDR)
> > +   return 0x;
> > +
> > +   buf[0] = ROLLBALL_DATA_ADDR;
> > +   buf[1] = (reg >> 16) & 0x1f;
> > +   buf[2] = (reg >> 8) & 0xff;
> > +   buf[3] = reg & 0xff;  
> 
> This looks odd. There are only 32 registers for C22 transactions, so
> it fits in one byte. You can set buf[1] and buf[2] to zero.

C22 is not supported by this protocol. A few line above there is
  if (!(reg & MII_ADDR_C45))
return -EOPNOTSUPP;

> C45 transactions allow for 65536 registers, and has the devtype field,
> so would need 3 bytes. Which suggests that C45 might actually be
> supported? At least by the protocol, if not the device.

> > +   dev_dbg(&bus->dev, "read reg %02x:%04x = %04x\n", (reg >> 16) & 0x1f,
> > +   reg & 0x, val);  
> 
> There is a tracepoint that allows access to this information, so you
> can probably remove this.
> 
> Andrew

OK then, I will remove this.

Thanks, Andrew.


Re: [PATCH net-next 4/4] tcp: remove limit on initial receive window

2021-01-12 Thread Heath Caldwell
On 2021-01-12 21:26 (+0100), Eric Dumazet  wrote:
> On Tue, Jan 12, 2021 at 8:25 PM Heath Caldwell  wrote:
> >
> > On 2021-01-12 18:05 (+0100), Eric Dumazet  wrote:
> > > On Tue, Jan 12, 2021 at 5:02 PM Heath Caldwell  
> > > wrote:
> > > >
> > > > On 2021-01-12 09:30 (+0100), Eric Dumazet  wrote:
> > > > > I think the whole patch series is an attempt to badly break TCP stack.
> > > >
> > > > Can you explain the concern that you have about how these changes might
> > > > break the TCP stack?
> > > >
> > > > Patches 1 and 3 fix clear bugs.
> > >
> > > Not clear to me at least.
> > >
> > > If they were bug fixes, a FIxes: tag would be provided.
> >
> > The underlying bugs that are addressed in patches 1 and 3 are present in
> > 1da177e4c3f4 ("Linux-2.6.12-rc2") which looks to be the earliest parent
> > commit in the repository.  What should I do for a Fixes: tag in this
> > case?
> >
> > > You are a first time contributor to linux TCP stack, so better make
> > > sure your claims are solid.
> >
> > I fear that I may not have expressed the problems and solutions in a
> > manner that imparted the ideas well.
> >
> > Maybe I added too much detail in the description for patch 1, which may
> > have obscured the problem: val is capped to sysctl_rmem_max *before* it
> > is doubled (resulting in the possibility for sk_rcvbuf to be set to
> > 2*sysctl_rmem_max, rather than it being capped at sysctl_rmem_max).
> 
> This is fine. This has been done forever. Your change might break 
> applications.

In what way might applications be broken?

It seems to be a very strange position to allow a configured maximum to
be violated because of obscure precedent.

It does not seem to be a supportable position to allow an application to
violate an installation's configuration because of a chance that the
application may behave differently if a setsockopt() call fails.  What
if a system administrator decides to reduce sysctl_rmem_max to half of
the current default?

> I would advise documenting this fact, since existing behavior will be kept
> in many linux hosts for years to come.
> 
> >
> > Maybe I was not explicit enough in the description for patch 3: space is
> > expanded into sock_net(sk)->ipv4.sysctl_tcp_rmem[2] and sysctl_rmem_max
> > without first shrinking them to discount the overhead.
> >
> > > > Patches 2 and 4 might be arguable, though.
> > >
> > > So we have to pick up whatever pleases us ?
> >
> > I have been treating all of these changes together because they all kind
> > of work together to provide a consistent model and configurability for
> > the initial receive window.
> >
> > Patches 1 and 3 address bugs.
> 
> Maybe, but will break applications.

How might patch 3 break an application?  It merely will reduce the
window scale value to something lower but still capable of representing
the largest window that a particular connection might advertise.

> > Patch 2 addresses an inconsistency in how overhead is treated specially
> > for TCP sockets.
> > Patch 4 addresses the 64KB limit which has been imposed.
> 
> For very good reasons.

What are the reasons?

> This is going nowhere. I will stop right now.

That is a shame :(.


Re: [PATCH v3 bpf-next 5/7] bpf: support BPF ksym variables in kernel modules

2021-01-12 Thread Andrii Nakryiko
On Tue, Jan 12, 2021 at 8:27 AM Daniel Borkmann  wrote:
>
> On 1/12/21 8:55 AM, Andrii Nakryiko wrote:
> > Add support for directly accessing kernel module variables from BPF programs
> > using special ldimm64 instructions. This functionality builds upon vmlinux
> > ksym support, but extends ldimm64 with src_reg=BPF_PSEUDO_BTF_ID to allow
> > specifying kernel module BTF's FD in insn[1].imm field.
> >
> > During BPF program load time, verifier will resolve FD to BTF object and 
> > will
> > take reference on BTF object itself and, for module BTFs, corresponding 
> > module
> > as well, to make sure it won't be unloaded from under running BPF program. 
> > The
> > mechanism used is similar to how bpf_prog keeps track of used bpf_maps.
> >
> > One interesting change is also in how per-CPU variable is determined. The
> > logic is to find .data..percpu data section in provided BTF, but both 
> > vmlinux
> > and module each have their own .data..percpu entries in BTF. So for module's
> > case, the search for DATASEC record needs to look at only module's added BTF
> > types. This is implemented with custom search function.
> >
> > Acked-by: Yonghong Song 
> > Acked-by: Hao Luo 
> > Signed-off-by: Andrii Nakryiko 
> [...]
> > +
> > +struct module *btf_try_get_module(const struct btf *btf)
> > +{
> > + struct module *res = NULL;
> > +#ifdef CONFIG_DEBUG_INFO_BTF_MODULES
> > + struct btf_module *btf_mod, *tmp;
> > +
> > + mutex_lock(&btf_module_mutex);
> > + list_for_each_entry_safe(btf_mod, tmp, &btf_modules, list) {
> > + if (btf_mod->btf != btf)
> > + continue;
> > +
> > + if (try_module_get(btf_mod->module))
> > + res = btf_mod->module;
>
> One more thought (follow-up would be okay I'd think) ... when a module 
> references
> a symbol from another module, it similarly needs to bump the refcount of the 
> module
> that is owning it and thus disallowing to unload for that other module's 
> lifetime.
> That usage dependency is visible via /proc/modules however, so if unload 
> doesn't work
> then lsmod allows a way to introspect that to the user. This seems to be 
> achieved via
> resolve_symbol() where it records its dependency/usage. Would be great if we 
> could at
> some point also include the BPF prog name into that list so that this is more 
> obvious.
> Wdyt?
>

Yeah, it's definitely nice to see dependent bpf progs. There is struct
module_use, which is used to record these dependencies, but the
assumption there is that dependencies could be only other modules. So
one way is to somehow extend that or add another set of bpf_prog
dependencies. First is a bit intrusive, while the seconds sucks even
more, IMO.

Alternatively, we can rely on bpf_link info to emit module info, if
the BPF program is attached to BTF type from the module. Then with
bpftool it would be easy to see this, but it's not as
readily-available info as /proc/modules, of course.

Any preferences?

> > + break;
> > + }
> > + mutex_unlock(&btf_module_mutex);
> > +#endif
> > +
> > + return res;
> > +}
> > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> > index 261f8692d0d2..69c3c308de5e 100644
> > --- a/kernel/bpf/core.c
> > +++ b/kernel/bpf/core.c
> > @@ -2119,6 +2119,28 @@ static void bpf_free_used_maps(struct bpf_prog_aux 
> > *aux)
> >   kfree(aux->used_maps);
> >   }
> >
> > +void __bpf_free_used_btfs(struct bpf_prog_aux *aux,
> > +   struct btf_mod_pair *used_btfs, u32 len)
> > +{
> > +#ifdef CONFIG_BPF_SYSCALL
> > + struct btf_mod_pair *btf_mod;
> > + u32 i;
> > +
> > + for (i = 0; i < len; i++) {
> > + btf_mod = &used_btfs[i];
> > + if (btf_mod->module)
> > + module_put(btf_mod->module);
> > + btf_put(btf_mod->btf);
> > + }
> > +#endif
> > +}


Re: [PATCH net-next v15 5/6] net: dsa: mv88e6xxx: Add support for mv88e6393x family of Marvell

2021-01-12 Thread Vladimir Oltean
On Tue, Jan 12, 2021 at 08:54:04PM +0100, Marek Behún wrote:
> From: Pavana Sharma 
> 
> The Marvell 88E6393X device is a single-chip integration of a 11-port
> Ethernet switch with eight integrated Gigabit Ethernet (GbE)
> transceivers and three 10-Gigabit interfaces.
> 
> This patch adds functionalities specific to mv88e6393x family (88E6393X,
> 88E6193X and 88E6191X).
> 
> The main differences between previous devices and this one are:
> - port 0 can be a SERDES port
> - all SERDESes are one-lane, eg. no XAUI nor RXAUI
> - on the other hand the SERDESes can do USXGMII, 10GBASER and 5GBASER
>   (on 6191X only one SERDES is capable of more than 1g; USXGMII is not
>   yet supported with this change)
> - Port Policy CTL register is changed to Port Policy MGMT CTL register,
>   via which several more registers can be accessed indirectly
> - egress monitor port is configured differently
> - ingress monitor/CPU/mirror ports are configured differently and can be
>   configured per port (ie. each port can have different ingress monitor
>   port, for example)
> - port speed AltBit works differently than previously
> - PHY registers can be also accessed via MDIO address 0x18 and 0x19
>   (on previous devices they could be accessed only via Global 2 offsets
>0x18 and 0x19, which means two indirections; this feature is not yet
>leveraged with this patch)
> 
> Co-developed-by: Ashkan Boldaji 
> Signed-off-by: Ashkan Boldaji 
> Signed-off-by: Pavana Sharma 
> Co-developed-by: Marek Behún 
> Signed-off-by: Marek Behún 
> ---
>  drivers/net/dsa/mv88e6xxx/chip.c| 149 +
>  drivers/net/dsa/mv88e6xxx/chip.h|   4 +
>  drivers/net/dsa/mv88e6xxx/global1.h |   2 +
>  drivers/net/dsa/mv88e6xxx/global2.h |   8 +
>  drivers/net/dsa/mv88e6xxx/port.c| 267 
>  drivers/net/dsa/mv88e6xxx/port.h|  47 -
>  drivers/net/dsa/mv88e6xxx/serdes.c  | 312 
>  drivers/net/dsa/mv88e6xxx/serdes.h  |  44 
>  8 files changed, 831 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/dsa/mv88e6xxx/chip.c 
> b/drivers/net/dsa/mv88e6xxx/chip.c
> index ed07cb29b285..c2dc6858481a 100644
> --- a/drivers/net/dsa/mv88e6xxx/chip.c
> +++ b/drivers/net/dsa/mv88e6xxx/chip.c
> @@ -635,6 +635,28 @@ static void mv88e6390x_phylink_validate(struct 
> mv88e6xxx_chip *chip, int port,
>   mv88e6390_phylink_validate(chip, port, mask, state);
>  }
>  
> +static void mv88e6393x_phylink_validate(struct mv88e6xxx_chip *chip, int 
> port,
> + unsigned long *mask,
> + struct phylink_link_state *state)
> +{
> + if (port == 0 || port == 9 || port == 10) {
> + phylink_set(mask, 1baseT_Full);
> + phylink_set(mask, 1baseCR_Full);
> + phylink_set(mask, 1baseSR_Full);
> + phylink_set(mask, 1baseLR_Full);
> + phylink_set(mask, 1baseLRM_Full);
> + phylink_set(mask, 1baseER_Full);

Why did you remove 1000baseKR_Full from here?


Re: [PATCH net-next v4 1/4] net: phy: mdio-i2c: support I2C MDIO protocol for RollBall SFP modules

2021-01-12 Thread Andrew Lunn
> +static int i2c_transfer_rollball(struct i2c_adapter *i2c,
> +  struct i2c_msg *msgs, int num)
> +{
> + u8 saved_page;
> + int ret;
> +
> + i2c_lock_bus(i2c, I2C_LOCK_SEGMENT);
> +
> + /* save original page */
> + ret = __i2c_rollball_get_page(i2c, msgs->addr, &saved_page);
> + if (ret)
> + goto unlock;
> +
> + /* change to RollBall MDIO page */
> + ret = __i2c_rollball_set_page(i2c, msgs->addr, SFP_PAGE_ROLLBALL_MDIO);
> + if (ret)
> + goto unlock;
> +
> + /* do the transfer */
> + ret = __i2c_transfer_err(i2c, msgs, num);
> + if (ret)
> + goto unlock;

If get page and set page worked, and you get an error in during the
actual data transfer, i wonder if you should try restoring the page
before returning with the error?

> +
> + /* restore original page */
> + ret = __i2c_rollball_set_page(i2c, msgs->addr, saved_page);
> +
> +unlock:
> + i2c_unlock_bus(i2c, I2C_LOCK_SEGMENT);
> +
> + return ret;
> +}


Re: [RFC PATCH 0/7] Support for virtio-net hash reporting

2021-01-12 Thread Yuri Benditovich
On Tue, Jan 12, 2021 at 9:49 PM Yuri Benditovich
 wrote:
>
> On Tue, Jan 12, 2021 at 9:41 PM Yuri Benditovich
>  wrote:
> >
> > Existing TUN module is able to use provided "steering eBPF" to
> > calculate per-packet hash and derive the destination queue to
> > place the packet to. The eBPF uses mapped configuration data
> > containing a key for hash calculation and indirection table
> > with array of queues' indices.
> >
> > This series of patches adds support for virtio-net hash reporting
> > feature as defined in virtio specification. It extends the TUN module
> > and the "steering eBPF" as follows:
> >
> > Extended steering eBPF calculates the hash value and hash type, keeps
> > hash value in the skb->hash and returns index of destination virtqueue
> > and the type of the hash. TUN module keeps returned hash type in
> > (currently unused) field of the skb.
> > skb->__unused renamed to 'hash_report_type'.
> >
> > When TUN module is called later to allocate and fill the virtio-net
> > header and push it to destination virtqueue it populates the hash
> > and the hash type into virtio-net header.
> >
> > VHOST driver is made aware of respective virtio-net feature that
> > extends the virtio-net header to report the hash value and hash report
> > type.
>
> Comment from Willem de Bruijn:
>
> Skbuff fields are in short supply. I don't think we need to add one
> just for this narrow path entirely internal to the tun device.
>

We understand that and try to minimize the impact by using an already
existing unused field of skb.


> Instead, you could just run the flow_dissector in tun_put_user if the
> feature is negotiated. Indeed, the flow dissector seems more apt to me
> than BPF here. Note that the flow dissector internally can be
> overridden by a BPF program if the admin so chooses.
>
When this set of patches is related to hash delivery in the virtio-net
packet in general,
it was prepared in context of RSS feature implementation as defined in
virtio spec [1]
In case of RSS it is not enough to run the flow_dissector in tun_put_user:
in tun_ebpf_select_queue the TUN calls eBPF to calculate the hash,
hash type and queue index
according to the (mapped) parameters (key, hash types, indirection
table) received from the guest.
Our intention is to keep the hash and hash type in the skb to populate them
into a virtio-net header later in tun_put_user.
Note that in this case the type of calculated hash is selected not
only from flow dissections
but also from limitations provided by the guest.
This is already implemented in qemu (for case of vhost=off), see [2]
(virtio_net_process_rss)
For case of vhost=on there are WIP for qemu to load eBPF and attach it to TUN.

Note that exact way of selecting rx virtqueue depends on the guest,
it could be automatic steering (typical for Linux VM), RSS (typical
for Windows VM) or
any other steering mechanism implemented in loadable TUN steering BPF with
or without hash calculation.

[1] https://github.com/oasis-tcs/virtio-spec/blob/master/content.tex#L3740
[2] https://github.com/qemu/qemu/blob/master/hw/net/virtio-net.c#L1591

> This also hits on a deeper point with the choice of hash values, that
> I also noticed in my RFC patchset to implement the inverse [1][2]. It
> is much more detailed than skb->hash + skb->l4_hash currently offers,
> and that can be gotten for free from most hardware.

Unfortunately in the case of RSS we can't get this hash from the hardware as
this requires configuration of the NIC's hardware with key and hash types for
Toeplitz hash calculation.

> In most practical
> cases, that information suffices. I added less specific fields
> VIRTIO_NET_HASH_REPORT_L4, VIRTIO_NET_HASH_REPORT_OTHER that work
> without explicit flow dissection. I understand that the existing
> fields are part of the standard. Just curious, what is their purpose
> beyond 4-tuple based flow hashing?

The hash is used in combination with the indirection table to select
destination rx virtqueue.
The hash and hash type are to be reported in virtio-net header, if requested.
For Windows VM - in case the device does not report the hash (even if
it calculated it to
schedule the packet to a proper queue), the driver must do that for each packet
(this is a certification requirement).

>
> [1] https://patchwork.kernel.org/project/netdevbpf/list/?series=406859&state=*
> [2] 
> https://github.com/wdebruij/linux/commit/0f77febf22cd6ffc242a575807fa8382a26e511e
> >
> > Yuri Benditovich (7):
> >   skbuff: define field for hash report type
> >   vhost: support for hash report virtio-net feature
> >   tun: allow use of BPF_PROG_TYPE_SCHED_CLS program type
> >   tun: free bpf_program by bpf_prog_put instead of bpf_prog_destroy
> >   tun: add ioctl code TUNSETHASHPOPULATION
> >   tun: populate hash in virtio-net header when needed
> >   tun: report new tun feature IFF_HASH
> >
> >  drivers/net/tun.c   | 43 +++--
> >  drivers/vhost/net.c | 37 

Re: [PATCH bpf 2/2] libbpf: allow loading empty BTFs

2021-01-12 Thread Andrii Nakryiko
On Tue, Jan 12, 2021 at 12:17 PM Daniel Borkmann  wrote:
>
> On 1/12/21 7:41 AM, Andrii Nakryiko wrote:
> > On Mon, Jan 11, 2021 at 5:16 PM Yonghong Song  wrote:
> >> On 1/11/21 12:51 PM, Andrii Nakryiko wrote:
> >>> On Mon, Jan 11, 2021 at 10:13 AM Yonghong Song  wrote:
>  On 1/9/21 11:03 PM, Andrii Nakryiko wrote:
> > Empty BTFs do come up (e.g., simple kernel modules with no new types and
> > strings, compared to the vmlinux BTF) and there is nothing technically 
> > wrong
> > with them. So remove unnecessary check preventing loading empty BTFs.
> >
> > Reported-by: Christopher William Snowhill 
> > Fixes: ("d8123624506c libbpf: Fix BTF data layout checks and allow 
> > empty BTF")
>
> Fixed up Fixes tag ^ while applying. ;-)

Oh the irony, eh? :) Thanks, Daniel!

>
> > Signed-off-by: Andrii Nakryiko 
> > ---
> > tools/lib/bpf/btf.c | 5 -
> > 1 file changed, 5 deletions(-)
> >
> > diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c
> > index 3c3f2bc6c652..9970a288dda5 100644
> > --- a/tools/lib/bpf/btf.c
> > +++ b/tools/lib/bpf/btf.c
> > @@ -240,11 +240,6 @@ static int btf_parse_hdr(struct btf *btf)
> > }
> >
> > meta_left = btf->raw_size - sizeof(*hdr);
> > - if (!meta_left) {
> > - pr_debug("BTF has no data\n");
> > - return -EINVAL;
> > - }
> 
>  Previous kernel patch allows empty btf only if that btf is module (not
>  base/vmlinux) btf. Here it seems we allow any empty non-module btf to be
>  loaded into the kernel. In such cases, loading may fail? Maybe we should
>  detect such cases in libbpf and error out instead of going to kernel and
>  get error back?
> >>>
> >>> I did this consciously. Kernel is more strict, because there is no
> >>> reasonable case when vmlinux BTF or BPF program's BTF can be empty (at
> >>> least not that now we have FUNCs in BTF). But allowing libbpf to load
> >>> empty BTF generically is helpful for bpftool, as one example, for
> >>> inspection. If you do `bpftool btf dump` on empty BTF, it will just
> >>> print nothing and you'll know that it's a valid (from BTF header
> >>> perspective) BTF, just doesn't have any types (besides VOID). If we
> >>> don't allow it, then we'll just get an error and then you'll have to
> >>> do painful hex dumping and decoding to see what's wrong.
> >>
> >> It is totally okay to allow empty btf in libbpf. I just want to check
> >> if this btf is going to be loaded into the kernel, right before it is
> >> loading whether libbpf could check whether it is a non-module empty btf
> >> or not, if it is, do not go to kernel.
> >
> > Ok, I see what you are proposing. We can do that, but it's definitely
> > separate from these bug fixes. But, to be honest, I wouldn't bother
> > because libbpf will return BTF verification log with a very readable
> > "No data" message in it.
>
> Right, seems okay to me for this particular case given the user will be
> able to make some sense of it from the log.
>
> Thanks,
> Daniel


Re: [RFC PATCH 3/7] tun: allow use of BPF_PROG_TYPE_SCHED_CLS program type

2021-01-12 Thread Yuri Benditovich
On Tue, Jan 12, 2021 at 9:46 PM Alexei Starovoitov
 wrote:
>
> On Tue, Jan 12, 2021 at 11:42 AM Yuri Benditovich
>  wrote:
> >
> > This program type can set skb hash value. It will be useful
> > when the tun will support hash reporting feature if virtio-net.
> >
> > Signed-off-by: Yuri Benditovich 
> > ---
> >  drivers/net/tun.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> > index 7959b5c2d11f..455f7afc1f36 100644
> > --- a/drivers/net/tun.c
> > +++ b/drivers/net/tun.c
> > @@ -2981,6 +2981,8 @@ static int tun_set_ebpf(struct tun_struct *tun, 
> > struct tun_prog __rcu **prog_p,
> > prog = NULL;
> > } else {
> > prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCKET_FILTER);
> > +   if (IS_ERR(prog))
> > +   prog = bpf_prog_get_type(fd, 
> > BPF_PROG_TYPE_SCHED_CLS);
>
> You've ignored the feedback and just resend? what for?

No, I do not. Some patches did not reach relevant people at all, so I
just resent _all_ the patches to all the people.
I will copy your earlier comment to this patch and will address it in
the discussion.
Sorry for misunderstanding and some redundant noise.


Re: [PATCH net-next v15 5/6] net: dsa: mv88e6xxx: Add support for mv88e6393x family of Marvell

2021-01-12 Thread Marek Behún
On Tue, 12 Jan 2021 23:30:48 +0200
Vladimir Oltean  wrote:

> On Tue, Jan 12, 2021 at 10:16:32PM +0100, Marek Behún wrote:
> > On Tue, 12 Jan 2021 22:38:08 +0200
> > Vladimir Oltean  wrote:
> >   
> > > > +   phylink_set(mask, 1baseT_Full);
> > > > +   phylink_set(mask, 1baseCR_Full);
> > > > +   phylink_set(mask, 1baseSR_Full);
> > > > +   phylink_set(mask, 1baseLR_Full);
> > > > +   phylink_set(mask, 1baseLRM_Full);
> > > > +   phylink_set(mask, 1baseER_Full);  
> > > 
> > > Why did you remove 1000baseKR_Full from here?  
> > 
> > I am confused now. Should 1000baseKR_Full be here? 10g-kr is 10g-r with
> > clause 73 autonegotiation, so they are not compatible, or are they?  
> 
> ETHTOOL_LINK_MODE_1baseKR_Full_BIT and most of everything else from
> enum ethtool_link_mode_bit_indices are network-side media interfaces
> (aka between the PHY and the link partner).
> 
> Whereas PHY_INTERFACE_MODE_10GKR and most of everything else from
> phy_interface_t is a system-side media independent interface (aka
> between the MAC and the PHY).

OK that finaly clears things up :) Sorry, I misunderstood it: I thought
that PHY_INTERFACE_MODE_10GKR is related to 1baseKR_Full and that
they are incompatible with 10gbase-r.

> What the 6393X does not support is PHY_INTERFACE_MODE_10GKR, and my
> feedback from your previous series did not ask you to remove
> 1000baseKR_Full from phylink_validate. There's nothing that would
> prevent somebody from attaching a PHY with a different (and compatible)
> system-side SERDES protocol, and 10GBase-KR on the media side.
> 
> What Russell said is that he's seriously thinking about reworking the
> phylink_validate API so that the MAC driver would not need to sign off
> on things it does not care about (such as what media-side interface will
> the PHY use, of which there is an ever increasing plethora). But at the
> moment, the API is what it is, and you should put as many media-side
> link modes as possible, at the given speed and duplex that the MAC
> supports.
> 
> _I_ am confused. What did you say you were happy to help with?

I am happy to help with phylink_validate code refactorization -
refactorization of the code itself, reviewing, testing.

Marek


Re: [PATCH net-next v15 4/6] net: dsa: mv88e6xxx: wrap .set_egress_port method

2021-01-12 Thread Vladimir Oltean
On Tue, Jan 12, 2021 at 08:54:03PM +0100, Marek Behún wrote:
> There are two implementations of the .set_egress_port method, and both
> of them, if successful, set chip->*gress_dest_port variable.
> 
> To avoid code repetition, wrap this method into
> mv88e6xxx_set_egress_port.
> 
> Signed-off-by: Marek Behún 
> Reviewed-by: Pavana Sharma 
> ---

Reviewed-by: Vladimir Oltean 


Re: [PATCH bpf 1/2] bpf: support PTR_TO_MEM{,_OR_NULL} register spilling

2021-01-12 Thread Daniel Borkmann

On 1/12/21 8:46 PM, Andrii Nakryiko wrote:

On Tue, Jan 12, 2021 at 1:14 AM Gilad Reti  wrote:


Add support for pointer to mem register spilling, to allow the verifier
to track pointer to valid memory addresses. Such pointers are returned
for example by a successful call of the bpf_ringbuf_reserve helper.

This patch was suggested as a solution by Yonghong Song.

The patch was partially contibuted by CyberArk Software, Inc.

Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier
support for it")


Surprised no one mentioned this yet. Fixes tag should always be on a
single unwrapped line, however long it is, please fix.


Especially for first-time contributors there is usually some luxury that we
would have fixed this up on the fly while applying. ;) But given a v2 is going
to be sent anyway it's good to point it out for future reference, agree.

Thanks,
Daniel


Re: [PATCH net-next 0/3] r8169: further improvements

2021-01-12 Thread Saeed Mahameed
On Tue, 2021-01-12 at 09:27 +0100, Heiner Kallweit wrote:
> Series includes further smaller improvements.
> 
> Heiner Kallweit (3):
>   r8169: align rtl_wol_suspend_quirk with vendor driver and rename it
>   r8169: improve rtl8169_rx_csum
>   r8169: improve DASH support
> 
>  drivers/net/ethernet/realtek/r8169_main.c | 81 +--
> 
>  1 file changed, 31 insertions(+), 50 deletions(-)
> 

Reviewed-by: Saeed Mahameed 



Re: [PATCH net-next v4 1/4] net: phy: mdio-i2c: support I2C MDIO protocol for RollBall SFP modules

2021-01-12 Thread Andrew Lunn
> -struct mii_bus *mdio_i2c_alloc(struct device *parent, struct i2c_adapter 
> *i2c)
> +/* RollBall SFPs do not access internal PHY via I2C address 0x56, but
> + * instead via address 0x51, when SFP page is set to 0x03 and password to
> + * 0x.
> + * Since current SFP code does not modify SFP_PAGE, we set it to 0x03 only at
> + * bus creation time, and expect it to remain set to 0x03 throughout the
> + * lifetime of the module plugged into the system. If the SFP code starts
> + * modifying SFP_PAGE in the future, this code will need to change.

Just an FYI: This is likely to change. NVIDIA are working on
generalizing the API ethtool -m uses. The new API should allow access
to pages and banks. I've already said i will implement the API in the
generic phylink SFP code. But no need to address this now.

Andrew


Re: [PATCH bpf 1/2] bpf: allow empty module BTFs

2021-01-12 Thread patchwork-bot+netdevbpf
Hello:

This series was applied to bpf/bpf.git (refs/heads/master):

On Sat, 9 Jan 2021 23:03:40 -0800 you wrote:
> Some modules don't declare any new types and end up with an empty BTF,
> containing only valid BTF header and no types or strings sections. This
> currently causes BTF validation error. There is nothing wrong with such BTF,
> so fix the issue by allowing module BTFs with no types or strings.
> 
> Reported-by: Christopher William Snowhill 
> Fixes: 36e68442d1af ("bpf: Load and verify kernel module BTFs")
> Signed-off-by: Andrii Nakryiko 
> 
> [...]

Here is the summary with links:
  - [bpf,1/2] bpf: allow empty module BTFs
https://git.kernel.org/bpf/bpf/c/bcc5e6162d66
  - [bpf,2/2] libbpf: allow loading empty BTFs
https://git.kernel.org/bpf/bpf/c/b8d52264df85

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html




Re: [PATCH bpf-next 1/2] trace: bpf: Allow bpf to attach to bare tracepoints

2021-01-12 Thread Yonghong Song




On 1/11/21 10:20 AM, Qais Yousef wrote:

Some subsystems only have bare tracepoints (a tracepoint with no
associated trace event) to avoid the problem of trace events being an
ABI that can't be changed.

 From bpf presepective, bare tracepoints are what it calls
RAW_TRACEPOINT().

Since bpf assumed there's 1:1 mapping, it relied on hooking to
DEFINE_EVENT() macro to create bpf mapping of the tracepoints. Since
bare tracepoints use DECLARE_TRACE() to create the tracepoint, bpf had
no knowledge about their existence.

By teaching bpf_probe.h to parse DECLARE_TRACE() in a similar fashion to
DEFINE_EVENT(), bpf can find and attach to the new raw tracepoints.

Enabling that comes with the contract that changes to raw tracepoints
don't constitute a regression if they break existing bpf programs.
We need the ability to continue to morph and modify these raw
tracepoints without worrying about any ABI.

Update Documentation/bpf/bpf_design_QA.rst to document this contract.

Signed-off-by: Qais Yousef 
---
  Documentation/bpf/bpf_design_QA.rst |  6 ++
  include/trace/bpf_probe.h   | 12 ++--
  2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/Documentation/bpf/bpf_design_QA.rst 
b/Documentation/bpf/bpf_design_QA.rst
index 2df7b067ab93..0e15f9b05c9d 100644
--- a/Documentation/bpf/bpf_design_QA.rst
+++ b/Documentation/bpf/bpf_design_QA.rst
@@ -208,6 +208,12 @@ data structures and compile with kernel internal headers. 
Both of these
  kernel internals are subject to change and can break with newer kernels
  such that the program needs to be adapted accordingly.
  
+Q: Are tracepoints part of the stable ABI?

+--
+A: NO. Tracepoints are tied to internal implementation details hence they are
+subject to change and can break with newer kernels. BPF programs need to change
+accordingly when this happens.
+
  Q: How much stack space a BPF program uses?
  ---
  A: Currently all program types are limited to 512 bytes of stack
diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
index cd74bffed5c6..cf1496b162b1 100644
--- a/include/trace/bpf_probe.h
+++ b/include/trace/bpf_probe.h
@@ -55,8 +55,7 @@
  /* tracepoints with more than 12 arguments will hit build error */
  #define CAST_TO_U64(...) CONCATENATE(__CAST, 
COUNT_ARGS(__VA_ARGS__))(__VA_ARGS__)
  
-#undef DECLARE_EVENT_CLASS

-#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
+#define __BPF_DECLARE_TRACE(call, proto, args) \
  static notrace void   \
  __bpf_trace_##call(void *__data, proto)   
\
  { \
@@ -64,6 +63,10 @@ __bpf_trace_##call(void *__data, proto)  
\
CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(prog, CAST_TO_U64(args));  
\
  }
  
+#undef DECLARE_EVENT_CLASS

+#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
+   __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))
+
  /*
   * This part is compiled out, it is only here as a build time check
   * to make sure that if the tracepoint handling changes, the
@@ -111,6 +114,11 @@ __DEFINE_EVENT(template, call, PARAMS(proto), 
PARAMS(args), size)
  #define DEFINE_EVENT_PRINT(template, name, proto, args, print)\
DEFINE_EVENT(template, name, PARAMS(proto), PARAMS(args))
  
+#undef DECLARE_TRACE

+#define DECLARE_TRACE(call, proto, args)   \
+   (__BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args)) \
+__DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), 0))


I applied the patch to my local bpf-next repo, and got the following
compilation error:

In file included from 
/data/users/yhs/work/net-next/include/trace/define_trace.h:104, 

 from 
/data/users/yhs/work/net-next/include/trace/events/sched.h:740, 

 from 
/data/users/yhs/work/net-next/kernel/sched/core.c:10: 

/data/users/yhs/work/net-next/include/trace/bpf_probe.h:59:1: error: 
expected identifier or ‘(’ before ‘static’
 static notrace void   \ 

 ^~ 

/data/users/yhs/work/net-next/include/trace/bpf_probe.h:119:3: note: in 
expansion of macro ‘__BPF_DECLARE_TRACE’
  (__BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))  \ 

   ^~~ 

/data/users/yhs/work/net-next/include/trace/events/sched.h:693:1: note: 
in expansion of macro ‘DECLARE_TRACE’
 DECLARE_TRACE(pelt_cfs_tp, 

 ^ 

/data/users/yhs/work/net-next/include/trace/bpf_probe.h:59:1: error: 
expected identifier or ‘(’ before ‘static’
 static notrace void   \ 

 ^~ 

/data/users/yhs/work/net-next/include/trace/bpf_probe.h:119:3: note: in 
expansion of macro ‘__BPF_DECLARE_TRACE’
  (__BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))  \ 

   ^~~ 

/dat

Re: [PATCH bpf 2/2] libbpf: allow loading empty BTFs

2021-01-12 Thread Daniel Borkmann

On 1/12/21 7:41 AM, Andrii Nakryiko wrote:

On Mon, Jan 11, 2021 at 5:16 PM Yonghong Song  wrote:

On 1/11/21 12:51 PM, Andrii Nakryiko wrote:

On Mon, Jan 11, 2021 at 10:13 AM Yonghong Song  wrote:

On 1/9/21 11:03 PM, Andrii Nakryiko wrote:

Empty BTFs do come up (e.g., simple kernel modules with no new types and
strings, compared to the vmlinux BTF) and there is nothing technically wrong
with them. So remove unnecessary check preventing loading empty BTFs.

Reported-by: Christopher William Snowhill 
Fixes: ("d8123624506c libbpf: Fix BTF data layout checks and allow empty BTF")


Fixed up Fixes tag ^ while applying. ;-)


Signed-off-by: Andrii Nakryiko 
---
tools/lib/bpf/btf.c | 5 -
1 file changed, 5 deletions(-)

diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c
index 3c3f2bc6c652..9970a288dda5 100644
--- a/tools/lib/bpf/btf.c
+++ b/tools/lib/bpf/btf.c
@@ -240,11 +240,6 @@ static int btf_parse_hdr(struct btf *btf)
}

meta_left = btf->raw_size - sizeof(*hdr);
- if (!meta_left) {
- pr_debug("BTF has no data\n");
- return -EINVAL;
- }


Previous kernel patch allows empty btf only if that btf is module (not
base/vmlinux) btf. Here it seems we allow any empty non-module btf to be
loaded into the kernel. In such cases, loading may fail? Maybe we should
detect such cases in libbpf and error out instead of going to kernel and
get error back?


I did this consciously. Kernel is more strict, because there is no
reasonable case when vmlinux BTF or BPF program's BTF can be empty (at
least not that now we have FUNCs in BTF). But allowing libbpf to load
empty BTF generically is helpful for bpftool, as one example, for
inspection. If you do `bpftool btf dump` on empty BTF, it will just
print nothing and you'll know that it's a valid (from BTF header
perspective) BTF, just doesn't have any types (besides VOID). If we
don't allow it, then we'll just get an error and then you'll have to
do painful hex dumping and decoding to see what's wrong.


It is totally okay to allow empty btf in libbpf. I just want to check
if this btf is going to be loaded into the kernel, right before it is
loading whether libbpf could check whether it is a non-module empty btf
or not, if it is, do not go to kernel.


Ok, I see what you are proposing. We can do that, but it's definitely
separate from these bug fixes. But, to be honest, I wouldn't bother
because libbpf will return BTF verification log with a very readable
"No data" message in it.


Right, seems okay to me for this particular case given the user will be
able to make some sense of it from the log.

Thanks,
Daniel


Re: [PATCH v6 net-next 14/15] net: bonding: ensure .ndo_get_stats64 can sleep

2021-01-12 Thread Saeed Mahameed
On Tue, 2021-01-12 at 16:37 +0200, Vladimir Oltean wrote:
> On Mon, Jan 11, 2021 at 03:38:49PM -0800, Saeed Mahameed wrote:
> > GFP_ATOMIC is a little bit aggressive especially when user daemons
> > are
> > periodically reading stats. This can be avoided.
> > 
> > You can pre-allocate with GFP_KERNEL an array with an "approximate"
> > size.
> > then fill the array up with whatever slaves the the bond has at
> > that
> > moment, num_of_slaves  can be less, equal or more than the array
> > you
> > just allocated but we shouldn't care ..
> > 
> > something like:
> > rcu_read_lock()
> > nslaves = bond_get_num_slaves();
> > rcu_read_unlock()
> > sarray = kcalloc(nslaves, sizeof(struct bonding_slave_dev),
> > GFP_KERNEL);
> > rcu_read_lock();
> > bond_fill_slaves_array(bond, sarray); // also do: dev_hold()
> > rcu_read_unlock();
> > 
> > 
> > bond_get_slaves_array_stats(sarray);
> > 
> > bond_put_slaves_array(sarray);
> 
> I don't know what to say about acquiring RCU read lock twice and
> traversing the list of interfaces three or four times.

You can optimize this by tracking #num_slaves.

> On the other hand, what's the worst that can happen if the GFP_ATOMIC
> memory allocation fails. It's not like there is any data loss.
> User space will retry when there is less memory pressure.

Anyway Up to you, i just don't like it when we use GFP_ATOMIC when it
can be avoided, especially for periodic jobs, like stats polling.. 



Re: [PATCH bpf v2] bpf: don't leak memory in bpf getsockopt when optlen == 0

2021-01-12 Thread patchwork-bot+netdevbpf
Hello:

This patch was applied to bpf/bpf.git (refs/heads/master):

On Tue, 12 Jan 2021 08:28:29 -0800 you wrote:
> optlen == 0 indicates that the kernel should ignore BPF buffer
> and use the original one from the user. We, however, forget
> to free the temporary buffer that we've allocated for BPF.
> 
> Reported-by: Martin KaFai Lau 
> Fixes: d8fe449a9c51 ("bpf: Don't return EINVAL from {get,set}sockopt when 
> optlen > PAGE_SIZE")
> Signed-off-by: Stanislav Fomichev 
> 
> [...]

Here is the summary with links:
  - [bpf,v2] bpf: don't leak memory in bpf getsockopt when optlen == 0
https://git.kernel.org/bpf/bpf/c/4be34f3d0731

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html




Re: [PATCH bpf-next 2/2] selftests: bpf: Add a new test for bare tracepoints

2021-01-12 Thread Andrii Nakryiko
On Tue, Jan 12, 2021 at 11:27 AM Qais Yousef  wrote:
>
> On 01/11/21 23:26, Andrii Nakryiko wrote:
> > On Mon, Jan 11, 2021 at 10:20 AM Qais Yousef  wrote:
> > >
> > > Reuse module_attach infrastructure to add a new bare tracepoint to check
> > > we can attach to it as a raw tracepoint.
> > >
> > > Signed-off-by: Qais Yousef 
> > > ---
> > >
> > > Andrii
> > >
> > > I was getting the error below when I was trying to run the test.
> > > I had to comment out all related fentry* code to be able to test the 
> > > raw_tp
> > > stuff. Not sure something I've done wrong or it's broken for some reason.
> > > I was on v5.11-rc2.
> >
> > Check that you have all the required Kconfig options from
> > tools/testing/selftests/bpf/config. And also you will need to build
>
> Yep I have merged this config snippet using merge_config.sh script.
>
> > pahole from master, 1.19 doesn't have some fixes that add kernel
> > module support. I think pahole is the reasons why you have the failure
> > below.
>
> I am using pahole 1.19. I have built it from tip of master though.
>
> /trying using v1.19 tag
>
> Still fails the same.
>
> >
> > >
> > > $ sudo ./test_progs -v -t module_attach
> >
> > use -vv when debugging stuff like that with test_progs, it will output
> > libbpf detailed logs, that often are very helpful
>
> I tried that but it didn't help me. Full output is here
>
> https://paste.debian.net/1180846
>

It did help a bit for me to make sure that you have bpf_testmod
properly loaded and its BTF was found, so the problem is somewhere
else. Also, given load succeeded and attach failed with OPNOTSUPP, I
suspect you are missing some of FTRACE configs, which seems to be
missing from selftests's config as well. Check that you have
CONFIG_FTRACE=y and CONFIG_DYNAMIC_FTRACE=y, and you might need some
more. See [0] for a real config we are using to run all tests in
libbpf CI. If you figure out what you were missing, please also
contribute a patch to selftests' config.

  [0] 
https://github.com/libbpf/libbpf/blob/master/travis-ci/vmtest/configs/latest.config

> >
> > > bpf_testmod.ko is already unloaded.
> > > Loading bpf_testmod.ko...
> > > Successfully loaded bpf_testmod.ko.
> > > test_module_attach:PASS:skel_open 0 nsec
> > > test_module_attach:PASS:set_attach_target 0 nsec
> > > test_module_attach:PASS:skel_load 0 nsec
> > > libbpf: prog 'handle_fentry': failed to attach: ERROR: 
> > > strerror_r(-524)=22
> > > libbpf: failed to auto-attach program 'handle_fentry': -524
> > > test_module_attach:FAIL:skel_attach skeleton attach failed: -524
> > > #58 module_attach:FAIL
> > > Successfully unloaded bpf_testmod.ko.
> > > Summary: 0/0 PASSED, 0 SKIPPED, 1 FAILED
> > >
> >
> > But even apart from test failure, there seems to be kernel build
> > failure. See [0] for what fails in kernel-patches CI.
> >
> >[0] https://travis-ci.com/github/kernel-patches/bpf/builds/212730017
>
> Sorry about that. I did a last minute change because of checkpatch.pl error 
> and
> it seems I either forgot to rebuild or missed that the rebuild failed :/
>

no worries, just fix and re-submit. Good that we have CI that caught
this early on.

> >
> >
> > >
> > >  .../selftests/bpf/bpf_testmod/bpf_testmod-events.h |  6 ++
> > >  tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c  |  2 ++
> > >  tools/testing/selftests/bpf/prog_tests/module_attach.c |  1 +
> > >  tools/testing/selftests/bpf/progs/test_module_attach.c | 10 ++
> > >  4 files changed, 19 insertions(+)
> > >
> > > diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h 
> > > b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h
> > > index b83ea448bc79..e1ada753f10c 100644
> > > --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h
> > > +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h
> > > @@ -28,6 +28,12 @@ TRACE_EVENT(bpf_testmod_test_read,
> > >   __entry->pid, __entry->comm, __entry->off, __entry->len)
> > >  );
> > >
> > > +/* A bare tracepoint with no event associated with it */
> > > +DECLARE_TRACE(bpf_testmod_test_read_bare,
> > > +   TP_PROTO(struct task_struct *task, struct 
> > > bpf_testmod_test_read_ctx *ctx),
> > > +   TP_ARGS(task, ctx)
> > > +);
> > > +
> > >  #endif /* _BPF_TESTMOD_EVENTS_H */
> > >
> > >  #undef TRACE_INCLUDE_PATH
> > > diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c 
> > > b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > > index 2df19d73ca49..d63cebdaca44 100644
> > > --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > > +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c
> > > @@ -22,6 +22,8 @@ bpf_testmod_test_read(struct file *file, struct kobject 
> > > *kobj,
> > > };
> > >
> > > trace_bpf_testmod_test_read(current, &ctx);
> > > +   ctx.len++;
> > > +   trace_bpf_testm

Re: [PATCH v6 net-next 03/15] net: procfs: hold netif_lists_lock when retrieving device statistics

2021-01-12 Thread Saeed Mahameed
On Tue, 2021-01-12 at 15:44 +0200, Vladimir Oltean wrote:
> On Mon, Jan 11, 2021 at 03:46:32PM -0800, Saeed Mahameed wrote:
> > This can be very costly, holding a mutex while traversing the whole
> > netedv lists and reading their stats, we need to at least allow
> > multiple readers to enter as it was before, so maybe you want to
> > use
> > rw_semaphore instead of the mutex.
> > 
> > or just have a unified approach of rcu+refcnt/dev_hold as you did
> > for
> > bonding and failover patches #13..#14, I used the same approach to
> > achieve the same for sysfs and procfs more than 2 years ago, you
> > are
> > welcome to use my patches:
> > https://lore.kernel.org/netdev/4cc44e85-cb5e-502c-30f3-c6ea564fe...@gmail.com/
> 
> Ok, what mail address do you want me to keep for your sign off?

Either is fine. Just make sure author and signed-off are the same :).
Thanks!




Re: [PATCH net-next v2 0/7] ibmvnic: Use more consistent locking

2021-01-12 Thread Saeed Mahameed
On Tue, 2021-01-12 at 10:14 -0800, Sukadev Bhattiprolu wrote:
> Use more consistent locking when reading/writing the adapter->state
> field. This patch set fixes a race condition during ibmvnic_open()
> where the adapter could be left in the PROBED state if a reset occurs
> at the wrong time. This can cause networking to not come up during
> boot and potentially require manual intervention in bringing up
> applications that depend on the network.
> 
> Changelog[v2] [Address comments from Jakub Kicinski]
>   - Fix up commit log for patch 5/7 and drop unnecessary variable
>   - Format Fixes line properly (no wrapping, no blank lines)
> 
> Sukadev Bhattiprolu (7):
>   ibmvnic: restore state in change-param reset
>   ibmvnic: update reset function prototypes
>   ibmvnic: avoid allocating rwi entries
>   ibmvnic: switch order of checks in ibmvnic_reset
>   ibmvnic: serialize access to work queue
>   ibmvnic: check adapter->state under state_lock
>   ibmvnic: add comments about state_lock
> 
> 
Other than the two minor comments, 
Reviewed-by: Saeed Mahameed 



Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors

2021-01-12 Thread Alexander Duyck
On Mon, Jan 11, 2021 at 10:56 PM Leon Romanovsky  wrote:
>
> On Mon, Jan 11, 2021 at 11:30:39AM -0800, Alexander Duyck wrote:
> > On Sun, Jan 10, 2021 at 7:10 AM Leon Romanovsky  wrote:
> > >
> > > From: Leon Romanovsky 
> > >
> > > Some SR-IOV capable devices provide an ability to configure specific
> > > number of MSI-X vectors on their VF prior driver is probed on that VF.
> > >
> > > In order to make management easy, provide new read-only sysfs file that
> > > returns a total number of possible to configure MSI-X vectors.
> > >
> > > cat /sys/bus/pci/devices/.../sriov_vf_total_msix
> > >   = 0 - feature is not supported
> > >   > 0 - total number of MSI-X vectors to consume by the VFs
> > >
> > > Signed-off-by: Leon Romanovsky 
> > > ---
> > >  Documentation/ABI/testing/sysfs-bus-pci | 14 +++
> > >  drivers/pci/iov.c   | 31 +
> > >  drivers/pci/pci.h   |  3 +++
> > >  include/linux/pci.h |  2 ++
> > >  4 files changed, 50 insertions(+)
> > >
> > > diff --git a/Documentation/ABI/testing/sysfs-bus-pci 
> > > b/Documentation/ABI/testing/sysfs-bus-pci
> > > index 05e26e5da54e..64e9b700acc9 100644
> > > --- a/Documentation/ABI/testing/sysfs-bus-pci
> > > +++ b/Documentation/ABI/testing/sysfs-bus-pci
> > > @@ -395,3 +395,17 @@ Description:
> > > The file is writable if the PF is bound to a driver that
> > > supports the ->sriov_set_msix_vec_count() callback and 
> > > there
> > > is no driver bound to the VF.
> > > +
> > > +What:  /sys/bus/pci/devices/.../sriov_vf_total_msix
> >
> > In this case I would drop the "vf" and just go with sriov_total_msix
> > since now you are referring to a global value instead of a per VF
> > value.
>
> This field indicates the amount of MSI-X available for VFs, it doesn't
> include PFs. The missing "_vf_" will mislead users who will believe that
> it is all MSI-X vectors available for this device. They will need to take
> into consideration amount of PF MSI-X in order to calculate the VF 
> distribution.
>
> So I would leave "_vf_" here.

The problem is you aren't indicating how many are available for an
individual VF though, you are indicating how many are available for
use by SR-IOV to give to the VFs. The fact that you are dealing with a
pool makes things confusing in my opinion. For example sriov_vf_device
describes the device ID that will be given to each VF.

> >
> > > +Date:  January 2021
> > > +Contact:   Leon Romanovsky 
> > > +Description:
> > > +   This file is associated with the SR-IOV PFs.
> > > +   It returns a total number of possible to configure MSI-X
> > > +   vectors on the enabled VFs.
> > > +
> > > +   The values returned are:
> > > +* > 0 - this will be total number possible to consume by 
> > > VFs,
> > > +* = 0 - feature is not supported
> > > +
> > > +   If no SR-IOV VFs are enabled, this value will return 0.
> > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > > index 42c0df4158d1..0a6ddf3230fd 100644
> > > --- a/drivers/pci/iov.c
> > > +++ b/drivers/pci/iov.c
> > > @@ -394,12 +394,22 @@ static ssize_t sriov_drivers_autoprobe_store(struct 
> > > device *dev,
> > > return count;
> > >  }
> > >
> > > +static ssize_t sriov_vf_total_msix_show(struct device *dev,
> > > +   struct device_attribute *attr,
> > > +   char *buf)
> > > +{
> > > +   struct pci_dev *pdev = to_pci_dev(dev);
> > > +
> > > +   return sprintf(buf, "%d\n", pdev->sriov->vf_total_msix);
> > > +}
> > > +
> >
> > You display it as a signed value, but unsigned values are not
> > supported, correct?
>
> Right, I made it similar to the vf_msix_set. I can change.
>
> >
> > >  static DEVICE_ATTR_RO(sriov_totalvfs);
> > >  static DEVICE_ATTR_RW(sriov_numvfs);
> > >  static DEVICE_ATTR_RO(sriov_offset);
> > >  static DEVICE_ATTR_RO(sriov_stride);
> > >  static DEVICE_ATTR_RO(sriov_vf_device);
> > >  static DEVICE_ATTR_RW(sriov_drivers_autoprobe);
> > > +static DEVICE_ATTR_RO(sriov_vf_total_msix);
> > >
> > >  static struct attribute *sriov_dev_attrs[] = {
> > > &dev_attr_sriov_totalvfs.attr,
> > > @@ -408,6 +418,7 @@ static struct attribute *sriov_dev_attrs[] = {
> > > &dev_attr_sriov_stride.attr,
> > > &dev_attr_sriov_vf_device.attr,
> > > &dev_attr_sriov_drivers_autoprobe.attr,
> > > +   &dev_attr_sriov_vf_total_msix.attr,
> > > NULL,
> > >  };
> > >
> > > @@ -658,6 +669,7 @@ static void sriov_disable(struct pci_dev *dev)
> > > sysfs_remove_link(&dev->dev.kobj, "dep_link");
> > >
> > > iov->num_VFs = 0;
> > > +   iov->vf_total_msix = 0;
> > > pci_iov_set_numvfs(dev, 0);
> > >  }
> > >
> > > @@ -1116,6 +1128,25 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev)
> > >  }
> > 

Re: [PATCH net-next v14 5/6] net: dsa: mv88e6xxx: Add support for mv88e6393x family of Marvell

2021-01-12 Thread Vladimir Oltean
On Tue, Jan 12, 2021 at 07:06:29PM +0100, Marek Behún wrote:
> On Tue, 12 Jan 2021 16:29:09 +
> Russell King - ARM Linux admin  wrote:
> 
> > I'm seriously thinking about changing the phylink_validate() interface
> > such that the question of which link _modes_ are supported no longer
> > comes up with MAC drivers, but instead MAC drivers say what interface
> > modes, speeds for each interface mode, duplexes for each speed are
> > supported.
> 
> BTW this would also solve the situation where DSA needs to know which
> interface modes are supported on a particular port to know which modes
> we can try on SFP connected to a DSA port.

What do you mean here? What makes this specific to DSA?


Re: [PATCH mlx5-next v1 1/5] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs

2021-01-12 Thread Alexander Duyck
On Mon, Jan 11, 2021 at 10:39 PM Leon Romanovsky  wrote:
>
> On Mon, Jan 11, 2021 at 11:30:33AM -0800, Alexander Duyck wrote:
> > On Sun, Jan 10, 2021 at 7:12 AM Leon Romanovsky  wrote:
> > >
> > > From: Leon Romanovsky 
> > >
> > > Extend PCI sysfs interface with a new callback that allows configure
> > > the number of MSI-X vectors for specific SR-IO VF. This is needed
> > > to optimize the performance of newly bound devices by allocating
> > > the number of vectors based on the administrator knowledge of targeted VM.
> > >
> > > This function is applicable for SR-IOV VF because such devices allocate
> > > their MSI-X table before they will run on the VMs and HW can't guess the
> > > right number of vectors, so the HW allocates them statically and equally.
> > >
> > > The newly added /sys/bus/pci/devices/.../vf_msix_vec file will be seen
> > > for the VFs and it is writable as long as a driver is not bounded to the 
> > > VF.
> > >
> > > The values accepted are:
> > >  * > 0 - this will be number reported by the VF's MSI-X capability
> > >  * < 0 - not valid
> > >  * = 0 - will reset to the device default value
> > >
> > > Signed-off-by: Leon Romanovsky 
> > > ---
> > >  Documentation/ABI/testing/sysfs-bus-pci | 20 
> > >  drivers/pci/iov.c   | 62 +
> > >  drivers/pci/msi.c   | 29 
> > >  drivers/pci/pci-sysfs.c |  1 +
> > >  drivers/pci/pci.h   |  2 +
> > >  include/linux/pci.h |  8 +++-
> > >  6 files changed, 121 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/Documentation/ABI/testing/sysfs-bus-pci 
> > > b/Documentation/ABI/testing/sysfs-bus-pci
> > > index 25c9c39770c6..05e26e5da54e 100644
> > > --- a/Documentation/ABI/testing/sysfs-bus-pci
> > > +++ b/Documentation/ABI/testing/sysfs-bus-pci
> > > @@ -375,3 +375,23 @@ Description:
> > > The value comes from the PCI kernel device state and can 
> > > be one
> > > of: "unknown", "error", "D0", D1", "D2", "D3hot", 
> > > "D3cold".
> > > The file is read only.
> > > +
> > > +What:  /sys/bus/pci/devices/.../vf_msix_vec
> >
> > So the name for this doesn't seem to match existing SR-IOV naming.  It
> > seems like this should probably be something like sriov_vf_msix_count
> > in order to be closer to the actual naming of what is being dealt
> > with.
>
> I'm open for suggestions. I didn't use sriov_vf_msix_count because it
> seems too long for me.
>
> >
> > > +Date:  December 2020
> > > +Contact:   Leon Romanovsky 
> > > +Description:
> > > +   This file is associated with the SR-IOV VFs.
> > > +   It allows configuration of the number of MSI-X vectors for
> > > +   the VF. This is needed to optimize performance of newly 
> > > bound
> > > +   devices by allocating the number of vectors based on the
> > > +   administrator knowledge of targeted VM.
> > > +
> > > +   The values accepted are:
> > > +* > 0 - this will be number reported by the VF's MSI-X
> > > +capability
> > > +* < 0 - not valid
> > > +* = 0 - will reset to the device default value
> > > +
> > > +   The file is writable if the PF is bound to a driver that
> > > +   supports the ->sriov_set_msix_vec_count() callback and 
> > > there
> > > +   is no driver bound to the VF.
> > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > > index 4afd4ee4f7f0..42c0df4158d1 100644
> > > --- a/drivers/pci/iov.c
> > > +++ b/drivers/pci/iov.c
> > > @@ -31,6 +31,7 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id)
> > > return (dev->devfn + dev->sriov->offset +
> > > dev->sriov->stride * vf_id) & 0xff;
> > >  }
> > > +EXPORT_SYMBOL(pci_iov_virtfn_devfn);
> > >
> > >  /*
> > >   * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride 
> > > may
> > > @@ -426,6 +427,67 @@ const struct attribute_group sriov_dev_attr_group = {
> > > .is_visible = sriov_attrs_are_visible,
> > >  };
> > >
> > > +#ifdef CONFIG_PCI_MSI
> > > +static ssize_t vf_msix_vec_show(struct device *dev,
> > > +   struct device_attribute *attr, char *buf)
> > > +{
> > > +   struct pci_dev *pdev = to_pci_dev(dev);
> > > +   int numb = pci_msix_vec_count(pdev);
> > > +   struct pci_dev *pfdev;
> > > +
> > > +   if (numb < 0)
> > > +   return numb;
> > > +
> > > +   pfdev = pci_physfn(pdev);
> > > +   if (!pfdev->driver || !pfdev->driver->sriov_set_msix_vec_count)
> > > +   return -EOPNOTSUPP;
> > > +
> >
> > This doesn't make sense to me. You are getting the vector count for
> > the PCI device and reporting that. Are you expecting to call this on
> > the PF or the VFs? It seems like this should be a PF attribute and not
> > b

[PATCH net 3/3] netfilter: nf_nat: Fix memleak in nf_nat_init

2021-01-12 Thread Pablo Neira Ayuso
From: Dinghao Liu 

When register_pernet_subsys() fails, nf_nat_bysource
should be freed just like when nf_ct_extend_register()
fails.

Fixes: 1cd472bf036ca ("netfilter: nf_nat: add nat hook register functions to 
nf_nat")
Signed-off-by: Dinghao Liu 
Acked-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_nat_core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c
index ea923f8cf9c4..b7c3c902290f 100644
--- a/net/netfilter/nf_nat_core.c
+++ b/net/netfilter/nf_nat_core.c
@@ -1174,6 +1174,7 @@ static int __init nf_nat_init(void)
ret = register_pernet_subsys(&nat_net_ops);
if (ret < 0) {
nf_ct_extend_unregister(&nat_extend);
+   kvfree(nf_nat_bysource);
return ret;
}
 
-- 
2.20.1



[PATCH net 2/3] netfilter: conntrack: fix reading nf_conntrack_buckets

2021-01-12 Thread Pablo Neira Ayuso
From: Jesper Dangaard Brouer 

The old way of changing the conntrack hashsize runtime was through changing
the module param via file /sys/module/nf_conntrack/parameters/hashsize. This
was extended to sysctl change in commit 3183ab8997a4 ("netfilter: conntrack:
allow increasing bucket size via sysctl too").

The commit introduced second "user" variable nf_conntrack_htable_size_user
which shadow actual variable nf_conntrack_htable_size. When hashsize is
changed via module param this "user" variable isn't updated. This results in
sysctl net/netfilter/nf_conntrack_buckets shows the wrong value when users
update via the old way.

This patch fix the issue by always updating "user" variable when reading the
proc file. This will take care of changes to the actual variable without
sysctl need to be aware.

Fixes: 3183ab8997a4 ("netfilter: conntrack: allow increasing bucket size via 
sysctl too")
Reported-by: Yoel Caspersen 
Signed-off-by: Jesper Dangaard Brouer 
Acked-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_conntrack_standalone.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/netfilter/nf_conntrack_standalone.c 
b/net/netfilter/nf_conntrack_standalone.c
index 46c5557c1fec..0ee702d374b0 100644
--- a/net/netfilter/nf_conntrack_standalone.c
+++ b/net/netfilter/nf_conntrack_standalone.c
@@ -523,6 +523,9 @@ nf_conntrack_hash_sysctl(struct ctl_table *table, int write,
 {
int ret;
 
+   /* module_param hashsize could have changed value */
+   nf_conntrack_htable_size_user = nf_conntrack_htable_size;
+
ret = proc_dointvec(table, write, buffer, lenp, ppos);
if (ret < 0 || !write)
return ret;
-- 
2.20.1



[PATCH net 0/3] Netfilter fixes for net

2021-01-12 Thread Pablo Neira Ayuso
Hi,

The following patchset contains Netfilter fixes for net:

1) Pass conntrack -f to specify family in netfilter conntrack helper
   selftests, from Chen Yi.

2) Honor hashsize modparam from nf_conntrack_buckets sysctl,
   from Jesper D. Brouer.

3) Fix memleak in nf_nat_init() error path, from Dinghao Liu.

Chen Yi (1):
  selftests: netfilter: Pass family parameter "-f" to conntrack tool

Dinghao Liu (1):
  netfilter: nf_nat: Fix memleak in nf_nat_init

Jesper Dangaard Brouer (1):
  netfilter: conntrack: fix reading nf_conntrack_buckets

 net/netfilter/nf_conntrack_standalone.c  |  3 +++
 net/netfilter/nf_nat_core.c  |  1 +
 .../selftests/netfilter/nft_conntrack_helper.sh  | 12 +---
 3 files changed, 13 insertions(+), 3 deletions(-)

Please, pull these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git

Thanks!



The following changes since commit c49243e8898233de18edfaaa5b7b261ea457f221:

  Merge branch 'net-fix-issues-around-register_netdevice-failures' (2021-01-08 
19:27:44 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git HEAD

for you to fetch changes up to 869f4fdaf4ca7bb6e0d05caf6fa1108dddc346a7:

  netfilter: nf_nat: Fix memleak in nf_nat_init (2021-01-11 00:34:11 +0100)


Chen Yi (1):
  selftests: netfilter: Pass family parameter "-f" to conntrack tool

Dinghao Liu (1):
  netfilter: nf_nat: Fix memleak in nf_nat_init

Jesper Dangaard Brouer (1):
  netfilter: conntrack: fix reading nf_conntrack_buckets

 net/netfilter/nf_conntrack_standalone.c   |  3 +++
 net/netfilter/nf_nat_core.c   |  1 +
 tools/testing/selftests/netfilter/nft_conntrack_helper.sh | 12 +---
 3 files changed, 13 insertions(+), 3 deletions(-)


[PATCH net 1/3] selftests: netfilter: Pass family parameter "-f" to conntrack tool

2021-01-12 Thread Pablo Neira Ayuso
From: Chen Yi 

Fix nft_conntrack_helper.sh false fail report:

1) Conntrack tool need "-f ipv6" parameter to show out ipv6 traffic items.

2) Sleep 1 second after background nc send packet, to make sure check
is after this statement executed.

False report:
FAIL: ns1-lkjUemYw did not show attached helper ip set via ruleset
PASS: ns1-lkjUemYw connection on port 2121 has ftp helper attached
...

After fix:
PASS: ns1-2hUniwU2 connection on port 2121 has ftp helper attached
PASS: ns2-2hUniwU2 connection on port 2121 has ftp helper attached
...

Fixes: 619ae8e0697a6 ("selftests: netfilter: add test case for conntrack helper 
assignment")
Signed-off-by: Chen Yi 
Signed-off-by: Pablo Neira Ayuso 
---
 .../selftests/netfilter/nft_conntrack_helper.sh  | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/netfilter/nft_conntrack_helper.sh 
b/tools/testing/selftests/netfilter/nft_conntrack_helper.sh
index edf0a48da6bf..bf6b9626c7dd 100755
--- a/tools/testing/selftests/netfilter/nft_conntrack_helper.sh
+++ b/tools/testing/selftests/netfilter/nft_conntrack_helper.sh
@@ -94,7 +94,13 @@ check_for_helper()
local message=$2
local port=$3
 
-   ip netns exec ${netns} conntrack -L -p tcp --dport $port 2> /dev/null 
|grep -q 'helper=ftp'
+   if echo $message |grep -q 'ipv6';then
+   local family="ipv6"
+   else
+   local family="ipv4"
+   fi
+
+   ip netns exec ${netns} conntrack -L -f $family -p tcp --dport $port 2> 
/dev/null |grep -q 'helper=ftp'
if [ $? -ne 0 ] ; then
echo "FAIL: ${netns} did not show attached helper $message" 1>&2
ret=1
@@ -111,8 +117,8 @@ test_helper()
 
sleep 3 | ip netns exec ${ns2} nc -w 2 -l -p $port > /dev/null &
 
-   sleep 1
sleep 1 | ip netns exec ${ns1} nc -w 2 10.0.1.2 $port > /dev/null &
+   sleep 1
 
check_for_helper "$ns1" "ip $msg" $port
check_for_helper "$ns2" "ip $msg" $port
@@ -128,8 +134,8 @@ test_helper()
 
sleep 3 | ip netns exec ${ns2} nc -w 2 -6 -l -p $port > /dev/null &
 
-   sleep 1
sleep 1 | ip netns exec ${ns1} nc -w 2 -6 dead:1::2 $port > /dev/null &
+   sleep 1
 
check_for_helper "$ns1" "ipv6 $msg" $port
check_for_helper "$ns2" "ipv6 $msg" $port
-- 
2.20.1



Re: [PATCH net-next] net: ks8851: Connect and start/stop the internal PHY

2021-01-12 Thread Marek Vasut

On 1/11/21 3:47 PM, Andrew Lunn wrote:

On Mon, Jan 11, 2021 at 01:53:37PM +0100, Marek Vasut wrote:

Unless the internal PHY is connected and started, the phylib will not
poll the PHY for state and produce state updates. Connect the PHY and
start/stop it.


Hi Marek

Please continue the conversion and remove all mii_calls.

ks8851_set_link_ksettings() calling mii_ethtool_set_link_ksettings()
is not good, phylib will not know about changes which we made to the
PHY etc.


Hi,

I noticed a couple of drivers implement both the mii and mdiobus 
options, I was pondering why is that. Is there some legacy backward 
compatibility reason for keeping both or is it safe to remove the mii 
support completely from the driver?


Either way, I will do that in a separate patch, so it could be reverted 
if it breaks something.


Re: [PATCH net-next] net: ks8851: Connect and start/stop the internal PHY

2021-01-12 Thread Marek Vasut

On 1/11/21 3:43 PM, Heiner Kallweit wrote:

On 11.01.2021 15:10, Marek Vasut wrote:

On 1/11/21 2:50 PM, Heiner Kallweit wrote:

On 11.01.2021 14:38, Marek Vasut wrote:

On 1/11/21 2:26 PM, Heiner Kallweit wrote:
[...]


LGTM. When having a brief look at the driver I stumbled across two things:

1. Do MAC/PHY support any pause mode? Then a call to
  phy_support_(a)sym_pause() would be missing.


https://ww1.microchip.com/downloads/en/DeviceDoc/KSZ8851-16MLL-Single-Port-Ethernet-MAC-Controller-with-8-Bit-or-16-Bit-Non-PCI-Interface-DS2357B.pdf
page 64

https://www.mouser.com/datasheet/2/268/ksz8851-16mll_ds-776208.pdf
page 65

The later is more complete.

Apparently it does support pause.


Based on the datasheet, does it support sym or asym pause ?



According to the description of flow control on p.23 it can support asym pause.
However on the MAC side flow control doesn't seem to be always active, it's
controlled by these two bits:

p.49, TXCR, bit 3
p.50, RXCR1, bit 10

Default seems to be that flow control is disabled.


So I guess this patch is OK as-is ?


Re: [RFC PATCH net-next] bonding: add a vlan+srcmac tx hashing option

2021-01-12 Thread Jarod Wilson
On Tue, Jan 12, 2021 at 01:39:10PM -0800, Jay Vosburgh wrote:
> Jarod Wilson  wrote:
> 
> >On Thu, Jan 07, 2021 at 07:03:40PM -0500, Jarod Wilson wrote:
> >> On Fri, Dec 18, 2020 at 04:18:59PM -0800, Jay Vosburgh wrote:
> >> > Jarod Wilson  wrote:
> >> > 
> >> > >This comes from an end-user request, where they're running multiple VMs 
> >> > >on
> >> > >hosts with bonded interfaces connected to some interest switch 
> >> > >topologies,
> >> > >where 802.3ad isn't an option. They're currently running a proprietary
> >> > >solution that effectively achieves load-balancing of VMs and bandwidth
> >> > >utilization improvements with a similar form of transmission algorithm.
> >> > >
> >> > >Basically, each VM has it's own vlan, so it always sends its traffic out
> >> > >the same interface, unless that interface fails. Traffic gets split
> >> > >between the interfaces, maintaining a consistent path, with failover 
> >> > >still
> >> > >available if an interface goes down.
> >> > >
> >> > >This has been rudimetarily tested to provide similar results, suitable 
> >> > >for
> >> > >them to use to move off their current proprietary solution.
> >> > >
> >> > >Still on the TODO list, if these even looks sane to begin with, is
> >> > >fleshing out Documentation/networking/bonding.rst.
> >> > 
> >> >  I'm sure you're aware, but any final submission will also need
> >> > to include netlink and iproute2 support.
> >> 
> >> I believe everything for netlink support is already included, but I'll
> >> double-check that before submitting something for inclusion consideration.
> >
> >I'm not certain if what you actually meant was that I'd have to patch
> >iproute2 as well, which I've definitely stumbled onto today, but it's a
> >2-line patch, and everything seems to be working fine with it:
> 
>   Yes, that's what I meant.
> 
> >$ sudo ip link set bond0 type bond xmit_hash_policy 5
> 
>   Does the above work with the text label (presumably "vlansrc")
> as well as the number, and does "ip link add test type bond help" print
> the correct text for XMIT_HASH_POLICY?

All of the above looks correct to me, output below. Before submitting...
Could rename it from vlansrc to vlan+srcmac or some variation thereof if
it's desired. I tried to keep it relatively short, but it's perhaps a bit
less succinct like I have it now, and other modes include a +.

$ sudo modprobe bonding mode=2 max_bonds=1 xmit_hash_policy=0
$ sudo ip link set bond0 type bond xmit_hash_policy vlansrc
$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v4.18.0-272.el8.vstx.x86_64

Bonding Mode: load balancing (xor)
Transmit Hash Policy: vlansrc (5)
MII Status: down
MII Polling Interval (ms): 0
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0

$ sudo ip link add test type bond help
Usage: ... bond [ mode BONDMODE ] [ active_slave SLAVE_DEV ]
[ clear_active_slave ] [ miimon MIIMON ]
[ updelay UPDELAY ] [ downdelay DOWNDELAY ]
[ peer_notify_delay DELAY ]
[ use_carrier USE_CARRIER ]
[ arp_interval ARP_INTERVAL ]
[ arp_validate ARP_VALIDATE ]
[ arp_all_targets ARP_ALL_TARGETS ]
[ arp_ip_target [ ARP_IP_TARGET, ... ] ]
[ primary SLAVE_DEV ]
[ primary_reselect PRIMARY_RESELECT ]
[ fail_over_mac FAIL_OVER_MAC ]
[ xmit_hash_policy XMIT_HASH_POLICY ]
[ resend_igmp RESEND_IGMP ]
[ num_grat_arp|num_unsol_na NUM_GRAT_ARP|NUM_UNSOL_NA ]
[ all_slaves_active ALL_SLAVES_ACTIVE ]
[ min_links MIN_LINKS ]
[ lp_interval LP_INTERVAL ]
[ packets_per_slave PACKETS_PER_SLAVE ]
[ tlb_dynamic_lb TLB_DYNAMIC_LB ]
[ lacp_rate LACP_RATE ]
[ ad_select AD_SELECT ]
[ ad_user_port_key PORTKEY ]
[ ad_actor_sys_prio SYSPRIO ]
[ ad_actor_system LLADDR ]

BONDMODE := 
balance-rr|active-backup|balance-xor|broadcast|802.3ad|balance-tlb|balance-alb
ARP_VALIDATE := none|active|backup|all
ARP_ALL_TARGETS := any|all
PRIMARY_RESELECT := always|better|failure
FAIL_OVER_MAC := none|active|follow
XMIT_HASH_POLICY := layer2|layer2+3|layer3+4|encap2+3|encap3+4|vlansrc
LACP_RATE := slow|fast
AD_SELECT := stable|bandwidth|count


-- 
Jarod Wilson
ja...@redhat.com



Re: [PATCH net-next v15 5/6] net: dsa: mv88e6xxx: Add support for mv88e6393x family of Marvell

2021-01-12 Thread Marek Behún
On Tue, 12 Jan 2021 20:54:04 +0100
Marek Behún  wrote:

> + /* mv88e6393x family errata 3.7 :
> +  * When changing cmode on SERDES port from any other mode to 1000BASE-X
> +  * mode the link may not come up due to invalid 1000BASE-X
> +  * advertisement.
> +  * Workaround: Correct advertisement and reset PHY core.
> +  */
> + if (cmode == MV88E6XXX_PORT_STS_CMODE_1000BASEX) {
> + reg = MV88E6390_SGMII_ANAR_1000BASEX_FD;
> + err = mv88e6390_serdes_write(chip, lane, MDIO_MMD_PHYXS,
> +  MV88E6390_SGMII_ANAR, reg);
> + if (err)
> + return err;
> +
> + /* soft reset the PCS/PMA */
> + err = mv88e6390_serdes_read(chip, lane, MDIO_MMD_PHYXS,
> + MV88E6390_SGMII_CONTROL, ®);
> + if (err)
> + return err;
> +
> + reg |= MV88E6390_SGMII_CONTROL_RESET;
> + err = mv88e6390_serdes_write(chip, lane, MDIO_MMD_PHYXS,
> +  MV88E6390_SGMII_CONTROL, reg);
> + if (err)
> + return err;

It would seem that this is already done in
mv88e6390_serdes_pcs_config, just without the last reset.

> +#define MV88E6390_SGMII_STATUS_AN_ABLE   BIT(3)
> +#define MV88E6390_SGMII_ANAR 0x2004
> +#define MV88E6390_SGMII_ANAR_1000BASEX_FDBIT(5)
> +#define MV88E6390_SGMII_CONTROL  0x2000

This register is already called MV88E6390_SGMII_BMCR and the bits are
defined as BMCR_* macros. Thse same for MV88E6390_SGMII_STATUS and
MV88E6390_SGMII_ANAR.

> +#define MV88E6390_SGMII_CONTROL_RESETBIT(15)
> +#define MV88E6390_SGMII_CONTROL_LOOPBACK BIT(14)
> +#define MV88E6390_SGMII_CONTROL_PDOWNBIT(11)
> +#define MV88E6390_SGMII_STATUS   0x2001

I shall fix this in another version.


Re: [PATCH net-next] net: ipa: add config dependency on QCOM_SMEM

2021-01-12 Thread Bjorn Andersson
On Tue 12 Jan 13:21 CST 2021, Alex Elder wrote:

> The IPA driver depends on some SMEM functionality (qcom_smem_init(),
> qcom_smem_alloc(), and qcom_smem_virt_to_phys()), but this is not
> reflected in the configuration dependencies.  Add a dependency on
> QCOM_SMEM to avoid attempts to build the IPA driver without SMEM.
> This avoids a link error for certain configurations.
> 
> Reported-by: Randy Dunlap 
> Fixes: 38a4066f593c5 ("net: ipa: support COMPILE_TEST")
> Signed-off-by: Alex Elder 

Reviewed-by: Bjorn Andersson 

Regards,
Bjorn

> ---
>  drivers/net/ipa/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ipa/Kconfig b/drivers/net/ipa/Kconfig
> index 10a0e041ee775..b68f1289b89ef 100644
> --- a/drivers/net/ipa/Kconfig
> +++ b/drivers/net/ipa/Kconfig
> @@ -1,6 +1,6 @@
>  config QCOM_IPA
>   tristate "Qualcomm IPA support"
> - depends on 64BIT && NET
> + depends on 64BIT && NET && QCOM_SMEM
>   depends on ARCH_QCOM || COMPILE_TEST
>   depends on QCOM_RPROC_COMMON || (QCOM_RPROC_COMMON=n && COMPILE_TEST)
>   select QCOM_MDT_LOADER if ARCH_QCOM
> -- 
> 2.20.1
> 


[PATCH net-next] net: phy: ar803x: disable extended next page bit

2021-01-12 Thread Russell King
This bit is enabled by default and advertises support for extended
next page support.  XNP is only needed for 10GBase-T and MultiGig
support which is not supported. Additionally, Cisco MultiGig switches
will read this bit and attempt 10Gb negotiation even though Next Page
support is disabled. This will cause timeouts when the interface is
forced to 100Mbps and auto-negotiation will fail. The interfaces are
only 1000Base-T and supporting auto-negotiation for this only requires
the Next Page bit to be set.

Taken from:
https://github.com/SolidRun/linux-stable/commit/7406c5244b7ea6bc17a2afe8568277a8c4b126a9
and adapted to mainline kernels by rmk.

Signed-off-by: Russell King 
---
 drivers/net/phy/at803x.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/at803x.c b/drivers/net/phy/at803x.c
index 9636edb8d618..3909dc9fc94b 100644
--- a/drivers/net/phy/at803x.c
+++ b/drivers/net/phy/at803x.c
@@ -587,7 +587,13 @@ static int at803x_config_init(struct phy_device *phydev)
return ret;
}
 
-   return 0;
+   /* Ar803x extended next page bit is enabled by default. Cisco
+* multigig switches read this bit and attempt to negotiate 10Gbps
+* rates even if the next page bit is disabled. This is incorrect
+* behaviour but we still need to accomodate it. XNP is only needed
+* for 10Gbps support, so disable XNP.
+*/
+   return phy_modify(phydev, MII_ADVERTISE, MDIO_AN_CTRL1_XNP, 0);
 }
 
 static int at803x_ack_interrupt(struct phy_device *phydev)
-- 
2.20.1



Re: [PATCv4 net-next] octeontx2-pf: Add RSS multi group support

2021-01-12 Thread Jakub Kicinski
On Tue, 12 Jan 2021 04:17:44 + Geethasowjanya Akula wrote:
> Hi All,
> 
> Any feedback on this patch.

I think Dave merged it already, see commit 81a4362016e7 ("octeontx2-pf:
Add RSS multi group support") in net-next.


Re: [net] net: feature check mandating HW_CSUM is wrong

2021-01-12 Thread Jakub Kicinski
On Wed, 13 Jan 2021 02:47:51 +0530 rohit maheshwari wrote:
> On 07/01/21 12:47 AM, Jakub Kicinski wrote:
> > On Wed,  6 Jan 2021 23:23:27 +0530 Rohit Maheshwari wrote:  
> >> Mandating NETIF_F_HW_CSUM to enable TLS offload feature is wrong.
> >> And it broke tls offload feature for the drivers, which are still
> >> using NETIF_F_IP_CSUM or NETIF_F_IPV6_CSUM. We should use
> >> NETIF_F_CSUM_MASK instead.
> >>
> >> Fixes: ae0b04b238e2 ("net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is 
> >> disabled")
> >> Signed-off-by: Rohit Maheshwari   
> > Please use Tariq's suggestion.  
> HW_TLS_TX feature is for both IPv4/v6. And If one device is limited to
> support only IPv4 checksum offload, TLS offload should be allowed for
> that too. So I think putting a check of CSUM_MASK is enough.

If Tariq does not disagree please repost.


Re: [PATCH net-next 0/7] a set of fixes of coding style

2021-01-12 Thread Saeed Mahameed
On Tue, 2021-01-12 at 00:42 -0600, Lijun Pan wrote:
> This series address several coding style problems.
> 
> Lijun Pan (7):
>   ibmvnic: prefer 'unsigned long' over 'unsigned long int'
>   ibmvnic: fix block comments
>   ibmvnic: fix braces
>   ibmvnic: avoid multiple line dereference
>   ibmvnic: fix miscellaneous checks
>   ibmvnic: add comments for spinlock_t definitions
>   ibmvnic: remove unused spinlock_t stats_lock definition
> 
>  drivers/net/ethernet/ibm/ibmvnic.c | 65 +---
> --
>  drivers/net/ethernet/ibm/ibmvnic.h | 11 ++---
>  2 files changed, 35 insertions(+), 41 deletions(-)
> 
Reviewed-by: Saeed Mahameed 



Re: [PATCv4 net-next] octeontx2-pf: Add RSS multi group support

2021-01-12 Thread Saeed Mahameed
On Mon, 2021-01-04 at 12:50 +0530, Geetha sowjanya wrote:
> Hardware supports 8 RSS groups per interface. Currently we are using
> only group '0'. This patch allows user to create new RSS
> groups/contexts
> and use the same as destination for flow steering rules.
> 
> usage:
> To steer the traffic to RQ 2,3
> 
> ethtool -X eth0 weight 0 0 1 1 context new
> (It will print the allocated context id number)
> New RSS context is 1
> 
> ethtool -N eth0 flow-type tcp4 dst-port 80 context 1 loc 1
> 
> To delete the context
> ethtool -X eth0 context 1 delete
> 
> When an RSS context is removed, the active classification
> rules using this context are also removed.
> 
> Change-log:
> 
> v4
> - Fixed compiletime warning.
> - Address Saeed's comments on v3.
> 

This patch is marked as accepted in patchwork
https://patchwork.kernel.org/project/netdevbpf/patch/20210104072039.27297-1-gak...@marvell.com/

but it is not actually applied, maybe resend..


you can add:
Reviewed-by: Saeed Mahameed 




Re: [PATCv4 net-next] octeontx2-pf: Add RSS multi group support

2021-01-12 Thread Saeed Mahameed
On Tue, 2021-01-12 at 15:16 -0800, Saeed Mahameed wrote:
> On Mon, 2021-01-04 at 12:50 +0530, Geetha sowjanya wrote:
> > Hardware supports 8 RSS groups per interface. Currently we are
> > using
> > only group '0'. This patch allows user to create new RSS
> > groups/contexts
> > and use the same as destination for flow steering rules.
> > 
> > usage:
> > To steer the traffic to RQ 2,3
> > 
> > ethtool -X eth0 weight 0 0 1 1 context new
> > (It will print the allocated context id number)
> > New RSS context is 1
> > 
> > ethtool -N eth0 flow-type tcp4 dst-port 80 context 1 loc 1
> > 
> > To delete the context
> > ethtool -X eth0 context 1 delete
> > 
> > When an RSS context is removed, the active classification
> > rules using this context are also removed.
> > 
> > Change-log:
> > 
> > v4
> > - Fixed compiletime warning.
> > - Address Saeed's comments on v3.
> > 
> 
> This patch is marked as accepted in patchwork
> https://patchwork.kernel.org/project/netdevbpf/patch/20210104072039.27297-1-gak...@marvell.com/
> 
> but it is not actually applied, maybe resend..
> 
> 
> you can add:
> Reviewed-by: Saeed Mahameed 
> 
> 

I missed Jakub's comment, The patch is already applied, i looked at the
wrong tree.

Thanks.



Re: [PATCH v3 bpf-next 5/7] bpf: support BPF ksym variables in kernel modules

2021-01-12 Thread Alexei Starovoitov
On Tue, Jan 12, 2021 at 8:30 AM Daniel Borkmann  wrote:
>
> On 1/12/21 8:55 AM, Andrii Nakryiko wrote:
> > Add support for directly accessing kernel module variables from BPF programs
> > using special ldimm64 instructions. This functionality builds upon vmlinux
> > ksym support, but extends ldimm64 with src_reg=BPF_PSEUDO_BTF_ID to allow
> > specifying kernel module BTF's FD in insn[1].imm field.
> >
> > During BPF program load time, verifier will resolve FD to BTF object and 
> > will
> > take reference on BTF object itself and, for module BTFs, corresponding 
> > module
> > as well, to make sure it won't be unloaded from under running BPF program. 
> > The
> > mechanism used is similar to how bpf_prog keeps track of used bpf_maps.
> >
> > One interesting change is also in how per-CPU variable is determined. The
> > logic is to find .data..percpu data section in provided BTF, but both 
> > vmlinux
> > and module each have their own .data..percpu entries in BTF. So for module's
> > case, the search for DATASEC record needs to look at only module's added BTF
> > types. This is implemented with custom search function.
> >
> > Acked-by: Yonghong Song 
> > Acked-by: Hao Luo 
> > Signed-off-by: Andrii Nakryiko 
> [...]
> > +
> > +struct module *btf_try_get_module(const struct btf *btf)
> > +{
> > + struct module *res = NULL;
> > +#ifdef CONFIG_DEBUG_INFO_BTF_MODULES
> > + struct btf_module *btf_mod, *tmp;
> > +
> > + mutex_lock(&btf_module_mutex);
> > + list_for_each_entry_safe(btf_mod, tmp, &btf_modules, list) {
> > + if (btf_mod->btf != btf)
> > + continue;
> > +
> > + if (try_module_get(btf_mod->module))
> > + res = btf_mod->module;
>
> One more thought (follow-up would be okay I'd think) ... when a module 
> references
> a symbol from another module, it similarly needs to bump the refcount of the 
> module
> that is owning it and thus disallowing to unload for that other module's 
> lifetime.
> That usage dependency is visible via /proc/modules however, so if unload 
> doesn't work
> then lsmod allows a way to introspect that to the user. This seems to be 
> achieved via
> resolve_symbol() where it records its dependency/usage. Would be great if we 
> could at
> some point also include the BPF prog name into that list so that this is more 
> obvious.
> Wdyt?

I thought about it as well, but plenty of kernel things just grab the ref of ko
and don't add any way to introspect what piece of kernel is holding ko.
So this case won't be the first.
Also if we add it for bpf progs it could be confusing in lsmod.
Since it currently only shows other ko-s in there.
Long ago I had an awk script to parse that output to rmmod dependent modules
before rmmoding the main one. If somebody doing something like this
bpf prog names in the same place may break things.
So I think there are more cons than pros.
That is certainly a follow up if we agree on the direction.


Re: [PATCH] mm: net: memcg accounting for TCP rx zerocopy

2021-01-12 Thread Roman Gushchin
On Tue, Jan 12, 2021 at 01:41:05PM -0800, Shakeel Butt wrote:
> From: Arjun Roy 
> 
> TCP zerocopy receive is used by high performance network applications to
> further scale. For RX zerocopy, the memory containing the network data
> filled by network driver is directly mapped into the address space of
> high performance applications. To keep the TLB cost low, these
> applications unmaps the network memory in big batches. So, this memory
> can remain mapped for long time. This can cause memory isolation issue
> as this memory becomes unaccounted after getting mapped into the
> application address space. This patch adds the memcg accounting for such
> memory.
> 
> Accounting the network memory comes with its own unique challenge. The
> high performance NIC drivers use page pooling to reuse the pages to
> eliminate/reduce the expensive setup steps like IOMMU. These drivers
> keep an extra reference on the pages and thus we can not depends on the
> page reference for the uncharging. The page in the pool may keep a memcg
> pinned for arbitrary long time or may get used by other memcg.
> 
> This patch decouples the uncharging of the page from the refcnt and
> associate it with the map count i.e. the page gets uncharged when the
> last address space unmaps it. Now the question what if the driver drops
> its reference while the page is still mapped. That is fine as the
> address space also holds a reference to the page i.e. the reference
> count can not drop to zero before the map count.
> 
> Signed-off-by: Arjun Roy 
> Co-developed-by: Shakeel Butt 
> Signed-off-by: Shakeel Butt 
> ---
>  include/linux/memcontrol.h | 34 +++--
>  mm/memcontrol.c| 60 ++
>  mm/rmap.c  |  3 ++
>  net/ipv4/tcp.c | 27 +
>  4 files changed, 116 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 7a38a1517a05..0b0e3b4615cf 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -349,11 +349,13 @@ extern struct mem_cgroup *root_mem_cgroup;
>  
>  enum page_memcg_data_flags {
>   /* page->memcg_data is a pointer to an objcgs vector */
> - MEMCG_DATA_OBJCGS = (1UL << 0),
> + MEMCG_DATA_OBJCGS   = (1UL << 0),
>   /* page has been accounted as a non-slab kernel page */
> - MEMCG_DATA_KMEM = (1UL << 1),
> + MEMCG_DATA_KMEM = (1UL << 1),
> + /* page has been accounted as network memory */
> + MEMCG_DATA_SOCK = (1UL << 2),
>   /* the next bit after the last actual flag */
> - __NR_MEMCG_DATA_FLAGS  = (1UL << 2),
> + __NR_MEMCG_DATA_FLAGS   = (1UL << 3),
>  };
>  
>  #define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1)
> @@ -444,6 +446,11 @@ static inline bool PageMemcgKmem(struct page *page)
>   return page->memcg_data & MEMCG_DATA_KMEM;
>  }
>  
> +static inline bool PageMemcgSock(struct page *page)
> +{
> + return page->memcg_data & MEMCG_DATA_SOCK;
> +}
> +
>  #ifdef CONFIG_MEMCG_KMEM
>  /*
>   * page_objcgs - get the object cgroups vector associated with a page
> @@ -1095,6 +1102,11 @@ static inline bool PageMemcgKmem(struct page *page)
>   return false;
>  }
>  
> +static inline bool PageMemcgSock(struct page *page)
> +{
> + return false;
> +}
> +
>  static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
>  {
>   return true;
> @@ -1561,6 +1573,10 @@ extern struct static_key_false 
> memcg_sockets_enabled_key;
>  #define mem_cgroup_sockets_enabled 
> static_branch_unlikely(&memcg_sockets_enabled_key)
>  void mem_cgroup_sk_alloc(struct sock *sk);
>  void mem_cgroup_sk_free(struct sock *sk);
> +int mem_cgroup_charge_sock_pages(struct mem_cgroup *memcg, struct page 
> **pages,
> +  unsigned int nr_pages);
> +void mem_cgroup_uncharge_sock_pages(struct page **pages, unsigned int 
> nr_pages);
> +
>  static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg)
>  {
>   if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && memcg->tcpmem_pressure)
> @@ -1589,6 +1605,18 @@ static inline void memcg_set_shrinker_bit(struct 
> mem_cgroup *memcg,
> int nid, int shrinker_id)
>  {
>  }
> +
> +static inline int mem_cgroup_charge_sock_pages(struct mem_cgroup *memcg,
> +struct page **pages,
> +unsigned int nr_pages)
> +{
> + return 0;
> +}
> +
> +static inline void mem_cgroup_uncharge_sock_pages(struct page **pages,
> +   unsigned int nr_pages)
> +{
> +}
>  #endif
>  
>  #ifdef CONFIG_MEMCG_KMEM
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index db9836f4b64b..38e94538e081 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -7061,6 +7061,66 @@ void mem_cgroup_uncharge_skmem(struct mem_cgroup 
> *memcg, unsigned int nr_pages)
>   refill_st

Re: (subset) [PATCH 00/10] Remove support for TX49xx

2021-01-12 Thread Alexandre Belloni
On Tue, 5 Jan 2021 15:02:45 +0100, Thomas Bogendoerfer wrote:
> I couldn't find any buyable product other than reference boards using
> TX49xx CPUs. And since nobody showed interest in keeping support for
> it, it's time to remove it.
> 
> I've split up the removal into seperate parts for different maintainers.
> So if the patch fits your needs, please take it via your tree or
> give me an ack so I can apply them  the mips-next tree.
> 
> [...]

Applied, thanks!

[08/10] rtc: tx4939: Remove driver
commit: 446667df283002fdda0530523347ffd1cf053373

Best regards,
-- 
Alexandre Belloni 


Re: [PATCH] tcp: keepalive fixes

2021-01-12 Thread Enke Chen
Hi, Yuchung:

I have attached the python script that reproduces the keepalive issues.
The script is a slight modification of the one written by Marek Majkowski:

https://github.com/cloudflare/cloudflare-blog/blob/master/2019-09-tcp-keepalives/test-zero.py

Please note that only the TCP keepalive is configured, and not the user timeout.

Thanks.  -- Enke

On Tue, Jan 12, 2021 at 02:48:01PM -0800, Yuchung Cheng wrote:
> On Tue, Jan 12, 2021 at 2:31 PM Enke Chen  wrote:
> >
> > From: Enke Chen 
> >
> > In this patch two issues with TCP keepalives are fixed:
> >
> > 1) TCP keepalive does not timeout when there are data waiting to be
> >delivered and then the connection got broken. The TCP keepalive
> >timeout is not evaluated in that condition.
> hi enke
> Do you have an example to demonstrate this issue -- in theory when
> there is data inflight, an RTO timer should be pending (which
> considers user-timeout setting). based on the user-timeout description
> (man tcp), the user timeout should abort the socket per the specified
> time after data commences. some data would help to understand the
> issue.
> 

--
#! /usr/bin/python

import io
import os
import select
import socket
import time
import utils
import ctypes

utils.new_ns()

port = 1

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM, 0)
s.bind(('127.0.0.1', port))
s.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 1024)

s.listen(16)

tcpdump = utils.tcpdump_start(port)
c = socket.socket(socket.AF_INET, socket.SOCK_STREAM, 0)
c.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 1024)
c.connect(('127.0.0.1', port))

x, _ = s.accept()

if False:
c.setsockopt(socket.IPPROTO_TCP, socket.TCP_USER_TIMEOUT, 90*1000)

if True:
c.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
c.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 5)
c.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 10)
c.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 10)

time.sleep(0.2)
print("[ ] c.send()")
import fcntl
TIOCOUTQ=0x5411
c.setblocking(False)
while True:
bytes_avail = ctypes.c_int()
fcntl.ioctl(c.fileno(), TIOCOUTQ, bytes_avail)
if bytes_avail.value > 64*1024:
break
try:
c.send(b"A" * 16384 * 4)
except io.BlockingIOError:
break
c.setblocking(True)
time.sleep(0.2)
utils.ss(port)
utils.check_buffer(c)

t0 = time.time()

if True:
utils.drop_start(dport=port)
utils.drop_start(sport=port)

poll = select.poll()
poll.register(c, select.POLLIN)
poll.poll()

utils.ss(port)


e = c.getsockopt(socket.SOL_SOCKET, socket.SO_ERROR)
print("[ ] SO_ERROR = %s" % (e,))

t1 = time.time()
print("[ ] took: %f seconds" % (t1-t0,))


Re: [PATCH] kernel: trace: uprobe: Fix word to the correct spelling

2021-01-12 Thread Randy Dunlap
On 1/11/21 8:50 PM, Bhaskar Chowdhury wrote:
> s/controling/controlling/p
> 
> Signed-off-by: Bhaskar Chowdhury 

Acked-by: Randy Dunlap 

Thanks.

> ---
>  kernel/trace/trace_uprobe.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> index 3cf7128e1ad3..55c6afd8cb27 100644
> --- a/kernel/trace/trace_uprobe.c
> +++ b/kernel/trace/trace_uprobe.c
> @@ -1635,7 +1635,7 @@ void destroy_local_trace_uprobe(struct trace_event_call 
> *event_call)
>  }
>  #endif /* CONFIG_PERF_EVENTS */
> 
> -/* Make a trace interface for controling probe points */
> +/* Make a trace interface for controlling probe points */
>  static __init int init_uprobe_trace(void)
>  {
>   int ret;
> --
> 2.26.2
> 


-- 
~Randy
You can't do anything without having to do something else first.
-- Belefant's Law


mv88e6xxx: 2500base-x inband AN is broken on Amethyst? what to do?

2021-01-12 Thread Marek Behún
Hello,

it seems that inband AN is broken on Amethyst when on 2500base-x mode.

Even SERDES scripts for Amethyst from Marvell has autonegotiation
disabled for 2500base-x mode. For all the other supported Serdes modes
autonegotiation is enabled in these scripts.

The current implementation in mv88e6390_serdes_pcs_config() enables
autonegotiation if phylink_autoneg_inband(mode) is true:

  if (phylink_autoneg_inband(mode))
bmcr = val | BMCR_ANENABLE;
  else
bmcr = val & ~BMCR_ANENABLE;

But for PHY_INTERFACE_MODE_2500BASEX this is broken on Amethyst. The
2500base-x mode seems to work only with autoneg disabled.

The result is that when I connect via a passive SFP cable Amethyst
serdes port with a Peridot serdes port, they will not link. If I
disable autonegotiation on both sides, they will link, though.

What is strange is that if I don't use Peridot, but connect the SFP
directly to Serdes on Armada 3720, where the mvneta driver also enables
autonegotiation for 2500base-x mode, they will link even if Amethyst
does not enable 2500base-x.

To summarize:
Amethyst  <->   Peridot
AN -AN -works
AN -AN +does not work

Amethyst  <->   Armada 3720 serdes
AN -AN +works

(It is possible that Marvell may find some workaround by touch some
 undocumented registers, to solve this. I will try to open a bug
 report.)

Should we just print an error in the serdes_pcs_config method if inband
autonegotiation is being requested?

phylink's code currently allows connecting SFPs in non MLO_AN_INBAND
mode only for when there is Broadcom BCM84881 PHY inside the SFP (by
method phylink_phy_no_inband() in phylink.c).

I wonder whether we can somehow in a sane way implement code to inform
phylink from the mv88e6xxx driver that inband is not supported for the
specific mode. Maybe the .mac_config/.pcs_config method could return an
error indicating this? Or the mv88e6xxx driver can just print an error
that the mode is not supported, and try to ask the user to disable AN?
That would need implementing this in ethtool for SFP, though.

What do you guys think?

Marek


Re: [PATCH net-next v2 5/7] ibmvnic: serialize access to work queue

2021-01-12 Thread Sukadev Bhattiprolu
Saeed Mahameed [sa...@kernel.org] wrote:
> On Tue, 2021-01-12 at 10:14 -0800, Sukadev Bhattiprolu wrote:


> > @@ -5467,7 +5472,15 @@ static int ibmvnic_remove(struct vio_dev *dev)
> > return -EBUSY;
> > }
> >  
> > +   /* If ibmvnic_reset() is scheduling a reset, wait for it to
> > +* finish. Then prevent it from scheduling any more resets
> > +* and have the reset functions ignore any resets that have
> > +* already been scheduled.
> > +*/
> > +   spin_lock_irqsave(&adapter->remove_lock, flags);
> > adapter->state = VNIC_REMOVING;
> > +   spin_unlock_irqrestore(&adapter->remove_lock, flags);
> > +
> 
> Why irqsave/restore variants ? are you expecting this spinlock to be
> held in interruptcontext ?
> 
> > spin_unlock_irqrestore(&adapter->state_lock, flags);

Good question.

One of the callers of ibmvnic_reset() is the ->ndo_tx_timeout()
method which gets called from the watchdog timer.

Thanks for the review.

Sukadev


Re: [PATCH net] net: dcb: Accept RTM_GETDCB messages carrying set-like DCB commands

2021-01-12 Thread patchwork-bot+netdevbpf
Hello:

This patch was applied to netdev/net.git (refs/heads/master):

On Mon, 11 Jan 2021 18:07:07 +0100 you wrote:
> In commit 826f328e2b7e ("net: dcb: Validate netlink message in DCB
> handler"), Linux started rejecting RTM_GETDCB netlink messages if they
> contained a set-like DCB_CMD_ command.
> 
> The reason was that privileges were only verified for RTM_SETDCB messages,
> but the value that determined the action to be taken is the command, not
> the message type. And validation of message type against the DCB command
> was the obvious missing piece.
> 
> [...]

Here is the summary with links:
  - [net] net: dcb: Accept RTM_GETDCB messages carrying set-like DCB commands
https://git.kernel.org/netdev/net/c/df85bc140a4d

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html




Re: [PATCH net-next 0/7] a set of fixes of coding style

2021-01-12 Thread Jakub Kicinski
On Tue, 12 Jan 2021 00:42:58 -0600 Lijun Pan wrote:
> This series address several coding style problems.

Erm, are you sure this series will not conflict with the fixes
Sukadev is sending?


Re: [PATCH] mm: net: memcg accounting for TCP rx zerocopy

2021-01-12 Thread Roman Gushchin
On Tue, Jan 12, 2021 at 03:36:18PM -0800, Arjun Roy wrote:
> On Tue, Jan 12, 2021 at 3:31 PM Roman Gushchin  wrote:
> >
> > On Tue, Jan 12, 2021 at 01:41:05PM -0800, Shakeel Butt wrote:
> > > From: Arjun Roy 
> > >
> > > TCP zerocopy receive is used by high performance network applications to
> > > further scale. For RX zerocopy, the memory containing the network data
> > > filled by network driver is directly mapped into the address space of
> > > high performance applications. To keep the TLB cost low, these
> > > applications unmaps the network memory in big batches. So, this memory
> > > can remain mapped for long time. This can cause memory isolation issue
> > > as this memory becomes unaccounted after getting mapped into the
> > > application address space. This patch adds the memcg accounting for such
> > > memory.
> > >
> > > Accounting the network memory comes with its own unique challenge. The
> > > high performance NIC drivers use page pooling to reuse the pages to
> > > eliminate/reduce the expensive setup steps like IOMMU. These drivers
> > > keep an extra reference on the pages and thus we can not depends on the
> > > page reference for the uncharging. The page in the pool may keep a memcg
> > > pinned for arbitrary long time or may get used by other memcg.
> > >
> > > This patch decouples the uncharging of the page from the refcnt and
> > > associate it with the map count i.e. the page gets uncharged when the
> > > last address space unmaps it. Now the question what if the driver drops
> > > its reference while the page is still mapped. That is fine as the
> > > address space also holds a reference to the page i.e. the reference
> > > count can not drop to zero before the map count.
> > >
> > > Signed-off-by: Arjun Roy 
> > > Co-developed-by: Shakeel Butt 
> > > Signed-off-by: Shakeel Butt 
> > > ---
> > >  include/linux/memcontrol.h | 34 +++--
> > >  mm/memcontrol.c| 60 ++
> > >  mm/rmap.c  |  3 ++
> > >  net/ipv4/tcp.c | 27 +
> > >  4 files changed, 116 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > index 7a38a1517a05..0b0e3b4615cf 100644
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -349,11 +349,13 @@ extern struct mem_cgroup *root_mem_cgroup;
> > >
> > >  enum page_memcg_data_flags {
> > >   /* page->memcg_data is a pointer to an objcgs vector */
> > > - MEMCG_DATA_OBJCGS = (1UL << 0),
> > > + MEMCG_DATA_OBJCGS   = (1UL << 0),
> > >   /* page has been accounted as a non-slab kernel page */
> > > - MEMCG_DATA_KMEM = (1UL << 1),
> > > + MEMCG_DATA_KMEM = (1UL << 1),
> > > + /* page has been accounted as network memory */
> > > + MEMCG_DATA_SOCK = (1UL << 2),
> > >   /* the next bit after the last actual flag */
> > > - __NR_MEMCG_DATA_FLAGS  = (1UL << 2),
> > > + __NR_MEMCG_DATA_FLAGS   = (1UL << 3),
> > >  };
> > >
> > >  #define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1)
> > > @@ -444,6 +446,11 @@ static inline bool PageMemcgKmem(struct page *page)
> > >   return page->memcg_data & MEMCG_DATA_KMEM;
> > >  }
> > >
> > > +static inline bool PageMemcgSock(struct page *page)
> > > +{
> > > + return page->memcg_data & MEMCG_DATA_SOCK;
> > > +}
> > > +
> > >  #ifdef CONFIG_MEMCG_KMEM
> > >  /*
> > >   * page_objcgs - get the object cgroups vector associated with a page
> > > @@ -1095,6 +1102,11 @@ static inline bool PageMemcgKmem(struct page *page)
> > >   return false;
> > >  }
> > >
> > > +static inline bool PageMemcgSock(struct page *page)
> > > +{
> > > + return false;
> > > +}
> > > +
> > >  static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
> > >  {
> > >   return true;
> > > @@ -1561,6 +1573,10 @@ extern struct static_key_false 
> > > memcg_sockets_enabled_key;
> > >  #define mem_cgroup_sockets_enabled 
> > > static_branch_unlikely(&memcg_sockets_enabled_key)
> > >  void mem_cgroup_sk_alloc(struct sock *sk);
> > >  void mem_cgroup_sk_free(struct sock *sk);
> > > +int mem_cgroup_charge_sock_pages(struct mem_cgroup *memcg, struct page 
> > > **pages,
> > > +  unsigned int nr_pages);
> > > +void mem_cgroup_uncharge_sock_pages(struct page **pages, unsigned int 
> > > nr_pages);
> > > +
> > >  static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup 
> > > *memcg)
> > >  {
> > >   if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && 
> > > memcg->tcpmem_pressure)
> > > @@ -1589,6 +1605,18 @@ static inline void memcg_set_shrinker_bit(struct 
> > > mem_cgroup *memcg,
> > > int nid, int shrinker_id)
> > >  {
> > >  }
> > > +
> > > +static inline int mem_cgroup_charge_sock_pages(struct mem_cgroup *memcg,
> > > +struct page **pages,
> > > + 

Re: [RFC PATCH 0/7] Support for virtio-net hash reporting

2021-01-12 Thread Willem de Bruijn
On Tue, Jan 12, 2021 at 3:29 PM Yuri Benditovich
 wrote:
>
> On Tue, Jan 12, 2021 at 9:49 PM Yuri Benditovich
>  wrote:
> >
> > On Tue, Jan 12, 2021 at 9:41 PM Yuri Benditovich
> >  wrote:
> > >
> > > Existing TUN module is able to use provided "steering eBPF" to
> > > calculate per-packet hash and derive the destination queue to
> > > place the packet to. The eBPF uses mapped configuration data
> > > containing a key for hash calculation and indirection table
> > > with array of queues' indices.
> > >
> > > This series of patches adds support for virtio-net hash reporting
> > > feature as defined in virtio specification. It extends the TUN module
> > > and the "steering eBPF" as follows:
> > >
> > > Extended steering eBPF calculates the hash value and hash type, keeps
> > > hash value in the skb->hash and returns index of destination virtqueue
> > > and the type of the hash. TUN module keeps returned hash type in
> > > (currently unused) field of the skb.
> > > skb->__unused renamed to 'hash_report_type'.
> > >
> > > When TUN module is called later to allocate and fill the virtio-net
> > > header and push it to destination virtqueue it populates the hash
> > > and the hash type into virtio-net header.
> > >
> > > VHOST driver is made aware of respective virtio-net feature that
> > > extends the virtio-net header to report the hash value and hash report
> > > type.
> >
> > Comment from Willem de Bruijn:
> >
> > Skbuff fields are in short supply. I don't think we need to add one
> > just for this narrow path entirely internal to the tun device.
> >
>
> We understand that and try to minimize the impact by using an already
> existing unused field of skb.

Not anymore. It was repurposed as a flags field very recently.

This use case is also very narrow in scope. And a very short path from
data producer to consumer. So I don't think it needs to claim scarce
bits in the skb.

tun_ebpf_select_queue stores the field, tun_put_user reads it and
converts it to the virtio_net_hdr in the descriptor.

tun_ebpf_select_queue is called from .ndo_select_queue.  Storing the
field in skb->cb is fragile, as in theory some code could overwrite
that between field between ndo_select_queue and
ndo_start_xmit/tun_net_xmit, from which point it is fully under tun
control again. But in practice, I don't believe anything does.

Alternatively an existing skb field that is used only on disjoint
datapaths, such as ingress-only, could be viable.

> > Instead, you could just run the flow_dissector in tun_put_user if the
> > feature is negotiated. Indeed, the flow dissector seems more apt to me
> > than BPF here. Note that the flow dissector internally can be
> > overridden by a BPF program if the admin so chooses.
> >
> When this set of patches is related to hash delivery in the virtio-net
> packet in general,
> it was prepared in context of RSS feature implementation as defined in
> virtio spec [1]
> In case of RSS it is not enough to run the flow_dissector in tun_put_user:
> in tun_ebpf_select_queue the TUN calls eBPF to calculate the hash,
> hash type and queue index
> according to the (mapped) parameters (key, hash types, indirection
> table) received from the guest.

TUNSETSTEERINGEBPF was added to support more diverse queue selection
than the default in case of multiqueue tun. Not sure what the exact
use cases are.

But RSS is exactly the purpose of the flow dissector. It is used for
that purpose in the software variant RPS. The flow dissector
implements a superset of the RSS spec, and certainly computes a
four-tuple for TCP/IPv6. In the case of RPS, it is skipped if the NIC
has already computed a 4-tuple hash.

What it does not give is a type indication, such as
VIRTIO_NET_HASH_TYPE_TCPv6. I don't understand how this would be used.
In datapaths where the NIC has already computed the four-tuple hash
and stored it in skb->hash --the common case for servers--, That type
field is the only reason to have to compute again.

> Our intention is to keep the hash and hash type in the skb to populate them
> into a virtio-net header later in tun_put_user.
> Note that in this case the type of calculated hash is selected not
> only from flow dissections
> but also from limitations provided by the guest.
>
> This is already implemented in qemu (for case of vhost=off), see [2]
> (virtio_net_process_rss)
> For case of vhost=on there are WIP for qemu to load eBPF and attach it to TUN.

> Note that exact way of selecting rx virtqueue depends on the guest,
> it could be automatic steering (typical for Linux VM), RSS (typical
> for Windows VM) or
> any other steering mechanism implemented in loadable TUN steering BPF with
> or without hash calculation.
>
> [1] https://github.com/oasis-tcs/virtio-spec/blob/master/content.tex#L3740
> [2] https://github.com/qemu/qemu/blob/master/hw/net/virtio-net.c#L1591
>
> > This also hits on a deeper point with the choice of hash values, that
> > I also noticed in my RFC patchset to implement the inverse [1][2]. It
> > i

Re: [PATCH] bpf: Hoise pahole version checks into Kconfig

2021-01-12 Thread Sedat Dilek
On Mon, Jan 11, 2021 at 7:06 PM Nathan Chancellor
 wrote:
>
> After commit da5fb18225b4 ("bpf: Support pre-2.25-binutils objcopy for
> vmlinux BTF"), having CONFIG_DEBUG_INFO_BTF enabled but lacking a valid
> copy of pahole results in a kernel that will fully compile but fail to
> link. The user then has to either install pahole or disable
> CONFIG_DEBUG_INFO_BTF and rebuild the kernel but only after their build
> has failed, which could have been a significant amount of time depending
> on the hardware.
>
> Avoid a poor user experience and require pahole to be installed with an
> appropriate version to select and use CONFIG_DEBUG_INFO_BTF, which is
> standard for options that require a specific tools version.
>
> Suggested-by: Sedat Dilek 
> Signed-off-by: Nathan Chancellor 

Thanks for the patch, Nathan,

Might be good to gave a hint to the user if pahole version does not
match requirements?

Feel free to add my:

Tested-by: Sedat Dilek 

- Sedat -



> ---
>  MAINTAINERS   |  1 +
>  init/Kconfig  |  4 
>  lib/Kconfig.debug |  6 ++
>  scripts/link-vmlinux.sh   | 13 -
>  scripts/pahole-version.sh | 16 
>  5 files changed, 23 insertions(+), 17 deletions(-)
>  create mode 100755 scripts/pahole-version.sh
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index b8db7637263a..6f6e24285a94 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3282,6 +3282,7 @@ F:net/core/filter.c
>  F: net/sched/act_bpf.c
>  F: net/sched/cls_bpf.c
>  F: samples/bpf/
> +F: scripts/pahole-version.sh
>  F: tools/bpf/
>  F: tools/lib/bpf/
>  F: tools/testing/selftests/bpf/
> diff --git a/init/Kconfig b/init/Kconfig
> index b77c60f8b963..872c61b5d204 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -74,6 +74,10 @@ config TOOLS_SUPPORT_RELR
>  config CC_HAS_ASM_INLINE
> def_bool $(success,echo 'void foo(void) { asm inline (""); }' | $(CC) 
> -x c - -c -o /dev/null)
>
> +config PAHOLE_VERSION
> +   int
> +   default $(shell,$(srctree)/scripts/pahole-version.sh $(PAHOLE))
> +
>  config CONSTRUCTORS
> bool
> depends on !UML
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 7937265ef879..70c446af9664 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -267,6 +267,7 @@ config DEBUG_INFO_DWARF4
>
>  config DEBUG_INFO_BTF
> bool "Generate BTF typeinfo"
> +   depends on PAHOLE_VERSION >= 116
> depends on !DEBUG_INFO_SPLIT && !DEBUG_INFO_REDUCED
> depends on !GCC_PLUGIN_RANDSTRUCT || COMPILE_TEST
> help
> @@ -274,12 +275,9 @@ config DEBUG_INFO_BTF
>   Turning this on expects presence of pahole tool, which will convert
>   DWARF type info into equivalent deduplicated BTF type info.
>
> -config PAHOLE_HAS_SPLIT_BTF
> -   def_bool $(success, test `$(PAHOLE) --version | sed -E 
> 's/v([0-9]+)\.([0-9]+)/\1\2/'` -ge "119")
> -
>  config DEBUG_INFO_BTF_MODULES
> def_bool y
> -   depends on DEBUG_INFO_BTF && MODULES && PAHOLE_HAS_SPLIT_BTF
> +   depends on DEBUG_INFO_BTF && MODULES && PAHOLE_VERSION >= 119
> help
>   Generate compact split BTF type information for kernel modules.
>
> diff --git a/scripts/link-vmlinux.sh b/scripts/link-vmlinux.sh
> index 6eded325c837..eef40fa9485d 100755
> --- a/scripts/link-vmlinux.sh
> +++ b/scripts/link-vmlinux.sh
> @@ -139,19 +139,6 @@ vmlinux_link()
>  # ${2} - file to dump raw BTF data into
>  gen_btf()
>  {
> -   local pahole_ver
> -
> -   if ! [ -x "$(command -v ${PAHOLE})" ]; then
> -   echo >&2 "BTF: ${1}: pahole (${PAHOLE}) is not available"
> -   return 1
> -   fi
> -
> -   pahole_ver=$(${PAHOLE} --version | sed -E 
> 's/v([0-9]+)\.([0-9]+)/\1\2/')
> -   if [ "${pahole_ver}" -lt "116" ]; then
> -   echo >&2 "BTF: ${1}: pahole version $(${PAHOLE} --version) is 
> too old, need at least v1.16"
> -   return 1
> -   fi
> -
> vmlinux_link ${1}
>
> info "BTF" ${2}
> diff --git a/scripts/pahole-version.sh b/scripts/pahole-version.sh
> new file mode 100755
> index ..6de6f734a345
> --- /dev/null
> +++ b/scripts/pahole-version.sh
> @@ -0,0 +1,16 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0
> +#
> +# Usage: $ ./scripts/pahole-version.sh pahole
> +#
> +# Print the pahole version as a three digit string
> +# such as `119' for pahole v1.19 etc.
> +
> +pahole="$*"
> +
> +if ! [ -x "$(command -v $pahole)" ]; then
> +echo 0
> +exit 1
> +fi
> +
> +$pahole --version | sed -E 's/v([0-9]+)\.([0-9]+)/\1\2/'
>
> base-commit: e22d7f05e445165e58feddb4e40cc9c0f94453bc
> --
> 2.30.0
>


Re: [PATCH v0 net-next 1/1] Allow user to set metric on default route learned via Router Advertisement.

2021-01-12 Thread Jakub Kicinski
On Tue, 12 Jan 2021 11:55:11 -0800 Praveen Chaudhary wrote:
> Hi Jakub
> 
> Thanks for the review,
> 
> Sure, I will reraise the patch (again v0i, sonce no code changes) after 
> adding space before '<'.
> 
> This patch adds lines in 'include/uapi/', that requires ABI version changes 
> for debian build. I am not sure, if we need any such changes to avoid 
> breaking allmodconfig. It will be really helpful, if you can look at the 
> patch once 'https://lkml.org/lkml/2021/1/11/1668' and suggest on this. Thanks 
> a lot again.

The code doesn't build, AFAIK it's because:

net/ipv6/ndisc.c:1322:8: error: too few arguments to function 
‘rt6_add_dflt_router’


[PATCH bpf-next] bpf: reject too big ctx_size_in for raw_tp test run

2021-01-12 Thread Song Liu
syzbot reported a WARNING for allocating too big memory:

WARNING: CPU: 1 PID: 8484 at mm/page_alloc.c:4976 
__alloc_pages_nodemask+0x5f8/0x730 mm/page_alloc.c:5011
Modules linked in:
CPU: 1 PID: 8484 Comm: syz-executor862 Not tainted 5.11.0-rc2-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
RIP: 0010:__alloc_pages_nodemask+0x5f8/0x730 mm/page_alloc.c:4976
Code: 00 00 0c 00 0f 85 a7 00 00 00 8b 3c 24 4c 89 f2 44 89 e6 c6 44 24 70 00 
48 89 6c 24 58 e8 d0 d7 ff ff 49 89 c5 e9 ea fc ff ff <0f> 0b e9 b5 fd ff ff 89 
74 24 14 4c 89 4c 24 08 4c 89 74 24 18 e8
RSP: 0018:c900012efb10 EFLAGS: 00010246
RAX:  RBX: 19200025df66 RCX: 
RDX:  RSI: dc00 RDI: 00140dc0
RBP: 00140dc0 R08:  R09: 
R10: 81b1f7e1 R11:  R12: 0014
R13: 0014 R14:  R15: 
FS:  0190c880() GS:8880b9e0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f08b7f316c0 CR3: 12073000 CR4: 001506f0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
alloc_pages_current+0x18c/0x2a0 mm/mempolicy.c:2267
alloc_pages include/linux/gfp.h:547 [inline]
kmalloc_order+0x2e/0xb0 mm/slab_common.c:837
kmalloc_order_trace+0x14/0x120 mm/slab_common.c:853
kmalloc include/linux/slab.h:557 [inline]
kzalloc include/linux/slab.h:682 [inline]
bpf_prog_test_run_raw_tp+0x4b5/0x670 net/bpf/test_run.c:282
bpf_prog_test_run kernel/bpf/syscall.c:3120 [inline]
__do_sys_bpf+0x1ea9/0x4f10 kernel/bpf/syscall.c:4398
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x440499
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 
89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 
7b 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:7ffe1f3bfb18 EFLAGS: 0246 ORIG_RAX: 0141
RAX: ffda RBX: 004002c8 RCX: 00440499
RDX: 0048 RSI: 2600 RDI: 000a
RBP: 006ca018 R08:  R09: 004002c8
R10:  R11: 0246 R12: 00401ca0
R13: 00401d30 R14:  R15: 

This is because we didn't filter out too big ctx_size_in. Fix it by
rejecting ctx_size_in that are bigger than MAX_BPF_FUNC_ARGS (12) u64
numbers.

Reported-by: syzbot+4f98876664c7337a4...@syzkaller.appspotmail.com
Fixes: 1b4d60ec162f ("bpf: Enable BPF_PROG_TEST_RUN for raw_tracepoint")
Cc: sta...@vger.kernel.org # v5.10+
Signed-off-by: Song Liu 
---
 net/bpf/test_run.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 23dfb2010ba69..58bcb8c849d54 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -272,7 +272,8 @@ int bpf_prog_test_run_raw_tp(struct bpf_prog *prog,
kattr->test.repeat)
return -EINVAL;
 
-   if (ctx_size_in < prog->aux->max_ctx_offset)
+   if (ctx_size_in < prog->aux->max_ctx_offset ||
+   ctx_size_in > MAX_BPF_FUNC_ARGS * sizeof(u64))
return -EINVAL;
 
if ((kattr->test.flags & BPF_F_TEST_RUN_ON_CPU) == 0 && cpu != 0)
-- 
2.24.1



Re: [PATCH net-next 0/5] skbuff: introduce skbuff_heads bulking and reusing

2021-01-12 Thread Jakub Kicinski
On Tue, 12 Jan 2021 13:23:16 +0100 Eric Dumazet wrote:
> On Tue, Jan 12, 2021 at 12:08 PM Alexander Lobakin  wrote:
> >
> > From: Edward Cree 
> > Date: Tue, 12 Jan 2021 09:54:04 +
> >  
> > > Without wishing to weigh in on whether this caching is a good idea...  
> >
> > Well, we already have a cache to bulk flush "consumed" skbs, although
> > kmem_cache_free() is generally lighter than kmem_cache_alloc(), and
> > a page frag cache to allocate skb->head that is also bulking the
> > operations, since it contains a (compound) page with the size of
> > min(SZ_32K, PAGE_SIZE).
> > If they wouldn't give any visible boosts, I think they wouldn't hit
> > mainline.
> >  
> > > Wouldn't it be simpler, rather than having two separate "alloc" and 
> > > "flush"
> > >  caches, to have a single larger cache, such that whenever it becomes full
> > >  we bulk flush the top half, and when it's empty we bulk alloc the bottom
> > >  half?  That should mean fewer branches, fewer instructions etc. than
> > >  having to decide which cache to act upon every time.  
> >
> > I though about a unified cache, but couldn't decide whether to flush
> > or to allocate heads and how much to process. Your suggestion answers
> > these questions and generally seems great. I'll try that one, thanks!
>  
> The thing is : kmalloc() is supposed to have batches already, and nice
> per-cpu caches.
> 
> This looks like an mm issue, are we sure we want to get over it ?
> 
> I would like a full analysis of why SLAB/SLUB does not work well for
> your test workload.

+1, it does feel like we're getting into mm territory

> More details, more numbers before we accept yet another
> 'networking optimization' adding more code to the 'fast' path.
> 
> More code means more latencies when all code needs to be brought up in
> cpu caches.



Re: [PATCH net-next v2 3/7] ibmvnic: avoid allocating rwi entries

2021-01-12 Thread Sukadev Bhattiprolu
Saeed Mahameed [sa...@kernel.org] wrote:
> > -struct ibmvnic_rwi {
> > -   enum ibmvnic_reset_reason reset_reason;
> > -   struct list_head list;
> > -};
> > +  VNIC_RESET_CHANGE_PARAM,
> > +  VNIC_RESET_MAX}; // must be last
>this is not the preferred comment style: ^^
> 
> I would just drop the comment here, it is clear from the name of the
> enum.
> 

Yeah, we debated and figured the comment could serve as another reminder.

Here is updated patch, fixing the comment style.

Thanks

Sukadev
---
>From 59d4b23fe1f97a67436e14829368744ee288157d Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu 
Date: Wed, 16 Dec 2020 23:00:34 -0600
Subject: [PATCH net-next v2 3/7] ibmvnic: avoid allocating rwi entries

Whenever we need to schedule a reset, we allocate an rwi (reset work
item?) entry and add to the list of pending resets.

Since we only add one rwi for a given reason type to the list (no duplicates).
we will only have a handful of reset types in the list - even in the
worst case. In the common case we should just have a couple of entries
at most.

Rather than allocating/freeing every time (and dealing with the corner
case of the allocation failing), use a fixed number of rwi entries.
The extra memory needed is tiny and most of it will be used over the
active life of the adapter.

This also fixes a couple of tiny memory leaks. One is in ibmvnic_reset()
where we don't free the rwi entries after deleting them from the list due
to a transport event.  The second is in __ibmvnic_reset() where if we
find that the adapter is being removed, we simply break out of the loop
(with rc = EBUSY) but ignore any rwi entries that remain on the list.

Fixes: 2770a7984db58 ("Introduce hard reset recovery")
Fixes: 36f1031c51a2 ("ibmvnic: Do not process reset during or after device 
removal")
Signed-off-by: Sukadev Bhattiprolu 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 123 +
 drivers/net/ethernet/ibm/ibmvnic.h |  14 ++--
 2 files changed, 78 insertions(+), 59 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index cd8108dbddec..d1c2aaed1478 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -2257,29 +2257,81 @@ static int do_hard_reset(struct ibmvnic_adapter 
*adapter,
return rc;
 }
 
-static struct ibmvnic_rwi *get_next_rwi(struct ibmvnic_adapter *adapter)
+/**
+ * Next reset will always be the first on the list.
+ * When we take it off the list, we move any remaining resets so
+ * that the next one is again the first on the list. Most of the
+ * time the pending_resets[] should have a couple of types of resets
+ * (FAILOVER, TIMEOUT or CHANGE-PARAM and less often, MOBILITY).
+ */
+static enum ibmvnic_reset_reason get_pending_reset(struct ibmvnic_adapter 
*adapter)
 {
-   struct ibmvnic_rwi *rwi;
+   enum ibmvnic_reset_reason *pending_resets;
+   enum ibmvnic_reset_reason reason = 0;
unsigned long flags;
+   int i;
 
spin_lock_irqsave(&adapter->rwi_lock, flags);
 
-   if (!list_empty(&adapter->rwi_list)) {
-   rwi = list_first_entry(&adapter->rwi_list, struct ibmvnic_rwi,
-  list);
-   list_del(&rwi->list);
-   } else {
-   rwi = NULL;
+   pending_resets = &adapter->pending_resets[0];
+
+   reason = pending_resets[0];
+
+   if (reason)  {
+   for (i = 0; i < adapter->next_reset; i++) {
+   pending_resets[i] = pending_resets[i+1];
+   if (!pending_resets[i])
+   break;
+   }
+   adapter->next_reset--;
+   }
+
+   spin_unlock_irqrestore(&adapter->rwi_lock, flags);
+   return reason;
+}
+
+/**
+ * Add a pending reset, making sure not to add duplicates.
+ * If @clear is set, clear all existing resets before adding.
+ *
+ * TODO: If clear (i.e force_reset_recovery) is true AND we have a
+ *  duplicate reset, wouldn't it still make sense to clear the
+ *  queue including the duplicate and add this reset? Preserving
+ *  existing behavior for now.
+ */
+static void add_pending_reset(struct ibmvnic_adapter *adapter,
+ enum ibmvnic_reset_reason reason,
+ bool clear)
+{
+   enum ibmvnic_reset_reason *pending_resets;
+   unsigned long flags;
+   int i;
+
+   spin_lock_irqsave(&adapter->rwi_lock, flags);
+
+   pending_resets = &adapter->pending_resets[0];
+
+   for (i = 0; i < adapter->next_reset; i++) {
+   if (pending_resets[i] == reason)
+   goto out;
+   }
+
+   if (clear) {
+   for (i = 0; i < adapter->next_reset; i++) {
+   pending_resets[i] = 0;
+   }
+   adapter->next_reset = 0;
}
 
+   pending_resets[adapter->next_reset] = r

RE: [PATCH v2] net: phy: realtek: Add support for RTL9000AA/AN

2021-01-12 Thread ashid...@fujitsu.com
> > I think it's possible to return a Master/Slave configuration.
>
> Great. It would be good to add it.

OK. I think it will take some time to implement this feature, 
as we prioritize investigating comments from Russell.

> > By the way, do you need the cable test function as implementedin
> > nxp-tja11xx.c?
> 
> We don't need it. But if you want to implement it, that would be
> great.

We also don't need the cable test feature.
However, we are interested in adding this feature, so we will consider 
adding the cable test feature after the RTL9000AA/AN support are merged.

Thanks & Best Regards,
Yuusuke Ashiduka


Re: [PATCH bpf-next v2 1/2] bpf: allow to retrieve sol_socket opts from sock_addr progs

2021-01-12 Thread Alexei Starovoitov
On Mon, Jan 11, 2021 at 3:09 PM Daniel Borkmann  wrote:
>
> The _bpf_setsockopt() is able to set some of the SOL_SOCKET level options,
> however, _bpf_getsockopt() has little support to actually retrieve them.
> This small patch adds few misc options such as SO_MARK, SO_PRIORITY and
> SO_BINDTOIFINDEX. For the latter getter and setter are added. The mark and
> priority in particular allow to retrieve the options from BPF cgroup hooks
> to then implement custom behavior / settings on the syscall hooks compared
> to other sockets that stick to the defaults, for example.
>
> Signed-off-by: Daniel Borkmann 
> Acked-by: Yonghong Song 

Applied.


Re: [PATCH v3 bpf-next 7/7] selftests/bpf: test kernel module ksym externs

2021-01-12 Thread Alexei Starovoitov
On Tue, Jan 12, 2021 at 3:41 AM Andrii Nakryiko  wrote:
>
> Add per-CPU variable to bpf_testmod.ko and use those from new selftest to
> validate it works end-to-end.
>
> Acked-by: Yonghong Song 
> Acked-by: Hao Luo 
> Signed-off-by: Andrii Nakryiko 

Applied.

FYI for everyone. This test needs the latest pahole.


RE: [PATCH] igb: avoid premature Rx buffer reuse

2021-01-12 Thread Li,Rongqing


> -Original Message-
> From: Alexander Duyck [mailto:alexander.du...@gmail.com]
> Sent: Wednesday, January 13, 2021 5:23 AM
> To: Li,Rongqing 
> Cc: Netdev ; intel-wired-lan
> ; Björn Töpel 
> Subject: Re: [PATCH] igb: avoid premature Rx buffer r 
> Okay, this explanation makes much more sense. Could you please either include
> this explanation in your patch, or include a reference to this patch as this
> explains clearly what the issue is while yours didn't and led to the 
> confusion as I
> was assuming the freeing was happening closer to the t0 case, and really the
> problem is t1.
> 
> Thanks.
> 
> - Alex


Ok, I will send V2

Thanks

-Li


Re: [Patch net] cls_flower: call nla_ok() before nla_next()

2021-01-12 Thread Jakub Kicinski
On Mon, 11 Jan 2021 18:55:48 -0800 Cong Wang wrote:
> From: Cong Wang 
> 
> fl_set_enc_opt() simply checks if there are still bytes left to parse,
> but this is not sufficent as syzbot seems to be able to generate
> malformatted netlink messages. nla_ok() is more strict so should be
> used to validate the next nlattr here.
> 
> And nla_validate_nested_deprecated() has less strict check too, it is
> probably too late to switch to the strict version, but we can just
> call nla_ok() too after it.
> 
> Reported-and-tested-by: syzbot+2624e3778b18fc497...@syzkaller.appspotmail.com
> Fixes: 0a6e77784f49 ("net/sched: allow flower to match tunnel options")
> Fixes: 79b1011cb33d ("net: sched: allow flower to match erspan options")
> Cc: Pieter Jansen van Vuuren 
> Cc: Jamal Hadi Salim 
> Cc: Xin Long 
> Cc: Jiri Pirko 
> Signed-off-by: Cong Wang 

Thanks for keeping up with the syzbot bugs!

> diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
> index 1319986693fc..e265c443536e 100644
> --- a/net/sched/cls_flower.c
> +++ b/net/sched/cls_flower.c
> @@ -1272,6 +1272,8 @@ static int fl_set_enc_opt(struct nlattr **tb, struct 
> fl_flow_key *key,
>  
>   nla_opt_msk = nla_data(tb[TCA_FLOWER_KEY_ENC_OPTS_MASK]);
>   msk_depth = nla_len(tb[TCA_FLOWER_KEY_ENC_OPTS_MASK]);
> + if (!nla_ok(nla_opt_msk, msk_depth))
> + return -EINVAL;

Can we just add another call to nla_validate_nested_deprecated() 
here instead of having to worry about each attr individually?
See below..

>   }
>  
>   nla_for_each_attr(nla_opt_key, nla_enc_key,
> @@ -1308,7 +1310,7 @@ static int fl_set_enc_opt(struct nlattr **tb, struct 
> fl_flow_key *key,
>   return -EINVAL;
>   }
>  
> - if (msk_depth)
> + if (nla_ok(nla_opt_msk, msk_depth))
>   nla_opt_msk = nla_next(nla_opt_msk, &msk_depth);

Should we not error otherwise? if msk_depth && !nla_ok() then the
message is clearly misformatted. If we don't error out we'll keep
reusing the same mask over and over, while the intention of this 
code was to have mask per key AFAICT.

>   break;
>   case TCA_FLOWER_KEY_ENC_OPTS_VXLAN:


Re: [PATCH v4 1/1] can: dev: add software tx timestamps

2021-01-12 Thread Jakub Kicinski
On Tue, 12 Jan 2021 11:03:00 +0100 Marc Kleine-Budde wrote:
> On 1/12/21 10:54 AM, Vincent Mailhol wrote:
> > Call skb_tx_timestamp() within can_put_echo_skb() so that a software
> > tx timestamp gets attached on the skb.
> > 
> > There two main reasons to include this call in can_put_echo_skb():
> > 
> >   * It easily allow to enable the tx timestamp on all devices with
> > just one small change.
> > 
> >   * According to Documentation/networking/timestamping.rst, the tx
> > timestamps should be generated in the device driver as close as
> > possible, but always prior to passing the packet to the network
> > interface. During the call to can_put_echo_skb(), the skb gets
> > cloned meaning that the driver should not dereference the skb
> > variable anymore after can_put_echo_skb() returns. This makes
> > can_put_echo_skb() the very last place we can use the skb without
> > having to access the echo_skb[] array.
> > 
> > Remark: by default, skb_tx_timestamp() does nothing. It needs to be
> > activated by passing the SOF_TIMESTAMPING_TX_SOFTWARE flag either
> > through socket options or control messages.
> > 
> > References:
> > 
> >  * Support for the error queue in CAN RAW sockets (which is needed for
> >tx timestamps) was introduced in:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=eb88531bdbfaafb827192d1fc6c5a3fcc4fadd96
> > 
> >   * Put the call to skb_tx_timestamp() just before adding it to the
> > array: https://lkml.org/lkml/2021/1/10/54
> > 
> >   * About Tx hardware timestamps
> > 
> > https://lore.kernel.org/linux-can/2021071152.gb11...@hoboy.vegasvil.org/
> > 
> > Signed-off-by: Vincent Mailhol   
> 
> Applied to linux-can-next/testing

Please make sure to address the warnings before this hits net-next:

https://patchwork.kernel.org/project/netdevbpf/patch/20210112130538.14912-2-mailhol.vinc...@wanadoo.fr/

Actually it appears not to build with allmodconfig..?


Re: [PATCH v4 1/1] can: dev: add software tx timestamps

2021-01-12 Thread Jakub Kicinski
On Tue, 12 Jan 2021 17:46:25 -0800 Jakub Kicinski wrote:
> On Tue, 12 Jan 2021 11:03:00 +0100 Marc Kleine-Budde wrote:
> > On 1/12/21 10:54 AM, Vincent Mailhol wrote:  
> > > Call skb_tx_timestamp() within can_put_echo_skb() so that a software
> > > tx timestamp gets attached on the skb.
> > > 
> > > There two main reasons to include this call in can_put_echo_skb():
> > > 
> > >   * It easily allow to enable the tx timestamp on all devices with
> > > just one small change.
> > > 
> > >   * According to Documentation/networking/timestamping.rst, the tx
> > > timestamps should be generated in the device driver as close as
> > > possible, but always prior to passing the packet to the network
> > > interface. During the call to can_put_echo_skb(), the skb gets
> > > cloned meaning that the driver should not dereference the skb
> > > variable anymore after can_put_echo_skb() returns. This makes
> > > can_put_echo_skb() the very last place we can use the skb without
> > > having to access the echo_skb[] array.
> > > 
> > > Remark: by default, skb_tx_timestamp() does nothing. It needs to be
> > > activated by passing the SOF_TIMESTAMPING_TX_SOFTWARE flag either
> > > through socket options or control messages.
> > > 
> > > References:
> > > 
> > >  * Support for the error queue in CAN RAW sockets (which is needed for
> > >tx timestamps) was introduced in:
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=eb88531bdbfaafb827192d1fc6c5a3fcc4fadd96
> > > 
> > >   * Put the call to skb_tx_timestamp() just before adding it to the
> > > array: https://lkml.org/lkml/2021/1/10/54
> > > 
> > >   * About Tx hardware timestamps
> > > 
> > > https://lore.kernel.org/linux-can/2021071152.gb11...@hoboy.vegasvil.org/
> > > 
> > > Signed-off-by: Vincent Mailhol 
> > 
> > Applied to linux-can-next/testing  
> 
> Please make sure to address the warnings before this hits net-next:
> 
> https://patchwork.kernel.org/project/netdevbpf/patch/20210112130538.14912-2-mailhol.vinc...@wanadoo.fr/
> 
> Actually it appears not to build with allmodconfig..?

Erm, apologies, I confused different CAN patches, this one did not get
build tested.


[PATCH v1 net-next 1/1] Allow user to set metric on default route learned via Router Advertisement.

2021-01-12 Thread Praveen Chaudhary
For IPv4, default route is learned via DHCPv4 and user is allowed to change
metric using config etc/network/interfaces. But for IPv6, default route can
be learned via RA, for which, currently a fixed metric value 1024 is used.

Ideally, user should be able to configure metric on default route for IPv6
similar to IPv4. This fix adds sysctl for the same.

Signed-off-by: Praveen Chaudhary
Signed-off-by: Zhenggen Xu

Changes in v1.
---
1.) Correct the call to rt6_add_dflt_router.
---

---
 Documentation/networking/ip-sysctl.rst | 18 ++
 include/linux/ipv6.h   |  1 +
 include/net/ip6_route.h|  3 ++-
 include/uapi/linux/ipv6.h  |  1 +
 include/uapi/linux/sysctl.h|  1 +
 net/ipv6/addrconf.c| 10 ++
 net/ipv6/ndisc.c   | 14 ++
 net/ipv6/route.c   |  5 +++--
 8 files changed, 46 insertions(+), 7 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.rst 
b/Documentation/networking/ip-sysctl.rst
index dd2b12a32b73..384159081d91 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -1871,6 +1871,24 @@ accept_ra_defrtr - BOOLEAN
- enabled if accept_ra is enabled.
- disabled if accept_ra is disabled.
 
+accept_ra_defrtr_metric - INTEGER
+   Route metric for default route learned in Router Advertisement. This
+   value will be assigned as metric for the route learned via IPv6 Router
+   Advertisement.
+
+   Possible values are:
+   0:
+   Use default value i.e. IP6_RT_PRIO_USER 1024.
+   0x to -1:
+   -ve values represent high route metric, value will be 
treated as
+   unsigned value. This behaviour is inline with current 
IPv4 metric
+   shown with commands such as "route -n" or "ip route 
list".
+   1 to 0x7FF:
+   +ve values will be used as is for route metric.
+
+   Functional default: enabled if accept_ra_defrtr is enabled.
+   disabled if accept_ra_defrtr is disabled.
+
 accept_ra_from_local - BOOLEAN
Accept RA with source-address that is found on local machine
if the RA is otherwise proper and able to be accepted.
diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index dda61d150a13..19af90c77200 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -31,6 +31,7 @@ struct ipv6_devconf {
__s32   max_desync_factor;
__s32   max_addresses;
__s32   accept_ra_defrtr;
+   __s32   accept_ra_defrtr_metric;
__s32   accept_ra_min_hop_limit;
__s32   accept_ra_pinfo;
__s32   ignore_routes_with_linkdown;
diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 2a5277758379..a470bdab2420 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -174,7 +174,8 @@ struct fib6_info *rt6_get_dflt_router(struct net *net,
 struct net_device *dev);
 struct fib6_info *rt6_add_dflt_router(struct net *net,
 const struct in6_addr *gwaddr,
-struct net_device *dev, unsigned int pref);
+struct net_device *dev, unsigned int pref,
+unsigned int defrtr_usr_metric);
 
 void rt6_purge_dflt_routers(struct net *net);
 
diff --git a/include/uapi/linux/ipv6.h b/include/uapi/linux/ipv6.h
index 13e8751bf24a..945de5de5144 100644
--- a/include/uapi/linux/ipv6.h
+++ b/include/uapi/linux/ipv6.h
@@ -189,6 +189,7 @@ enum {
DEVCONF_ACCEPT_RA_RT_INFO_MIN_PLEN,
DEVCONF_NDISC_TCLASS,
DEVCONF_RPL_SEG_ENABLED,
+   DEVCONF_ACCEPT_RA_DEFRTR_METRIC,
DEVCONF_MAX
 };
 
diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
index 458179df9b27..5e79c196e33c 100644
--- a/include/uapi/linux/sysctl.h
+++ b/include/uapi/linux/sysctl.h
@@ -571,6 +571,7 @@ enum {
NET_IPV6_ACCEPT_SOURCE_ROUTE=25,
NET_IPV6_ACCEPT_RA_FROM_LOCAL=26,
NET_IPV6_ACCEPT_RA_RT_INFO_MIN_PLEN=27,
+   NET_IPV6_ACCEPT_RA_DEFRTR_METRIC=28,
__NET_IPV6_MAX
 };
 
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index eff2cacd5209..702ec4a33936 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -205,6 +205,7 @@ static struct ipv6_devconf ipv6_devconf __read_mostly = {
.max_desync_factor  = MAX_DESYNC_FACTOR,
.max_addresses  = IPV6_MAX_ADDRESSES,
.accept_ra_defrtr   = 1,
+   .accept_ra_defrtr_metric = 0,
.accept_ra_from_local   = 0,
.accept_ra_min_hop_limit= 1,
.accept_ra_pinfo= 1,
@@ -260,6 +261,7 @@ static struct ipv6_devconf ipv6_devconf_dflt __read_mostly 
= {
.ma

[PATCH v1 net-next 0/1] Allow user to set metric on default route learned via Router Advertisement.

2021-01-12 Thread Praveen Chaudhary
Allow user to set metric on default route learned via Router Advertisement.

Note: RFC 4191 does not say anything for metric for IPv6 default route.

Fix:
For IPv4, default route is learned via DHCPv4 and user is allowed to change
metric using config in etc/network/interfaces. But for IPv6, default route can
be learned via RA, for which, currently a fixed metric value 1024 is used.

Ideally, user should be able to configure metric on default route for IPv6
similar to IPv4. This fix adds sysctl for the same.

Logs:

For IPv4:


Config in etc/network/interfaces

```
auto eth0
iface eth0 inet dhcp
metric 4261413864
```

IPv4 Kernel Route Table:

```
$ sudo route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric RefUse Iface
0.0.0.0 172.11.44.1 0.0.0.0 UG-33553432 00 eth0
```

FRR Table, if a static route is configured. [In real scenario, it is useful to 
prefer BGP learned default route over DHCPv4 default route.]

```
Codes: K - kernel route, C - connected, S - static, R - RIP,
   O - OSPF, I - IS-IS, B - BGP, P - PIM, E - EIGRP, N - NHRP,
   T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
   > - selected route, * - FIB route

S>* 0.0.0.0/0 [20/0] is directly connected, eth0, 00:00:03
K   0.0.0.0/0 [254/1000] via 172.21.47.1, eth0, 6d08h51m
```


i.e. User can prefer Default Router learned via Routing Protocol,
Similar behavior is not possible for IPv6, without this fix.



After fix [for IPv6]:

```
sudo sysctl -w 
net.ipv6.conf.eth0.net.ipv6.conf.eth0.accept_ra_defrtr_metric=0x770003e9
```

IP monitor:

```
default via fe80::xx16::feb3:ce8e dev eth0 proto ra metric 1996489705  pref 
high
```

Kernel IPv6 routing table

```
DestinationNext Hop   Flag Met Ref Use If
::/0   fe80::xx16::feb3:ce8e  UGDAe 1996489705 0
 0 eth0
```

FRR Table, if a static route is configured. [In real scenario, it is useful to 
prefer BGP learned default route over IPv6 RA default route.]
```

Codes: K - kernel route, C - connected, S - static, R - RIPng,
   O - OSPFv3, I - IS-IS, B - BGP, N - NHRP, T - Table,
   v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
   > - selected route, * - FIB route

S>* ::/0 [20/0] is directly connected, eth0, 00:00:06
K   ::/0 [119/1001] via fe80::xx16::feb3:ce8e, eth0, 6d07h43m

```

If the metric is changed later, the effect will be seen only when IPv6 RA is 
received, because the default route must be fully controlled by RA msg.
```
admin@lnos-x1-a-asw03:~$ sudo sysctl -w 
net.ipv6.conf.eth0.accept_ra_defrtr_metric=0x770003e8
net.ipv6.conf.eth0.accept_ra_defrtr_metric = 0x770003e8

```

IP monitor: when metric is changed after learning Default Route from previous 
IPv6 RA msg:
```
Deleted default via fe80::xx16::feb3:ce8e dev eth0 proto ra metric 
1996489705  expires 3sec hoplimit 64 pref high
default via fe80::xx16::feb3:ce8e dev eth0 proto ra metric 1996489704  pref 
high
```

Praveen Chaudhary (1):
  Allow user to set metric on default route learned via Router
Advertisement.

 Documentation/networking/ip-sysctl.rst | 18 ++
 include/linux/ipv6.h   |  1 +
 include/net/ip6_route.h|  3 ++-
 include/uapi/linux/ipv6.h  |  1 +
 include/uapi/linux/sysctl.h|  1 +
 net/ipv6/addrconf.c| 10 ++
 net/ipv6/ndisc.c   | 14 ++
 net/ipv6/route.c   |  5 +++--
 8 files changed, 46 insertions(+), 7 deletions(-)


base-commit: 139711f033f636cc78b6aaf7363252241b9698ef
-- 
2.29.0



Re: [PATCH net-next v2 5/7] ibmvnic: serialize access to work queue

2021-01-12 Thread Jakub Kicinski
On Tue, 12 Jan 2021 16:40:49 -0800 Sukadev Bhattiprolu wrote:
> Saeed Mahameed [sa...@kernel.org] wrote:
> > On Tue, 2021-01-12 at 10:14 -0800, Sukadev Bhattiprolu wrote:  
> 
> 
> > > @@ -5467,7 +5472,15 @@ static int ibmvnic_remove(struct vio_dev *dev)
> > >   return -EBUSY;
> > >   }
> > >  
> > > + /* If ibmvnic_reset() is scheduling a reset, wait for it to
> > > +  * finish. Then prevent it from scheduling any more resets
> > > +  * and have the reset functions ignore any resets that have
> > > +  * already been scheduled.
> > > +  */
> > > + spin_lock_irqsave(&adapter->remove_lock, flags);
> > >   adapter->state = VNIC_REMOVING;
> > > + spin_unlock_irqrestore(&adapter->remove_lock, flags);
> > > +  
> > 
> > Why irqsave/restore variants ? are you expecting this spinlock to be
> > held in interruptcontext ?
> >   
> > >   spin_unlock_irqrestore(&adapter->state_lock, flags);  
> 
> Good question.
> 
> One of the callers of ibmvnic_reset() is the ->ndo_tx_timeout()
> method which gets called from the watchdog timer.

watchdog is a normal timer, so it's gonna run in softirq, you don't
need to mask irqs.



  1   2   3   4   5   >