[RFC PATCH 6/7] tun: populate hash in virtio-net header when needed
If the BPF program populated the hash in the skb the tun propagates the hash value and hash report type to the respective fields of virtio-net header. Signed-off-by: Yuri Benditovich --- drivers/net/tun.c | 25 ++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 45f4f04a4a3e..214feb0b16fb 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -556,15 +556,20 @@ static u16 tun_ebpf_select_queue(struct tun_struct *tun, struct sk_buff *skb) { struct tun_prog *prog; u32 numqueues; - u16 ret = 0; + u32 ret = 0; numqueues = READ_ONCE(tun->numqueues); if (!numqueues) return 0; prog = rcu_dereference(tun->steering_prog); - if (prog) + if (prog) { ret = bpf_prog_run_clear_cb(prog->prog, skb); + if (tun->bpf_populates_hash) { + *skb_hash_report_type(skb) = (__u8)(ret >> 16); + ret &= 0x; + } + } return ret % numqueues; } @@ -2062,6 +2067,7 @@ static ssize_t tun_put_user(struct tun_struct *tun, if (vnet_hdr_sz) { struct virtio_net_hdr gso; + __u16 extra_copy = 0; if (iov_iter_count(iter) < vnet_hdr_sz) return -EINVAL; @@ -2085,7 +2091,20 @@ static ssize_t tun_put_user(struct tun_struct *tun, if (copy_to_iter(&gso, sizeof(gso), iter) != sizeof(gso)) return -EFAULT; - iov_iter_advance(iter, vnet_hdr_sz - sizeof(gso)); + if (tun->bpf_populates_hash && + vnet_hdr_sz >= sizeof(struct virtio_net_hdr_v1_hash)) { + struct virtio_net_hdr_v1_hash hdr; + + hdr.hdr.num_buffers = 0; + hdr.hash_value = cpu_to_le32(skb_get_hash(skb)); + hdr.hash_report = cpu_to_le16(*skb_hash_report_type(skb)); + hdr.padding = 0; + extra_copy = sizeof(hdr) - sizeof(gso); + if (copy_to_iter(&hdr.hdr.num_buffers, extra_copy, iter) != extra_copy) + return -EFAULT; + } + + iov_iter_advance(iter, vnet_hdr_sz - sizeof(gso) - extra_copy); } if (vlan_hlen) { -- 2.17.1
[RFC PATCH 7/7] tun: report new tun feature IFF_HASH
IFF_HASH feature indicates that the tun supports TUNSETHASHPOPULATION ioctl and can propagate the hash data to the virtio-net packet. Signed-off-by: Yuri Benditovich --- drivers/net/tun.c | 2 +- include/uapi/linux/if_tun.h | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 214feb0b16fb..b46aa8941a9d 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -88,7 +88,7 @@ static void tun_default_link_ksettings(struct net_device *dev, #define TUN_VNET_LE 0x8000 #define TUN_VNET_BE 0x4000 -#define TUN_FEATURES (IFF_NO_PI | IFF_ONE_QUEUE | IFF_VNET_HDR | \ +#define TUN_FEATURES (IFF_NO_PI | IFF_ONE_QUEUE | IFF_VNET_HDR | IFF_HASH |\ IFF_MULTI_QUEUE | IFF_NAPI | IFF_NAPI_FRAGS) #define GOODCOPY_LEN 128 diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h index 0fd43533da26..116b84ede3a0 100644 --- a/include/uapi/linux/if_tun.h +++ b/include/uapi/linux/if_tun.h @@ -73,6 +73,7 @@ #define IFF_ONE_QUEUE 0x2000 #define IFF_VNET_HDR 0x4000 #define IFF_TUN_EXCL 0x8000 +#define IFF_HASH 0x0080 #define IFF_MULTI_QUEUE 0x0100 #define IFF_ATTACH_QUEUE 0x0200 #define IFF_DETACH_QUEUE 0x0400 -- 2.17.1
Re: [PATCH bpf 1/2] bpf: support PTR_TO_MEM{,_OR_NULL} register spilling
On Tue, Jan 12, 2021 at 1:14 AM Gilad Reti wrote: > > Add support for pointer to mem register spilling, to allow the verifier > to track pointer to valid memory addresses. Such pointers are returned > for example by a successful call of the bpf_ringbuf_reserve helper. > > This patch was suggested as a solution by Yonghong Song. > > The patch was partially contibuted by CyberArk Software, Inc. > > Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier > support for it") Surprised no one mentioned this yet. Fixes tag should always be on a single unwrapped line, however long it is, please fix. > Signed-off-by: Gilad Reti > --- > kernel/bpf/verifier.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index 17270b8404f1..36af69fac591 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -2217,6 +2217,8 @@ static bool is_spillable_regtype(enum bpf_reg_type type) > case PTR_TO_RDWR_BUF: > case PTR_TO_RDWR_BUF_OR_NULL: > case PTR_TO_PERCPU_BTF_ID: > + case PTR_TO_MEM: > + case PTR_TO_MEM_OR_NULL: > return true; > default: > return false; > -- > 2.27.0 >
Re: [RFC PATCH 3/7] tun: allow use of BPF_PROG_TYPE_SCHED_CLS program type
On Tue, Jan 12, 2021 at 11:42 AM Yuri Benditovich wrote: > > This program type can set skb hash value. It will be useful > when the tun will support hash reporting feature if virtio-net. > > Signed-off-by: Yuri Benditovich > --- > drivers/net/tun.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/net/tun.c b/drivers/net/tun.c > index 7959b5c2d11f..455f7afc1f36 100644 > --- a/drivers/net/tun.c > +++ b/drivers/net/tun.c > @@ -2981,6 +2981,8 @@ static int tun_set_ebpf(struct tun_struct *tun, struct > tun_prog __rcu **prog_p, > prog = NULL; > } else { > prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCKET_FILTER); > + if (IS_ERR(prog)) > + prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SCHED_CLS); You've ignored the feedback and just resend? what for?
Re: [PATCH net-next v2 3/7] ibmvnic: avoid allocating rwi entries
On Tue, 2021-01-12 at 10:14 -0800, Sukadev Bhattiprolu wrote: > Whenever we need to schedule a reset, we allocate an rwi (reset work > item?) entry and add to the list of pending resets. > > Since we only add one rwi for a given reason type to the list (no > duplicates). > we will only have a handful of reset types in the list - even in the > worst case. In the common case we should just have a couple of > entries > at most. > > Rather than allocating/freeing every time (and dealing with the > corner > case of the allocation failing), use a fixed number of rwi entries. > The extra memory needed is tiny and most of it will be used over the > active life of the adapter. > > This also fixes a couple of tiny memory leaks. One is in > ibmvnic_reset() > where we don't free the rwi entries after deleting them from the list > due > to a transport event. The second is in __ibmvnic_reset() where if we > find that the adapter is being removed, we simply break out of the > loop > (with rc = EBUSY) but ignore any rwi entries that remain on the list. > > Fixes: 2770a7984db58 ("Introduce hard reset recovery") > Fixes: 36f1031c51a2 ("ibmvnic: Do not process reset during or after > device removal") > Signed-off-by: Sukadev Bhattiprolu > --- > drivers/net/ethernet/ibm/ibmvnic.c | 123 + > > drivers/net/ethernet/ibm/ibmvnic.h | 14 ++-- > 2 files changed, 78 insertions(+), 59 deletions(-) > [...] > -enum ibmvnic_reset_reason {VNIC_RESET_FAILOVER = 1, > +enum ibmvnic_reset_reason {VNIC_RESET_UNUSED = 0, > +VNIC_RESET_FAILOVER = 1, > VNIC_RESET_MOBILITY, > VNIC_RESET_FATAL, > VNIC_RESET_NON_FATAL, > VNIC_RESET_TIMEOUT, > -VNIC_RESET_CHANGE_PARAM}; > - > -struct ibmvnic_rwi { > - enum ibmvnic_reset_reason reset_reason; > - struct list_head list; > -}; > +VNIC_RESET_CHANGE_PARAM, > +VNIC_RESET_MAX}; // must be last this is not the preferred comment style: ^^ I would just drop the comment here, it is clear from the name of the enum.
Re: [RFC PATCH 0/7] Support for virtio-net hash reporting
On Tue, Jan 12, 2021 at 9:41 PM Yuri Benditovich wrote: > > Existing TUN module is able to use provided "steering eBPF" to > calculate per-packet hash and derive the destination queue to > place the packet to. The eBPF uses mapped configuration data > containing a key for hash calculation and indirection table > with array of queues' indices. > > This series of patches adds support for virtio-net hash reporting > feature as defined in virtio specification. It extends the TUN module > and the "steering eBPF" as follows: > > Extended steering eBPF calculates the hash value and hash type, keeps > hash value in the skb->hash and returns index of destination virtqueue > and the type of the hash. TUN module keeps returned hash type in > (currently unused) field of the skb. > skb->__unused renamed to 'hash_report_type'. > > When TUN module is called later to allocate and fill the virtio-net > header and push it to destination virtqueue it populates the hash > and the hash type into virtio-net header. > > VHOST driver is made aware of respective virtio-net feature that > extends the virtio-net header to report the hash value and hash report > type. Comment from Willem de Bruijn: Skbuff fields are in short supply. I don't think we need to add one just for this narrow path entirely internal to the tun device. Instead, you could just run the flow_dissector in tun_put_user if the feature is negotiated. Indeed, the flow dissector seems more apt to me than BPF here. Note that the flow dissector internally can be overridden by a BPF program if the admin so chooses. This also hits on a deeper point with the choice of hash values, that I also noticed in my RFC patchset to implement the inverse [1][2]. It is much more detailed than skb->hash + skb->l4_hash currently offers, and that can be gotten for free from most hardware. In most practical cases, that information suffices. I added less specific fields VIRTIO_NET_HASH_REPORT_L4, VIRTIO_NET_HASH_REPORT_OTHER that work without explicit flow dissection. I understand that the existing fields are part of the standard. Just curious, what is their purpose beyond 4-tuple based flow hashing? [1] https://patchwork.kernel.org/project/netdevbpf/list/?series=406859&state=* [2] https://github.com/wdebruij/linux/commit/0f77febf22cd6ffc242a575807fa8382a26e511e > > Yuri Benditovich (7): > skbuff: define field for hash report type > vhost: support for hash report virtio-net feature > tun: allow use of BPF_PROG_TYPE_SCHED_CLS program type > tun: free bpf_program by bpf_prog_put instead of bpf_prog_destroy > tun: add ioctl code TUNSETHASHPOPULATION > tun: populate hash in virtio-net header when needed > tun: report new tun feature IFF_HASH > > drivers/net/tun.c | 43 +++-- > drivers/vhost/net.c | 37 --- > include/linux/skbuff.h | 7 +- > include/uapi/linux/if_tun.h | 2 ++ > 4 files changed, 74 insertions(+), 15 deletions(-) > > -- > 2.17.1 >
[PATCH net-next v15 1/6] dt-bindings: net: Add 5GBASER phy interface
From: Pavana Sharma Add 5gbase-r PHY interface mode. Signed-off-by: Pavana Sharma Reviewed-by: Andrew Lunn Acked-by: Rob Herring Signed-off-by: Marek Behún --- Documentation/devicetree/bindings/net/ethernet-controller.yaml | 1 + 1 file changed, 1 insertion(+) diff --git a/Documentation/devicetree/bindings/net/ethernet-controller.yaml b/Documentation/devicetree/bindings/net/ethernet-controller.yaml index 0965f6515f9e..5507ae3c478d 100644 --- a/Documentation/devicetree/bindings/net/ethernet-controller.yaml +++ b/Documentation/devicetree/bindings/net/ethernet-controller.yaml @@ -89,6 +89,7 @@ properties: - trgmii - 1000base-x - 2500base-x + - 5gbase-r - rxaui - xaui -- 2.26.2
[PATCH net-next v15 0/6] Add support for mv88e6393x family of Marvell
Hello, this is version 15 of patches adding support for Amethyst family to mv88e6xxx. It should apply cleanly on net-next. This series is tested on Marvell CN9130-CRB. Changes from v14: - added my Signed-off-by tags to Pavana's patches, since I am sending them (as suggested by Andrew) - added documentation to second patch adding 5gbase-r mode (as requested by Russell) - added Reviewed-by tags - applied Vladimir's suggestions: - reduced indentation level in mv88e6xxx_set_egress_port and mv88e6393x_serdes_port_config - removed 1baseKR_Full from mv88e6393x_phylink_validate - removed PHY_INTERFACE_MODE_10GKR from mv88e6xxx_port_set_cmode Changes from v13: - added patch that wraps .set_egress_port into mv88e6xxx_set_egress_port, so that we do not have to set chip->*gress_dest_port members in every implementation of this method - for the patch that adds Amethyst support: - added more information into commit message - added these methods for mv88e6393x_ops: .port_sync_link .port_setup_message_port .port_max_speed_mode (new implementation needed) .atu_get_hash .atu_set_hash .serdes_pcs_config .serdes_pcs_an_restart .serdes_pcs_link_up - this device can set upstream port per port, so implement .port_set_upstream_port instead of .set_cpu_port - removed USXGMII cmode (not yet supported, working on it) - added debug messages into mv88e6393x_port_set_speed_duplex - added Amethyst errata 4.5 (EEE should be disabled on SERDES ports) - fixed 5gbase-r serdes configuration and interrupt handling - refactored mv88e6393x_serdes_setup_errata - refactored mv88e6393x_port_policy_write - added patch implementing .port_set_policy for Amethyst Marek Marek Behún (2): net: dsa: mv88e6xxx: wrap .set_egress_port method net: dsa: mv88e6xxx: implement .port_set_policy for Amethyst Pavana Sharma (4): dt-bindings: net: Add 5GBASER phy interface net: phy: Add 5GBASER interface mode net: dsa: mv88e6xxx: Change serdes lane parameter type from u8 type to int net: dsa: mv88e6xxx: Add support for mv88e6393x family of Marvell .../bindings/net/ethernet-controller.yaml | 1 + Documentation/networking/phy.rst | 6 + drivers/net/dsa/mv88e6xxx/chip.c | 227 -- drivers/net/dsa/mv88e6xxx/chip.h | 20 +- drivers/net/dsa/mv88e6xxx/global1.c | 19 +- drivers/net/dsa/mv88e6xxx/global1.h | 2 + drivers/net/dsa/mv88e6xxx/global2.h | 8 + drivers/net/dsa/mv88e6xxx/port.c | 397 -- drivers/net/dsa/mv88e6xxx/port.h | 50 ++- drivers/net/dsa/mv88e6xxx/serdes.c| 394 +++-- drivers/net/dsa/mv88e6xxx/serdes.h| 108 +++-- include/linux/phy.h | 4 + 12 files changed, 1073 insertions(+), 163 deletions(-) base-commit: c73a45965dd54a10c368191804b9de661eee1007 -- 2.26.2
[PATCH net-next v15 3/6] net: dsa: mv88e6xxx: Change serdes lane parameter type from u8 type to int
From: Pavana Sharma Returning 0 is no more an error case with MV88E6393 family which has serdes lane numbers 0, 9 or 10. So with this change .serdes_get_lane will return lane number or -errno (-ENODEV or -EOPNOTSUPP). Signed-off-by: Pavana Sharma Reviewed-by: Andrew Lunn Reviewed-by: Vladimir Oltean Signed-off-by: Marek Behún --- drivers/net/dsa/mv88e6xxx/chip.c | 28 +- drivers/net/dsa/mv88e6xxx/chip.h | 16 +++--- drivers/net/dsa/mv88e6xxx/port.c | 8 +-- drivers/net/dsa/mv88e6xxx/serdes.c | 82 +++--- drivers/net/dsa/mv88e6xxx/serdes.h | 64 +++ 5 files changed, 99 insertions(+), 99 deletions(-) diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c index 4aa7d0a8f197..44591ef487af 100644 --- a/drivers/net/dsa/mv88e6xxx/chip.c +++ b/drivers/net/dsa/mv88e6xxx/chip.c @@ -485,12 +485,12 @@ static int mv88e6xxx_serdes_pcs_get_state(struct dsa_switch *ds, int port, struct phylink_link_state *state) { struct mv88e6xxx_chip *chip = ds->priv; - u8 lane; + int lane; int err; mv88e6xxx_reg_lock(chip); lane = mv88e6xxx_serdes_get_lane(chip, port); - if (lane && chip->info->ops->serdes_pcs_get_state) + if (lane >= 0 && chip->info->ops->serdes_pcs_get_state) err = chip->info->ops->serdes_pcs_get_state(chip, port, lane, state); else @@ -506,11 +506,11 @@ static int mv88e6xxx_serdes_pcs_config(struct mv88e6xxx_chip *chip, int port, const unsigned long *advertise) { const struct mv88e6xxx_ops *ops = chip->info->ops; - u8 lane; + int lane; if (ops->serdes_pcs_config) { lane = mv88e6xxx_serdes_get_lane(chip, port); - if (lane) + if (lane >= 0) return ops->serdes_pcs_config(chip, port, lane, mode, interface, advertise); } @@ -523,14 +523,14 @@ static void mv88e6xxx_serdes_pcs_an_restart(struct dsa_switch *ds, int port) struct mv88e6xxx_chip *chip = ds->priv; const struct mv88e6xxx_ops *ops; int err = 0; - u8 lane; + int lane; ops = chip->info->ops; if (ops->serdes_pcs_an_restart) { mv88e6xxx_reg_lock(chip); lane = mv88e6xxx_serdes_get_lane(chip, port); - if (lane) + if (lane >= 0) err = ops->serdes_pcs_an_restart(chip, port, lane); mv88e6xxx_reg_unlock(chip); @@ -544,11 +544,11 @@ static int mv88e6xxx_serdes_pcs_link_up(struct mv88e6xxx_chip *chip, int port, int speed, int duplex) { const struct mv88e6xxx_ops *ops = chip->info->ops; - u8 lane; + int lane; if (!phylink_autoneg_inband(mode) && ops->serdes_pcs_link_up) { lane = mv88e6xxx_serdes_get_lane(chip, port); - if (lane) + if (lane >= 0) return ops->serdes_pcs_link_up(chip, port, lane, speed, duplex); } @@ -2429,11 +2429,11 @@ static irqreturn_t mv88e6xxx_serdes_irq_thread_fn(int irq, void *dev_id) struct mv88e6xxx_chip *chip = mvp->chip; irqreturn_t ret = IRQ_NONE; int port = mvp->port; - u8 lane; + int lane; mv88e6xxx_reg_lock(chip); lane = mv88e6xxx_serdes_get_lane(chip, port); - if (lane) + if (lane >= 0) ret = mv88e6xxx_serdes_irq_status(chip, port, lane); mv88e6xxx_reg_unlock(chip); @@ -2441,7 +2441,7 @@ static irqreturn_t mv88e6xxx_serdes_irq_thread_fn(int irq, void *dev_id) } static int mv88e6xxx_serdes_irq_request(struct mv88e6xxx_chip *chip, int port, - u8 lane) + int lane) { struct mv88e6xxx_port *dev_id = &chip->ports[port]; unsigned int irq; @@ -2470,7 +2470,7 @@ static int mv88e6xxx_serdes_irq_request(struct mv88e6xxx_chip *chip, int port, } static int mv88e6xxx_serdes_irq_free(struct mv88e6xxx_chip *chip, int port, -u8 lane) +int lane) { struct mv88e6xxx_port *dev_id = &chip->ports[port]; unsigned int irq = dev_id->serdes_irq; @@ -2495,11 +2495,11 @@ static int mv88e6xxx_serdes_irq_free(struct mv88e6xxx_chip *chip, int port, static int mv88e6xxx_serdes_power(struct mv88e6xxx_chip *chip, int port, bool on) { - u8 lane; + int lane; int err; lane = mv88e6xxx_serdes_get_lane(chip, port); - if (!lane) + if (lane < 0) return 0; if (on) { diff --git a
[PATCH net-next v15 2/6] net: phy: Add 5GBASER interface mode
From: Pavana Sharma Add 5GBASE-R phy interface mode Signed-off-by: Pavana Sharma Reviewed-by: Andrew Lunn Signed-off-by: Marek Behún --- Documentation/networking/phy.rst | 6 ++ include/linux/phy.h | 4 2 files changed, 10 insertions(+) diff --git a/Documentation/networking/phy.rst b/Documentation/networking/phy.rst index b2f7ec794bc8..d7be261b5a8d 100644 --- a/Documentation/networking/phy.rst +++ b/Documentation/networking/phy.rst @@ -267,6 +267,12 @@ Some of the interface modes are described below: duplex, pause or other settings. This is dependent on the MAC and/or PHY behaviour. +``PHY_INTERFACE_MODE_5GBASER`` +This is the IEEE 802.3 Clause 129 defined 5GBASE-R protocol. It is +identical to the 10GBASE-R protocol defined in Clause 49, with the +exception that it operates at half the frequency. Please refer to the +IEEE standard for the definition. + ``PHY_INTERFACE_MODE_10GBASER`` This is the IEEE 802.3 Clause 49 defined 10GBASE-R protocol used with various different mediums. Please refer to the IEEE standard for a diff --git a/include/linux/phy.h b/include/linux/phy.h index 9effb511acde..548372eb253a 100644 --- a/include/linux/phy.h +++ b/include/linux/phy.h @@ -106,6 +106,7 @@ extern const int phy_10gbit_features_array[1]; * @PHY_INTERFACE_MODE_TRGMII: Turbo RGMII * @PHY_INTERFACE_MODE_1000BASEX: 1000 BaseX * @PHY_INTERFACE_MODE_2500BASEX: 2500 BaseX + * @PHY_INTERFACE_MODE_5GBASER: 5G BaseR * @PHY_INTERFACE_MODE_RXAUI: Reduced XAUI * @PHY_INTERFACE_MODE_XAUI: 10 Gigabit Attachment Unit Interface * @PHY_INTERFACE_MODE_10GBASER: 10G BaseR @@ -137,6 +138,7 @@ typedef enum { PHY_INTERFACE_MODE_TRGMII, PHY_INTERFACE_MODE_1000BASEX, PHY_INTERFACE_MODE_2500BASEX, + PHY_INTERFACE_MODE_5GBASER, PHY_INTERFACE_MODE_RXAUI, PHY_INTERFACE_MODE_XAUI, /* 10GBASE-R, XFI, SFI - single lane 10G Serdes */ @@ -207,6 +209,8 @@ static inline const char *phy_modes(phy_interface_t interface) return "1000base-x"; case PHY_INTERFACE_MODE_2500BASEX: return "2500base-x"; + case PHY_INTERFACE_MODE_5GBASER: + return "5gbase-r"; case PHY_INTERFACE_MODE_RXAUI: return "rxaui"; case PHY_INTERFACE_MODE_XAUI: -- 2.26.2
[PATCH net-next v15 4/6] net: dsa: mv88e6xxx: wrap .set_egress_port method
There are two implementations of the .set_egress_port method, and both of them, if successful, set chip->*gress_dest_port variable. To avoid code repetition, wrap this method into mv88e6xxx_set_egress_port. Signed-off-by: Marek Behún Reviewed-by: Pavana Sharma --- drivers/net/dsa/mv88e6xxx/chip.c| 49 ++--- drivers/net/dsa/mv88e6xxx/global1.c | 19 ++- 2 files changed, 33 insertions(+), 35 deletions(-) diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c index 44591ef487af..ed07cb29b285 100644 --- a/drivers/net/dsa/mv88e6xxx/chip.c +++ b/drivers/net/dsa/mv88e6xxx/chip.c @@ -2519,6 +2519,27 @@ static int mv88e6xxx_serdes_power(struct mv88e6xxx_chip *chip, int port, return err; } +static int mv88e6xxx_set_egress_port(struct mv88e6xxx_chip *chip, +enum mv88e6xxx_egress_direction direction, +int port) +{ + int err; + + if (!chip->info->ops->set_egress_port) + return -EOPNOTSUPP; + + err = chip->info->ops->set_egress_port(chip, direction, port); + if (err) + return err; + + if (direction == MV88E6XXX_EGRESS_DIR_INGRESS) + chip->ingress_dest_port = port; + else + chip->egress_dest_port = port; + + return 0; +} + static int mv88e6xxx_setup_upstream_port(struct mv88e6xxx_chip *chip, int port) { struct dsa_switch *ds = chip->ds; @@ -2541,19 +2562,17 @@ static int mv88e6xxx_setup_upstream_port(struct mv88e6xxx_chip *chip, int port) return err; } - if (chip->info->ops->set_egress_port) { - err = chip->info->ops->set_egress_port(chip, + err = mv88e6xxx_set_egress_port(chip, MV88E6XXX_EGRESS_DIR_INGRESS, upstream_port); - if (err) - return err; + if (err && err != -EOPNOTSUPP) + return err; - err = chip->info->ops->set_egress_port(chip, + err = mv88e6xxx_set_egress_port(chip, MV88E6XXX_EGRESS_DIR_EGRESS, upstream_port); - if (err) - return err; - } + if (err && err != -EOPNOTSUPP) + return err; } return 0; @@ -5286,9 +5305,6 @@ static int mv88e6xxx_port_mirror_add(struct dsa_switch *ds, int port, int i; int err; - if (!chip->info->ops->set_egress_port) - return -EOPNOTSUPP; - mutex_lock(&chip->reg_lock); if ((ingress ? chip->ingress_dest_port : chip->egress_dest_port) != mirror->to_local_port) { @@ -5303,9 +5319,8 @@ static int mv88e6xxx_port_mirror_add(struct dsa_switch *ds, int port, goto out; } - err = chip->info->ops->set_egress_port(chip, - direction, - mirror->to_local_port); + err = mv88e6xxx_set_egress_port(chip, direction, + mirror->to_local_port); if (err) goto out; } @@ -5338,10 +5353,8 @@ static void mv88e6xxx_port_mirror_del(struct dsa_switch *ds, int port, /* Reset egress port when no other mirror is active */ if (!other_mirrors) { - if (chip->info->ops->set_egress_port(chip, -direction, -dsa_upstream_port(ds, - port))) + if (mv88e6xxx_set_egress_port(chip, direction, + dsa_upstream_port(ds, port))) dev_err(ds->dev, "failed to set egress port\n"); } diff --git a/drivers/net/dsa/mv88e6xxx/global1.c b/drivers/net/dsa/mv88e6xxx/global1.c index 33d443a37efc..815b0f681d69 100644 --- a/drivers/net/dsa/mv88e6xxx/global1.c +++ b/drivers/net/dsa/mv88e6xxx/global1.c @@ -315,7 +315,6 @@ int mv88e6095_g1_set_egress_port(struct mv88e6xxx_chip *chip, enum mv88e6xxx_egress_direction direction, int port) { - int *dest_port_chip; u16 reg; int err; @@ -325,13 +324,11 @@ int mv88e6095_g1_set_egress_port(struct mv88e6xxx_chip *chip, switch (direction) { case MV88E6XXX_EGRESS_DIR_INGRESS: - dest_port_chip = &chip->ingress_dest_port; reg &= ~MV88E6185_G1_MONITOR_CTL_INGRESS_DEST
[PATCH net-next v15 6/6] net: dsa: mv88e6xxx: implement .port_set_policy for Amethyst
The 16-bit Port Policy CTL register from older chips is on 6393x changed to Port Policy MGMT CTL, which can access more data, but indirectly and via 8-bit registers. The original 16-bit value is divided into first two 8-bit register in the Port Policy MGMT CTL. We can therefore use the previous code to compute the mask and shift, and then - if 0 <= shift < 8, we access register 0 in Port Policy MGMT CTL - if 8 <= shift < 16, we access register 1 in Port Policy MGMT CTL There are in fact other possible policy settings for Amethyst which could be added here, but this can be done in the future. Signed-off-by: Marek Behún Reviewed-by: Pavana Sharma --- drivers/net/dsa/mv88e6xxx/chip.c | 1 + drivers/net/dsa/mv88e6xxx/port.c | 122 --- drivers/net/dsa/mv88e6xxx/port.h | 3 + 3 files changed, 99 insertions(+), 27 deletions(-) diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c index c2dc6858481a..77f7a263d1a7 100644 --- a/drivers/net/dsa/mv88e6xxx/chip.c +++ b/drivers/net/dsa/mv88e6xxx/chip.c @@ -4570,6 +4570,7 @@ static const struct mv88e6xxx_ops mv88e6393x_ops = { .port_set_speed_duplex = mv88e6393x_port_set_speed_duplex, .port_max_speed_mode = mv88e6393x_port_max_speed_mode, .port_tag_remap = mv88e6390_port_tag_remap, + .port_set_policy = mv88e6393x_port_set_policy, .port_set_frame_mode = mv88e6351_port_set_frame_mode, .port_set_egress_floods = mv88e6352_port_set_egress_floods, .port_set_ether_type = mv88e6393x_port_set_ether_type, diff --git a/drivers/net/dsa/mv88e6xxx/port.c b/drivers/net/dsa/mv88e6xxx/port.c index 2ff38357c481..732e43569f80 100644 --- a/drivers/net/dsa/mv88e6xxx/port.c +++ b/drivers/net/dsa/mv88e6xxx/port.c @@ -1304,6 +1304,27 @@ int mv88e6xxx_port_disable_pri_override(struct mv88e6xxx_chip *chip, int port) /* Offset 0x0E: Policy & MGMT Control Register for FAMILY 6191X 6193X 6393X */ +static int mv88e6393x_port_policy_read(struct mv88e6xxx_chip *chip, int port, + u16 pointer, u8 *data) +{ + u16 reg; + int err; + + err = mv88e6xxx_port_write(chip, port, MV88E6393X_PORT_POLICY_MGMT_CTL, + pointer); + if (err) + return err; + + err = mv88e6xxx_port_read(chip, port, MV88E6393X_PORT_POLICY_MGMT_CTL, + ®); + if (err) + return err; + + *data = reg; + + return 0; +} + static int mv88e6393x_port_policy_write(struct mv88e6xxx_chip *chip, int port, u16 pointer, u8 data) { @@ -1505,46 +1526,43 @@ int mv88e6390_port_tag_remap(struct mv88e6xxx_chip *chip, int port) /* Offset 0x0E: Policy Control Register */ -int mv88e6352_port_set_policy(struct mv88e6xxx_chip *chip, int port, - enum mv88e6xxx_policy_mapping mapping, - enum mv88e6xxx_policy_action action) +static int +mv88e6xxx_port_policy_mapping_get_pos(enum mv88e6xxx_policy_mapping mapping, + enum mv88e6xxx_policy_action action, + u16 *mask, u16 *val, int *shift) { - u16 reg, mask, val; - int shift; - int err; - switch (mapping) { case MV88E6XXX_POLICY_MAPPING_DA: - shift = __bf_shf(MV88E6XXX_PORT_POLICY_CTL_DA_MASK); - mask = MV88E6XXX_PORT_POLICY_CTL_DA_MASK; + *shift = __bf_shf(MV88E6XXX_PORT_POLICY_CTL_DA_MASK); + *mask = MV88E6XXX_PORT_POLICY_CTL_DA_MASK; break; case MV88E6XXX_POLICY_MAPPING_SA: - shift = __bf_shf(MV88E6XXX_PORT_POLICY_CTL_SA_MASK); - mask = MV88E6XXX_PORT_POLICY_CTL_SA_MASK; + *shift = __bf_shf(MV88E6XXX_PORT_POLICY_CTL_SA_MASK); + *mask = MV88E6XXX_PORT_POLICY_CTL_SA_MASK; break; case MV88E6XXX_POLICY_MAPPING_VTU: - shift = __bf_shf(MV88E6XXX_PORT_POLICY_CTL_VTU_MASK); - mask = MV88E6XXX_PORT_POLICY_CTL_VTU_MASK; + *shift = __bf_shf(MV88E6XXX_PORT_POLICY_CTL_VTU_MASK); + *mask = MV88E6XXX_PORT_POLICY_CTL_VTU_MASK; break; case MV88E6XXX_POLICY_MAPPING_ETYPE: - shift = __bf_shf(MV88E6XXX_PORT_POLICY_CTL_ETYPE_MASK); - mask = MV88E6XXX_PORT_POLICY_CTL_ETYPE_MASK; + *shift = __bf_shf(MV88E6XXX_PORT_POLICY_CTL_ETYPE_MASK); + *mask = MV88E6XXX_PORT_POLICY_CTL_ETYPE_MASK; break; case MV88E6XXX_POLICY_MAPPING_PPPOE: - shift = __bf_shf(MV88E6XXX_PORT_POLICY_CTL_PPPOE_MASK); - mask = MV88E6XXX_PORT_POLICY_CTL_PPPOE_MASK; + *shift = __bf_shf(MV88E6XXX_PORT_POLICY_CTL_PPPOE_MASK); + *mask = MV88E6XXX_PORT_POLICY_CTL_PPPOE_MASK; break; cas
RE: [PATCH v0 net-next 1/1] Allow user to set metric on default route learned via Router Advertisement.
Hi Jakub Thanks for the review, Sure, I will reraise the patch (again v0i, sonce no code changes) after adding space before '<'. This patch adds lines in 'include/uapi/', that requires ABI version changes for debian build. I am not sure, if we need any such changes to avoid breaking allmodconfig. It will be really helpful, if you can look at the patch once 'https://lkml.org/lkml/2021/1/11/1668' and suggest on this. Thanks a lot again.
Re: [PATCH net-next v15 1/6] dt-bindings: net: Add 5GBASER phy interface
On 1/12/21 11:54 AM, Marek Behún wrote: > From: Pavana Sharma > > Add 5gbase-r PHY interface mode. > > Signed-off-by: Pavana Sharma > Reviewed-by: Andrew Lunn > Acked-by: Rob Herring > Signed-off-by: Marek Behún Reviewed-by: Florian Fainelli -- Florian
[PATCH net-next v15 5/6] net: dsa: mv88e6xxx: Add support for mv88e6393x family of Marvell
From: Pavana Sharma The Marvell 88E6393X device is a single-chip integration of a 11-port Ethernet switch with eight integrated Gigabit Ethernet (GbE) transceivers and three 10-Gigabit interfaces. This patch adds functionalities specific to mv88e6393x family (88E6393X, 88E6193X and 88E6191X). The main differences between previous devices and this one are: - port 0 can be a SERDES port - all SERDESes are one-lane, eg. no XAUI nor RXAUI - on the other hand the SERDESes can do USXGMII, 10GBASER and 5GBASER (on 6191X only one SERDES is capable of more than 1g; USXGMII is not yet supported with this change) - Port Policy CTL register is changed to Port Policy MGMT CTL register, via which several more registers can be accessed indirectly - egress monitor port is configured differently - ingress monitor/CPU/mirror ports are configured differently and can be configured per port (ie. each port can have different ingress monitor port, for example) - port speed AltBit works differently than previously - PHY registers can be also accessed via MDIO address 0x18 and 0x19 (on previous devices they could be accessed only via Global 2 offsets 0x18 and 0x19, which means two indirections; this feature is not yet leveraged with this patch) Co-developed-by: Ashkan Boldaji Signed-off-by: Ashkan Boldaji Signed-off-by: Pavana Sharma Co-developed-by: Marek Behún Signed-off-by: Marek Behún --- drivers/net/dsa/mv88e6xxx/chip.c| 149 + drivers/net/dsa/mv88e6xxx/chip.h| 4 + drivers/net/dsa/mv88e6xxx/global1.h | 2 + drivers/net/dsa/mv88e6xxx/global2.h | 8 + drivers/net/dsa/mv88e6xxx/port.c| 267 drivers/net/dsa/mv88e6xxx/port.h| 47 - drivers/net/dsa/mv88e6xxx/serdes.c | 312 drivers/net/dsa/mv88e6xxx/serdes.h | 44 8 files changed, 831 insertions(+), 2 deletions(-) diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c index ed07cb29b285..c2dc6858481a 100644 --- a/drivers/net/dsa/mv88e6xxx/chip.c +++ b/drivers/net/dsa/mv88e6xxx/chip.c @@ -635,6 +635,28 @@ static void mv88e6390x_phylink_validate(struct mv88e6xxx_chip *chip, int port, mv88e6390_phylink_validate(chip, port, mask, state); } +static void mv88e6393x_phylink_validate(struct mv88e6xxx_chip *chip, int port, + unsigned long *mask, + struct phylink_link_state *state) +{ + if (port == 0 || port == 9 || port == 10) { + phylink_set(mask, 1baseT_Full); + phylink_set(mask, 1baseCR_Full); + phylink_set(mask, 1baseSR_Full); + phylink_set(mask, 1baseLR_Full); + phylink_set(mask, 1baseLRM_Full); + phylink_set(mask, 1baseER_Full); + phylink_set(mask, 5000baseT_Full); + phylink_set(mask, 2500baseX_Full); + phylink_set(mask, 2500baseT_Full); + } + + phylink_set(mask, 1000baseT_Full); + phylink_set(mask, 1000baseX_Full); + + mv88e6065_phylink_validate(chip, port, mask, state); +} + static void mv88e6xxx_validate(struct dsa_switch *ds, int port, unsigned long *supported, struct phylink_link_state *state) @@ -4533,6 +4555,64 @@ static const struct mv88e6xxx_ops mv88e6390x_ops = { .phylink_validate = mv88e6390x_phylink_validate, }; +static const struct mv88e6xxx_ops mv88e6393x_ops = { + /* MV88E6XXX_FAMILY_6393 */ + .setup_errata = mv88e6393x_serdes_setup_errata, + .irl_init_all = mv88e6390_g2_irl_init_all, + .get_eeprom = mv88e6xxx_g2_get_eeprom8, + .set_eeprom = mv88e6xxx_g2_set_eeprom8, + .set_switch_mac = mv88e6xxx_g2_set_switch_mac, + .phy_read = mv88e6xxx_g2_smi_phy_read, + .phy_write = mv88e6xxx_g2_smi_phy_write, + .port_set_link = mv88e6xxx_port_set_link, + .port_sync_link = mv88e6xxx_port_sync_link, + .port_set_rgmii_delay = mv88e6390_port_set_rgmii_delay, + .port_set_speed_duplex = mv88e6393x_port_set_speed_duplex, + .port_max_speed_mode = mv88e6393x_port_max_speed_mode, + .port_tag_remap = mv88e6390_port_tag_remap, + .port_set_frame_mode = mv88e6351_port_set_frame_mode, + .port_set_egress_floods = mv88e6352_port_set_egress_floods, + .port_set_ether_type = mv88e6393x_port_set_ether_type, + .port_set_jumbo_size = mv88e6165_port_set_jumbo_size, + .port_egress_rate_limiting = mv88e6097_port_egress_rate_limiting, + .port_pause_limit = mv88e6390_port_pause_limit, + .port_disable_learn_limit = mv88e6xxx_port_disable_learn_limit, + .port_disable_pri_override = mv88e6xxx_port_disable_pri_override, + .port_get_cmode = mv88e6352_port_get_cmode, + .port_set_cmode = mv88e6393x_port_set_cmode, + .port_setup_message_port = mv88e6xxx_setup_message_port, +
Re: [PATCH net-next v15 2/6] net: phy: Add 5GBASER interface mode
On 1/12/21 11:54 AM, Marek Behún wrote: > From: Pavana Sharma > > Add 5GBASE-R phy interface mode > > Signed-off-by: Pavana Sharma > Reviewed-by: Andrew Lunn > Signed-off-by: Marek Behún Reviewed-by: Florian Fainelli -- Florian
Re: [PATCH net-next v2 5/7] ibmvnic: serialize access to work queue
On Tue, 2021-01-12 at 10:14 -0800, Sukadev Bhattiprolu wrote: > The work queue is used to queue reset requests like CHANGE-PARAM or > FAILOVER resets for the worker thread. When the adapter is being > removed > the adapter state is set to VNIC_REMOVING and the work queue is > flushed > so no new work is added. However the check for adapter being removed > is > racy in that the adapter can go into REMOVING state just after we > check > and we might end up adding work just as it is being flushed (or > after). > > Use a new spin lock, ->remove_lock to ensure that after (while) the > work > queue is (being) flushed, no new work is queued. > > A follow-on patch will convert the existing ->state_lock from a spin > lock > to to a mutex to allow us to hold it for longer periods of time. > Since > work can be scheduled from a tasklet, we cannot block on the mutex, > so > use a new spin lock. > > Fixes: 6954a9e4192b ("ibmvnic: Flush existing work items before > device removal") > Signed-off-by: Sukadev Bhattiprolu > --- > drivers/net/ethernet/ibm/ibmvnic.c | 15 ++- > drivers/net/ethernet/ibm/ibmvnic.h | 2 ++ > 2 files changed, 16 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/ethernet/ibm/ibmvnic.c > b/drivers/net/ethernet/ibm/ibmvnic.c > index ad551418ac63..7645df37e886 100644 > --- a/drivers/net/ethernet/ibm/ibmvnic.c > +++ b/drivers/net/ethernet/ibm/ibmvnic.c > @@ -2435,6 +2435,7 @@ static int ibmvnic_reset(struct ibmvnic_adapter > *adapter, >enum ibmvnic_reset_reason reason) > { > struct net_device *netdev = adapter->netdev; > + unsigned long flags; > int ret; > > if (adapter->state == VNIC_PROBING) { > @@ -2443,6 +2444,8 @@ static int ibmvnic_reset(struct ibmvnic_adapter > *adapter, > goto err; > } > > + spin_lock_irqsave(&adapter->remove_lock, flags); > + > /* >* If failover is pending don't schedule any other reset. >* Instead let the failover complete. If there is already a > @@ -2465,8 +2468,9 @@ static int ibmvnic_reset(struct ibmvnic_adapter > *adapter, > netdev_dbg(adapter->netdev, "Scheduling reset (reason %d)\n", > reason); > schedule_work(&adapter->ibmvnic_reset); > > - return 0; > + ret = 0; > err: > + spin_unlock_irqrestore(&adapter->remove_lock, flags); > return -ret; > } > > @@ -5387,6 +5391,7 @@ static int ibmvnic_probe(struct vio_dev *dev, > const struct vio_device_id *id) > memset(&adapter->pending_resets, 0, sizeof(adapter- > >pending_resets)); > spin_lock_init(&adapter->rwi_lock); > spin_lock_init(&adapter->state_lock); > + spin_lock_init(&adapter->remove_lock); > mutex_init(&adapter->fw_lock); > init_completion(&adapter->init_done); > init_completion(&adapter->fw_done); > @@ -5467,7 +5472,15 @@ static int ibmvnic_remove(struct vio_dev *dev) > return -EBUSY; > } > > + /* If ibmvnic_reset() is scheduling a reset, wait for it to > + * finish. Then prevent it from scheduling any more resets > + * and have the reset functions ignore any resets that have > + * already been scheduled. > + */ > + spin_lock_irqsave(&adapter->remove_lock, flags); > adapter->state = VNIC_REMOVING; > + spin_unlock_irqrestore(&adapter->remove_lock, flags); > + Why irqsave/restore variants ? are you expecting this spinlock to be held in interruptcontext ? > spin_unlock_irqrestore(&adapter->state_lock, flags); > > flush_work(&adapter->ibmvnic_reset); > diff --git a/drivers/net/ethernet/ibm/ibmvnic.h > b/drivers/net/ethernet/ibm/ibmvnic.h > index 1179a95a3f92..2779696ade09 100644 > --- a/drivers/net/ethernet/ibm/ibmvnic.h > +++ b/drivers/net/ethernet/ibm/ibmvnic.h > @@ -1081,6 +1081,8 @@ struct ibmvnic_adapter { > spinlock_t rwi_lock; > enum ibmvnic_reset_reason pending_resets[VNIC_RESET_MAX-1]; > short next_reset; > + /* serialize ibmvnic_reset() and ibmvnic_remove() */ > + spinlock_t remove_lock; > struct work_struct ibmvnic_reset; > struct delayed_work ibmvnic_delayed_reset; > unsigned long resetting;
Re: [PATCH net-next v4 1/4] net: phy: mdio-i2c: support I2C MDIO protocol for RollBall SFP modules
> +/* RollBall SFPs do not access internal PHY via I2C address 0x56, but > + * instead via address 0x51, when SFP page is set to 0x03 and password to > + * 0x. > + * Since current SFP code does not modify SFP_PAGE, we set it to 0x03 only at > + * bus creation time, and expect it to remain set to 0x03 throughout the > + * lifetime of the module plugged into the system. If the SFP code starts > + * modifying SFP_PAGE in the future, this code will need to change. ... > +/* In order to not interfere with other SFP code (which possibly may > manipulate > + * SFP_PAGE), for every transfer we do this: > + * 1. lock the bus > + * 2. save content of SFP_PAGE > + * 3. set SFP_PAGE to 3 > + * 4. do the transfer > + * 5. restore original SFP_PAGE > + * 6. unlock the bus These two comments seem to contradict each other? > +static int i2c_rollball_mii_poll(struct mii_bus *bus, int bus_addr, u8 *buf, > + size_t len) > +{ > + struct i2c_adapter *i2c = bus->priv; > + struct i2c_msg msgs[2]; > + u8 cmd_addr, tmp, *res; > + int i, ret; > + > + cmd_addr = ROLLBALL_CMD_ADDR; > + > + res = buf ? buf : &tmp; > + len = buf ? len : 1; > + > + msgs[0].addr = bus_addr; > + msgs[0].flags = 0; > + msgs[0].len = 1; > + msgs[0].buf = &cmd_addr; > + > + msgs[1].addr = bus_addr; > + msgs[1].flags = I2C_M_RD; > + msgs[1].len = len; > + msgs[1].buf = res; > + > + /* By experiment it takes up to 70 ms to access a register for these > + * SFPs. Sleep 20ms between iteratios and try 10 times. > + */ iterations > +static int i2c_mii_read_rollball(struct mii_bus *bus, int phy_id, int reg) > +{ > + u8 buf[4], res[6]; > + int bus_addr, ret; > + u16 val; > + > + if (!(reg & MII_ADDR_C45)) > + return -EOPNOTSUPP; > + > + bus_addr = i2c_mii_phy_addr(phy_id); > + if (bus_addr != ROLLBALL_PHY_I2C_ADDR) > + return 0x; > + > + buf[0] = ROLLBALL_DATA_ADDR; > + buf[1] = (reg >> 16) & 0x1f; > + buf[2] = (reg >> 8) & 0xff; > + buf[3] = reg & 0xff; This looks odd. There are only 32 registers for C22 transactions, so it fits in one byte. You can set buf[1] and buf[2] to zero. C45 transactions allow for 65536 registers, and has the devtype field, so would need 3 bytes. Which suggests that C45 might actually be supported? At least by the protocol, if not the device. > + dev_dbg(&bus->dev, "read reg %02x:%04x = %04x\n", (reg >> 16) & 0x1f, > + reg & 0x, val); There is a tracepoint that allows access to this information, so you can probably remove this. Andrew
Re: [PATCH net-next v4 1/4] net: phy: mdio-i2c: support I2C MDIO protocol for RollBall SFP modules
> > > +static int i2c_mii_read_rollball(struct mii_bus *bus, int phy_id, int > > > reg) > > > +{ > > > + u8 buf[4], res[6]; > > > + int bus_addr, ret; > > > + u16 val; > > > + > > > + if (!(reg & MII_ADDR_C45)) > > > + return -EOPNOTSUPP; > > > + > > > + bus_addr = i2c_mii_phy_addr(phy_id); > > > + if (bus_addr != ROLLBALL_PHY_I2C_ADDR) > > > + return 0x; > > > + > > > + buf[0] = ROLLBALL_DATA_ADDR; > > > + buf[1] = (reg >> 16) & 0x1f; > > > + buf[2] = (reg >> 8) & 0xff; > > > + buf[3] = reg & 0xff; > > > > This looks odd. There are only 32 registers for C22 transactions, so > > it fits in one byte. You can set buf[1] and buf[2] to zero. > > C22 is not supported by this protocol. Duh! Sorry for the noise. Andrew
Re: [RFC PATCH net-next] bonding: add a vlan+srcmac tx hashing option
On Thu, Jan 07, 2021 at 07:03:40PM -0500, Jarod Wilson wrote: > On Fri, Dec 18, 2020 at 04:18:59PM -0800, Jay Vosburgh wrote: > > Jarod Wilson wrote: > > > > >This comes from an end-user request, where they're running multiple VMs on > > >hosts with bonded interfaces connected to some interest switch topologies, > > >where 802.3ad isn't an option. They're currently running a proprietary > > >solution that effectively achieves load-balancing of VMs and bandwidth > > >utilization improvements with a similar form of transmission algorithm. > > > > > >Basically, each VM has it's own vlan, so it always sends its traffic out > > >the same interface, unless that interface fails. Traffic gets split > > >between the interfaces, maintaining a consistent path, with failover still > > >available if an interface goes down. > > > > > >This has been rudimetarily tested to provide similar results, suitable for > > >them to use to move off their current proprietary solution. > > > > > >Still on the TODO list, if these even looks sane to begin with, is > > >fleshing out Documentation/networking/bonding.rst. > > > > I'm sure you're aware, but any final submission will also need > > to include netlink and iproute2 support. > > I believe everything for netlink support is already included, but I'll > double-check that before submitting something for inclusion consideration. I'm not certain if what you actually meant was that I'd have to patch iproute2 as well, which I've definitely stumbled onto today, but it's a 2-line patch, and everything seems to be working fine with it: $ sudo ip link set bond0 type bond xmit_hash_policy 5 $ ip -d link show bond0 11: bond0: mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether ce:85:5e:24:ce:90 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535 bond mode balance-xor miimon 0 updelay 0 downdelay 0 peer_notify_delay 0 use_carrier 1 arp_interval 0 arp_validate none arp_all_targets any primary_reselect always fail_over_mac none xmit_hash_policy vlansrc resend_igmp 1 num_grat_arp 1 all_slaves_active 0 min_links 0 lp_interval 1 packets_per_slave 1 lacp_rate slow ad_select stable tlb_dynamic_lb 1 addrgenmode eui64 numtxqueues 16 numrxqueues 16 gso_max_size 65536 gso_max_segs 65535 $ grep Hash /proc/net/bonding/bond0 Transmit Hash Policy: vlansrc (5) Nothing bad seems to happen on an older kernel if one tries to set the new hash, you just get told that it's an invalid argument. I *think* this is all ready for submission then, so I'll get both the kernel and iproute2 patches out soon. -- Jarod Wilson ja...@redhat.com
Re: [PATCH net-next] net: ipa: add config dependency on QCOM_SMEM
On 1/12/21 11:21 AM, Alex Elder wrote: > The IPA driver depends on some SMEM functionality (qcom_smem_init(), > qcom_smem_alloc(), and qcom_smem_virt_to_phys()), but this is not > reflected in the configuration dependencies. Add a dependency on > QCOM_SMEM to avoid attempts to build the IPA driver without SMEM. > This avoids a link error for certain configurations. > > Reported-by: Randy Dunlap > Fixes: 38a4066f593c5 ("net: ipa: support COMPILE_TEST") > Signed-off-by: Alex Elder Acked-by: Randy Dunlap # build-tested Thanks. > --- > drivers/net/ipa/Kconfig | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/net/ipa/Kconfig b/drivers/net/ipa/Kconfig > index 10a0e041ee775..b68f1289b89ef 100644 > --- a/drivers/net/ipa/Kconfig > +++ b/drivers/net/ipa/Kconfig > @@ -1,6 +1,6 @@ > config QCOM_IPA > tristate "Qualcomm IPA support" > - depends on 64BIT && NET > + depends on 64BIT && NET && QCOM_SMEM > depends on ARCH_QCOM || COMPILE_TEST > depends on QCOM_RPROC_COMMON || (QCOM_RPROC_COMMON=n && COMPILE_TEST) > select QCOM_MDT_LOADER if ARCH_QCOM > -- ~Randy
Re: [PATCH net-next v15 5/6] net: dsa: mv88e6xxx: Add support for mv88e6393x family of Marvell
On Tue, Jan 12, 2021 at 10:16:32PM +0100, Marek Behún wrote: > On Tue, 12 Jan 2021 22:38:08 +0200 > Vladimir Oltean wrote: > > > > + phylink_set(mask, 1baseT_Full); > > > + phylink_set(mask, 1baseCR_Full); > > > + phylink_set(mask, 1baseSR_Full); > > > + phylink_set(mask, 1baseLR_Full); > > > + phylink_set(mask, 1baseLRM_Full); > > > + phylink_set(mask, 1baseER_Full); > > > > Why did you remove 1000baseKR_Full from here? > > I am confused now. Should 1000baseKR_Full be here? 10g-kr is 10g-r with > clause 73 autonegotiation, so they are not compatible, or are they? ETHTOOL_LINK_MODE_1baseKR_Full_BIT and most of everything else from enum ethtool_link_mode_bit_indices are network-side media interfaces (aka between the PHY and the link partner). Whereas PHY_INTERFACE_MODE_10GKR and most of everything else from phy_interface_t is a system-side media independent interface (aka between the MAC and the PHY). What the 6393X does not support is PHY_INTERFACE_MODE_10GKR, and my feedback from your previous series did not ask you to remove 1000baseKR_Full from phylink_validate. There's nothing that would prevent somebody from attaching a PHY with a different (and compatible) system-side SERDES protocol, and 10GBase-KR on the media side. What Russell said is that he's seriously thinking about reworking the phylink_validate API so that the MAC driver would not need to sign off on things it does not care about (such as what media-side interface will the PHY use, of which there is an ever increasing plethora). But at the moment, the API is what it is, and you should put as many media-side link modes as possible, at the given speed and duplex that the MAC supports. _I_ am confused. What did you say you were happy to help with?
RE: [PATCH v2 0/3] hv_netvsc: Prevent packet loss during VF add/remove
> Subject: Re: [PATCH v2 0/3] hv_netvsc: Prevent packet loss during VF > add/remove > > On Fri, 8 Jan 2021 16:53:40 -0800 Long Li wrote: > > From: Long Li > > > > This patch set fixes issues with packet loss on VF add/remove. > > These patches are for net-next? They just optimize the amount of packet > loss on switch, not fix bugs, right? Yes, those patches are for net-next. They eliminate the packet loss introduced from Linux side during VF changes. They can be seen as optimizations.
Re: [PATCH net-next v4 1/4] net: phy: mdio-i2c: support I2C MDIO protocol for RollBall SFP modules
On Tue, Jan 12, 2021 at 09:54:24PM +0100, Andrew Lunn wrote: > > +static int i2c_transfer_rollball(struct i2c_adapter *i2c, > > +struct i2c_msg *msgs, int num) > > +{ > > + u8 saved_page; > > + int ret; > > + > > + i2c_lock_bus(i2c, I2C_LOCK_SEGMENT); > > + > > + /* save original page */ > > + ret = __i2c_rollball_get_page(i2c, msgs->addr, &saved_page); > > + if (ret) > > + goto unlock; > > + > > + /* change to RollBall MDIO page */ > > + ret = __i2c_rollball_set_page(i2c, msgs->addr, SFP_PAGE_ROLLBALL_MDIO); > > + if (ret) > > + goto unlock; > > + > > + /* do the transfer */ > > + ret = __i2c_transfer_err(i2c, msgs, num); > > + if (ret) > > + goto unlock; > > If get page and set page worked, and you get an error in during the > actual data transfer, i wonder if you should try restoring the page > before returning with the error? That's what we do with paged PHYs - if the access encounters an error, we still attempt to restore the page and propagate the _original_ error. We would only propagate the error from the page restore if it was the only error. See the logic and comments for phy_restore_page(). Note that phy_(save|select)_page()..phy_restore_page() deal with the locking, which is why they always have to be paired (also means that the user doesn't have to think too hard about error handling.) -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
Re: [net] net: feature check mandating HW_CSUM is wrong
On 07/01/21 12:47 AM, Jakub Kicinski wrote: On Wed, 6 Jan 2021 23:23:27 +0530 Rohit Maheshwari wrote: Mandating NETIF_F_HW_CSUM to enable TLS offload feature is wrong. And it broke tls offload feature for the drivers, which are still using NETIF_F_IP_CSUM or NETIF_F_IPV6_CSUM. We should use NETIF_F_CSUM_MASK instead. Fixes: ae0b04b238e2 ("net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is disabled") Signed-off-by: Rohit Maheshwari Please use Tariq's suggestion. HW_TLS_TX feature is for both IPv4/v6. And If one device is limited to support only IPv4 checksum offload, TLS offload should be allowed for that too. So I think putting a check of CSUM_MASK is enough.
Re: [PATCH net-next v4 1/4] net: phy: mdio-i2c: support I2C MDIO protocol for RollBall SFP modules
On Tue, 12 Jan 2021 21:54:24 +0100 Andrew Lunn wrote: > > +static int i2c_transfer_rollball(struct i2c_adapter *i2c, > > +struct i2c_msg *msgs, int num) > > +{ > > + u8 saved_page; > > + int ret; > > + > > + i2c_lock_bus(i2c, I2C_LOCK_SEGMENT); > > + > > + /* save original page */ > > + ret = __i2c_rollball_get_page(i2c, msgs->addr, &saved_page); > > + if (ret) > > + goto unlock; > > + > > + /* change to RollBall MDIO page */ > > + ret = __i2c_rollball_set_page(i2c, msgs->addr, SFP_PAGE_ROLLBALL_MDIO); > > + if (ret) > > + goto unlock; > > + > > + /* do the transfer */ > > + ret = __i2c_transfer_err(i2c, msgs, num); > > + if (ret) > > + goto unlock; > > If get page and set page worked, and you get an error in during the > actual data transfer, i wonder if you should try restoring the page > before returning with the error? I don't know. Can i2c trasfer fail and the next one succeed?
[PATCH net-next] udp: allow forwarding of plain (non-fraglisted) UDP GRO packets
Commit 9fd1ff5d2ac7 ("udp: Support UDP fraglist GRO/GSO.") actually not only added a support for fraglisted UDP GRO, but also tweaked some logics the way that non-fraglisted UDP GRO started to work for forwarding too. Tests showed that currently forwarding and NATing of plain UDP GRO packets are performed fully correctly, regardless if the target netdevice has a support for hardware/driver GSO UDP L4 or not. Add the last element and allow to form plain UDP GRO packets if there is no socket -> we are on forwarding path. Plain UDP GRO forwarding even shows better performance than fraglisted UDP GRO in some cases due to not wasting one skbuff_head per every segment. Signed-off-by: Alexander Lobakin --- net/ipv4/udp_offload.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c index ff39e94781bf..9d71df3d52ce 100644 --- a/net/ipv4/udp_offload.c +++ b/net/ipv4/udp_offload.c @@ -460,12 +460,13 @@ struct sk_buff *udp_gro_receive(struct list_head *head, struct sk_buff *skb, if (skb->dev->features & NETIF_F_GRO_FRAGLIST) NAPI_GRO_CB(skb)->is_flist = sk ? !udp_sk(sk)->gro_enabled: 1; - if ((sk && udp_sk(sk)->gro_enabled) || NAPI_GRO_CB(skb)->is_flist) { + if (!sk || (sk && udp_sk(sk)->gro_enabled) || + NAPI_GRO_CB(skb)->is_flist) { pp = call_gro_receive(udp_gro_receive_segment, head, skb); return pp; } - if (!sk || NAPI_GRO_CB(skb)->encap_mark || + if (NAPI_GRO_CB(skb)->encap_mark || (skb->ip_summed != CHECKSUM_PARTIAL && NAPI_GRO_CB(skb)->csum_cnt == 0 && !NAPI_GRO_CB(skb)->csum_valid) || -- 2.30.0
Re: [PATCH net-next v15 5/6] net: dsa: mv88e6xxx: Add support for mv88e6393x family of Marvell
On Tue, 12 Jan 2021 22:38:08 +0200 Vladimir Oltean wrote: > > + phylink_set(mask, 1baseT_Full); > > + phylink_set(mask, 1baseCR_Full); > > + phylink_set(mask, 1baseSR_Full); > > + phylink_set(mask, 1baseLR_Full); > > + phylink_set(mask, 1baseLRM_Full); > > + phylink_set(mask, 1baseER_Full); > > Why did you remove 1000baseKR_Full from here? I am confused now. Should 1000baseKR_Full be here? 10g-kr is 10g-r with clause 73 autonegotiation, so they are not compatible, or are they? Marek
Re: [PATCH] igb: avoid premature Rx buffer reuse
On Mon, Jan 11, 2021 at 6:54 PM Li,Rongqing wrote: > > > > > -Original Message- > > From: Alexander Duyck [mailto:alexander.du...@gmail.com] > > Sent: Tuesday, January 12, 2021 4:54 AM > > To: Li,Rongqing > > Cc: Netdev ; intel-wired-lan > > ; Björn Töpel > > Subject: Re: [PATCH] igb: avoid premature Rx buffer reuse > > > > On Wed, Jan 6, 2021 at 7:53 PM Li RongQing wrote: > > > > > > The page recycle code, incorrectly, relied on that a page fragment > > > could not be freed inside xdp_do_redirect(). This assumption leads to > > > that page fragments that are used by the stack/XDP redirect can be > > > reused and overwritten. > > > > > > To avoid this, store the page count prior invoking xdp_do_redirect(). > > > > > > Fixes: 9cbc948b5a20 ("igb: add XDP support") > > > Signed-off-by: Li RongQing > > > Cc: Björn Töpel > > > > I'm not sure what you are talking about here. We allow for a 0 to 1 count > > difference in the pagecount bias. The idea is the driver should be holding > > onto > > at least one reference from the driver at all times. > > Are you saying that is not the case? > > > > As far as the code itself we hold onto the page as long as our difference > > does > > not exceed 1. So specifically if the XDP call is freeing the page the page > > itself > > should still be valid as the reference count shouldn't drop below 1, and in > > that > > case the driver should be holding that one reference to the page. > > > > When we perform our check we are performing it such at output of either 0 if > > the page is freed, or 1 if the page is not freed are acceptable for us to > > allow > > reuse. The key bit is in igb_clean_rx_irq where we will flip the buffer for > > the > > IGB_XDP_TX | IGB_XDP_REDIR case and just increment the pagecnt_bias > > indicating that the page was dropped in the non-flipped case. > > > > Are you perhaps seeing a function that is returning an error and still > > consuming > > the page? If so that might explain what you are seeing. > > However the bug would be in the other driver not this one. The > > xdp_do_redirect function is not supposed to free the page if it returns an > > error. > > It is supposed to leave that up to the function that called xdp_do_redirect. > > > > > --- > > > drivers/net/ethernet/intel/igb/igb_main.c | 22 +++--- > > > 1 file changed, 15 insertions(+), 7 deletions(-) > > > > > > diff --git a/drivers/net/ethernet/intel/igb/igb_main.c > > > b/drivers/net/ethernet/intel/igb/igb_main.c > > > index 03f78fdb0dcd..3e0d903cf919 100644 > > > --- a/drivers/net/ethernet/intel/igb/igb_main.c > > > +++ b/drivers/net/ethernet/intel/igb/igb_main.c > > > @@ -8232,7 +8232,8 @@ static inline bool igb_page_is_reserved(struct > > page *page) > > > return (page_to_nid(page) != numa_mem_id()) || > > > page_is_pfmemalloc(page); } > > > > > > -static bool igb_can_reuse_rx_page(struct igb_rx_buffer *rx_buffer) > > > +static bool igb_can_reuse_rx_page(struct igb_rx_buffer *rx_buffer, > > > + > > int > > > +rx_buf_pgcnt) > > > { > > > unsigned int pagecnt_bias = rx_buffer->pagecnt_bias; > > > struct page *page = rx_buffer->page; @@ -8243,7 +8244,7 @@ > > > static bool igb_can_reuse_rx_page(struct igb_rx_buffer *rx_buffer) > > > > > > #if (PAGE_SIZE < 8192) > > > /* if we are only owner of page we can reuse it */ > > > - if (unlikely((page_ref_count(page) - pagecnt_bias) > 1)) > > > + if (unlikely((rx_buf_pgcnt - pagecnt_bias) > 1)) > > > return false; > > > #else > > I would need more info on the actual issue. If nothing else it might be > > useful to > > have an example where you print out the page_ref_count versus the > > pagecnt_bias at a few points to verify exactly what is going on. As I said > > before > > if the issue is the xdp_do_redirect returning an error and still consuming > > the > > page then the bug is elsewhere and not here. > > > This patch is same as 75aab4e10ae6a4593a60f66d13de755d4e91f400 > > > commit 75aab4e10ae6a4593a60f66d13de755d4e91f400 > Author: Björn Töpel > Date: Tue Aug 25 19:27:34 2020 +0200 > > i40e: avoid premature Rx buffer reuse > > The page recycle code, incorrectly, relied on that a page fragment > could not be freed inside xdp_do_redirect(). This assumption leads to > that page fragments that are used by the stack/XDP redirect can be > reused and overwritten. > > To avoid this, store the page count prior invoking xdp_do_redirect(). > > Longer explanation: > > Intel NICs have a recycle mechanism. The main idea is that a page is > split into two parts. One part is owned by the driver, one part might > be owned by someone else, such as the stack. > > t0: Page is allocated, and put on the Rx ring > +--- > used by NIC ->| upper buffer > (rx_buffer) +--- > | lower buffer > +--- > page count == USHRT
Re: [MPTCP] [PATCH net 2/2] mptcp: better msk-level shutdown.
On Tue, 12 Jan 2021, Paolo Abeni wrote: Instead of re-implementing most of inet_shutdown, re-use such helper, and implement the MPTCP-specific bits at the 'proto' level. The msk-level disconnect() can now be invoked, lets provide a suitable implementation. As a side effect, this fixes bad state management for listener sockets. The latter could lead to division by 0 oops since commit ea4ca586b16f ("mptcp: refine MPTCP-level ack scheduling"). Fixes: 43b54c6ee382 ("mptcp: Use full MPTCP-level disconnect state machine") Fixes: ea4ca586b16f ("mptcp: refine MPTCP-level ack scheduling") Signed-off-by: Paolo Abeni --- net/mptcp/protocol.c | 62 1 file changed, 17 insertions(+), 45 deletions(-) Reviewed-by: Mat Martineau -- Mat Martineau Intel
Re: [MPTCP] [PATCH net 1/2] mptcp: more strict state checking for acks
On Tue, 12 Jan 2021, Paolo Abeni wrote: Syzkaller found a way to trigger division by zero in mptcp_subflow_cleanup_rbuf(). The current checks implemented into tcp_can_send_ack() are too week, let's be more accurate. Reported-by: Christoph Paasch Fixes: ea4ca586b16f ("mptcp: refine MPTCP-level ack scheduling") Fixes: fd8976790a6c ("mptcp: be careful on MPTCP-level ack.") Signed-off-by: Paolo Abeni --- net/mptcp/protocol.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Reviewed-by: Mat Martineau -- Mat Martineau Intel
Re: [RFC PATCH net-next] bonding: add a vlan+srcmac tx hashing option
Jarod Wilson wrote: >On Thu, Jan 07, 2021 at 07:03:40PM -0500, Jarod Wilson wrote: >> On Fri, Dec 18, 2020 at 04:18:59PM -0800, Jay Vosburgh wrote: >> > Jarod Wilson wrote: >> > >> > >This comes from an end-user request, where they're running multiple VMs on >> > >hosts with bonded interfaces connected to some interest switch topologies, >> > >where 802.3ad isn't an option. They're currently running a proprietary >> > >solution that effectively achieves load-balancing of VMs and bandwidth >> > >utilization improvements with a similar form of transmission algorithm. >> > > >> > >Basically, each VM has it's own vlan, so it always sends its traffic out >> > >the same interface, unless that interface fails. Traffic gets split >> > >between the interfaces, maintaining a consistent path, with failover still >> > >available if an interface goes down. >> > > >> > >This has been rudimetarily tested to provide similar results, suitable for >> > >them to use to move off their current proprietary solution. >> > > >> > >Still on the TODO list, if these even looks sane to begin with, is >> > >fleshing out Documentation/networking/bonding.rst. >> > >> >I'm sure you're aware, but any final submission will also need >> > to include netlink and iproute2 support. >> >> I believe everything for netlink support is already included, but I'll >> double-check that before submitting something for inclusion consideration. > >I'm not certain if what you actually meant was that I'd have to patch >iproute2 as well, which I've definitely stumbled onto today, but it's a >2-line patch, and everything seems to be working fine with it: Yes, that's what I meant. >$ sudo ip link set bond0 type bond xmit_hash_policy 5 Does the above work with the text label (presumably "vlansrc") as well as the number, and does "ip link add test type bond help" print the correct text for XMIT_HASH_POLICY? -J >$ ip -d link show bond0 >11: bond0: mtu 1500 qdisc noop state DOWN mode >DEFAULT group default qlen 1000 >link/ether ce:85:5e:24:ce:90 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 > maxmtu 65535 >bond mode balance-xor miimon 0 updelay 0 downdelay 0 peer_notify_delay 0 > use_carrier 1 arp_interval 0 arp_validate none arp_all_targets any > primary_reselect always fail_over_mac none xmit_hash_policy vlansrc > resend_igmp 1 num_grat_arp 1 all_slaves_active 0 min_links 0 lp_interval 1 > packets_per_slave 1 lacp_rate slow ad_select stable tlb_dynamic_lb 1 > addrgenmode eui64 numtxqueues 16 numrxqueues 16 gso_max_size 65536 > gso_max_segs 65535 >$ grep Hash /proc/net/bonding/bond0 >Transmit Hash Policy: vlansrc (5) > >Nothing bad seems to happen on an older kernel if one tries to set the new >hash, you just get told that it's an invalid argument. > >I *think* this is all ready for submission then, so I'll get both the kernel >and iproute2 patches out soon. > >-- >Jarod Wilson >ja...@redhat.com --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [RFC PATCH 3/7] tun: allow use of BPF_PROG_TYPE_SCHED_CLS program type
On Tue, Jan 12, 2021 at 10:40 PM Yuri Benditovich wrote: > > On Tue, Jan 12, 2021 at 9:42 PM Yuri Benditovich > wrote: > > > > This program type can set skb hash value. It will be useful > > when the tun will support hash reporting feature if virtio-net. > > > > Signed-off-by: Yuri Benditovich > > --- > > drivers/net/tun.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/drivers/net/tun.c b/drivers/net/tun.c > > index 7959b5c2d11f..455f7afc1f36 100644 > > --- a/drivers/net/tun.c > > +++ b/drivers/net/tun.c > > @@ -2981,6 +2981,8 @@ static int tun_set_ebpf(struct tun_struct *tun, > > struct tun_prog __rcu **prog_p, > > prog = NULL; > > } else { > > prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCKET_FILTER); > > + if (IS_ERR(prog)) > > + prog = bpf_prog_get_type(fd, > > BPF_PROG_TYPE_SCHED_CLS); > > if (IS_ERR(prog)) > > return PTR_ERR(prog); > > } > > Comment from Alexei Starovoitov: > Patches 1 and 2 are missing for me, so I couldn't review properly, > but this diff looks odd. > It allows sched_cls prog type to attach to tun. > That means everything that sched_cls progs can do will be done from tun hook? We do not have an intention to modify the packet in this steering eBPF. There is just one function that unavailable for BPF_PROG_TYPE_SOCKET_FILTER that the eBPF needs to make possible to deliver the hash to the guest VM - it is 'bpf_set_hash' Does it mean that we need to define a new eBPF type for socket filter operations + set_hash? Our problem is that the eBPF calculates 32-bit hash, 16-bit queue index and 8-bit of hash type. But it is able to return only 32-bit integer, so in this set of patches the eBPF returns queue index and hash type and saves the hash in skb->hash using bpf_set_hash(). If this is unacceptable, can you please recommend a better solution? > sched_cls assumes l2 and can modify the packet. The steering eBPF in TUN module also assumes l2. > I think crashes are inevitable. > > > -- > > 2.17.1 > >
Re: [RFC PATCH 3/7] tun: allow use of BPF_PROG_TYPE_SCHED_CLS program type
On Tue, Jan 12, 2021 at 9:42 PM Yuri Benditovich wrote: > > This program type can set skb hash value. It will be useful > when the tun will support hash reporting feature if virtio-net. > > Signed-off-by: Yuri Benditovich > --- > drivers/net/tun.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/net/tun.c b/drivers/net/tun.c > index 7959b5c2d11f..455f7afc1f36 100644 > --- a/drivers/net/tun.c > +++ b/drivers/net/tun.c > @@ -2981,6 +2981,8 @@ static int tun_set_ebpf(struct tun_struct *tun, struct > tun_prog __rcu **prog_p, > prog = NULL; > } else { > prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCKET_FILTER); > + if (IS_ERR(prog)) > + prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SCHED_CLS); > if (IS_ERR(prog)) > return PTR_ERR(prog); > } Comment from Alexei Starovoitov: Patches 1 and 2 are missing for me, so I couldn't review properly, but this diff looks odd. It allows sched_cls prog type to attach to tun. That means everything that sched_cls progs can do will be done from tun hook? sched_cls assumes l2 and can modify the packet. I think crashes are inevitable. > -- > 2.17.1 >
Re: mlx5 error when the skb linear space is empty
On Mon, 2021-01-11 at 09:02 +0100, Magnus Karlsson wrote: > On Tue, Jan 5, 2021 at 9:51 PM Saeed Mahameed > wrote: > > On Mon, 2021-01-04 at 18:59 +0800, Xuan Zhuo wrote: > > > hi > > > > > > In the process of developing xdp socket, we tried to directly use > > > page to > > > construct skb directly, to avoid data copy. And the MAC > > > information > > > is also in > > > the page, which caused the linear space of skb to be empty. In > > > this > > > case, I > > > encountered a problem : > > > > > > mlx5_core :3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8, qn > > > 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68 > > > : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > > 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > > 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > > 0030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2 > > > WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64 > > > : 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00 > > > 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > > 0020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00 > > > 0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > > mlx5_core :3b:00.1 eth1: ERR CQE on SQ: 0x1dbb > > > > > > > > > And when I try to copy only the mac address into the linear space > > > of > > > skb, the > > > other parts are still placed in the page. When constructing skb > > > in > > > this way, I > > > found that although the data can be sent successfully, the > > > sending > > > performance > > > is relatively poor!! > > > > > > > Hi, > > > > This is an expected behavior of ConnectX4-LX, ConnectX4-LX requires > > the > > driver to copy at least the L2 headers into the linear part, in > > some > > DCB/DSCP configuration it will require L3 headers. > > Do I understand this correctly if I say whatever is calling > ndo_start_xmit has to make sure at least the L2 headers is in the > linear part of the skb? If Xuan does not do this, the ConnectX4 > driver > crashes, but if he does, it works. So from an ndo_start_xmit > interface > perspective, what is the requirement of an skb that is passed to it? > Do all users of ndo_start_xmit make sure the L2 header is in the > linear part, or are there users that do not make sure this is the > case? Judging from the ConnectX5 code it seems that the latter is > possible (since it has code to deal with this), but from the > ConnectX4, it seems like the former is true (since it does not copy > the L2 headers into the linear part as far as I can see). Sorry for > my > confusion, but I think it is important to get some clarity here as it > will decide if Xuan's patch is a good idea or not in its current > form. > To clarify: Connectx4Lx, doesn't really require data to be in the linear part, I was refereing to a HW limitation that requires the driver to copy the L2/L3 headers (depending on current HW config) to a special area in the tx descriptor, currently the driver copy the L2/L3 headers only from the linear part of the SKB, but this can be changed via calling pskb_may_pull in mlx5 ConnectX4LX tx path to make sure the linear part has the needed data .. Something like: diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c index 61ed671fe741..5939fd8eed2c 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c @@ -159,6 +159,7 @@ static inline int mlx5e_skb_l2_header_offset(struct sk_buff *skb) { #define MLX5E_MIN_INLINE (ETH_HLEN + VLAN_HLEN) + /* need to check ret val */ + pskb_may_pull(skb, MLX5E_MIN_INLINE); return max(skb_network_offset(skb), MLX5E_MIN_INLINE); }
Re: [PATCH net-next v4 1/4] net: phy: mdio-i2c: support I2C MDIO protocol for RollBall SFP modules
On Tue, 12 Jan 2021 21:43:29 +0100 Andrew Lunn wrote: > > +/* RollBall SFPs do not access internal PHY via I2C address 0x56, but > > + * instead via address 0x51, when SFP page is set to 0x03 and password to > > + * 0x. > > + * Since current SFP code does not modify SFP_PAGE, we set it to 0x03 only > > at > > + * bus creation time, and expect it to remain set to 0x03 throughout the > > + * lifetime of the module plugged into the system. If the SFP code starts > > + * modifying SFP_PAGE in the future, this code will need to change. > > ... > > > +/* In order to not interfere with other SFP code (which possibly may > > manipulate > > + * SFP_PAGE), for every transfer we do this: > > + * 1. lock the bus > > + * 2. save content of SFP_PAGE > > + * 3. set SFP_PAGE to 3 > > + * 4. do the transfer > > + * 5. restore original SFP_PAGE > > + * 6. unlock the bus > > These two comments seem to contradict each other? I forgot about the first comment. I will rewrite it and send again. > > > +static int i2c_rollball_mii_poll(struct mii_bus *bus, int bus_addr, u8 > > *buf, > > +size_t len) > > +{ > > + struct i2c_adapter *i2c = bus->priv; > > + struct i2c_msg msgs[2]; > > + u8 cmd_addr, tmp, *res; > > + int i, ret; > > + > > + cmd_addr = ROLLBALL_CMD_ADDR; > > + > > + res = buf ? buf : &tmp; > > + len = buf ? len : 1; > > + > > + msgs[0].addr = bus_addr; > > + msgs[0].flags = 0; > > + msgs[0].len = 1; > > + msgs[0].buf = &cmd_addr; > > + > > + msgs[1].addr = bus_addr; > > + msgs[1].flags = I2C_M_RD; > > + msgs[1].len = len; > > + msgs[1].buf = res; > > + > > + /* By experiment it takes up to 70 ms to access a register for these > > +* SFPs. Sleep 20ms between iteratios and try 10 times. > > +*/ > > iterations THX. > > > +static int i2c_mii_read_rollball(struct mii_bus *bus, int phy_id, int reg) > > +{ > > + u8 buf[4], res[6]; > > + int bus_addr, ret; > > + u16 val; > > + > > + if (!(reg & MII_ADDR_C45)) > > + return -EOPNOTSUPP; > > + > > + bus_addr = i2c_mii_phy_addr(phy_id); > > + if (bus_addr != ROLLBALL_PHY_I2C_ADDR) > > + return 0x; > > + > > + buf[0] = ROLLBALL_DATA_ADDR; > > + buf[1] = (reg >> 16) & 0x1f; > > + buf[2] = (reg >> 8) & 0xff; > > + buf[3] = reg & 0xff; > > This looks odd. There are only 32 registers for C22 transactions, so > it fits in one byte. You can set buf[1] and buf[2] to zero. C22 is not supported by this protocol. A few line above there is if (!(reg & MII_ADDR_C45)) return -EOPNOTSUPP; > C45 transactions allow for 65536 registers, and has the devtype field, > so would need 3 bytes. Which suggests that C45 might actually be > supported? At least by the protocol, if not the device. > > + dev_dbg(&bus->dev, "read reg %02x:%04x = %04x\n", (reg >> 16) & 0x1f, > > + reg & 0x, val); > > There is a tracepoint that allows access to this information, so you > can probably remove this. > > Andrew OK then, I will remove this. Thanks, Andrew.
Re: [PATCH net-next 4/4] tcp: remove limit on initial receive window
On 2021-01-12 21:26 (+0100), Eric Dumazet wrote: > On Tue, Jan 12, 2021 at 8:25 PM Heath Caldwell wrote: > > > > On 2021-01-12 18:05 (+0100), Eric Dumazet wrote: > > > On Tue, Jan 12, 2021 at 5:02 PM Heath Caldwell > > > wrote: > > > > > > > > On 2021-01-12 09:30 (+0100), Eric Dumazet wrote: > > > > > I think the whole patch series is an attempt to badly break TCP stack. > > > > > > > > Can you explain the concern that you have about how these changes might > > > > break the TCP stack? > > > > > > > > Patches 1 and 3 fix clear bugs. > > > > > > Not clear to me at least. > > > > > > If they were bug fixes, a FIxes: tag would be provided. > > > > The underlying bugs that are addressed in patches 1 and 3 are present in > > 1da177e4c3f4 ("Linux-2.6.12-rc2") which looks to be the earliest parent > > commit in the repository. What should I do for a Fixes: tag in this > > case? > > > > > You are a first time contributor to linux TCP stack, so better make > > > sure your claims are solid. > > > > I fear that I may not have expressed the problems and solutions in a > > manner that imparted the ideas well. > > > > Maybe I added too much detail in the description for patch 1, which may > > have obscured the problem: val is capped to sysctl_rmem_max *before* it > > is doubled (resulting in the possibility for sk_rcvbuf to be set to > > 2*sysctl_rmem_max, rather than it being capped at sysctl_rmem_max). > > This is fine. This has been done forever. Your change might break > applications. In what way might applications be broken? It seems to be a very strange position to allow a configured maximum to be violated because of obscure precedent. It does not seem to be a supportable position to allow an application to violate an installation's configuration because of a chance that the application may behave differently if a setsockopt() call fails. What if a system administrator decides to reduce sysctl_rmem_max to half of the current default? > I would advise documenting this fact, since existing behavior will be kept > in many linux hosts for years to come. > > > > > Maybe I was not explicit enough in the description for patch 3: space is > > expanded into sock_net(sk)->ipv4.sysctl_tcp_rmem[2] and sysctl_rmem_max > > without first shrinking them to discount the overhead. > > > > > > Patches 2 and 4 might be arguable, though. > > > > > > So we have to pick up whatever pleases us ? > > > > I have been treating all of these changes together because they all kind > > of work together to provide a consistent model and configurability for > > the initial receive window. > > > > Patches 1 and 3 address bugs. > > Maybe, but will break applications. How might patch 3 break an application? It merely will reduce the window scale value to something lower but still capable of representing the largest window that a particular connection might advertise. > > Patch 2 addresses an inconsistency in how overhead is treated specially > > for TCP sockets. > > Patch 4 addresses the 64KB limit which has been imposed. > > For very good reasons. What are the reasons? > This is going nowhere. I will stop right now. That is a shame :(.
Re: [PATCH v3 bpf-next 5/7] bpf: support BPF ksym variables in kernel modules
On Tue, Jan 12, 2021 at 8:27 AM Daniel Borkmann wrote: > > On 1/12/21 8:55 AM, Andrii Nakryiko wrote: > > Add support for directly accessing kernel module variables from BPF programs > > using special ldimm64 instructions. This functionality builds upon vmlinux > > ksym support, but extends ldimm64 with src_reg=BPF_PSEUDO_BTF_ID to allow > > specifying kernel module BTF's FD in insn[1].imm field. > > > > During BPF program load time, verifier will resolve FD to BTF object and > > will > > take reference on BTF object itself and, for module BTFs, corresponding > > module > > as well, to make sure it won't be unloaded from under running BPF program. > > The > > mechanism used is similar to how bpf_prog keeps track of used bpf_maps. > > > > One interesting change is also in how per-CPU variable is determined. The > > logic is to find .data..percpu data section in provided BTF, but both > > vmlinux > > and module each have their own .data..percpu entries in BTF. So for module's > > case, the search for DATASEC record needs to look at only module's added BTF > > types. This is implemented with custom search function. > > > > Acked-by: Yonghong Song > > Acked-by: Hao Luo > > Signed-off-by: Andrii Nakryiko > [...] > > + > > +struct module *btf_try_get_module(const struct btf *btf) > > +{ > > + struct module *res = NULL; > > +#ifdef CONFIG_DEBUG_INFO_BTF_MODULES > > + struct btf_module *btf_mod, *tmp; > > + > > + mutex_lock(&btf_module_mutex); > > + list_for_each_entry_safe(btf_mod, tmp, &btf_modules, list) { > > + if (btf_mod->btf != btf) > > + continue; > > + > > + if (try_module_get(btf_mod->module)) > > + res = btf_mod->module; > > One more thought (follow-up would be okay I'd think) ... when a module > references > a symbol from another module, it similarly needs to bump the refcount of the > module > that is owning it and thus disallowing to unload for that other module's > lifetime. > That usage dependency is visible via /proc/modules however, so if unload > doesn't work > then lsmod allows a way to introspect that to the user. This seems to be > achieved via > resolve_symbol() where it records its dependency/usage. Would be great if we > could at > some point also include the BPF prog name into that list so that this is more > obvious. > Wdyt? > Yeah, it's definitely nice to see dependent bpf progs. There is struct module_use, which is used to record these dependencies, but the assumption there is that dependencies could be only other modules. So one way is to somehow extend that or add another set of bpf_prog dependencies. First is a bit intrusive, while the seconds sucks even more, IMO. Alternatively, we can rely on bpf_link info to emit module info, if the BPF program is attached to BTF type from the module. Then with bpftool it would be easy to see this, but it's not as readily-available info as /proc/modules, of course. Any preferences? > > + break; > > + } > > + mutex_unlock(&btf_module_mutex); > > +#endif > > + > > + return res; > > +} > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c > > index 261f8692d0d2..69c3c308de5e 100644 > > --- a/kernel/bpf/core.c > > +++ b/kernel/bpf/core.c > > @@ -2119,6 +2119,28 @@ static void bpf_free_used_maps(struct bpf_prog_aux > > *aux) > > kfree(aux->used_maps); > > } > > > > +void __bpf_free_used_btfs(struct bpf_prog_aux *aux, > > + struct btf_mod_pair *used_btfs, u32 len) > > +{ > > +#ifdef CONFIG_BPF_SYSCALL > > + struct btf_mod_pair *btf_mod; > > + u32 i; > > + > > + for (i = 0; i < len; i++) { > > + btf_mod = &used_btfs[i]; > > + if (btf_mod->module) > > + module_put(btf_mod->module); > > + btf_put(btf_mod->btf); > > + } > > +#endif > > +}
Re: [PATCH net-next v15 5/6] net: dsa: mv88e6xxx: Add support for mv88e6393x family of Marvell
On Tue, Jan 12, 2021 at 08:54:04PM +0100, Marek Behún wrote: > From: Pavana Sharma > > The Marvell 88E6393X device is a single-chip integration of a 11-port > Ethernet switch with eight integrated Gigabit Ethernet (GbE) > transceivers and three 10-Gigabit interfaces. > > This patch adds functionalities specific to mv88e6393x family (88E6393X, > 88E6193X and 88E6191X). > > The main differences between previous devices and this one are: > - port 0 can be a SERDES port > - all SERDESes are one-lane, eg. no XAUI nor RXAUI > - on the other hand the SERDESes can do USXGMII, 10GBASER and 5GBASER > (on 6191X only one SERDES is capable of more than 1g; USXGMII is not > yet supported with this change) > - Port Policy CTL register is changed to Port Policy MGMT CTL register, > via which several more registers can be accessed indirectly > - egress monitor port is configured differently > - ingress monitor/CPU/mirror ports are configured differently and can be > configured per port (ie. each port can have different ingress monitor > port, for example) > - port speed AltBit works differently than previously > - PHY registers can be also accessed via MDIO address 0x18 and 0x19 > (on previous devices they could be accessed only via Global 2 offsets >0x18 and 0x19, which means two indirections; this feature is not yet >leveraged with this patch) > > Co-developed-by: Ashkan Boldaji > Signed-off-by: Ashkan Boldaji > Signed-off-by: Pavana Sharma > Co-developed-by: Marek Behún > Signed-off-by: Marek Behún > --- > drivers/net/dsa/mv88e6xxx/chip.c| 149 + > drivers/net/dsa/mv88e6xxx/chip.h| 4 + > drivers/net/dsa/mv88e6xxx/global1.h | 2 + > drivers/net/dsa/mv88e6xxx/global2.h | 8 + > drivers/net/dsa/mv88e6xxx/port.c| 267 > drivers/net/dsa/mv88e6xxx/port.h| 47 - > drivers/net/dsa/mv88e6xxx/serdes.c | 312 > drivers/net/dsa/mv88e6xxx/serdes.h | 44 > 8 files changed, 831 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/dsa/mv88e6xxx/chip.c > b/drivers/net/dsa/mv88e6xxx/chip.c > index ed07cb29b285..c2dc6858481a 100644 > --- a/drivers/net/dsa/mv88e6xxx/chip.c > +++ b/drivers/net/dsa/mv88e6xxx/chip.c > @@ -635,6 +635,28 @@ static void mv88e6390x_phylink_validate(struct > mv88e6xxx_chip *chip, int port, > mv88e6390_phylink_validate(chip, port, mask, state); > } > > +static void mv88e6393x_phylink_validate(struct mv88e6xxx_chip *chip, int > port, > + unsigned long *mask, > + struct phylink_link_state *state) > +{ > + if (port == 0 || port == 9 || port == 10) { > + phylink_set(mask, 1baseT_Full); > + phylink_set(mask, 1baseCR_Full); > + phylink_set(mask, 1baseSR_Full); > + phylink_set(mask, 1baseLR_Full); > + phylink_set(mask, 1baseLRM_Full); > + phylink_set(mask, 1baseER_Full); Why did you remove 1000baseKR_Full from here?
Re: [PATCH net-next v4 1/4] net: phy: mdio-i2c: support I2C MDIO protocol for RollBall SFP modules
> +static int i2c_transfer_rollball(struct i2c_adapter *i2c, > + struct i2c_msg *msgs, int num) > +{ > + u8 saved_page; > + int ret; > + > + i2c_lock_bus(i2c, I2C_LOCK_SEGMENT); > + > + /* save original page */ > + ret = __i2c_rollball_get_page(i2c, msgs->addr, &saved_page); > + if (ret) > + goto unlock; > + > + /* change to RollBall MDIO page */ > + ret = __i2c_rollball_set_page(i2c, msgs->addr, SFP_PAGE_ROLLBALL_MDIO); > + if (ret) > + goto unlock; > + > + /* do the transfer */ > + ret = __i2c_transfer_err(i2c, msgs, num); > + if (ret) > + goto unlock; If get page and set page worked, and you get an error in during the actual data transfer, i wonder if you should try restoring the page before returning with the error? > + > + /* restore original page */ > + ret = __i2c_rollball_set_page(i2c, msgs->addr, saved_page); > + > +unlock: > + i2c_unlock_bus(i2c, I2C_LOCK_SEGMENT); > + > + return ret; > +}
Re: [RFC PATCH 0/7] Support for virtio-net hash reporting
On Tue, Jan 12, 2021 at 9:49 PM Yuri Benditovich wrote: > > On Tue, Jan 12, 2021 at 9:41 PM Yuri Benditovich > wrote: > > > > Existing TUN module is able to use provided "steering eBPF" to > > calculate per-packet hash and derive the destination queue to > > place the packet to. The eBPF uses mapped configuration data > > containing a key for hash calculation and indirection table > > with array of queues' indices. > > > > This series of patches adds support for virtio-net hash reporting > > feature as defined in virtio specification. It extends the TUN module > > and the "steering eBPF" as follows: > > > > Extended steering eBPF calculates the hash value and hash type, keeps > > hash value in the skb->hash and returns index of destination virtqueue > > and the type of the hash. TUN module keeps returned hash type in > > (currently unused) field of the skb. > > skb->__unused renamed to 'hash_report_type'. > > > > When TUN module is called later to allocate and fill the virtio-net > > header and push it to destination virtqueue it populates the hash > > and the hash type into virtio-net header. > > > > VHOST driver is made aware of respective virtio-net feature that > > extends the virtio-net header to report the hash value and hash report > > type. > > Comment from Willem de Bruijn: > > Skbuff fields are in short supply. I don't think we need to add one > just for this narrow path entirely internal to the tun device. > We understand that and try to minimize the impact by using an already existing unused field of skb. > Instead, you could just run the flow_dissector in tun_put_user if the > feature is negotiated. Indeed, the flow dissector seems more apt to me > than BPF here. Note that the flow dissector internally can be > overridden by a BPF program if the admin so chooses. > When this set of patches is related to hash delivery in the virtio-net packet in general, it was prepared in context of RSS feature implementation as defined in virtio spec [1] In case of RSS it is not enough to run the flow_dissector in tun_put_user: in tun_ebpf_select_queue the TUN calls eBPF to calculate the hash, hash type and queue index according to the (mapped) parameters (key, hash types, indirection table) received from the guest. Our intention is to keep the hash and hash type in the skb to populate them into a virtio-net header later in tun_put_user. Note that in this case the type of calculated hash is selected not only from flow dissections but also from limitations provided by the guest. This is already implemented in qemu (for case of vhost=off), see [2] (virtio_net_process_rss) For case of vhost=on there are WIP for qemu to load eBPF and attach it to TUN. Note that exact way of selecting rx virtqueue depends on the guest, it could be automatic steering (typical for Linux VM), RSS (typical for Windows VM) or any other steering mechanism implemented in loadable TUN steering BPF with or without hash calculation. [1] https://github.com/oasis-tcs/virtio-spec/blob/master/content.tex#L3740 [2] https://github.com/qemu/qemu/blob/master/hw/net/virtio-net.c#L1591 > This also hits on a deeper point with the choice of hash values, that > I also noticed in my RFC patchset to implement the inverse [1][2]. It > is much more detailed than skb->hash + skb->l4_hash currently offers, > and that can be gotten for free from most hardware. Unfortunately in the case of RSS we can't get this hash from the hardware as this requires configuration of the NIC's hardware with key and hash types for Toeplitz hash calculation. > In most practical > cases, that information suffices. I added less specific fields > VIRTIO_NET_HASH_REPORT_L4, VIRTIO_NET_HASH_REPORT_OTHER that work > without explicit flow dissection. I understand that the existing > fields are part of the standard. Just curious, what is their purpose > beyond 4-tuple based flow hashing? The hash is used in combination with the indirection table to select destination rx virtqueue. The hash and hash type are to be reported in virtio-net header, if requested. For Windows VM - in case the device does not report the hash (even if it calculated it to schedule the packet to a proper queue), the driver must do that for each packet (this is a certification requirement). > > [1] https://patchwork.kernel.org/project/netdevbpf/list/?series=406859&state=* > [2] > https://github.com/wdebruij/linux/commit/0f77febf22cd6ffc242a575807fa8382a26e511e > > > > Yuri Benditovich (7): > > skbuff: define field for hash report type > > vhost: support for hash report virtio-net feature > > tun: allow use of BPF_PROG_TYPE_SCHED_CLS program type > > tun: free bpf_program by bpf_prog_put instead of bpf_prog_destroy > > tun: add ioctl code TUNSETHASHPOPULATION > > tun: populate hash in virtio-net header when needed > > tun: report new tun feature IFF_HASH > > > > drivers/net/tun.c | 43 +++-- > > drivers/vhost/net.c | 37
Re: [PATCH bpf 2/2] libbpf: allow loading empty BTFs
On Tue, Jan 12, 2021 at 12:17 PM Daniel Borkmann wrote: > > On 1/12/21 7:41 AM, Andrii Nakryiko wrote: > > On Mon, Jan 11, 2021 at 5:16 PM Yonghong Song wrote: > >> On 1/11/21 12:51 PM, Andrii Nakryiko wrote: > >>> On Mon, Jan 11, 2021 at 10:13 AM Yonghong Song wrote: > On 1/9/21 11:03 PM, Andrii Nakryiko wrote: > > Empty BTFs do come up (e.g., simple kernel modules with no new types and > > strings, compared to the vmlinux BTF) and there is nothing technically > > wrong > > with them. So remove unnecessary check preventing loading empty BTFs. > > > > Reported-by: Christopher William Snowhill > > Fixes: ("d8123624506c libbpf: Fix BTF data layout checks and allow > > empty BTF") > > Fixed up Fixes tag ^ while applying. ;-) Oh the irony, eh? :) Thanks, Daniel! > > > Signed-off-by: Andrii Nakryiko > > --- > > tools/lib/bpf/btf.c | 5 - > > 1 file changed, 5 deletions(-) > > > > diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c > > index 3c3f2bc6c652..9970a288dda5 100644 > > --- a/tools/lib/bpf/btf.c > > +++ b/tools/lib/bpf/btf.c > > @@ -240,11 +240,6 @@ static int btf_parse_hdr(struct btf *btf) > > } > > > > meta_left = btf->raw_size - sizeof(*hdr); > > - if (!meta_left) { > > - pr_debug("BTF has no data\n"); > > - return -EINVAL; > > - } > > Previous kernel patch allows empty btf only if that btf is module (not > base/vmlinux) btf. Here it seems we allow any empty non-module btf to be > loaded into the kernel. In such cases, loading may fail? Maybe we should > detect such cases in libbpf and error out instead of going to kernel and > get error back? > >>> > >>> I did this consciously. Kernel is more strict, because there is no > >>> reasonable case when vmlinux BTF or BPF program's BTF can be empty (at > >>> least not that now we have FUNCs in BTF). But allowing libbpf to load > >>> empty BTF generically is helpful for bpftool, as one example, for > >>> inspection. If you do `bpftool btf dump` on empty BTF, it will just > >>> print nothing and you'll know that it's a valid (from BTF header > >>> perspective) BTF, just doesn't have any types (besides VOID). If we > >>> don't allow it, then we'll just get an error and then you'll have to > >>> do painful hex dumping and decoding to see what's wrong. > >> > >> It is totally okay to allow empty btf in libbpf. I just want to check > >> if this btf is going to be loaded into the kernel, right before it is > >> loading whether libbpf could check whether it is a non-module empty btf > >> or not, if it is, do not go to kernel. > > > > Ok, I see what you are proposing. We can do that, but it's definitely > > separate from these bug fixes. But, to be honest, I wouldn't bother > > because libbpf will return BTF verification log with a very readable > > "No data" message in it. > > Right, seems okay to me for this particular case given the user will be > able to make some sense of it from the log. > > Thanks, > Daniel
Re: [RFC PATCH 3/7] tun: allow use of BPF_PROG_TYPE_SCHED_CLS program type
On Tue, Jan 12, 2021 at 9:46 PM Alexei Starovoitov wrote: > > On Tue, Jan 12, 2021 at 11:42 AM Yuri Benditovich > wrote: > > > > This program type can set skb hash value. It will be useful > > when the tun will support hash reporting feature if virtio-net. > > > > Signed-off-by: Yuri Benditovich > > --- > > drivers/net/tun.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/drivers/net/tun.c b/drivers/net/tun.c > > index 7959b5c2d11f..455f7afc1f36 100644 > > --- a/drivers/net/tun.c > > +++ b/drivers/net/tun.c > > @@ -2981,6 +2981,8 @@ static int tun_set_ebpf(struct tun_struct *tun, > > struct tun_prog __rcu **prog_p, > > prog = NULL; > > } else { > > prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCKET_FILTER); > > + if (IS_ERR(prog)) > > + prog = bpf_prog_get_type(fd, > > BPF_PROG_TYPE_SCHED_CLS); > > You've ignored the feedback and just resend? what for? No, I do not. Some patches did not reach relevant people at all, so I just resent _all_ the patches to all the people. I will copy your earlier comment to this patch and will address it in the discussion. Sorry for misunderstanding and some redundant noise.
Re: [PATCH net-next v15 5/6] net: dsa: mv88e6xxx: Add support for mv88e6393x family of Marvell
On Tue, 12 Jan 2021 23:30:48 +0200 Vladimir Oltean wrote: > On Tue, Jan 12, 2021 at 10:16:32PM +0100, Marek Behún wrote: > > On Tue, 12 Jan 2021 22:38:08 +0200 > > Vladimir Oltean wrote: > > > > > > + phylink_set(mask, 1baseT_Full); > > > > + phylink_set(mask, 1baseCR_Full); > > > > + phylink_set(mask, 1baseSR_Full); > > > > + phylink_set(mask, 1baseLR_Full); > > > > + phylink_set(mask, 1baseLRM_Full); > > > > + phylink_set(mask, 1baseER_Full); > > > > > > Why did you remove 1000baseKR_Full from here? > > > > I am confused now. Should 1000baseKR_Full be here? 10g-kr is 10g-r with > > clause 73 autonegotiation, so they are not compatible, or are they? > > ETHTOOL_LINK_MODE_1baseKR_Full_BIT and most of everything else from > enum ethtool_link_mode_bit_indices are network-side media interfaces > (aka between the PHY and the link partner). > > Whereas PHY_INTERFACE_MODE_10GKR and most of everything else from > phy_interface_t is a system-side media independent interface (aka > between the MAC and the PHY). OK that finaly clears things up :) Sorry, I misunderstood it: I thought that PHY_INTERFACE_MODE_10GKR is related to 1baseKR_Full and that they are incompatible with 10gbase-r. > What the 6393X does not support is PHY_INTERFACE_MODE_10GKR, and my > feedback from your previous series did not ask you to remove > 1000baseKR_Full from phylink_validate. There's nothing that would > prevent somebody from attaching a PHY with a different (and compatible) > system-side SERDES protocol, and 10GBase-KR on the media side. > > What Russell said is that he's seriously thinking about reworking the > phylink_validate API so that the MAC driver would not need to sign off > on things it does not care about (such as what media-side interface will > the PHY use, of which there is an ever increasing plethora). But at the > moment, the API is what it is, and you should put as many media-side > link modes as possible, at the given speed and duplex that the MAC > supports. > > _I_ am confused. What did you say you were happy to help with? I am happy to help with phylink_validate code refactorization - refactorization of the code itself, reviewing, testing. Marek
Re: [PATCH net-next v15 4/6] net: dsa: mv88e6xxx: wrap .set_egress_port method
On Tue, Jan 12, 2021 at 08:54:03PM +0100, Marek Behún wrote: > There are two implementations of the .set_egress_port method, and both > of them, if successful, set chip->*gress_dest_port variable. > > To avoid code repetition, wrap this method into > mv88e6xxx_set_egress_port. > > Signed-off-by: Marek Behún > Reviewed-by: Pavana Sharma > --- Reviewed-by: Vladimir Oltean
Re: [PATCH bpf 1/2] bpf: support PTR_TO_MEM{,_OR_NULL} register spilling
On 1/12/21 8:46 PM, Andrii Nakryiko wrote: On Tue, Jan 12, 2021 at 1:14 AM Gilad Reti wrote: Add support for pointer to mem register spilling, to allow the verifier to track pointer to valid memory addresses. Such pointers are returned for example by a successful call of the bpf_ringbuf_reserve helper. This patch was suggested as a solution by Yonghong Song. The patch was partially contibuted by CyberArk Software, Inc. Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support for it") Surprised no one mentioned this yet. Fixes tag should always be on a single unwrapped line, however long it is, please fix. Especially for first-time contributors there is usually some luxury that we would have fixed this up on the fly while applying. ;) But given a v2 is going to be sent anyway it's good to point it out for future reference, agree. Thanks, Daniel
Re: [PATCH net-next 0/3] r8169: further improvements
On Tue, 2021-01-12 at 09:27 +0100, Heiner Kallweit wrote: > Series includes further smaller improvements. > > Heiner Kallweit (3): > r8169: align rtl_wol_suspend_quirk with vendor driver and rename it > r8169: improve rtl8169_rx_csum > r8169: improve DASH support > > drivers/net/ethernet/realtek/r8169_main.c | 81 +-- > > 1 file changed, 31 insertions(+), 50 deletions(-) > Reviewed-by: Saeed Mahameed
Re: [PATCH net-next v4 1/4] net: phy: mdio-i2c: support I2C MDIO protocol for RollBall SFP modules
> -struct mii_bus *mdio_i2c_alloc(struct device *parent, struct i2c_adapter > *i2c) > +/* RollBall SFPs do not access internal PHY via I2C address 0x56, but > + * instead via address 0x51, when SFP page is set to 0x03 and password to > + * 0x. > + * Since current SFP code does not modify SFP_PAGE, we set it to 0x03 only at > + * bus creation time, and expect it to remain set to 0x03 throughout the > + * lifetime of the module plugged into the system. If the SFP code starts > + * modifying SFP_PAGE in the future, this code will need to change. Just an FYI: This is likely to change. NVIDIA are working on generalizing the API ethtool -m uses. The new API should allow access to pages and banks. I've already said i will implement the API in the generic phylink SFP code. But no need to address this now. Andrew
Re: [PATCH bpf 1/2] bpf: allow empty module BTFs
Hello: This series was applied to bpf/bpf.git (refs/heads/master): On Sat, 9 Jan 2021 23:03:40 -0800 you wrote: > Some modules don't declare any new types and end up with an empty BTF, > containing only valid BTF header and no types or strings sections. This > currently causes BTF validation error. There is nothing wrong with such BTF, > so fix the issue by allowing module BTFs with no types or strings. > > Reported-by: Christopher William Snowhill > Fixes: 36e68442d1af ("bpf: Load and verify kernel module BTFs") > Signed-off-by: Andrii Nakryiko > > [...] Here is the summary with links: - [bpf,1/2] bpf: allow empty module BTFs https://git.kernel.org/bpf/bpf/c/bcc5e6162d66 - [bpf,2/2] libbpf: allow loading empty BTFs https://git.kernel.org/bpf/bpf/c/b8d52264df85 You are awesome, thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html
Re: [PATCH bpf-next 1/2] trace: bpf: Allow bpf to attach to bare tracepoints
On 1/11/21 10:20 AM, Qais Yousef wrote: Some subsystems only have bare tracepoints (a tracepoint with no associated trace event) to avoid the problem of trace events being an ABI that can't be changed. From bpf presepective, bare tracepoints are what it calls RAW_TRACEPOINT(). Since bpf assumed there's 1:1 mapping, it relied on hooking to DEFINE_EVENT() macro to create bpf mapping of the tracepoints. Since bare tracepoints use DECLARE_TRACE() to create the tracepoint, bpf had no knowledge about their existence. By teaching bpf_probe.h to parse DECLARE_TRACE() in a similar fashion to DEFINE_EVENT(), bpf can find and attach to the new raw tracepoints. Enabling that comes with the contract that changes to raw tracepoints don't constitute a regression if they break existing bpf programs. We need the ability to continue to morph and modify these raw tracepoints without worrying about any ABI. Update Documentation/bpf/bpf_design_QA.rst to document this contract. Signed-off-by: Qais Yousef --- Documentation/bpf/bpf_design_QA.rst | 6 ++ include/trace/bpf_probe.h | 12 ++-- 2 files changed, 16 insertions(+), 2 deletions(-) diff --git a/Documentation/bpf/bpf_design_QA.rst b/Documentation/bpf/bpf_design_QA.rst index 2df7b067ab93..0e15f9b05c9d 100644 --- a/Documentation/bpf/bpf_design_QA.rst +++ b/Documentation/bpf/bpf_design_QA.rst @@ -208,6 +208,12 @@ data structures and compile with kernel internal headers. Both of these kernel internals are subject to change and can break with newer kernels such that the program needs to be adapted accordingly. +Q: Are tracepoints part of the stable ABI? +-- +A: NO. Tracepoints are tied to internal implementation details hence they are +subject to change and can break with newer kernels. BPF programs need to change +accordingly when this happens. + Q: How much stack space a BPF program uses? --- A: Currently all program types are limited to 512 bytes of stack diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h index cd74bffed5c6..cf1496b162b1 100644 --- a/include/trace/bpf_probe.h +++ b/include/trace/bpf_probe.h @@ -55,8 +55,7 @@ /* tracepoints with more than 12 arguments will hit build error */ #define CAST_TO_U64(...) CONCATENATE(__CAST, COUNT_ARGS(__VA_ARGS__))(__VA_ARGS__) -#undef DECLARE_EVENT_CLASS -#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \ +#define __BPF_DECLARE_TRACE(call, proto, args) \ static notrace void \ __bpf_trace_##call(void *__data, proto) \ { \ @@ -64,6 +63,10 @@ __bpf_trace_##call(void *__data, proto) \ CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(prog, CAST_TO_U64(args)); \ } +#undef DECLARE_EVENT_CLASS +#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \ + __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args)) + /* * This part is compiled out, it is only here as a build time check * to make sure that if the tracepoint handling changes, the @@ -111,6 +114,11 @@ __DEFINE_EVENT(template, call, PARAMS(proto), PARAMS(args), size) #define DEFINE_EVENT_PRINT(template, name, proto, args, print)\ DEFINE_EVENT(template, name, PARAMS(proto), PARAMS(args)) +#undef DECLARE_TRACE +#define DECLARE_TRACE(call, proto, args) \ + (__BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args)) \ +__DEFINE_EVENT(call, call, PARAMS(proto), PARAMS(args), 0)) I applied the patch to my local bpf-next repo, and got the following compilation error: In file included from /data/users/yhs/work/net-next/include/trace/define_trace.h:104, from /data/users/yhs/work/net-next/include/trace/events/sched.h:740, from /data/users/yhs/work/net-next/kernel/sched/core.c:10: /data/users/yhs/work/net-next/include/trace/bpf_probe.h:59:1: error: expected identifier or ‘(’ before ‘static’ static notrace void \ ^~ /data/users/yhs/work/net-next/include/trace/bpf_probe.h:119:3: note: in expansion of macro ‘__BPF_DECLARE_TRACE’ (__BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args)) \ ^~~ /data/users/yhs/work/net-next/include/trace/events/sched.h:693:1: note: in expansion of macro ‘DECLARE_TRACE’ DECLARE_TRACE(pelt_cfs_tp, ^ /data/users/yhs/work/net-next/include/trace/bpf_probe.h:59:1: error: expected identifier or ‘(’ before ‘static’ static notrace void \ ^~ /data/users/yhs/work/net-next/include/trace/bpf_probe.h:119:3: note: in expansion of macro ‘__BPF_DECLARE_TRACE’ (__BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args)) \ ^~~ /dat
Re: [PATCH bpf 2/2] libbpf: allow loading empty BTFs
On 1/12/21 7:41 AM, Andrii Nakryiko wrote: On Mon, Jan 11, 2021 at 5:16 PM Yonghong Song wrote: On 1/11/21 12:51 PM, Andrii Nakryiko wrote: On Mon, Jan 11, 2021 at 10:13 AM Yonghong Song wrote: On 1/9/21 11:03 PM, Andrii Nakryiko wrote: Empty BTFs do come up (e.g., simple kernel modules with no new types and strings, compared to the vmlinux BTF) and there is nothing technically wrong with them. So remove unnecessary check preventing loading empty BTFs. Reported-by: Christopher William Snowhill Fixes: ("d8123624506c libbpf: Fix BTF data layout checks and allow empty BTF") Fixed up Fixes tag ^ while applying. ;-) Signed-off-by: Andrii Nakryiko --- tools/lib/bpf/btf.c | 5 - 1 file changed, 5 deletions(-) diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c index 3c3f2bc6c652..9970a288dda5 100644 --- a/tools/lib/bpf/btf.c +++ b/tools/lib/bpf/btf.c @@ -240,11 +240,6 @@ static int btf_parse_hdr(struct btf *btf) } meta_left = btf->raw_size - sizeof(*hdr); - if (!meta_left) { - pr_debug("BTF has no data\n"); - return -EINVAL; - } Previous kernel patch allows empty btf only if that btf is module (not base/vmlinux) btf. Here it seems we allow any empty non-module btf to be loaded into the kernel. In such cases, loading may fail? Maybe we should detect such cases in libbpf and error out instead of going to kernel and get error back? I did this consciously. Kernel is more strict, because there is no reasonable case when vmlinux BTF or BPF program's BTF can be empty (at least not that now we have FUNCs in BTF). But allowing libbpf to load empty BTF generically is helpful for bpftool, as one example, for inspection. If you do `bpftool btf dump` on empty BTF, it will just print nothing and you'll know that it's a valid (from BTF header perspective) BTF, just doesn't have any types (besides VOID). If we don't allow it, then we'll just get an error and then you'll have to do painful hex dumping and decoding to see what's wrong. It is totally okay to allow empty btf in libbpf. I just want to check if this btf is going to be loaded into the kernel, right before it is loading whether libbpf could check whether it is a non-module empty btf or not, if it is, do not go to kernel. Ok, I see what you are proposing. We can do that, but it's definitely separate from these bug fixes. But, to be honest, I wouldn't bother because libbpf will return BTF verification log with a very readable "No data" message in it. Right, seems okay to me for this particular case given the user will be able to make some sense of it from the log. Thanks, Daniel
Re: [PATCH v6 net-next 14/15] net: bonding: ensure .ndo_get_stats64 can sleep
On Tue, 2021-01-12 at 16:37 +0200, Vladimir Oltean wrote: > On Mon, Jan 11, 2021 at 03:38:49PM -0800, Saeed Mahameed wrote: > > GFP_ATOMIC is a little bit aggressive especially when user daemons > > are > > periodically reading stats. This can be avoided. > > > > You can pre-allocate with GFP_KERNEL an array with an "approximate" > > size. > > then fill the array up with whatever slaves the the bond has at > > that > > moment, num_of_slaves can be less, equal or more than the array > > you > > just allocated but we shouldn't care .. > > > > something like: > > rcu_read_lock() > > nslaves = bond_get_num_slaves(); > > rcu_read_unlock() > > sarray = kcalloc(nslaves, sizeof(struct bonding_slave_dev), > > GFP_KERNEL); > > rcu_read_lock(); > > bond_fill_slaves_array(bond, sarray); // also do: dev_hold() > > rcu_read_unlock(); > > > > > > bond_get_slaves_array_stats(sarray); > > > > bond_put_slaves_array(sarray); > > I don't know what to say about acquiring RCU read lock twice and > traversing the list of interfaces three or four times. You can optimize this by tracking #num_slaves. > On the other hand, what's the worst that can happen if the GFP_ATOMIC > memory allocation fails. It's not like there is any data loss. > User space will retry when there is less memory pressure. Anyway Up to you, i just don't like it when we use GFP_ATOMIC when it can be avoided, especially for periodic jobs, like stats polling..
Re: [PATCH bpf v2] bpf: don't leak memory in bpf getsockopt when optlen == 0
Hello: This patch was applied to bpf/bpf.git (refs/heads/master): On Tue, 12 Jan 2021 08:28:29 -0800 you wrote: > optlen == 0 indicates that the kernel should ignore BPF buffer > and use the original one from the user. We, however, forget > to free the temporary buffer that we've allocated for BPF. > > Reported-by: Martin KaFai Lau > Fixes: d8fe449a9c51 ("bpf: Don't return EINVAL from {get,set}sockopt when > optlen > PAGE_SIZE") > Signed-off-by: Stanislav Fomichev > > [...] Here is the summary with links: - [bpf,v2] bpf: don't leak memory in bpf getsockopt when optlen == 0 https://git.kernel.org/bpf/bpf/c/4be34f3d0731 You are awesome, thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html
Re: [PATCH bpf-next 2/2] selftests: bpf: Add a new test for bare tracepoints
On Tue, Jan 12, 2021 at 11:27 AM Qais Yousef wrote: > > On 01/11/21 23:26, Andrii Nakryiko wrote: > > On Mon, Jan 11, 2021 at 10:20 AM Qais Yousef wrote: > > > > > > Reuse module_attach infrastructure to add a new bare tracepoint to check > > > we can attach to it as a raw tracepoint. > > > > > > Signed-off-by: Qais Yousef > > > --- > > > > > > Andrii > > > > > > I was getting the error below when I was trying to run the test. > > > I had to comment out all related fentry* code to be able to test the > > > raw_tp > > > stuff. Not sure something I've done wrong or it's broken for some reason. > > > I was on v5.11-rc2. > > > > Check that you have all the required Kconfig options from > > tools/testing/selftests/bpf/config. And also you will need to build > > Yep I have merged this config snippet using merge_config.sh script. > > > pahole from master, 1.19 doesn't have some fixes that add kernel > > module support. I think pahole is the reasons why you have the failure > > below. > > I am using pahole 1.19. I have built it from tip of master though. > > /trying using v1.19 tag > > Still fails the same. > > > > > > > > > $ sudo ./test_progs -v -t module_attach > > > > use -vv when debugging stuff like that with test_progs, it will output > > libbpf detailed logs, that often are very helpful > > I tried that but it didn't help me. Full output is here > > https://paste.debian.net/1180846 > It did help a bit for me to make sure that you have bpf_testmod properly loaded and its BTF was found, so the problem is somewhere else. Also, given load succeeded and attach failed with OPNOTSUPP, I suspect you are missing some of FTRACE configs, which seems to be missing from selftests's config as well. Check that you have CONFIG_FTRACE=y and CONFIG_DYNAMIC_FTRACE=y, and you might need some more. See [0] for a real config we are using to run all tests in libbpf CI. If you figure out what you were missing, please also contribute a patch to selftests' config. [0] https://github.com/libbpf/libbpf/blob/master/travis-ci/vmtest/configs/latest.config > > > > > bpf_testmod.ko is already unloaded. > > > Loading bpf_testmod.ko... > > > Successfully loaded bpf_testmod.ko. > > > test_module_attach:PASS:skel_open 0 nsec > > > test_module_attach:PASS:set_attach_target 0 nsec > > > test_module_attach:PASS:skel_load 0 nsec > > > libbpf: prog 'handle_fentry': failed to attach: ERROR: > > > strerror_r(-524)=22 > > > libbpf: failed to auto-attach program 'handle_fentry': -524 > > > test_module_attach:FAIL:skel_attach skeleton attach failed: -524 > > > #58 module_attach:FAIL > > > Successfully unloaded bpf_testmod.ko. > > > Summary: 0/0 PASSED, 0 SKIPPED, 1 FAILED > > > > > > > But even apart from test failure, there seems to be kernel build > > failure. See [0] for what fails in kernel-patches CI. > > > >[0] https://travis-ci.com/github/kernel-patches/bpf/builds/212730017 > > Sorry about that. I did a last minute change because of checkpatch.pl error > and > it seems I either forgot to rebuild or missed that the rebuild failed :/ > no worries, just fix and re-submit. Good that we have CI that caught this early on. > > > > > > > > > > .../selftests/bpf/bpf_testmod/bpf_testmod-events.h | 6 ++ > > > tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c | 2 ++ > > > tools/testing/selftests/bpf/prog_tests/module_attach.c | 1 + > > > tools/testing/selftests/bpf/progs/test_module_attach.c | 10 ++ > > > 4 files changed, 19 insertions(+) > > > > > > diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h > > > b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h > > > index b83ea448bc79..e1ada753f10c 100644 > > > --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h > > > +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod-events.h > > > @@ -28,6 +28,12 @@ TRACE_EVENT(bpf_testmod_test_read, > > > __entry->pid, __entry->comm, __entry->off, __entry->len) > > > ); > > > > > > +/* A bare tracepoint with no event associated with it */ > > > +DECLARE_TRACE(bpf_testmod_test_read_bare, > > > + TP_PROTO(struct task_struct *task, struct > > > bpf_testmod_test_read_ctx *ctx), > > > + TP_ARGS(task, ctx) > > > +); > > > + > > > #endif /* _BPF_TESTMOD_EVENTS_H */ > > > > > > #undef TRACE_INCLUDE_PATH > > > diff --git a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c > > > b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c > > > index 2df19d73ca49..d63cebdaca44 100644 > > > --- a/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c > > > +++ b/tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c > > > @@ -22,6 +22,8 @@ bpf_testmod_test_read(struct file *file, struct kobject > > > *kobj, > > > }; > > > > > > trace_bpf_testmod_test_read(current, &ctx); > > > + ctx.len++; > > > + trace_bpf_testm
Re: [PATCH v6 net-next 03/15] net: procfs: hold netif_lists_lock when retrieving device statistics
On Tue, 2021-01-12 at 15:44 +0200, Vladimir Oltean wrote: > On Mon, Jan 11, 2021 at 03:46:32PM -0800, Saeed Mahameed wrote: > > This can be very costly, holding a mutex while traversing the whole > > netedv lists and reading their stats, we need to at least allow > > multiple readers to enter as it was before, so maybe you want to > > use > > rw_semaphore instead of the mutex. > > > > or just have a unified approach of rcu+refcnt/dev_hold as you did > > for > > bonding and failover patches #13..#14, I used the same approach to > > achieve the same for sysfs and procfs more than 2 years ago, you > > are > > welcome to use my patches: > > https://lore.kernel.org/netdev/4cc44e85-cb5e-502c-30f3-c6ea564fe...@gmail.com/ > > Ok, what mail address do you want me to keep for your sign off? Either is fine. Just make sure author and signed-off are the same :). Thanks!
Re: [PATCH net-next v2 0/7] ibmvnic: Use more consistent locking
On Tue, 2021-01-12 at 10:14 -0800, Sukadev Bhattiprolu wrote: > Use more consistent locking when reading/writing the adapter->state > field. This patch set fixes a race condition during ibmvnic_open() > where the adapter could be left in the PROBED state if a reset occurs > at the wrong time. This can cause networking to not come up during > boot and potentially require manual intervention in bringing up > applications that depend on the network. > > Changelog[v2] [Address comments from Jakub Kicinski] > - Fix up commit log for patch 5/7 and drop unnecessary variable > - Format Fixes line properly (no wrapping, no blank lines) > > Sukadev Bhattiprolu (7): > ibmvnic: restore state in change-param reset > ibmvnic: update reset function prototypes > ibmvnic: avoid allocating rwi entries > ibmvnic: switch order of checks in ibmvnic_reset > ibmvnic: serialize access to work queue > ibmvnic: check adapter->state under state_lock > ibmvnic: add comments about state_lock > > Other than the two minor comments, Reviewed-by: Saeed Mahameed
Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
On Mon, Jan 11, 2021 at 10:56 PM Leon Romanovsky wrote: > > On Mon, Jan 11, 2021 at 11:30:39AM -0800, Alexander Duyck wrote: > > On Sun, Jan 10, 2021 at 7:10 AM Leon Romanovsky wrote: > > > > > > From: Leon Romanovsky > > > > > > Some SR-IOV capable devices provide an ability to configure specific > > > number of MSI-X vectors on their VF prior driver is probed on that VF. > > > > > > In order to make management easy, provide new read-only sysfs file that > > > returns a total number of possible to configure MSI-X vectors. > > > > > > cat /sys/bus/pci/devices/.../sriov_vf_total_msix > > > = 0 - feature is not supported > > > > 0 - total number of MSI-X vectors to consume by the VFs > > > > > > Signed-off-by: Leon Romanovsky > > > --- > > > Documentation/ABI/testing/sysfs-bus-pci | 14 +++ > > > drivers/pci/iov.c | 31 + > > > drivers/pci/pci.h | 3 +++ > > > include/linux/pci.h | 2 ++ > > > 4 files changed, 50 insertions(+) > > > > > > diff --git a/Documentation/ABI/testing/sysfs-bus-pci > > > b/Documentation/ABI/testing/sysfs-bus-pci > > > index 05e26e5da54e..64e9b700acc9 100644 > > > --- a/Documentation/ABI/testing/sysfs-bus-pci > > > +++ b/Documentation/ABI/testing/sysfs-bus-pci > > > @@ -395,3 +395,17 @@ Description: > > > The file is writable if the PF is bound to a driver that > > > supports the ->sriov_set_msix_vec_count() callback and > > > there > > > is no driver bound to the VF. > > > + > > > +What: /sys/bus/pci/devices/.../sriov_vf_total_msix > > > > In this case I would drop the "vf" and just go with sriov_total_msix > > since now you are referring to a global value instead of a per VF > > value. > > This field indicates the amount of MSI-X available for VFs, it doesn't > include PFs. The missing "_vf_" will mislead users who will believe that > it is all MSI-X vectors available for this device. They will need to take > into consideration amount of PF MSI-X in order to calculate the VF > distribution. > > So I would leave "_vf_" here. The problem is you aren't indicating how many are available for an individual VF though, you are indicating how many are available for use by SR-IOV to give to the VFs. The fact that you are dealing with a pool makes things confusing in my opinion. For example sriov_vf_device describes the device ID that will be given to each VF. > > > > > +Date: January 2021 > > > +Contact: Leon Romanovsky > > > +Description: > > > + This file is associated with the SR-IOV PFs. > > > + It returns a total number of possible to configure MSI-X > > > + vectors on the enabled VFs. > > > + > > > + The values returned are: > > > +* > 0 - this will be total number possible to consume by > > > VFs, > > > +* = 0 - feature is not supported > > > + > > > + If no SR-IOV VFs are enabled, this value will return 0. > > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > > > index 42c0df4158d1..0a6ddf3230fd 100644 > > > --- a/drivers/pci/iov.c > > > +++ b/drivers/pci/iov.c > > > @@ -394,12 +394,22 @@ static ssize_t sriov_drivers_autoprobe_store(struct > > > device *dev, > > > return count; > > > } > > > > > > +static ssize_t sriov_vf_total_msix_show(struct device *dev, > > > + struct device_attribute *attr, > > > + char *buf) > > > +{ > > > + struct pci_dev *pdev = to_pci_dev(dev); > > > + > > > + return sprintf(buf, "%d\n", pdev->sriov->vf_total_msix); > > > +} > > > + > > > > You display it as a signed value, but unsigned values are not > > supported, correct? > > Right, I made it similar to the vf_msix_set. I can change. > > > > > > static DEVICE_ATTR_RO(sriov_totalvfs); > > > static DEVICE_ATTR_RW(sriov_numvfs); > > > static DEVICE_ATTR_RO(sriov_offset); > > > static DEVICE_ATTR_RO(sriov_stride); > > > static DEVICE_ATTR_RO(sriov_vf_device); > > > static DEVICE_ATTR_RW(sriov_drivers_autoprobe); > > > +static DEVICE_ATTR_RO(sriov_vf_total_msix); > > > > > > static struct attribute *sriov_dev_attrs[] = { > > > &dev_attr_sriov_totalvfs.attr, > > > @@ -408,6 +418,7 @@ static struct attribute *sriov_dev_attrs[] = { > > > &dev_attr_sriov_stride.attr, > > > &dev_attr_sriov_vf_device.attr, > > > &dev_attr_sriov_drivers_autoprobe.attr, > > > + &dev_attr_sriov_vf_total_msix.attr, > > > NULL, > > > }; > > > > > > @@ -658,6 +669,7 @@ static void sriov_disable(struct pci_dev *dev) > > > sysfs_remove_link(&dev->dev.kobj, "dep_link"); > > > > > > iov->num_VFs = 0; > > > + iov->vf_total_msix = 0; > > > pci_iov_set_numvfs(dev, 0); > > > } > > > > > > @@ -1116,6 +1128,25 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev) > > > } > >
Re: [PATCH net-next v14 5/6] net: dsa: mv88e6xxx: Add support for mv88e6393x family of Marvell
On Tue, Jan 12, 2021 at 07:06:29PM +0100, Marek Behún wrote: > On Tue, 12 Jan 2021 16:29:09 + > Russell King - ARM Linux admin wrote: > > > I'm seriously thinking about changing the phylink_validate() interface > > such that the question of which link _modes_ are supported no longer > > comes up with MAC drivers, but instead MAC drivers say what interface > > modes, speeds for each interface mode, duplexes for each speed are > > supported. > > BTW this would also solve the situation where DSA needs to know which > interface modes are supported on a particular port to know which modes > we can try on SFP connected to a DSA port. What do you mean here? What makes this specific to DSA?
Re: [PATCH mlx5-next v1 1/5] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
On Mon, Jan 11, 2021 at 10:39 PM Leon Romanovsky wrote: > > On Mon, Jan 11, 2021 at 11:30:33AM -0800, Alexander Duyck wrote: > > On Sun, Jan 10, 2021 at 7:12 AM Leon Romanovsky wrote: > > > > > > From: Leon Romanovsky > > > > > > Extend PCI sysfs interface with a new callback that allows configure > > > the number of MSI-X vectors for specific SR-IO VF. This is needed > > > to optimize the performance of newly bound devices by allocating > > > the number of vectors based on the administrator knowledge of targeted VM. > > > > > > This function is applicable for SR-IOV VF because such devices allocate > > > their MSI-X table before they will run on the VMs and HW can't guess the > > > right number of vectors, so the HW allocates them statically and equally. > > > > > > The newly added /sys/bus/pci/devices/.../vf_msix_vec file will be seen > > > for the VFs and it is writable as long as a driver is not bounded to the > > > VF. > > > > > > The values accepted are: > > > * > 0 - this will be number reported by the VF's MSI-X capability > > > * < 0 - not valid > > > * = 0 - will reset to the device default value > > > > > > Signed-off-by: Leon Romanovsky > > > --- > > > Documentation/ABI/testing/sysfs-bus-pci | 20 > > > drivers/pci/iov.c | 62 + > > > drivers/pci/msi.c | 29 > > > drivers/pci/pci-sysfs.c | 1 + > > > drivers/pci/pci.h | 2 + > > > include/linux/pci.h | 8 +++- > > > 6 files changed, 121 insertions(+), 1 deletion(-) > > > > > > diff --git a/Documentation/ABI/testing/sysfs-bus-pci > > > b/Documentation/ABI/testing/sysfs-bus-pci > > > index 25c9c39770c6..05e26e5da54e 100644 > > > --- a/Documentation/ABI/testing/sysfs-bus-pci > > > +++ b/Documentation/ABI/testing/sysfs-bus-pci > > > @@ -375,3 +375,23 @@ Description: > > > The value comes from the PCI kernel device state and can > > > be one > > > of: "unknown", "error", "D0", D1", "D2", "D3hot", > > > "D3cold". > > > The file is read only. > > > + > > > +What: /sys/bus/pci/devices/.../vf_msix_vec > > > > So the name for this doesn't seem to match existing SR-IOV naming. It > > seems like this should probably be something like sriov_vf_msix_count > > in order to be closer to the actual naming of what is being dealt > > with. > > I'm open for suggestions. I didn't use sriov_vf_msix_count because it > seems too long for me. > > > > > > +Date: December 2020 > > > +Contact: Leon Romanovsky > > > +Description: > > > + This file is associated with the SR-IOV VFs. > > > + It allows configuration of the number of MSI-X vectors for > > > + the VF. This is needed to optimize performance of newly > > > bound > > > + devices by allocating the number of vectors based on the > > > + administrator knowledge of targeted VM. > > > + > > > + The values accepted are: > > > +* > 0 - this will be number reported by the VF's MSI-X > > > +capability > > > +* < 0 - not valid > > > +* = 0 - will reset to the device default value > > > + > > > + The file is writable if the PF is bound to a driver that > > > + supports the ->sriov_set_msix_vec_count() callback and > > > there > > > + is no driver bound to the VF. > > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c > > > index 4afd4ee4f7f0..42c0df4158d1 100644 > > > --- a/drivers/pci/iov.c > > > +++ b/drivers/pci/iov.c > > > @@ -31,6 +31,7 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id) > > > return (dev->devfn + dev->sriov->offset + > > > dev->sriov->stride * vf_id) & 0xff; > > > } > > > +EXPORT_SYMBOL(pci_iov_virtfn_devfn); > > > > > > /* > > > * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride > > > may > > > @@ -426,6 +427,67 @@ const struct attribute_group sriov_dev_attr_group = { > > > .is_visible = sriov_attrs_are_visible, > > > }; > > > > > > +#ifdef CONFIG_PCI_MSI > > > +static ssize_t vf_msix_vec_show(struct device *dev, > > > + struct device_attribute *attr, char *buf) > > > +{ > > > + struct pci_dev *pdev = to_pci_dev(dev); > > > + int numb = pci_msix_vec_count(pdev); > > > + struct pci_dev *pfdev; > > > + > > > + if (numb < 0) > > > + return numb; > > > + > > > + pfdev = pci_physfn(pdev); > > > + if (!pfdev->driver || !pfdev->driver->sriov_set_msix_vec_count) > > > + return -EOPNOTSUPP; > > > + > > > > This doesn't make sense to me. You are getting the vector count for > > the PCI device and reporting that. Are you expecting to call this on > > the PF or the VFs? It seems like this should be a PF attribute and not > > b
[PATCH net 3/3] netfilter: nf_nat: Fix memleak in nf_nat_init
From: Dinghao Liu When register_pernet_subsys() fails, nf_nat_bysource should be freed just like when nf_ct_extend_register() fails. Fixes: 1cd472bf036ca ("netfilter: nf_nat: add nat hook register functions to nf_nat") Signed-off-by: Dinghao Liu Acked-by: Florian Westphal Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_nat_core.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c index ea923f8cf9c4..b7c3c902290f 100644 --- a/net/netfilter/nf_nat_core.c +++ b/net/netfilter/nf_nat_core.c @@ -1174,6 +1174,7 @@ static int __init nf_nat_init(void) ret = register_pernet_subsys(&nat_net_ops); if (ret < 0) { nf_ct_extend_unregister(&nat_extend); + kvfree(nf_nat_bysource); return ret; } -- 2.20.1
[PATCH net 2/3] netfilter: conntrack: fix reading nf_conntrack_buckets
From: Jesper Dangaard Brouer The old way of changing the conntrack hashsize runtime was through changing the module param via file /sys/module/nf_conntrack/parameters/hashsize. This was extended to sysctl change in commit 3183ab8997a4 ("netfilter: conntrack: allow increasing bucket size via sysctl too"). The commit introduced second "user" variable nf_conntrack_htable_size_user which shadow actual variable nf_conntrack_htable_size. When hashsize is changed via module param this "user" variable isn't updated. This results in sysctl net/netfilter/nf_conntrack_buckets shows the wrong value when users update via the old way. This patch fix the issue by always updating "user" variable when reading the proc file. This will take care of changes to the actual variable without sysctl need to be aware. Fixes: 3183ab8997a4 ("netfilter: conntrack: allow increasing bucket size via sysctl too") Reported-by: Yoel Caspersen Signed-off-by: Jesper Dangaard Brouer Acked-by: Florian Westphal Signed-off-by: Pablo Neira Ayuso --- net/netfilter/nf_conntrack_standalone.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c index 46c5557c1fec..0ee702d374b0 100644 --- a/net/netfilter/nf_conntrack_standalone.c +++ b/net/netfilter/nf_conntrack_standalone.c @@ -523,6 +523,9 @@ nf_conntrack_hash_sysctl(struct ctl_table *table, int write, { int ret; + /* module_param hashsize could have changed value */ + nf_conntrack_htable_size_user = nf_conntrack_htable_size; + ret = proc_dointvec(table, write, buffer, lenp, ppos); if (ret < 0 || !write) return ret; -- 2.20.1
[PATCH net 0/3] Netfilter fixes for net
Hi, The following patchset contains Netfilter fixes for net: 1) Pass conntrack -f to specify family in netfilter conntrack helper selftests, from Chen Yi. 2) Honor hashsize modparam from nf_conntrack_buckets sysctl, from Jesper D. Brouer. 3) Fix memleak in nf_nat_init() error path, from Dinghao Liu. Chen Yi (1): selftests: netfilter: Pass family parameter "-f" to conntrack tool Dinghao Liu (1): netfilter: nf_nat: Fix memleak in nf_nat_init Jesper Dangaard Brouer (1): netfilter: conntrack: fix reading nf_conntrack_buckets net/netfilter/nf_conntrack_standalone.c | 3 +++ net/netfilter/nf_nat_core.c | 1 + .../selftests/netfilter/nft_conntrack_helper.sh | 12 +--- 3 files changed, 13 insertions(+), 3 deletions(-) Please, pull these changes from: git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git Thanks! The following changes since commit c49243e8898233de18edfaaa5b7b261ea457f221: Merge branch 'net-fix-issues-around-register_netdevice-failures' (2021-01-08 19:27:44 -0800) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git HEAD for you to fetch changes up to 869f4fdaf4ca7bb6e0d05caf6fa1108dddc346a7: netfilter: nf_nat: Fix memleak in nf_nat_init (2021-01-11 00:34:11 +0100) Chen Yi (1): selftests: netfilter: Pass family parameter "-f" to conntrack tool Dinghao Liu (1): netfilter: nf_nat: Fix memleak in nf_nat_init Jesper Dangaard Brouer (1): netfilter: conntrack: fix reading nf_conntrack_buckets net/netfilter/nf_conntrack_standalone.c | 3 +++ net/netfilter/nf_nat_core.c | 1 + tools/testing/selftests/netfilter/nft_conntrack_helper.sh | 12 +--- 3 files changed, 13 insertions(+), 3 deletions(-)
[PATCH net 1/3] selftests: netfilter: Pass family parameter "-f" to conntrack tool
From: Chen Yi Fix nft_conntrack_helper.sh false fail report: 1) Conntrack tool need "-f ipv6" parameter to show out ipv6 traffic items. 2) Sleep 1 second after background nc send packet, to make sure check is after this statement executed. False report: FAIL: ns1-lkjUemYw did not show attached helper ip set via ruleset PASS: ns1-lkjUemYw connection on port 2121 has ftp helper attached ... After fix: PASS: ns1-2hUniwU2 connection on port 2121 has ftp helper attached PASS: ns2-2hUniwU2 connection on port 2121 has ftp helper attached ... Fixes: 619ae8e0697a6 ("selftests: netfilter: add test case for conntrack helper assignment") Signed-off-by: Chen Yi Signed-off-by: Pablo Neira Ayuso --- .../selftests/netfilter/nft_conntrack_helper.sh | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/netfilter/nft_conntrack_helper.sh b/tools/testing/selftests/netfilter/nft_conntrack_helper.sh index edf0a48da6bf..bf6b9626c7dd 100755 --- a/tools/testing/selftests/netfilter/nft_conntrack_helper.sh +++ b/tools/testing/selftests/netfilter/nft_conntrack_helper.sh @@ -94,7 +94,13 @@ check_for_helper() local message=$2 local port=$3 - ip netns exec ${netns} conntrack -L -p tcp --dport $port 2> /dev/null |grep -q 'helper=ftp' + if echo $message |grep -q 'ipv6';then + local family="ipv6" + else + local family="ipv4" + fi + + ip netns exec ${netns} conntrack -L -f $family -p tcp --dport $port 2> /dev/null |grep -q 'helper=ftp' if [ $? -ne 0 ] ; then echo "FAIL: ${netns} did not show attached helper $message" 1>&2 ret=1 @@ -111,8 +117,8 @@ test_helper() sleep 3 | ip netns exec ${ns2} nc -w 2 -l -p $port > /dev/null & - sleep 1 sleep 1 | ip netns exec ${ns1} nc -w 2 10.0.1.2 $port > /dev/null & + sleep 1 check_for_helper "$ns1" "ip $msg" $port check_for_helper "$ns2" "ip $msg" $port @@ -128,8 +134,8 @@ test_helper() sleep 3 | ip netns exec ${ns2} nc -w 2 -6 -l -p $port > /dev/null & - sleep 1 sleep 1 | ip netns exec ${ns1} nc -w 2 -6 dead:1::2 $port > /dev/null & + sleep 1 check_for_helper "$ns1" "ipv6 $msg" $port check_for_helper "$ns2" "ipv6 $msg" $port -- 2.20.1
Re: [PATCH net-next] net: ks8851: Connect and start/stop the internal PHY
On 1/11/21 3:47 PM, Andrew Lunn wrote: On Mon, Jan 11, 2021 at 01:53:37PM +0100, Marek Vasut wrote: Unless the internal PHY is connected and started, the phylib will not poll the PHY for state and produce state updates. Connect the PHY and start/stop it. Hi Marek Please continue the conversion and remove all mii_calls. ks8851_set_link_ksettings() calling mii_ethtool_set_link_ksettings() is not good, phylib will not know about changes which we made to the PHY etc. Hi, I noticed a couple of drivers implement both the mii and mdiobus options, I was pondering why is that. Is there some legacy backward compatibility reason for keeping both or is it safe to remove the mii support completely from the driver? Either way, I will do that in a separate patch, so it could be reverted if it breaks something.
Re: [PATCH net-next] net: ks8851: Connect and start/stop the internal PHY
On 1/11/21 3:43 PM, Heiner Kallweit wrote: On 11.01.2021 15:10, Marek Vasut wrote: On 1/11/21 2:50 PM, Heiner Kallweit wrote: On 11.01.2021 14:38, Marek Vasut wrote: On 1/11/21 2:26 PM, Heiner Kallweit wrote: [...] LGTM. When having a brief look at the driver I stumbled across two things: 1. Do MAC/PHY support any pause mode? Then a call to phy_support_(a)sym_pause() would be missing. https://ww1.microchip.com/downloads/en/DeviceDoc/KSZ8851-16MLL-Single-Port-Ethernet-MAC-Controller-with-8-Bit-or-16-Bit-Non-PCI-Interface-DS2357B.pdf page 64 https://www.mouser.com/datasheet/2/268/ksz8851-16mll_ds-776208.pdf page 65 The later is more complete. Apparently it does support pause. Based on the datasheet, does it support sym or asym pause ? According to the description of flow control on p.23 it can support asym pause. However on the MAC side flow control doesn't seem to be always active, it's controlled by these two bits: p.49, TXCR, bit 3 p.50, RXCR1, bit 10 Default seems to be that flow control is disabled. So I guess this patch is OK as-is ?
Re: [RFC PATCH net-next] bonding: add a vlan+srcmac tx hashing option
On Tue, Jan 12, 2021 at 01:39:10PM -0800, Jay Vosburgh wrote: > Jarod Wilson wrote: > > >On Thu, Jan 07, 2021 at 07:03:40PM -0500, Jarod Wilson wrote: > >> On Fri, Dec 18, 2020 at 04:18:59PM -0800, Jay Vosburgh wrote: > >> > Jarod Wilson wrote: > >> > > >> > >This comes from an end-user request, where they're running multiple VMs > >> > >on > >> > >hosts with bonded interfaces connected to some interest switch > >> > >topologies, > >> > >where 802.3ad isn't an option. They're currently running a proprietary > >> > >solution that effectively achieves load-balancing of VMs and bandwidth > >> > >utilization improvements with a similar form of transmission algorithm. > >> > > > >> > >Basically, each VM has it's own vlan, so it always sends its traffic out > >> > >the same interface, unless that interface fails. Traffic gets split > >> > >between the interfaces, maintaining a consistent path, with failover > >> > >still > >> > >available if an interface goes down. > >> > > > >> > >This has been rudimetarily tested to provide similar results, suitable > >> > >for > >> > >them to use to move off their current proprietary solution. > >> > > > >> > >Still on the TODO list, if these even looks sane to begin with, is > >> > >fleshing out Documentation/networking/bonding.rst. > >> > > >> > I'm sure you're aware, but any final submission will also need > >> > to include netlink and iproute2 support. > >> > >> I believe everything for netlink support is already included, but I'll > >> double-check that before submitting something for inclusion consideration. > > > >I'm not certain if what you actually meant was that I'd have to patch > >iproute2 as well, which I've definitely stumbled onto today, but it's a > >2-line patch, and everything seems to be working fine with it: > > Yes, that's what I meant. > > >$ sudo ip link set bond0 type bond xmit_hash_policy 5 > > Does the above work with the text label (presumably "vlansrc") > as well as the number, and does "ip link add test type bond help" print > the correct text for XMIT_HASH_POLICY? All of the above looks correct to me, output below. Before submitting... Could rename it from vlansrc to vlan+srcmac or some variation thereof if it's desired. I tried to keep it relatively short, but it's perhaps a bit less succinct like I have it now, and other modes include a +. $ sudo modprobe bonding mode=2 max_bonds=1 xmit_hash_policy=0 $ sudo ip link set bond0 type bond xmit_hash_policy vlansrc $ cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v4.18.0-272.el8.vstx.x86_64 Bonding Mode: load balancing (xor) Transmit Hash Policy: vlansrc (5) MII Status: down MII Polling Interval (ms): 0 Up Delay (ms): 0 Down Delay (ms): 0 Peer Notification Delay (ms): 0 $ sudo ip link add test type bond help Usage: ... bond [ mode BONDMODE ] [ active_slave SLAVE_DEV ] [ clear_active_slave ] [ miimon MIIMON ] [ updelay UPDELAY ] [ downdelay DOWNDELAY ] [ peer_notify_delay DELAY ] [ use_carrier USE_CARRIER ] [ arp_interval ARP_INTERVAL ] [ arp_validate ARP_VALIDATE ] [ arp_all_targets ARP_ALL_TARGETS ] [ arp_ip_target [ ARP_IP_TARGET, ... ] ] [ primary SLAVE_DEV ] [ primary_reselect PRIMARY_RESELECT ] [ fail_over_mac FAIL_OVER_MAC ] [ xmit_hash_policy XMIT_HASH_POLICY ] [ resend_igmp RESEND_IGMP ] [ num_grat_arp|num_unsol_na NUM_GRAT_ARP|NUM_UNSOL_NA ] [ all_slaves_active ALL_SLAVES_ACTIVE ] [ min_links MIN_LINKS ] [ lp_interval LP_INTERVAL ] [ packets_per_slave PACKETS_PER_SLAVE ] [ tlb_dynamic_lb TLB_DYNAMIC_LB ] [ lacp_rate LACP_RATE ] [ ad_select AD_SELECT ] [ ad_user_port_key PORTKEY ] [ ad_actor_sys_prio SYSPRIO ] [ ad_actor_system LLADDR ] BONDMODE := balance-rr|active-backup|balance-xor|broadcast|802.3ad|balance-tlb|balance-alb ARP_VALIDATE := none|active|backup|all ARP_ALL_TARGETS := any|all PRIMARY_RESELECT := always|better|failure FAIL_OVER_MAC := none|active|follow XMIT_HASH_POLICY := layer2|layer2+3|layer3+4|encap2+3|encap3+4|vlansrc LACP_RATE := slow|fast AD_SELECT := stable|bandwidth|count -- Jarod Wilson ja...@redhat.com
Re: [PATCH net-next v15 5/6] net: dsa: mv88e6xxx: Add support for mv88e6393x family of Marvell
On Tue, 12 Jan 2021 20:54:04 +0100 Marek Behún wrote: > + /* mv88e6393x family errata 3.7 : > + * When changing cmode on SERDES port from any other mode to 1000BASE-X > + * mode the link may not come up due to invalid 1000BASE-X > + * advertisement. > + * Workaround: Correct advertisement and reset PHY core. > + */ > + if (cmode == MV88E6XXX_PORT_STS_CMODE_1000BASEX) { > + reg = MV88E6390_SGMII_ANAR_1000BASEX_FD; > + err = mv88e6390_serdes_write(chip, lane, MDIO_MMD_PHYXS, > + MV88E6390_SGMII_ANAR, reg); > + if (err) > + return err; > + > + /* soft reset the PCS/PMA */ > + err = mv88e6390_serdes_read(chip, lane, MDIO_MMD_PHYXS, > + MV88E6390_SGMII_CONTROL, ®); > + if (err) > + return err; > + > + reg |= MV88E6390_SGMII_CONTROL_RESET; > + err = mv88e6390_serdes_write(chip, lane, MDIO_MMD_PHYXS, > + MV88E6390_SGMII_CONTROL, reg); > + if (err) > + return err; It would seem that this is already done in mv88e6390_serdes_pcs_config, just without the last reset. > +#define MV88E6390_SGMII_STATUS_AN_ABLE BIT(3) > +#define MV88E6390_SGMII_ANAR 0x2004 > +#define MV88E6390_SGMII_ANAR_1000BASEX_FDBIT(5) > +#define MV88E6390_SGMII_CONTROL 0x2000 This register is already called MV88E6390_SGMII_BMCR and the bits are defined as BMCR_* macros. Thse same for MV88E6390_SGMII_STATUS and MV88E6390_SGMII_ANAR. > +#define MV88E6390_SGMII_CONTROL_RESETBIT(15) > +#define MV88E6390_SGMII_CONTROL_LOOPBACK BIT(14) > +#define MV88E6390_SGMII_CONTROL_PDOWNBIT(11) > +#define MV88E6390_SGMII_STATUS 0x2001 I shall fix this in another version.
Re: [PATCH net-next] net: ipa: add config dependency on QCOM_SMEM
On Tue 12 Jan 13:21 CST 2021, Alex Elder wrote: > The IPA driver depends on some SMEM functionality (qcom_smem_init(), > qcom_smem_alloc(), and qcom_smem_virt_to_phys()), but this is not > reflected in the configuration dependencies. Add a dependency on > QCOM_SMEM to avoid attempts to build the IPA driver without SMEM. > This avoids a link error for certain configurations. > > Reported-by: Randy Dunlap > Fixes: 38a4066f593c5 ("net: ipa: support COMPILE_TEST") > Signed-off-by: Alex Elder Reviewed-by: Bjorn Andersson Regards, Bjorn > --- > drivers/net/ipa/Kconfig | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/net/ipa/Kconfig b/drivers/net/ipa/Kconfig > index 10a0e041ee775..b68f1289b89ef 100644 > --- a/drivers/net/ipa/Kconfig > +++ b/drivers/net/ipa/Kconfig > @@ -1,6 +1,6 @@ > config QCOM_IPA > tristate "Qualcomm IPA support" > - depends on 64BIT && NET > + depends on 64BIT && NET && QCOM_SMEM > depends on ARCH_QCOM || COMPILE_TEST > depends on QCOM_RPROC_COMMON || (QCOM_RPROC_COMMON=n && COMPILE_TEST) > select QCOM_MDT_LOADER if ARCH_QCOM > -- > 2.20.1 >
[PATCH net-next] net: phy: ar803x: disable extended next page bit
This bit is enabled by default and advertises support for extended next page support. XNP is only needed for 10GBase-T and MultiGig support which is not supported. Additionally, Cisco MultiGig switches will read this bit and attempt 10Gb negotiation even though Next Page support is disabled. This will cause timeouts when the interface is forced to 100Mbps and auto-negotiation will fail. The interfaces are only 1000Base-T and supporting auto-negotiation for this only requires the Next Page bit to be set. Taken from: https://github.com/SolidRun/linux-stable/commit/7406c5244b7ea6bc17a2afe8568277a8c4b126a9 and adapted to mainline kernels by rmk. Signed-off-by: Russell King --- drivers/net/phy/at803x.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/net/phy/at803x.c b/drivers/net/phy/at803x.c index 9636edb8d618..3909dc9fc94b 100644 --- a/drivers/net/phy/at803x.c +++ b/drivers/net/phy/at803x.c @@ -587,7 +587,13 @@ static int at803x_config_init(struct phy_device *phydev) return ret; } - return 0; + /* Ar803x extended next page bit is enabled by default. Cisco +* multigig switches read this bit and attempt to negotiate 10Gbps +* rates even if the next page bit is disabled. This is incorrect +* behaviour but we still need to accomodate it. XNP is only needed +* for 10Gbps support, so disable XNP. +*/ + return phy_modify(phydev, MII_ADVERTISE, MDIO_AN_CTRL1_XNP, 0); } static int at803x_ack_interrupt(struct phy_device *phydev) -- 2.20.1
Re: [PATCv4 net-next] octeontx2-pf: Add RSS multi group support
On Tue, 12 Jan 2021 04:17:44 + Geethasowjanya Akula wrote: > Hi All, > > Any feedback on this patch. I think Dave merged it already, see commit 81a4362016e7 ("octeontx2-pf: Add RSS multi group support") in net-next.
Re: [net] net: feature check mandating HW_CSUM is wrong
On Wed, 13 Jan 2021 02:47:51 +0530 rohit maheshwari wrote: > On 07/01/21 12:47 AM, Jakub Kicinski wrote: > > On Wed, 6 Jan 2021 23:23:27 +0530 Rohit Maheshwari wrote: > >> Mandating NETIF_F_HW_CSUM to enable TLS offload feature is wrong. > >> And it broke tls offload feature for the drivers, which are still > >> using NETIF_F_IP_CSUM or NETIF_F_IPV6_CSUM. We should use > >> NETIF_F_CSUM_MASK instead. > >> > >> Fixes: ae0b04b238e2 ("net: Disable NETIF_F_HW_TLS_TX when HW_CSUM is > >> disabled") > >> Signed-off-by: Rohit Maheshwari > > Please use Tariq's suggestion. > HW_TLS_TX feature is for both IPv4/v6. And If one device is limited to > support only IPv4 checksum offload, TLS offload should be allowed for > that too. So I think putting a check of CSUM_MASK is enough. If Tariq does not disagree please repost.
Re: [PATCH net-next 0/7] a set of fixes of coding style
On Tue, 2021-01-12 at 00:42 -0600, Lijun Pan wrote: > This series address several coding style problems. > > Lijun Pan (7): > ibmvnic: prefer 'unsigned long' over 'unsigned long int' > ibmvnic: fix block comments > ibmvnic: fix braces > ibmvnic: avoid multiple line dereference > ibmvnic: fix miscellaneous checks > ibmvnic: add comments for spinlock_t definitions > ibmvnic: remove unused spinlock_t stats_lock definition > > drivers/net/ethernet/ibm/ibmvnic.c | 65 +--- > -- > drivers/net/ethernet/ibm/ibmvnic.h | 11 ++--- > 2 files changed, 35 insertions(+), 41 deletions(-) > Reviewed-by: Saeed Mahameed
Re: [PATCv4 net-next] octeontx2-pf: Add RSS multi group support
On Mon, 2021-01-04 at 12:50 +0530, Geetha sowjanya wrote: > Hardware supports 8 RSS groups per interface. Currently we are using > only group '0'. This patch allows user to create new RSS > groups/contexts > and use the same as destination for flow steering rules. > > usage: > To steer the traffic to RQ 2,3 > > ethtool -X eth0 weight 0 0 1 1 context new > (It will print the allocated context id number) > New RSS context is 1 > > ethtool -N eth0 flow-type tcp4 dst-port 80 context 1 loc 1 > > To delete the context > ethtool -X eth0 context 1 delete > > When an RSS context is removed, the active classification > rules using this context are also removed. > > Change-log: > > v4 > - Fixed compiletime warning. > - Address Saeed's comments on v3. > This patch is marked as accepted in patchwork https://patchwork.kernel.org/project/netdevbpf/patch/20210104072039.27297-1-gak...@marvell.com/ but it is not actually applied, maybe resend.. you can add: Reviewed-by: Saeed Mahameed
Re: [PATCv4 net-next] octeontx2-pf: Add RSS multi group support
On Tue, 2021-01-12 at 15:16 -0800, Saeed Mahameed wrote: > On Mon, 2021-01-04 at 12:50 +0530, Geetha sowjanya wrote: > > Hardware supports 8 RSS groups per interface. Currently we are > > using > > only group '0'. This patch allows user to create new RSS > > groups/contexts > > and use the same as destination for flow steering rules. > > > > usage: > > To steer the traffic to RQ 2,3 > > > > ethtool -X eth0 weight 0 0 1 1 context new > > (It will print the allocated context id number) > > New RSS context is 1 > > > > ethtool -N eth0 flow-type tcp4 dst-port 80 context 1 loc 1 > > > > To delete the context > > ethtool -X eth0 context 1 delete > > > > When an RSS context is removed, the active classification > > rules using this context are also removed. > > > > Change-log: > > > > v4 > > - Fixed compiletime warning. > > - Address Saeed's comments on v3. > > > > This patch is marked as accepted in patchwork > https://patchwork.kernel.org/project/netdevbpf/patch/20210104072039.27297-1-gak...@marvell.com/ > > but it is not actually applied, maybe resend.. > > > you can add: > Reviewed-by: Saeed Mahameed > > I missed Jakub's comment, The patch is already applied, i looked at the wrong tree. Thanks.
Re: [PATCH v3 bpf-next 5/7] bpf: support BPF ksym variables in kernel modules
On Tue, Jan 12, 2021 at 8:30 AM Daniel Borkmann wrote: > > On 1/12/21 8:55 AM, Andrii Nakryiko wrote: > > Add support for directly accessing kernel module variables from BPF programs > > using special ldimm64 instructions. This functionality builds upon vmlinux > > ksym support, but extends ldimm64 with src_reg=BPF_PSEUDO_BTF_ID to allow > > specifying kernel module BTF's FD in insn[1].imm field. > > > > During BPF program load time, verifier will resolve FD to BTF object and > > will > > take reference on BTF object itself and, for module BTFs, corresponding > > module > > as well, to make sure it won't be unloaded from under running BPF program. > > The > > mechanism used is similar to how bpf_prog keeps track of used bpf_maps. > > > > One interesting change is also in how per-CPU variable is determined. The > > logic is to find .data..percpu data section in provided BTF, but both > > vmlinux > > and module each have their own .data..percpu entries in BTF. So for module's > > case, the search for DATASEC record needs to look at only module's added BTF > > types. This is implemented with custom search function. > > > > Acked-by: Yonghong Song > > Acked-by: Hao Luo > > Signed-off-by: Andrii Nakryiko > [...] > > + > > +struct module *btf_try_get_module(const struct btf *btf) > > +{ > > + struct module *res = NULL; > > +#ifdef CONFIG_DEBUG_INFO_BTF_MODULES > > + struct btf_module *btf_mod, *tmp; > > + > > + mutex_lock(&btf_module_mutex); > > + list_for_each_entry_safe(btf_mod, tmp, &btf_modules, list) { > > + if (btf_mod->btf != btf) > > + continue; > > + > > + if (try_module_get(btf_mod->module)) > > + res = btf_mod->module; > > One more thought (follow-up would be okay I'd think) ... when a module > references > a symbol from another module, it similarly needs to bump the refcount of the > module > that is owning it and thus disallowing to unload for that other module's > lifetime. > That usage dependency is visible via /proc/modules however, so if unload > doesn't work > then lsmod allows a way to introspect that to the user. This seems to be > achieved via > resolve_symbol() where it records its dependency/usage. Would be great if we > could at > some point also include the BPF prog name into that list so that this is more > obvious. > Wdyt? I thought about it as well, but plenty of kernel things just grab the ref of ko and don't add any way to introspect what piece of kernel is holding ko. So this case won't be the first. Also if we add it for bpf progs it could be confusing in lsmod. Since it currently only shows other ko-s in there. Long ago I had an awk script to parse that output to rmmod dependent modules before rmmoding the main one. If somebody doing something like this bpf prog names in the same place may break things. So I think there are more cons than pros. That is certainly a follow up if we agree on the direction.
Re: [PATCH] mm: net: memcg accounting for TCP rx zerocopy
On Tue, Jan 12, 2021 at 01:41:05PM -0800, Shakeel Butt wrote: > From: Arjun Roy > > TCP zerocopy receive is used by high performance network applications to > further scale. For RX zerocopy, the memory containing the network data > filled by network driver is directly mapped into the address space of > high performance applications. To keep the TLB cost low, these > applications unmaps the network memory in big batches. So, this memory > can remain mapped for long time. This can cause memory isolation issue > as this memory becomes unaccounted after getting mapped into the > application address space. This patch adds the memcg accounting for such > memory. > > Accounting the network memory comes with its own unique challenge. The > high performance NIC drivers use page pooling to reuse the pages to > eliminate/reduce the expensive setup steps like IOMMU. These drivers > keep an extra reference on the pages and thus we can not depends on the > page reference for the uncharging. The page in the pool may keep a memcg > pinned for arbitrary long time or may get used by other memcg. > > This patch decouples the uncharging of the page from the refcnt and > associate it with the map count i.e. the page gets uncharged when the > last address space unmaps it. Now the question what if the driver drops > its reference while the page is still mapped. That is fine as the > address space also holds a reference to the page i.e. the reference > count can not drop to zero before the map count. > > Signed-off-by: Arjun Roy > Co-developed-by: Shakeel Butt > Signed-off-by: Shakeel Butt > --- > include/linux/memcontrol.h | 34 +++-- > mm/memcontrol.c| 60 ++ > mm/rmap.c | 3 ++ > net/ipv4/tcp.c | 27 + > 4 files changed, 116 insertions(+), 8 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 7a38a1517a05..0b0e3b4615cf 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -349,11 +349,13 @@ extern struct mem_cgroup *root_mem_cgroup; > > enum page_memcg_data_flags { > /* page->memcg_data is a pointer to an objcgs vector */ > - MEMCG_DATA_OBJCGS = (1UL << 0), > + MEMCG_DATA_OBJCGS = (1UL << 0), > /* page has been accounted as a non-slab kernel page */ > - MEMCG_DATA_KMEM = (1UL << 1), > + MEMCG_DATA_KMEM = (1UL << 1), > + /* page has been accounted as network memory */ > + MEMCG_DATA_SOCK = (1UL << 2), > /* the next bit after the last actual flag */ > - __NR_MEMCG_DATA_FLAGS = (1UL << 2), > + __NR_MEMCG_DATA_FLAGS = (1UL << 3), > }; > > #define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1) > @@ -444,6 +446,11 @@ static inline bool PageMemcgKmem(struct page *page) > return page->memcg_data & MEMCG_DATA_KMEM; > } > > +static inline bool PageMemcgSock(struct page *page) > +{ > + return page->memcg_data & MEMCG_DATA_SOCK; > +} > + > #ifdef CONFIG_MEMCG_KMEM > /* > * page_objcgs - get the object cgroups vector associated with a page > @@ -1095,6 +1102,11 @@ static inline bool PageMemcgKmem(struct page *page) > return false; > } > > +static inline bool PageMemcgSock(struct page *page) > +{ > + return false; > +} > + > static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg) > { > return true; > @@ -1561,6 +1573,10 @@ extern struct static_key_false > memcg_sockets_enabled_key; > #define mem_cgroup_sockets_enabled > static_branch_unlikely(&memcg_sockets_enabled_key) > void mem_cgroup_sk_alloc(struct sock *sk); > void mem_cgroup_sk_free(struct sock *sk); > +int mem_cgroup_charge_sock_pages(struct mem_cgroup *memcg, struct page > **pages, > + unsigned int nr_pages); > +void mem_cgroup_uncharge_sock_pages(struct page **pages, unsigned int > nr_pages); > + > static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup *memcg) > { > if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && memcg->tcpmem_pressure) > @@ -1589,6 +1605,18 @@ static inline void memcg_set_shrinker_bit(struct > mem_cgroup *memcg, > int nid, int shrinker_id) > { > } > + > +static inline int mem_cgroup_charge_sock_pages(struct mem_cgroup *memcg, > +struct page **pages, > +unsigned int nr_pages) > +{ > + return 0; > +} > + > +static inline void mem_cgroup_uncharge_sock_pages(struct page **pages, > + unsigned int nr_pages) > +{ > +} > #endif > > #ifdef CONFIG_MEMCG_KMEM > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index db9836f4b64b..38e94538e081 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -7061,6 +7061,66 @@ void mem_cgroup_uncharge_skmem(struct mem_cgroup > *memcg, unsigned int nr_pages) > refill_st
Re: (subset) [PATCH 00/10] Remove support for TX49xx
On Tue, 5 Jan 2021 15:02:45 +0100, Thomas Bogendoerfer wrote: > I couldn't find any buyable product other than reference boards using > TX49xx CPUs. And since nobody showed interest in keeping support for > it, it's time to remove it. > > I've split up the removal into seperate parts for different maintainers. > So if the patch fits your needs, please take it via your tree or > give me an ack so I can apply them the mips-next tree. > > [...] Applied, thanks! [08/10] rtc: tx4939: Remove driver commit: 446667df283002fdda0530523347ffd1cf053373 Best regards, -- Alexandre Belloni
Re: [PATCH] tcp: keepalive fixes
Hi, Yuchung: I have attached the python script that reproduces the keepalive issues. The script is a slight modification of the one written by Marek Majkowski: https://github.com/cloudflare/cloudflare-blog/blob/master/2019-09-tcp-keepalives/test-zero.py Please note that only the TCP keepalive is configured, and not the user timeout. Thanks. -- Enke On Tue, Jan 12, 2021 at 02:48:01PM -0800, Yuchung Cheng wrote: > On Tue, Jan 12, 2021 at 2:31 PM Enke Chen wrote: > > > > From: Enke Chen > > > > In this patch two issues with TCP keepalives are fixed: > > > > 1) TCP keepalive does not timeout when there are data waiting to be > >delivered and then the connection got broken. The TCP keepalive > >timeout is not evaluated in that condition. > hi enke > Do you have an example to demonstrate this issue -- in theory when > there is data inflight, an RTO timer should be pending (which > considers user-timeout setting). based on the user-timeout description > (man tcp), the user timeout should abort the socket per the specified > time after data commences. some data would help to understand the > issue. > -- #! /usr/bin/python import io import os import select import socket import time import utils import ctypes utils.new_ns() port = 1 s = socket.socket(socket.AF_INET, socket.SOCK_STREAM, 0) s.bind(('127.0.0.1', port)) s.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 1024) s.listen(16) tcpdump = utils.tcpdump_start(port) c = socket.socket(socket.AF_INET, socket.SOCK_STREAM, 0) c.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 1024) c.connect(('127.0.0.1', port)) x, _ = s.accept() if False: c.setsockopt(socket.IPPROTO_TCP, socket.TCP_USER_TIMEOUT, 90*1000) if True: c.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1) c.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 5) c.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 10) c.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 10) time.sleep(0.2) print("[ ] c.send()") import fcntl TIOCOUTQ=0x5411 c.setblocking(False) while True: bytes_avail = ctypes.c_int() fcntl.ioctl(c.fileno(), TIOCOUTQ, bytes_avail) if bytes_avail.value > 64*1024: break try: c.send(b"A" * 16384 * 4) except io.BlockingIOError: break c.setblocking(True) time.sleep(0.2) utils.ss(port) utils.check_buffer(c) t0 = time.time() if True: utils.drop_start(dport=port) utils.drop_start(sport=port) poll = select.poll() poll.register(c, select.POLLIN) poll.poll() utils.ss(port) e = c.getsockopt(socket.SOL_SOCKET, socket.SO_ERROR) print("[ ] SO_ERROR = %s" % (e,)) t1 = time.time() print("[ ] took: %f seconds" % (t1-t0,))
Re: [PATCH] kernel: trace: uprobe: Fix word to the correct spelling
On 1/11/21 8:50 PM, Bhaskar Chowdhury wrote: > s/controling/controlling/p > > Signed-off-by: Bhaskar Chowdhury Acked-by: Randy Dunlap Thanks. > --- > kernel/trace/trace_uprobe.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c > index 3cf7128e1ad3..55c6afd8cb27 100644 > --- a/kernel/trace/trace_uprobe.c > +++ b/kernel/trace/trace_uprobe.c > @@ -1635,7 +1635,7 @@ void destroy_local_trace_uprobe(struct trace_event_call > *event_call) > } > #endif /* CONFIG_PERF_EVENTS */ > > -/* Make a trace interface for controling probe points */ > +/* Make a trace interface for controlling probe points */ > static __init int init_uprobe_trace(void) > { > int ret; > -- > 2.26.2 > -- ~Randy You can't do anything without having to do something else first. -- Belefant's Law
mv88e6xxx: 2500base-x inband AN is broken on Amethyst? what to do?
Hello, it seems that inband AN is broken on Amethyst when on 2500base-x mode. Even SERDES scripts for Amethyst from Marvell has autonegotiation disabled for 2500base-x mode. For all the other supported Serdes modes autonegotiation is enabled in these scripts. The current implementation in mv88e6390_serdes_pcs_config() enables autonegotiation if phylink_autoneg_inband(mode) is true: if (phylink_autoneg_inband(mode)) bmcr = val | BMCR_ANENABLE; else bmcr = val & ~BMCR_ANENABLE; But for PHY_INTERFACE_MODE_2500BASEX this is broken on Amethyst. The 2500base-x mode seems to work only with autoneg disabled. The result is that when I connect via a passive SFP cable Amethyst serdes port with a Peridot serdes port, they will not link. If I disable autonegotiation on both sides, they will link, though. What is strange is that if I don't use Peridot, but connect the SFP directly to Serdes on Armada 3720, where the mvneta driver also enables autonegotiation for 2500base-x mode, they will link even if Amethyst does not enable 2500base-x. To summarize: Amethyst <-> Peridot AN -AN -works AN -AN +does not work Amethyst <-> Armada 3720 serdes AN -AN +works (It is possible that Marvell may find some workaround by touch some undocumented registers, to solve this. I will try to open a bug report.) Should we just print an error in the serdes_pcs_config method if inband autonegotiation is being requested? phylink's code currently allows connecting SFPs in non MLO_AN_INBAND mode only for when there is Broadcom BCM84881 PHY inside the SFP (by method phylink_phy_no_inband() in phylink.c). I wonder whether we can somehow in a sane way implement code to inform phylink from the mv88e6xxx driver that inband is not supported for the specific mode. Maybe the .mac_config/.pcs_config method could return an error indicating this? Or the mv88e6xxx driver can just print an error that the mode is not supported, and try to ask the user to disable AN? That would need implementing this in ethtool for SFP, though. What do you guys think? Marek
Re: [PATCH net-next v2 5/7] ibmvnic: serialize access to work queue
Saeed Mahameed [sa...@kernel.org] wrote: > On Tue, 2021-01-12 at 10:14 -0800, Sukadev Bhattiprolu wrote: > > @@ -5467,7 +5472,15 @@ static int ibmvnic_remove(struct vio_dev *dev) > > return -EBUSY; > > } > > > > + /* If ibmvnic_reset() is scheduling a reset, wait for it to > > +* finish. Then prevent it from scheduling any more resets > > +* and have the reset functions ignore any resets that have > > +* already been scheduled. > > +*/ > > + spin_lock_irqsave(&adapter->remove_lock, flags); > > adapter->state = VNIC_REMOVING; > > + spin_unlock_irqrestore(&adapter->remove_lock, flags); > > + > > Why irqsave/restore variants ? are you expecting this spinlock to be > held in interruptcontext ? > > > spin_unlock_irqrestore(&adapter->state_lock, flags); Good question. One of the callers of ibmvnic_reset() is the ->ndo_tx_timeout() method which gets called from the watchdog timer. Thanks for the review. Sukadev
Re: [PATCH net] net: dcb: Accept RTM_GETDCB messages carrying set-like DCB commands
Hello: This patch was applied to netdev/net.git (refs/heads/master): On Mon, 11 Jan 2021 18:07:07 +0100 you wrote: > In commit 826f328e2b7e ("net: dcb: Validate netlink message in DCB > handler"), Linux started rejecting RTM_GETDCB netlink messages if they > contained a set-like DCB_CMD_ command. > > The reason was that privileges were only verified for RTM_SETDCB messages, > but the value that determined the action to be taken is the command, not > the message type. And validation of message type against the DCB command > was the obvious missing piece. > > [...] Here is the summary with links: - [net] net: dcb: Accept RTM_GETDCB messages carrying set-like DCB commands https://git.kernel.org/netdev/net/c/df85bc140a4d You are awesome, thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html
Re: [PATCH net-next 0/7] a set of fixes of coding style
On Tue, 12 Jan 2021 00:42:58 -0600 Lijun Pan wrote: > This series address several coding style problems. Erm, are you sure this series will not conflict with the fixes Sukadev is sending?
Re: [PATCH] mm: net: memcg accounting for TCP rx zerocopy
On Tue, Jan 12, 2021 at 03:36:18PM -0800, Arjun Roy wrote: > On Tue, Jan 12, 2021 at 3:31 PM Roman Gushchin wrote: > > > > On Tue, Jan 12, 2021 at 01:41:05PM -0800, Shakeel Butt wrote: > > > From: Arjun Roy > > > > > > TCP zerocopy receive is used by high performance network applications to > > > further scale. For RX zerocopy, the memory containing the network data > > > filled by network driver is directly mapped into the address space of > > > high performance applications. To keep the TLB cost low, these > > > applications unmaps the network memory in big batches. So, this memory > > > can remain mapped for long time. This can cause memory isolation issue > > > as this memory becomes unaccounted after getting mapped into the > > > application address space. This patch adds the memcg accounting for such > > > memory. > > > > > > Accounting the network memory comes with its own unique challenge. The > > > high performance NIC drivers use page pooling to reuse the pages to > > > eliminate/reduce the expensive setup steps like IOMMU. These drivers > > > keep an extra reference on the pages and thus we can not depends on the > > > page reference for the uncharging. The page in the pool may keep a memcg > > > pinned for arbitrary long time or may get used by other memcg. > > > > > > This patch decouples the uncharging of the page from the refcnt and > > > associate it with the map count i.e. the page gets uncharged when the > > > last address space unmaps it. Now the question what if the driver drops > > > its reference while the page is still mapped. That is fine as the > > > address space also holds a reference to the page i.e. the reference > > > count can not drop to zero before the map count. > > > > > > Signed-off-by: Arjun Roy > > > Co-developed-by: Shakeel Butt > > > Signed-off-by: Shakeel Butt > > > --- > > > include/linux/memcontrol.h | 34 +++-- > > > mm/memcontrol.c| 60 ++ > > > mm/rmap.c | 3 ++ > > > net/ipv4/tcp.c | 27 + > > > 4 files changed, 116 insertions(+), 8 deletions(-) > > > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > > index 7a38a1517a05..0b0e3b4615cf 100644 > > > --- a/include/linux/memcontrol.h > > > +++ b/include/linux/memcontrol.h > > > @@ -349,11 +349,13 @@ extern struct mem_cgroup *root_mem_cgroup; > > > > > > enum page_memcg_data_flags { > > > /* page->memcg_data is a pointer to an objcgs vector */ > > > - MEMCG_DATA_OBJCGS = (1UL << 0), > > > + MEMCG_DATA_OBJCGS = (1UL << 0), > > > /* page has been accounted as a non-slab kernel page */ > > > - MEMCG_DATA_KMEM = (1UL << 1), > > > + MEMCG_DATA_KMEM = (1UL << 1), > > > + /* page has been accounted as network memory */ > > > + MEMCG_DATA_SOCK = (1UL << 2), > > > /* the next bit after the last actual flag */ > > > - __NR_MEMCG_DATA_FLAGS = (1UL << 2), > > > + __NR_MEMCG_DATA_FLAGS = (1UL << 3), > > > }; > > > > > > #define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1) > > > @@ -444,6 +446,11 @@ static inline bool PageMemcgKmem(struct page *page) > > > return page->memcg_data & MEMCG_DATA_KMEM; > > > } > > > > > > +static inline bool PageMemcgSock(struct page *page) > > > +{ > > > + return page->memcg_data & MEMCG_DATA_SOCK; > > > +} > > > + > > > #ifdef CONFIG_MEMCG_KMEM > > > /* > > > * page_objcgs - get the object cgroups vector associated with a page > > > @@ -1095,6 +1102,11 @@ static inline bool PageMemcgKmem(struct page *page) > > > return false; > > > } > > > > > > +static inline bool PageMemcgSock(struct page *page) > > > +{ > > > + return false; > > > +} > > > + > > > static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg) > > > { > > > return true; > > > @@ -1561,6 +1573,10 @@ extern struct static_key_false > > > memcg_sockets_enabled_key; > > > #define mem_cgroup_sockets_enabled > > > static_branch_unlikely(&memcg_sockets_enabled_key) > > > void mem_cgroup_sk_alloc(struct sock *sk); > > > void mem_cgroup_sk_free(struct sock *sk); > > > +int mem_cgroup_charge_sock_pages(struct mem_cgroup *memcg, struct page > > > **pages, > > > + unsigned int nr_pages); > > > +void mem_cgroup_uncharge_sock_pages(struct page **pages, unsigned int > > > nr_pages); > > > + > > > static inline bool mem_cgroup_under_socket_pressure(struct mem_cgroup > > > *memcg) > > > { > > > if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && > > > memcg->tcpmem_pressure) > > > @@ -1589,6 +1605,18 @@ static inline void memcg_set_shrinker_bit(struct > > > mem_cgroup *memcg, > > > int nid, int shrinker_id) > > > { > > > } > > > + > > > +static inline int mem_cgroup_charge_sock_pages(struct mem_cgroup *memcg, > > > +struct page **pages, > > > +
Re: [RFC PATCH 0/7] Support for virtio-net hash reporting
On Tue, Jan 12, 2021 at 3:29 PM Yuri Benditovich wrote: > > On Tue, Jan 12, 2021 at 9:49 PM Yuri Benditovich > wrote: > > > > On Tue, Jan 12, 2021 at 9:41 PM Yuri Benditovich > > wrote: > > > > > > Existing TUN module is able to use provided "steering eBPF" to > > > calculate per-packet hash and derive the destination queue to > > > place the packet to. The eBPF uses mapped configuration data > > > containing a key for hash calculation and indirection table > > > with array of queues' indices. > > > > > > This series of patches adds support for virtio-net hash reporting > > > feature as defined in virtio specification. It extends the TUN module > > > and the "steering eBPF" as follows: > > > > > > Extended steering eBPF calculates the hash value and hash type, keeps > > > hash value in the skb->hash and returns index of destination virtqueue > > > and the type of the hash. TUN module keeps returned hash type in > > > (currently unused) field of the skb. > > > skb->__unused renamed to 'hash_report_type'. > > > > > > When TUN module is called later to allocate and fill the virtio-net > > > header and push it to destination virtqueue it populates the hash > > > and the hash type into virtio-net header. > > > > > > VHOST driver is made aware of respective virtio-net feature that > > > extends the virtio-net header to report the hash value and hash report > > > type. > > > > Comment from Willem de Bruijn: > > > > Skbuff fields are in short supply. I don't think we need to add one > > just for this narrow path entirely internal to the tun device. > > > > We understand that and try to minimize the impact by using an already > existing unused field of skb. Not anymore. It was repurposed as a flags field very recently. This use case is also very narrow in scope. And a very short path from data producer to consumer. So I don't think it needs to claim scarce bits in the skb. tun_ebpf_select_queue stores the field, tun_put_user reads it and converts it to the virtio_net_hdr in the descriptor. tun_ebpf_select_queue is called from .ndo_select_queue. Storing the field in skb->cb is fragile, as in theory some code could overwrite that between field between ndo_select_queue and ndo_start_xmit/tun_net_xmit, from which point it is fully under tun control again. But in practice, I don't believe anything does. Alternatively an existing skb field that is used only on disjoint datapaths, such as ingress-only, could be viable. > > Instead, you could just run the flow_dissector in tun_put_user if the > > feature is negotiated. Indeed, the flow dissector seems more apt to me > > than BPF here. Note that the flow dissector internally can be > > overridden by a BPF program if the admin so chooses. > > > When this set of patches is related to hash delivery in the virtio-net > packet in general, > it was prepared in context of RSS feature implementation as defined in > virtio spec [1] > In case of RSS it is not enough to run the flow_dissector in tun_put_user: > in tun_ebpf_select_queue the TUN calls eBPF to calculate the hash, > hash type and queue index > according to the (mapped) parameters (key, hash types, indirection > table) received from the guest. TUNSETSTEERINGEBPF was added to support more diverse queue selection than the default in case of multiqueue tun. Not sure what the exact use cases are. But RSS is exactly the purpose of the flow dissector. It is used for that purpose in the software variant RPS. The flow dissector implements a superset of the RSS spec, and certainly computes a four-tuple for TCP/IPv6. In the case of RPS, it is skipped if the NIC has already computed a 4-tuple hash. What it does not give is a type indication, such as VIRTIO_NET_HASH_TYPE_TCPv6. I don't understand how this would be used. In datapaths where the NIC has already computed the four-tuple hash and stored it in skb->hash --the common case for servers--, That type field is the only reason to have to compute again. > Our intention is to keep the hash and hash type in the skb to populate them > into a virtio-net header later in tun_put_user. > Note that in this case the type of calculated hash is selected not > only from flow dissections > but also from limitations provided by the guest. > > This is already implemented in qemu (for case of vhost=off), see [2] > (virtio_net_process_rss) > For case of vhost=on there are WIP for qemu to load eBPF and attach it to TUN. > Note that exact way of selecting rx virtqueue depends on the guest, > it could be automatic steering (typical for Linux VM), RSS (typical > for Windows VM) or > any other steering mechanism implemented in loadable TUN steering BPF with > or without hash calculation. > > [1] https://github.com/oasis-tcs/virtio-spec/blob/master/content.tex#L3740 > [2] https://github.com/qemu/qemu/blob/master/hw/net/virtio-net.c#L1591 > > > This also hits on a deeper point with the choice of hash values, that > > I also noticed in my RFC patchset to implement the inverse [1][2]. It > > i
Re: [PATCH] bpf: Hoise pahole version checks into Kconfig
On Mon, Jan 11, 2021 at 7:06 PM Nathan Chancellor wrote: > > After commit da5fb18225b4 ("bpf: Support pre-2.25-binutils objcopy for > vmlinux BTF"), having CONFIG_DEBUG_INFO_BTF enabled but lacking a valid > copy of pahole results in a kernel that will fully compile but fail to > link. The user then has to either install pahole or disable > CONFIG_DEBUG_INFO_BTF and rebuild the kernel but only after their build > has failed, which could have been a significant amount of time depending > on the hardware. > > Avoid a poor user experience and require pahole to be installed with an > appropriate version to select and use CONFIG_DEBUG_INFO_BTF, which is > standard for options that require a specific tools version. > > Suggested-by: Sedat Dilek > Signed-off-by: Nathan Chancellor Thanks for the patch, Nathan, Might be good to gave a hint to the user if pahole version does not match requirements? Feel free to add my: Tested-by: Sedat Dilek - Sedat - > --- > MAINTAINERS | 1 + > init/Kconfig | 4 > lib/Kconfig.debug | 6 ++ > scripts/link-vmlinux.sh | 13 - > scripts/pahole-version.sh | 16 > 5 files changed, 23 insertions(+), 17 deletions(-) > create mode 100755 scripts/pahole-version.sh > > diff --git a/MAINTAINERS b/MAINTAINERS > index b8db7637263a..6f6e24285a94 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -3282,6 +3282,7 @@ F:net/core/filter.c > F: net/sched/act_bpf.c > F: net/sched/cls_bpf.c > F: samples/bpf/ > +F: scripts/pahole-version.sh > F: tools/bpf/ > F: tools/lib/bpf/ > F: tools/testing/selftests/bpf/ > diff --git a/init/Kconfig b/init/Kconfig > index b77c60f8b963..872c61b5d204 100644 > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -74,6 +74,10 @@ config TOOLS_SUPPORT_RELR > config CC_HAS_ASM_INLINE > def_bool $(success,echo 'void foo(void) { asm inline (""); }' | $(CC) > -x c - -c -o /dev/null) > > +config PAHOLE_VERSION > + int > + default $(shell,$(srctree)/scripts/pahole-version.sh $(PAHOLE)) > + > config CONSTRUCTORS > bool > depends on !UML > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug > index 7937265ef879..70c446af9664 100644 > --- a/lib/Kconfig.debug > +++ b/lib/Kconfig.debug > @@ -267,6 +267,7 @@ config DEBUG_INFO_DWARF4 > > config DEBUG_INFO_BTF > bool "Generate BTF typeinfo" > + depends on PAHOLE_VERSION >= 116 > depends on !DEBUG_INFO_SPLIT && !DEBUG_INFO_REDUCED > depends on !GCC_PLUGIN_RANDSTRUCT || COMPILE_TEST > help > @@ -274,12 +275,9 @@ config DEBUG_INFO_BTF > Turning this on expects presence of pahole tool, which will convert > DWARF type info into equivalent deduplicated BTF type info. > > -config PAHOLE_HAS_SPLIT_BTF > - def_bool $(success, test `$(PAHOLE) --version | sed -E > 's/v([0-9]+)\.([0-9]+)/\1\2/'` -ge "119") > - > config DEBUG_INFO_BTF_MODULES > def_bool y > - depends on DEBUG_INFO_BTF && MODULES && PAHOLE_HAS_SPLIT_BTF > + depends on DEBUG_INFO_BTF && MODULES && PAHOLE_VERSION >= 119 > help > Generate compact split BTF type information for kernel modules. > > diff --git a/scripts/link-vmlinux.sh b/scripts/link-vmlinux.sh > index 6eded325c837..eef40fa9485d 100755 > --- a/scripts/link-vmlinux.sh > +++ b/scripts/link-vmlinux.sh > @@ -139,19 +139,6 @@ vmlinux_link() > # ${2} - file to dump raw BTF data into > gen_btf() > { > - local pahole_ver > - > - if ! [ -x "$(command -v ${PAHOLE})" ]; then > - echo >&2 "BTF: ${1}: pahole (${PAHOLE}) is not available" > - return 1 > - fi > - > - pahole_ver=$(${PAHOLE} --version | sed -E > 's/v([0-9]+)\.([0-9]+)/\1\2/') > - if [ "${pahole_ver}" -lt "116" ]; then > - echo >&2 "BTF: ${1}: pahole version $(${PAHOLE} --version) is > too old, need at least v1.16" > - return 1 > - fi > - > vmlinux_link ${1} > > info "BTF" ${2} > diff --git a/scripts/pahole-version.sh b/scripts/pahole-version.sh > new file mode 100755 > index ..6de6f734a345 > --- /dev/null > +++ b/scripts/pahole-version.sh > @@ -0,0 +1,16 @@ > +#!/bin/sh > +# SPDX-License-Identifier: GPL-2.0 > +# > +# Usage: $ ./scripts/pahole-version.sh pahole > +# > +# Print the pahole version as a three digit string > +# such as `119' for pahole v1.19 etc. > + > +pahole="$*" > + > +if ! [ -x "$(command -v $pahole)" ]; then > +echo 0 > +exit 1 > +fi > + > +$pahole --version | sed -E 's/v([0-9]+)\.([0-9]+)/\1\2/' > > base-commit: e22d7f05e445165e58feddb4e40cc9c0f94453bc > -- > 2.30.0 >
Re: [PATCH v0 net-next 1/1] Allow user to set metric on default route learned via Router Advertisement.
On Tue, 12 Jan 2021 11:55:11 -0800 Praveen Chaudhary wrote: > Hi Jakub > > Thanks for the review, > > Sure, I will reraise the patch (again v0i, sonce no code changes) after > adding space before '<'. > > This patch adds lines in 'include/uapi/', that requires ABI version changes > for debian build. I am not sure, if we need any such changes to avoid > breaking allmodconfig. It will be really helpful, if you can look at the > patch once 'https://lkml.org/lkml/2021/1/11/1668' and suggest on this. Thanks > a lot again. The code doesn't build, AFAIK it's because: net/ipv6/ndisc.c:1322:8: error: too few arguments to function ‘rt6_add_dflt_router’
[PATCH bpf-next] bpf: reject too big ctx_size_in for raw_tp test run
syzbot reported a WARNING for allocating too big memory: WARNING: CPU: 1 PID: 8484 at mm/page_alloc.c:4976 __alloc_pages_nodemask+0x5f8/0x730 mm/page_alloc.c:5011 Modules linked in: CPU: 1 PID: 8484 Comm: syz-executor862 Not tainted 5.11.0-rc2-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:__alloc_pages_nodemask+0x5f8/0x730 mm/page_alloc.c:4976 Code: 00 00 0c 00 0f 85 a7 00 00 00 8b 3c 24 4c 89 f2 44 89 e6 c6 44 24 70 00 48 89 6c 24 58 e8 d0 d7 ff ff 49 89 c5 e9 ea fc ff ff <0f> 0b e9 b5 fd ff ff 89 74 24 14 4c 89 4c 24 08 4c 89 74 24 18 e8 RSP: 0018:c900012efb10 EFLAGS: 00010246 RAX: RBX: 19200025df66 RCX: RDX: RSI: dc00 RDI: 00140dc0 RBP: 00140dc0 R08: R09: R10: 81b1f7e1 R11: R12: 0014 R13: 0014 R14: R15: FS: 0190c880() GS:8880b9e0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7f08b7f316c0 CR3: 12073000 CR4: 001506f0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: alloc_pages_current+0x18c/0x2a0 mm/mempolicy.c:2267 alloc_pages include/linux/gfp.h:547 [inline] kmalloc_order+0x2e/0xb0 mm/slab_common.c:837 kmalloc_order_trace+0x14/0x120 mm/slab_common.c:853 kmalloc include/linux/slab.h:557 [inline] kzalloc include/linux/slab.h:682 [inline] bpf_prog_test_run_raw_tp+0x4b5/0x670 net/bpf/test_run.c:282 bpf_prog_test_run kernel/bpf/syscall.c:3120 [inline] __do_sys_bpf+0x1ea9/0x4f10 kernel/bpf/syscall.c:4398 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x440499 Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:7ffe1f3bfb18 EFLAGS: 0246 ORIG_RAX: 0141 RAX: ffda RBX: 004002c8 RCX: 00440499 RDX: 0048 RSI: 2600 RDI: 000a RBP: 006ca018 R08: R09: 004002c8 R10: R11: 0246 R12: 00401ca0 R13: 00401d30 R14: R15: This is because we didn't filter out too big ctx_size_in. Fix it by rejecting ctx_size_in that are bigger than MAX_BPF_FUNC_ARGS (12) u64 numbers. Reported-by: syzbot+4f98876664c7337a4...@syzkaller.appspotmail.com Fixes: 1b4d60ec162f ("bpf: Enable BPF_PROG_TEST_RUN for raw_tracepoint") Cc: sta...@vger.kernel.org # v5.10+ Signed-off-by: Song Liu --- net/bpf/test_run.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c index 23dfb2010ba69..58bcb8c849d54 100644 --- a/net/bpf/test_run.c +++ b/net/bpf/test_run.c @@ -272,7 +272,8 @@ int bpf_prog_test_run_raw_tp(struct bpf_prog *prog, kattr->test.repeat) return -EINVAL; - if (ctx_size_in < prog->aux->max_ctx_offset) + if (ctx_size_in < prog->aux->max_ctx_offset || + ctx_size_in > MAX_BPF_FUNC_ARGS * sizeof(u64)) return -EINVAL; if ((kattr->test.flags & BPF_F_TEST_RUN_ON_CPU) == 0 && cpu != 0) -- 2.24.1
Re: [PATCH net-next 0/5] skbuff: introduce skbuff_heads bulking and reusing
On Tue, 12 Jan 2021 13:23:16 +0100 Eric Dumazet wrote: > On Tue, Jan 12, 2021 at 12:08 PM Alexander Lobakin wrote: > > > > From: Edward Cree > > Date: Tue, 12 Jan 2021 09:54:04 + > > > > > Without wishing to weigh in on whether this caching is a good idea... > > > > Well, we already have a cache to bulk flush "consumed" skbs, although > > kmem_cache_free() is generally lighter than kmem_cache_alloc(), and > > a page frag cache to allocate skb->head that is also bulking the > > operations, since it contains a (compound) page with the size of > > min(SZ_32K, PAGE_SIZE). > > If they wouldn't give any visible boosts, I think they wouldn't hit > > mainline. > > > > > Wouldn't it be simpler, rather than having two separate "alloc" and > > > "flush" > > > caches, to have a single larger cache, such that whenever it becomes full > > > we bulk flush the top half, and when it's empty we bulk alloc the bottom > > > half? That should mean fewer branches, fewer instructions etc. than > > > having to decide which cache to act upon every time. > > > > I though about a unified cache, but couldn't decide whether to flush > > or to allocate heads and how much to process. Your suggestion answers > > these questions and generally seems great. I'll try that one, thanks! > > The thing is : kmalloc() is supposed to have batches already, and nice > per-cpu caches. > > This looks like an mm issue, are we sure we want to get over it ? > > I would like a full analysis of why SLAB/SLUB does not work well for > your test workload. +1, it does feel like we're getting into mm territory > More details, more numbers before we accept yet another > 'networking optimization' adding more code to the 'fast' path. > > More code means more latencies when all code needs to be brought up in > cpu caches.
Re: [PATCH net-next v2 3/7] ibmvnic: avoid allocating rwi entries
Saeed Mahameed [sa...@kernel.org] wrote: > > -struct ibmvnic_rwi { > > - enum ibmvnic_reset_reason reset_reason; > > - struct list_head list; > > -}; > > + VNIC_RESET_CHANGE_PARAM, > > + VNIC_RESET_MAX}; // must be last >this is not the preferred comment style: ^^ > > I would just drop the comment here, it is clear from the name of the > enum. > Yeah, we debated and figured the comment could serve as another reminder. Here is updated patch, fixing the comment style. Thanks Sukadev --- >From 59d4b23fe1f97a67436e14829368744ee288157d Mon Sep 17 00:00:00 2001 From: Sukadev Bhattiprolu Date: Wed, 16 Dec 2020 23:00:34 -0600 Subject: [PATCH net-next v2 3/7] ibmvnic: avoid allocating rwi entries Whenever we need to schedule a reset, we allocate an rwi (reset work item?) entry and add to the list of pending resets. Since we only add one rwi for a given reason type to the list (no duplicates). we will only have a handful of reset types in the list - even in the worst case. In the common case we should just have a couple of entries at most. Rather than allocating/freeing every time (and dealing with the corner case of the allocation failing), use a fixed number of rwi entries. The extra memory needed is tiny and most of it will be used over the active life of the adapter. This also fixes a couple of tiny memory leaks. One is in ibmvnic_reset() where we don't free the rwi entries after deleting them from the list due to a transport event. The second is in __ibmvnic_reset() where if we find that the adapter is being removed, we simply break out of the loop (with rc = EBUSY) but ignore any rwi entries that remain on the list. Fixes: 2770a7984db58 ("Introduce hard reset recovery") Fixes: 36f1031c51a2 ("ibmvnic: Do not process reset during or after device removal") Signed-off-by: Sukadev Bhattiprolu --- drivers/net/ethernet/ibm/ibmvnic.c | 123 + drivers/net/ethernet/ibm/ibmvnic.h | 14 ++-- 2 files changed, 78 insertions(+), 59 deletions(-) diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c index cd8108dbddec..d1c2aaed1478 100644 --- a/drivers/net/ethernet/ibm/ibmvnic.c +++ b/drivers/net/ethernet/ibm/ibmvnic.c @@ -2257,29 +2257,81 @@ static int do_hard_reset(struct ibmvnic_adapter *adapter, return rc; } -static struct ibmvnic_rwi *get_next_rwi(struct ibmvnic_adapter *adapter) +/** + * Next reset will always be the first on the list. + * When we take it off the list, we move any remaining resets so + * that the next one is again the first on the list. Most of the + * time the pending_resets[] should have a couple of types of resets + * (FAILOVER, TIMEOUT or CHANGE-PARAM and less often, MOBILITY). + */ +static enum ibmvnic_reset_reason get_pending_reset(struct ibmvnic_adapter *adapter) { - struct ibmvnic_rwi *rwi; + enum ibmvnic_reset_reason *pending_resets; + enum ibmvnic_reset_reason reason = 0; unsigned long flags; + int i; spin_lock_irqsave(&adapter->rwi_lock, flags); - if (!list_empty(&adapter->rwi_list)) { - rwi = list_first_entry(&adapter->rwi_list, struct ibmvnic_rwi, - list); - list_del(&rwi->list); - } else { - rwi = NULL; + pending_resets = &adapter->pending_resets[0]; + + reason = pending_resets[0]; + + if (reason) { + for (i = 0; i < adapter->next_reset; i++) { + pending_resets[i] = pending_resets[i+1]; + if (!pending_resets[i]) + break; + } + adapter->next_reset--; + } + + spin_unlock_irqrestore(&adapter->rwi_lock, flags); + return reason; +} + +/** + * Add a pending reset, making sure not to add duplicates. + * If @clear is set, clear all existing resets before adding. + * + * TODO: If clear (i.e force_reset_recovery) is true AND we have a + * duplicate reset, wouldn't it still make sense to clear the + * queue including the duplicate and add this reset? Preserving + * existing behavior for now. + */ +static void add_pending_reset(struct ibmvnic_adapter *adapter, + enum ibmvnic_reset_reason reason, + bool clear) +{ + enum ibmvnic_reset_reason *pending_resets; + unsigned long flags; + int i; + + spin_lock_irqsave(&adapter->rwi_lock, flags); + + pending_resets = &adapter->pending_resets[0]; + + for (i = 0; i < adapter->next_reset; i++) { + if (pending_resets[i] == reason) + goto out; + } + + if (clear) { + for (i = 0; i < adapter->next_reset; i++) { + pending_resets[i] = 0; + } + adapter->next_reset = 0; } + pending_resets[adapter->next_reset] = r
RE: [PATCH v2] net: phy: realtek: Add support for RTL9000AA/AN
> > I think it's possible to return a Master/Slave configuration. > > Great. It would be good to add it. OK. I think it will take some time to implement this feature, as we prioritize investigating comments from Russell. > > By the way, do you need the cable test function as implementedin > > nxp-tja11xx.c? > > We don't need it. But if you want to implement it, that would be > great. We also don't need the cable test feature. However, we are interested in adding this feature, so we will consider adding the cable test feature after the RTL9000AA/AN support are merged. Thanks & Best Regards, Yuusuke Ashiduka
Re: [PATCH bpf-next v2 1/2] bpf: allow to retrieve sol_socket opts from sock_addr progs
On Mon, Jan 11, 2021 at 3:09 PM Daniel Borkmann wrote: > > The _bpf_setsockopt() is able to set some of the SOL_SOCKET level options, > however, _bpf_getsockopt() has little support to actually retrieve them. > This small patch adds few misc options such as SO_MARK, SO_PRIORITY and > SO_BINDTOIFINDEX. For the latter getter and setter are added. The mark and > priority in particular allow to retrieve the options from BPF cgroup hooks > to then implement custom behavior / settings on the syscall hooks compared > to other sockets that stick to the defaults, for example. > > Signed-off-by: Daniel Borkmann > Acked-by: Yonghong Song Applied.
Re: [PATCH v3 bpf-next 7/7] selftests/bpf: test kernel module ksym externs
On Tue, Jan 12, 2021 at 3:41 AM Andrii Nakryiko wrote: > > Add per-CPU variable to bpf_testmod.ko and use those from new selftest to > validate it works end-to-end. > > Acked-by: Yonghong Song > Acked-by: Hao Luo > Signed-off-by: Andrii Nakryiko Applied. FYI for everyone. This test needs the latest pahole.
RE: [PATCH] igb: avoid premature Rx buffer reuse
> -Original Message- > From: Alexander Duyck [mailto:alexander.du...@gmail.com] > Sent: Wednesday, January 13, 2021 5:23 AM > To: Li,Rongqing > Cc: Netdev ; intel-wired-lan > ; Björn Töpel > Subject: Re: [PATCH] igb: avoid premature Rx buffer r > Okay, this explanation makes much more sense. Could you please either include > this explanation in your patch, or include a reference to this patch as this > explains clearly what the issue is while yours didn't and led to the > confusion as I > was assuming the freeing was happening closer to the t0 case, and really the > problem is t1. > > Thanks. > > - Alex Ok, I will send V2 Thanks -Li
Re: [Patch net] cls_flower: call nla_ok() before nla_next()
On Mon, 11 Jan 2021 18:55:48 -0800 Cong Wang wrote: > From: Cong Wang > > fl_set_enc_opt() simply checks if there are still bytes left to parse, > but this is not sufficent as syzbot seems to be able to generate > malformatted netlink messages. nla_ok() is more strict so should be > used to validate the next nlattr here. > > And nla_validate_nested_deprecated() has less strict check too, it is > probably too late to switch to the strict version, but we can just > call nla_ok() too after it. > > Reported-and-tested-by: syzbot+2624e3778b18fc497...@syzkaller.appspotmail.com > Fixes: 0a6e77784f49 ("net/sched: allow flower to match tunnel options") > Fixes: 79b1011cb33d ("net: sched: allow flower to match erspan options") > Cc: Pieter Jansen van Vuuren > Cc: Jamal Hadi Salim > Cc: Xin Long > Cc: Jiri Pirko > Signed-off-by: Cong Wang Thanks for keeping up with the syzbot bugs! > diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c > index 1319986693fc..e265c443536e 100644 > --- a/net/sched/cls_flower.c > +++ b/net/sched/cls_flower.c > @@ -1272,6 +1272,8 @@ static int fl_set_enc_opt(struct nlattr **tb, struct > fl_flow_key *key, > > nla_opt_msk = nla_data(tb[TCA_FLOWER_KEY_ENC_OPTS_MASK]); > msk_depth = nla_len(tb[TCA_FLOWER_KEY_ENC_OPTS_MASK]); > + if (!nla_ok(nla_opt_msk, msk_depth)) > + return -EINVAL; Can we just add another call to nla_validate_nested_deprecated() here instead of having to worry about each attr individually? See below.. > } > > nla_for_each_attr(nla_opt_key, nla_enc_key, > @@ -1308,7 +1310,7 @@ static int fl_set_enc_opt(struct nlattr **tb, struct > fl_flow_key *key, > return -EINVAL; > } > > - if (msk_depth) > + if (nla_ok(nla_opt_msk, msk_depth)) > nla_opt_msk = nla_next(nla_opt_msk, &msk_depth); Should we not error otherwise? if msk_depth && !nla_ok() then the message is clearly misformatted. If we don't error out we'll keep reusing the same mask over and over, while the intention of this code was to have mask per key AFAICT. > break; > case TCA_FLOWER_KEY_ENC_OPTS_VXLAN:
Re: [PATCH v4 1/1] can: dev: add software tx timestamps
On Tue, 12 Jan 2021 11:03:00 +0100 Marc Kleine-Budde wrote: > On 1/12/21 10:54 AM, Vincent Mailhol wrote: > > Call skb_tx_timestamp() within can_put_echo_skb() so that a software > > tx timestamp gets attached on the skb. > > > > There two main reasons to include this call in can_put_echo_skb(): > > > > * It easily allow to enable the tx timestamp on all devices with > > just one small change. > > > > * According to Documentation/networking/timestamping.rst, the tx > > timestamps should be generated in the device driver as close as > > possible, but always prior to passing the packet to the network > > interface. During the call to can_put_echo_skb(), the skb gets > > cloned meaning that the driver should not dereference the skb > > variable anymore after can_put_echo_skb() returns. This makes > > can_put_echo_skb() the very last place we can use the skb without > > having to access the echo_skb[] array. > > > > Remark: by default, skb_tx_timestamp() does nothing. It needs to be > > activated by passing the SOF_TIMESTAMPING_TX_SOFTWARE flag either > > through socket options or control messages. > > > > References: > > > > * Support for the error queue in CAN RAW sockets (which is needed for > >tx timestamps) was introduced in: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=eb88531bdbfaafb827192d1fc6c5a3fcc4fadd96 > > > > * Put the call to skb_tx_timestamp() just before adding it to the > > array: https://lkml.org/lkml/2021/1/10/54 > > > > * About Tx hardware timestamps > > > > https://lore.kernel.org/linux-can/2021071152.gb11...@hoboy.vegasvil.org/ > > > > Signed-off-by: Vincent Mailhol > > Applied to linux-can-next/testing Please make sure to address the warnings before this hits net-next: https://patchwork.kernel.org/project/netdevbpf/patch/20210112130538.14912-2-mailhol.vinc...@wanadoo.fr/ Actually it appears not to build with allmodconfig..?
Re: [PATCH v4 1/1] can: dev: add software tx timestamps
On Tue, 12 Jan 2021 17:46:25 -0800 Jakub Kicinski wrote: > On Tue, 12 Jan 2021 11:03:00 +0100 Marc Kleine-Budde wrote: > > On 1/12/21 10:54 AM, Vincent Mailhol wrote: > > > Call skb_tx_timestamp() within can_put_echo_skb() so that a software > > > tx timestamp gets attached on the skb. > > > > > > There two main reasons to include this call in can_put_echo_skb(): > > > > > > * It easily allow to enable the tx timestamp on all devices with > > > just one small change. > > > > > > * According to Documentation/networking/timestamping.rst, the tx > > > timestamps should be generated in the device driver as close as > > > possible, but always prior to passing the packet to the network > > > interface. During the call to can_put_echo_skb(), the skb gets > > > cloned meaning that the driver should not dereference the skb > > > variable anymore after can_put_echo_skb() returns. This makes > > > can_put_echo_skb() the very last place we can use the skb without > > > having to access the echo_skb[] array. > > > > > > Remark: by default, skb_tx_timestamp() does nothing. It needs to be > > > activated by passing the SOF_TIMESTAMPING_TX_SOFTWARE flag either > > > through socket options or control messages. > > > > > > References: > > > > > > * Support for the error queue in CAN RAW sockets (which is needed for > > >tx timestamps) was introduced in: > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=eb88531bdbfaafb827192d1fc6c5a3fcc4fadd96 > > > > > > * Put the call to skb_tx_timestamp() just before adding it to the > > > array: https://lkml.org/lkml/2021/1/10/54 > > > > > > * About Tx hardware timestamps > > > > > > https://lore.kernel.org/linux-can/2021071152.gb11...@hoboy.vegasvil.org/ > > > > > > Signed-off-by: Vincent Mailhol > > > > Applied to linux-can-next/testing > > Please make sure to address the warnings before this hits net-next: > > https://patchwork.kernel.org/project/netdevbpf/patch/20210112130538.14912-2-mailhol.vinc...@wanadoo.fr/ > > Actually it appears not to build with allmodconfig..? Erm, apologies, I confused different CAN patches, this one did not get build tested.
[PATCH v1 net-next 1/1] Allow user to set metric on default route learned via Router Advertisement.
For IPv4, default route is learned via DHCPv4 and user is allowed to change metric using config etc/network/interfaces. But for IPv6, default route can be learned via RA, for which, currently a fixed metric value 1024 is used. Ideally, user should be able to configure metric on default route for IPv6 similar to IPv4. This fix adds sysctl for the same. Signed-off-by: Praveen Chaudhary Signed-off-by: Zhenggen Xu Changes in v1. --- 1.) Correct the call to rt6_add_dflt_router. --- --- Documentation/networking/ip-sysctl.rst | 18 ++ include/linux/ipv6.h | 1 + include/net/ip6_route.h| 3 ++- include/uapi/linux/ipv6.h | 1 + include/uapi/linux/sysctl.h| 1 + net/ipv6/addrconf.c| 10 ++ net/ipv6/ndisc.c | 14 ++ net/ipv6/route.c | 5 +++-- 8 files changed, 46 insertions(+), 7 deletions(-) diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index dd2b12a32b73..384159081d91 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -1871,6 +1871,24 @@ accept_ra_defrtr - BOOLEAN - enabled if accept_ra is enabled. - disabled if accept_ra is disabled. +accept_ra_defrtr_metric - INTEGER + Route metric for default route learned in Router Advertisement. This + value will be assigned as metric for the route learned via IPv6 Router + Advertisement. + + Possible values are: + 0: + Use default value i.e. IP6_RT_PRIO_USER 1024. + 0x to -1: + -ve values represent high route metric, value will be treated as + unsigned value. This behaviour is inline with current IPv4 metric + shown with commands such as "route -n" or "ip route list". + 1 to 0x7FF: + +ve values will be used as is for route metric. + + Functional default: enabled if accept_ra_defrtr is enabled. + disabled if accept_ra_defrtr is disabled. + accept_ra_from_local - BOOLEAN Accept RA with source-address that is found on local machine if the RA is otherwise proper and able to be accepted. diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h index dda61d150a13..19af90c77200 100644 --- a/include/linux/ipv6.h +++ b/include/linux/ipv6.h @@ -31,6 +31,7 @@ struct ipv6_devconf { __s32 max_desync_factor; __s32 max_addresses; __s32 accept_ra_defrtr; + __s32 accept_ra_defrtr_metric; __s32 accept_ra_min_hop_limit; __s32 accept_ra_pinfo; __s32 ignore_routes_with_linkdown; diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h index 2a5277758379..a470bdab2420 100644 --- a/include/net/ip6_route.h +++ b/include/net/ip6_route.h @@ -174,7 +174,8 @@ struct fib6_info *rt6_get_dflt_router(struct net *net, struct net_device *dev); struct fib6_info *rt6_add_dflt_router(struct net *net, const struct in6_addr *gwaddr, -struct net_device *dev, unsigned int pref); +struct net_device *dev, unsigned int pref, +unsigned int defrtr_usr_metric); void rt6_purge_dflt_routers(struct net *net); diff --git a/include/uapi/linux/ipv6.h b/include/uapi/linux/ipv6.h index 13e8751bf24a..945de5de5144 100644 --- a/include/uapi/linux/ipv6.h +++ b/include/uapi/linux/ipv6.h @@ -189,6 +189,7 @@ enum { DEVCONF_ACCEPT_RA_RT_INFO_MIN_PLEN, DEVCONF_NDISC_TCLASS, DEVCONF_RPL_SEG_ENABLED, + DEVCONF_ACCEPT_RA_DEFRTR_METRIC, DEVCONF_MAX }; diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h index 458179df9b27..5e79c196e33c 100644 --- a/include/uapi/linux/sysctl.h +++ b/include/uapi/linux/sysctl.h @@ -571,6 +571,7 @@ enum { NET_IPV6_ACCEPT_SOURCE_ROUTE=25, NET_IPV6_ACCEPT_RA_FROM_LOCAL=26, NET_IPV6_ACCEPT_RA_RT_INFO_MIN_PLEN=27, + NET_IPV6_ACCEPT_RA_DEFRTR_METRIC=28, __NET_IPV6_MAX }; diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index eff2cacd5209..702ec4a33936 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -205,6 +205,7 @@ static struct ipv6_devconf ipv6_devconf __read_mostly = { .max_desync_factor = MAX_DESYNC_FACTOR, .max_addresses = IPV6_MAX_ADDRESSES, .accept_ra_defrtr = 1, + .accept_ra_defrtr_metric = 0, .accept_ra_from_local = 0, .accept_ra_min_hop_limit= 1, .accept_ra_pinfo= 1, @@ -260,6 +261,7 @@ static struct ipv6_devconf ipv6_devconf_dflt __read_mostly = { .ma
[PATCH v1 net-next 0/1] Allow user to set metric on default route learned via Router Advertisement.
Allow user to set metric on default route learned via Router Advertisement. Note: RFC 4191 does not say anything for metric for IPv6 default route. Fix: For IPv4, default route is learned via DHCPv4 and user is allowed to change metric using config in etc/network/interfaces. But for IPv6, default route can be learned via RA, for which, currently a fixed metric value 1024 is used. Ideally, user should be able to configure metric on default route for IPv6 similar to IPv4. This fix adds sysctl for the same. Logs: For IPv4: Config in etc/network/interfaces ``` auto eth0 iface eth0 inet dhcp metric 4261413864 ``` IPv4 Kernel Route Table: ``` $ sudo route -n Kernel IP routing table Destination Gateway Genmask Flags Metric RefUse Iface 0.0.0.0 172.11.44.1 0.0.0.0 UG-33553432 00 eth0 ``` FRR Table, if a static route is configured. [In real scenario, it is useful to prefer BGP learned default route over DHCPv4 default route.] ``` Codes: K - kernel route, C - connected, S - static, R - RIP, O - OSPF, I - IS-IS, B - BGP, P - PIM, E - EIGRP, N - NHRP, T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP, > - selected route, * - FIB route S>* 0.0.0.0/0 [20/0] is directly connected, eth0, 00:00:03 K 0.0.0.0/0 [254/1000] via 172.21.47.1, eth0, 6d08h51m ``` i.e. User can prefer Default Router learned via Routing Protocol, Similar behavior is not possible for IPv6, without this fix. After fix [for IPv6]: ``` sudo sysctl -w net.ipv6.conf.eth0.net.ipv6.conf.eth0.accept_ra_defrtr_metric=0x770003e9 ``` IP monitor: ``` default via fe80::xx16::feb3:ce8e dev eth0 proto ra metric 1996489705 pref high ``` Kernel IPv6 routing table ``` DestinationNext Hop Flag Met Ref Use If ::/0 fe80::xx16::feb3:ce8e UGDAe 1996489705 0 0 eth0 ``` FRR Table, if a static route is configured. [In real scenario, it is useful to prefer BGP learned default route over IPv6 RA default route.] ``` Codes: K - kernel route, C - connected, S - static, R - RIPng, O - OSPFv3, I - IS-IS, B - BGP, N - NHRP, T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP, > - selected route, * - FIB route S>* ::/0 [20/0] is directly connected, eth0, 00:00:06 K ::/0 [119/1001] via fe80::xx16::feb3:ce8e, eth0, 6d07h43m ``` If the metric is changed later, the effect will be seen only when IPv6 RA is received, because the default route must be fully controlled by RA msg. ``` admin@lnos-x1-a-asw03:~$ sudo sysctl -w net.ipv6.conf.eth0.accept_ra_defrtr_metric=0x770003e8 net.ipv6.conf.eth0.accept_ra_defrtr_metric = 0x770003e8 ``` IP monitor: when metric is changed after learning Default Route from previous IPv6 RA msg: ``` Deleted default via fe80::xx16::feb3:ce8e dev eth0 proto ra metric 1996489705 expires 3sec hoplimit 64 pref high default via fe80::xx16::feb3:ce8e dev eth0 proto ra metric 1996489704 pref high ``` Praveen Chaudhary (1): Allow user to set metric on default route learned via Router Advertisement. Documentation/networking/ip-sysctl.rst | 18 ++ include/linux/ipv6.h | 1 + include/net/ip6_route.h| 3 ++- include/uapi/linux/ipv6.h | 1 + include/uapi/linux/sysctl.h| 1 + net/ipv6/addrconf.c| 10 ++ net/ipv6/ndisc.c | 14 ++ net/ipv6/route.c | 5 +++-- 8 files changed, 46 insertions(+), 7 deletions(-) base-commit: 139711f033f636cc78b6aaf7363252241b9698ef -- 2.29.0
Re: [PATCH net-next v2 5/7] ibmvnic: serialize access to work queue
On Tue, 12 Jan 2021 16:40:49 -0800 Sukadev Bhattiprolu wrote: > Saeed Mahameed [sa...@kernel.org] wrote: > > On Tue, 2021-01-12 at 10:14 -0800, Sukadev Bhattiprolu wrote: > > > > > @@ -5467,7 +5472,15 @@ static int ibmvnic_remove(struct vio_dev *dev) > > > return -EBUSY; > > > } > > > > > > + /* If ibmvnic_reset() is scheduling a reset, wait for it to > > > + * finish. Then prevent it from scheduling any more resets > > > + * and have the reset functions ignore any resets that have > > > + * already been scheduled. > > > + */ > > > + spin_lock_irqsave(&adapter->remove_lock, flags); > > > adapter->state = VNIC_REMOVING; > > > + spin_unlock_irqrestore(&adapter->remove_lock, flags); > > > + > > > > Why irqsave/restore variants ? are you expecting this spinlock to be > > held in interruptcontext ? > > > > > spin_unlock_irqrestore(&adapter->state_lock, flags); > > Good question. > > One of the callers of ibmvnic_reset() is the ->ndo_tx_timeout() > method which gets called from the watchdog timer. watchdog is a normal timer, so it's gonna run in softirq, you don't need to mask irqs.