Re: [RFC PATCH 0/7] Support for virtio-net hash reporting
On Thu, Jan 14, 2021 at 5:39 AM Jason Wang wrote: > > > On 2021/1/13 下午10:33, Willem de Bruijn wrote: > > On Tue, Jan 12, 2021 at 11:11 PM Jason Wang wrote: > >> > >> On 2021/1/13 上午7:47, Willem de Bruijn wrote: > >>> On Tue, Jan 12, 2021 at 3:29 PM Yuri Benditovich > >>> wrote: > On Tue, Jan 12, 2021 at 9:49 PM Yuri Benditovich > wrote: > > On Tue, Jan 12, 2021 at 9:41 PM Yuri Benditovich > > wrote: > >> Existing TUN module is able to use provided "steering eBPF" to > >> calculate per-packet hash and derive the destination queue to > >> place the packet to. The eBPF uses mapped configuration data > >> containing a key for hash calculation and indirection table > >> with array of queues' indices. > >> > >> This series of patches adds support for virtio-net hash reporting > >> feature as defined in virtio specification. It extends the TUN module > >> and the "steering eBPF" as follows: > >> > >> Extended steering eBPF calculates the hash value and hash type, keeps > >> hash value in the skb->hash and returns index of destination virtqueue > >> and the type of the hash. TUN module keeps returned hash type in > >> (currently unused) field of the skb. > >> skb->__unused renamed to 'hash_report_type'. > >> > >> When TUN module is called later to allocate and fill the virtio-net > >> header and push it to destination virtqueue it populates the hash > >> and the hash type into virtio-net header. > >> > >> VHOST driver is made aware of respective virtio-net feature that > >> extends the virtio-net header to report the hash value and hash report > >> type. > > Comment from Willem de Bruijn: > > > > Skbuff fields are in short supply. I don't think we need to add one > > just for this narrow path entirely internal to the tun device. > > > We understand that and try to minimize the impact by using an already > existing unused field of skb. > >>> Not anymore. It was repurposed as a flags field very recently. > >>> > >>> This use case is also very narrow in scope. And a very short path from > >>> data producer to consumer. So I don't think it needs to claim scarce > >>> bits in the skb. > >>> > >>> tun_ebpf_select_queue stores the field, tun_put_user reads it and > >>> converts it to the virtio_net_hdr in the descriptor. > >>> > >>> tun_ebpf_select_queue is called from .ndo_select_queue. Storing the > >>> field in skb->cb is fragile, as in theory some code could overwrite > >>> that between field between ndo_select_queue and > >>> ndo_start_xmit/tun_net_xmit, from which point it is fully under tun > >>> control again. But in practice, I don't believe anything does. > >>> > >>> Alternatively an existing skb field that is used only on disjoint > >>> datapaths, such as ingress-only, could be viable. > >> > >> A question here. We had metadata support in XDP for cooperation between > >> eBPF programs. Do we have something similar in the skb? > >> > >> E.g in the RSS, if we want to pass some metadata information between > >> eBPF program and the logic that generates the vnet header (either hard > >> logic in the kernel or another eBPF program). Is there any way that can > >> avoid the possible conflicts of qdiscs? > > Not that I am aware of. The closest thing is cb[]. > > > > It'll have to aliase a field like that, that is known unused for the given > > path. > > > Right, we need to make sure cb is not used by other ones. I'm not sure > how hard to achieve that consider Qemu installs the eBPF program but it > doesn't deal with networking configurations. > > > > > > One other approach that has been used within linear call stacks is out > > of band. Like percpu variables softnet_data.xmit.more and > > mirred_rec_level. But that is perhaps a bit overwrought for this use > > case. > > > Yes, and if we go that way then eBPF turns out to be a burden since we > need to invent helpers to access those auxiliary data structure. It > would be better then to hard-coded the RSS in the kernel. > > > > > > Instead, you could just run the flow_dissector in tun_put_user if the > > feature is negotiated. Indeed, the flow dissector seems more apt to me > > than BPF here. Note that the flow dissector internally can be > > overridden by a BPF program if the admin so chooses. > > > When this set of patches is related to hash delivery in the virtio-net > packet in general, > it was prepared in context of RSS feature implementation as defined in > virtio spec [1] > In case of RSS it is not enough to run the flow_dissector in > tun_put_user: > in tun_ebpf_select_queue the TUN calls eBPF to calculate the hash, > hash type and queue index > according to the (mapped) parameters (key, hash types, indirection > table) received from the guest. > >>> TUNSETSTEERINGEBPF was added to support more diverse queue selection > >>> than the default in
[PATCH net-next 4/5] selftests: mlxsw: sch_red_core: Drop two unused variables
From: Petr Machata These variables are cut'n'pasted from other functions in the file and not actually used. Signed-off-by: Petr Machata Signed-off-by: Ido Schimmel --- tools/testing/selftests/drivers/net/mlxsw/sch_red_core.sh | 2 -- 1 file changed, 2 deletions(-) diff --git a/tools/testing/selftests/drivers/net/mlxsw/sch_red_core.sh b/tools/testing/selftests/drivers/net/mlxsw/sch_red_core.sh index b0cb1aaffdda..d439ee0eaa8a 100644 --- a/tools/testing/selftests/drivers/net/mlxsw/sch_red_core.sh +++ b/tools/testing/selftests/drivers/net/mlxsw/sch_red_core.sh @@ -551,10 +551,8 @@ do_drop_test() local trigger=$1; shift local subtest=$1; shift local fetch_counter=$1; shift - local backlog local base local now - local pct RET=0 -- 2.29.2
[PATCH net-next 2/5] mlxsw: spectrum_trap: Add ecn_mark trap
From: Petr Machata Register with devlink a new buffer drop trap, ecn_mark. Since Spectrum-1 does not support the ecn_mark trap, do it only for Spectrum-2 and above. Reviewed-by: Jiri Pirko Signed-off-by: Petr Machata Signed-off-by: Ido Schimmel --- drivers/net/ethernet/mellanox/mlxsw/spectrum_trap.c | 9 + 1 file changed, 9 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_trap.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_trap.c index 4ef12e3e021a..45cf6c027cac 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_trap.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_trap.c @@ -51,6 +51,8 @@ enum { enum { /* Packet was early dropped. */ MLXSW_SP_MIRROR_REASON_INGRESS_WRED = 9, + /* Packet was ECN marked. */ + MLXSW_SP_MIRROR_REASON_EGRESS_ECN = 13, }; static int mlxsw_sp_rx_listener(struct mlxsw_sp *mlxsw_sp, struct sk_buff *skb, @@ -1760,6 +1762,13 @@ mlxsw_sp2_trap_items_arr[] = { }, .is_source = true, }, + { + .trap = MLXSW_SP_TRAP_BUFFER_DROP(ECN_MARK), + .listeners_arr = { + MLXSW_SP_RXL_BUFFER_DISCARD(EGRESS_ECN), + }, + .is_source = true, + }, }; static int -- 2.29.2
[PATCH net-next 1/5] devlink: Add ecn_mark trap
From: Petr Machata Add the packet trap that can report packets that were ECN marked due to RED AQM. Signed-off-by: Amit Cohen Reviewed-by: Jiri Pirko Signed-off-by: Petr Machata Signed-off-by: Ido Schimmel --- Documentation/networking/devlink/devlink-trap.rst | 4 include/net/devlink.h | 3 +++ net/core/devlink.c| 1 + 3 files changed, 8 insertions(+) diff --git a/Documentation/networking/devlink/devlink-trap.rst b/Documentation/networking/devlink/devlink-trap.rst index d875f3e1e9cf..c6026b44617e 100644 --- a/Documentation/networking/devlink/devlink-trap.rst +++ b/Documentation/networking/devlink/devlink-trap.rst @@ -480,6 +480,10 @@ be added to the following table: - ``drop`` - Traps packets that the device decided to drop in case they hit a blackhole nexthop + * - ``ecn_mark`` + - ``drop`` + - Traps ECN-capable packets that were marked with CE (Congestion + Encountered) code point by RED algorithm instead of being dropped Driver-specific Packet Traps diff --git a/include/net/devlink.h b/include/net/devlink.h index f466819cc477..dd0c0b8fba6e 100644 --- a/include/net/devlink.h +++ b/include/net/devlink.h @@ -836,6 +836,7 @@ enum devlink_trap_generic_id { DEVLINK_TRAP_GENERIC_ID_GTP_PARSING, DEVLINK_TRAP_GENERIC_ID_ESP_PARSING, DEVLINK_TRAP_GENERIC_ID_BLACKHOLE_NEXTHOP, + DEVLINK_TRAP_GENERIC_ID_ECN_MARK, /* Add new generic trap IDs above */ __DEVLINK_TRAP_GENERIC_ID_MAX, @@ -1061,6 +1062,8 @@ enum devlink_trap_group_generic_id { "esp_parsing" #define DEVLINK_TRAP_GENERIC_NAME_BLACKHOLE_NEXTHOP \ "blackhole_nexthop" +#define DEVLINK_TRAP_GENERIC_NAME_ECN_MARK \ + "ecn_mark" #define DEVLINK_TRAP_GROUP_GENERIC_NAME_L2_DROPS \ "l2_drops" diff --git a/net/core/devlink.c b/net/core/devlink.c index ee828e4b1007..f86688bfad46 100644 --- a/net/core/devlink.c +++ b/net/core/devlink.c @@ -9508,6 +9508,7 @@ static const struct devlink_trap devlink_trap_generic[] = { DEVLINK_TRAP(GTP_PARSING, DROP), DEVLINK_TRAP(ESP_PARSING, DROP), DEVLINK_TRAP(BLACKHOLE_NEXTHOP, DROP), + DEVLINK_TRAP(ECN_MARK, DROP), }; #define DEVLINK_TRAP_GROUP(_id) \ -- 2.29.2
[PATCH net-next 3/5] mlxsw: spectrum_qdisc: Offload RED qevent mark
From: Petr Machata The RED "mark" qevent can be offloaded under the same exact conditions as the RED "tail_drop" qevent. Therefore recognize its binding type in the TC_SETUP_BLOCK handler and translate to the right SPAN trigger. The corresponding hardware trigger, MLXSW_SP_SPAN_TRIGGER_ECN, is considered "egress", unlike the previously-offloaded _EARLY_DROP. Add a helper to spectrum_span, mlxsw_sp_span_trigger_is_ingress(), to classify triggers to ingress and egress. Pass result of this instead of hardcoding true when calling mlxsw_sp_span_analyzed_port_get()/_put(). Signed-off-by: Petr Machata Reviewed-by: Jiri Pirko Signed-off-by: Ido Schimmel --- drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 2 ++ drivers/net/ethernet/mellanox/mlxsw/spectrum.h | 2 ++ .../net/ethernet/mellanox/mlxsw/spectrum_qdisc.c | 14 +++--- .../net/ethernet/mellanox/mlxsw/spectrum_span.c | 16 .../net/ethernet/mellanox/mlxsw/spectrum_span.h | 1 + 5 files changed, 32 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c index 1650d9852b5b..e93b96837a44 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c @@ -992,6 +992,8 @@ static int mlxsw_sp_setup_tc_block(struct mlxsw_sp_port *mlxsw_sp_port, return mlxsw_sp_setup_tc_block_clsact(mlxsw_sp_port, f, false); case FLOW_BLOCK_BINDER_TYPE_RED_EARLY_DROP: return mlxsw_sp_setup_tc_block_qevent_early_drop(mlxsw_sp_port, f); + case FLOW_BLOCK_BINDER_TYPE_RED_MARK: + return mlxsw_sp_setup_tc_block_qevent_mark(mlxsw_sp_port, f); default: return -EOPNOTSUPP; } diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h index a6956cfc9cb1..0e5fd6a73da1 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h @@ -1109,6 +1109,8 @@ int mlxsw_sp_setup_tc_fifo(struct mlxsw_sp_port *mlxsw_sp_port, struct tc_fifo_qopt_offload *p); int mlxsw_sp_setup_tc_block_qevent_early_drop(struct mlxsw_sp_port *mlxsw_sp_port, struct flow_block_offload *f); +int mlxsw_sp_setup_tc_block_qevent_mark(struct mlxsw_sp_port *mlxsw_sp_port, + struct flow_block_offload *f); /* spectrum_fid.c */ bool mlxsw_sp_fid_is_dummy(struct mlxsw_sp *mlxsw_sp, u16 fid_index); diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c index fd672c6c9133..5cd8854e6070 100644 --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_qdisc.c @@ -1327,6 +1327,7 @@ static int mlxsw_sp_qevent_span_configure(struct mlxsw_sp *mlxsw_sp, const struct mlxsw_sp_span_agent_parms *agent_parms, int *p_span_id) { + bool ingress = mlxsw_sp_span_trigger_is_ingress(qevent_binding->span_trigger); struct mlxsw_sp_port *mlxsw_sp_port = qevent_binding->mlxsw_sp_port; struct mlxsw_sp_span_trigger_parms trigger_parms = {}; int span_id; @@ -1336,7 +1337,7 @@ static int mlxsw_sp_qevent_span_configure(struct mlxsw_sp *mlxsw_sp, if (err) return err; - err = mlxsw_sp_span_analyzed_port_get(mlxsw_sp_port, true); + err = mlxsw_sp_span_analyzed_port_get(mlxsw_sp_port, ingress); if (err) goto err_analyzed_port_get; @@ -1358,7 +1359,7 @@ static int mlxsw_sp_qevent_span_configure(struct mlxsw_sp *mlxsw_sp, mlxsw_sp_span_agent_unbind(mlxsw_sp, qevent_binding->span_trigger, mlxsw_sp_port, &trigger_parms); err_agent_bind: - mlxsw_sp_span_analyzed_port_put(mlxsw_sp_port, true); + mlxsw_sp_span_analyzed_port_put(mlxsw_sp_port, ingress); err_analyzed_port_get: mlxsw_sp_span_agent_put(mlxsw_sp, span_id); return err; @@ -1368,6 +1369,7 @@ static void mlxsw_sp_qevent_span_deconfigure(struct mlxsw_sp *mlxsw_sp, struct mlxsw_sp_qevent_binding *qevent_binding, int span_id) { + bool ingress = mlxsw_sp_span_trigger_is_ingress(qevent_binding->span_trigger); struct mlxsw_sp_port *mlxsw_sp_port = qevent_binding->mlxsw_sp_port; struct mlxsw_sp_span_trigger_parms trigger_parms = { .span_id = span_id, @@ -1377,7 +1379,7 @@ static void mlxsw_sp_qevent_span_deconfigure(struct mlxsw_sp *mlxsw_sp, qevent_binding->tclass_num); mlxsw_sp_span_agent_unbind(mlxsw_sp, qevent_binding->span_trigger, mlxsw_sp_port, &trigger_pa
[PATCH v4 net-next] net: bridge: check vlan with eth_type_vlan() method
From: Menglong Dong Replace some checks for ETH_P_8021Q and ETH_P_8021AD with eth_type_vlan(). Signed-off-by: Menglong Dong --- v4: - remove unnecessary brackets. v3: - fix compile warning in br_vlan_set_proto() by casting 'val' to be16. v2: - use eth_type_vlan() in br_validate() and __br_vlan_set_proto() too. --- net/bridge/br_forward.c | 3 +-- net/bridge/br_netlink.c | 12 +++- net/bridge/br_vlan.c| 2 +- 3 files changed, 5 insertions(+), 12 deletions(-) diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c index e28ffadd1371..6e9b049ae521 100644 --- a/net/bridge/br_forward.c +++ b/net/bridge/br_forward.c @@ -39,8 +39,7 @@ int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb br_drop_fake_rtable(skb); if (skb->ip_summed == CHECKSUM_PARTIAL && - (skb->protocol == htons(ETH_P_8021Q) || -skb->protocol == htons(ETH_P_8021AD))) { + eth_type_vlan(skb->protocol)) { int depth; if (!__vlan_get_protocol(skb, skb->protocol, &depth)) diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c index 49700ce0e919..762f273802cd 100644 --- a/net/bridge/br_netlink.c +++ b/net/bridge/br_netlink.c @@ -1096,15 +1096,9 @@ static int br_validate(struct nlattr *tb[], struct nlattr *data[], return 0; #ifdef CONFIG_BRIDGE_VLAN_FILTERING - if (data[IFLA_BR_VLAN_PROTOCOL]) { - switch (nla_get_be16(data[IFLA_BR_VLAN_PROTOCOL])) { - case htons(ETH_P_8021Q): - case htons(ETH_P_8021AD): - break; - default: - return -EPROTONOSUPPORT; - } - } + if (data[IFLA_BR_VLAN_PROTOCOL] && + !eth_type_vlan(nla_get_be16(data[IFLA_BR_VLAN_PROTOCOL]))) + return -EPROTONOSUPPORT; if (data[IFLA_BR_VLAN_DEFAULT_PVID]) { __u16 defpvid = nla_get_u16(data[IFLA_BR_VLAN_DEFAULT_PVID]); diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c index 701cad646b20..bb2909738518 100644 --- a/net/bridge/br_vlan.c +++ b/net/bridge/br_vlan.c @@ -917,7 +917,7 @@ int __br_vlan_set_proto(struct net_bridge *br, __be16 proto) int br_vlan_set_proto(struct net_bridge *br, unsigned long val) { - if (val != ETH_P_8021Q && val != ETH_P_8021AD) + if (!eth_type_vlan(htons(val))) return -EPROTONOSUPPORT; return __br_vlan_set_proto(br, htons(val)); -- 2.30.0
[PATCH mlx5-next v3 5/5] net/mlx5: Allow to the users to configure number of MSI-X vectors
From: Leon Romanovsky Implement ability to configure MSI-X for the SR-IOV VFs. Signed-off-by: Leon Romanovsky --- .../net/ethernet/mellanox/mlx5/core/main.c| 1 + .../ethernet/mellanox/mlx5/core/mlx5_core.h | 1 + .../net/ethernet/mellanox/mlx5/core/sriov.c | 38 +++ 3 files changed, 40 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c index 8269cfbfc69d..334b3b5077c5 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c @@ -1647,6 +1647,7 @@ static struct pci_driver mlx5_core_driver = { .shutdown = shutdown, .err_handler= &mlx5_err_handler, .sriov_configure = mlx5_core_sriov_configure, + .sriov_set_msix_vec_count = mlx5_core_sriov_set_msix_vec_count, }; static void mlx5_core_verify_params(void) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h index 5babb4434a87..8a2523d2d43a 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h @@ -138,6 +138,7 @@ void mlx5_sriov_cleanup(struct mlx5_core_dev *dev); int mlx5_sriov_attach(struct mlx5_core_dev *dev); void mlx5_sriov_detach(struct mlx5_core_dev *dev); int mlx5_core_sriov_configure(struct pci_dev *dev, int num_vfs); +int mlx5_core_sriov_set_msix_vec_count(struct pci_dev *vf, int msix_vec_count); int mlx5_core_enable_hca(struct mlx5_core_dev *dev, u16 func_id); int mlx5_core_disable_hca(struct mlx5_core_dev *dev, u16 func_id); int mlx5_create_scheduling_element_cmd(struct mlx5_core_dev *dev, u8 hierarchy, diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c index f0ec86a1c8a6..febb7a5ec72d 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c @@ -144,6 +144,7 @@ mlx5_device_disable_sriov(struct mlx5_core_dev *dev, int num_vfs, bool clear_vf) static int mlx5_sriov_enable(struct pci_dev *pdev, int num_vfs) { struct mlx5_core_dev *dev = pci_get_drvdata(pdev); + u32 num_vf_msix; int err; err = mlx5_device_enable_sriov(dev, num_vfs); @@ -152,6 +153,8 @@ static int mlx5_sriov_enable(struct pci_dev *pdev, int num_vfs) return err; } + num_vf_msix = MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix); + pci_sriov_set_vf_total_msix(pdev, num_vf_msix); err = pci_enable_sriov(pdev, num_vfs); if (err) { mlx5_core_warn(dev, "pci_enable_sriov failed : %d\n", err); @@ -187,6 +190,41 @@ int mlx5_core_sriov_configure(struct pci_dev *pdev, int num_vfs) return err ? err : num_vfs; } +int mlx5_core_sriov_set_msix_vec_count(struct pci_dev *vf, int msix_vec_count) +{ + struct pci_dev *pf = pci_physfn(vf); + struct mlx5_core_sriov *sriov; + struct mlx5_core_dev *dev; + int num_vf_msix, id; + + dev = pci_get_drvdata(pf); + num_vf_msix = MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix); + if (!num_vf_msix) + return -EOPNOTSUPP; + + if (!msix_vec_count) + msix_vec_count = + mlx5_get_default_msix_vec_count(dev, pci_num_vf(pf)); + + sriov = &dev->priv.sriov; + + /* Reversed translation of PCI VF function number to the internal +* function_id, which exists in the name of virtfn symlink. +*/ + for (id = 0; id < pci_num_vf(pf); id++) { + if (!sriov->vfs_ctx[id].enabled) + continue; + + if (vf->devfn == pci_iov_virtfn_devfn(pf, id)) + break; + } + + if (id == pci_num_vf(pf) || !sriov->vfs_ctx[id].enabled) + return -EINVAL; + + return mlx5_set_msix_vec_count(dev, id + 1, msix_vec_count); +} + int mlx5_sriov_attach(struct mlx5_core_dev *dev) { if (!mlx5_core_is_pf(dev) || !pci_num_vf(dev->pdev)) -- 2.29.2
[PATCH mlx5-next v3 2/5] PCI: Add SR-IOV sysfs entry to read total number of dynamic MSI-X vectors
From: Leon Romanovsky Some SR-IOV capable devices provide an ability to configure specific number of MSI-X vectors on their VF prior driver is probed on that VF. In order to make management easy, provide new read-only sysfs file that returns a total number of possible to configure MSI-X vectors. cat /sys/bus/pci/devices/.../sriov_vf_total_msix = 0 - feature is not supported > 0 - total number of MSI-X vectors to consume by the VFs Signed-off-by: Leon Romanovsky --- Documentation/ABI/testing/sysfs-bus-pci | 14 ++ drivers/pci/iov.c | 12 drivers/pci/pci.h | 3 +++ include/linux/pci.h | 2 ++ 4 files changed, 31 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci index 34a8c6bcde70..530c249cc3da 100644 --- a/Documentation/ABI/testing/sysfs-bus-pci +++ b/Documentation/ABI/testing/sysfs-bus-pci @@ -395,3 +395,17 @@ Description: The file is writable if the PF is bound to a driver that set sriov_vf_total_msix > 0 and there is no driver bound to the VF. + +What: /sys/bus/pci/devices/.../sriov_vf_total_msix +Date: January 2021 +Contact: Leon Romanovsky +Description: + This file is associated with the SR-IOV PFs. + It returns a total number of possible to configure MSI-X + vectors on the enabled VFs. + + The values returned are: +* > 0 - this will be total number possible to consume by VFs, +* = 0 - feature is not supported + + If no SR-IOV VFs are enabled, this value will return 0. diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index b8d7a3b26f87..6660632b7a14 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -394,12 +394,22 @@ static ssize_t sriov_drivers_autoprobe_store(struct device *dev, return count; } +static ssize_t sriov_vf_total_msix_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct pci_dev *pdev = to_pci_dev(dev); + + return sprintf(buf, "%u\n", pdev->sriov->vf_total_msix); +} + static DEVICE_ATTR_RO(sriov_totalvfs); static DEVICE_ATTR_RW(sriov_numvfs); static DEVICE_ATTR_RO(sriov_offset); static DEVICE_ATTR_RO(sriov_stride); static DEVICE_ATTR_RO(sriov_vf_device); static DEVICE_ATTR_RW(sriov_drivers_autoprobe); +static DEVICE_ATTR_RO(sriov_vf_total_msix); static struct attribute *sriov_dev_attrs[] = { &dev_attr_sriov_totalvfs.attr, @@ -408,6 +418,7 @@ static struct attribute *sriov_dev_attrs[] = { &dev_attr_sriov_stride.attr, &dev_attr_sriov_vf_device.attr, &dev_attr_sriov_drivers_autoprobe.attr, + &dev_attr_sriov_vf_total_msix.attr, NULL, }; @@ -659,6 +670,7 @@ static void sriov_disable(struct pci_dev *dev) sysfs_remove_link(&dev->dev.kobj, "dep_link"); iov->num_VFs = 0; + iov->vf_total_msix = 0; pci_iov_set_numvfs(dev, 0); } diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index a275018beb2f..444e03ce9c97 100644 --- a/drivers/pci/pci.h +++ b/drivers/pci/pci.h @@ -327,6 +327,9 @@ struct pci_sriov { u16 subsystem_device; /* VF subsystem device */ resource_size_t barsz[PCI_SRIOV_NUM_BARS]; /* VF BAR size */ booldrivers_autoprobe; /* Auto probing of VFs by driver */ + u32 vf_total_msix; /* Total number of MSI-X vectors the VFs +* can consume +*/ }; /** diff --git a/include/linux/pci.h b/include/linux/pci.h index 6be96d468eda..c950513558b8 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -2075,6 +2075,7 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev); int pci_sriov_configure_simple(struct pci_dev *dev, int nr_virtfn); resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno); void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe); +void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count); /* Arch may override these (weak) */ int pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs); @@ -2115,6 +2116,7 @@ static inline int pci_sriov_get_totalvfs(struct pci_dev *dev) static inline resource_size_t pci_iov_resource_size(struct pci_dev *dev, int resno) { return 0; } static inline void pci_vf_drivers_autoprobe(struct pci_dev *dev, bool probe) { } +static inline void pci_sriov_set_vf_total_msix(struct pci_dev *dev, u32 count) {} #endif #if defined(CONFIG_HOTPLUG_PCI) || defined(CONFIG_HOTPLUG_PCI_MODULE) -- 2.29.2
[PATCH] net/qla3xxx: switch from 'pci_' to 'dma_' API
The wrappers in include/linux/pci-dma-compat.h should go away. The patch has been generated with the coccinelle script below and has been hand modified to replace GFP_ with a correct flag. It has been compile tested. When memory is allocated in 'ql_alloc_net_req_rsp_queues()' GFP_KERNEL can be used because it is only called from 'ql_alloc_mem_resources()' which already calls 'ql_alloc_buffer_queues()' which uses GFP_KERNEL. (see below) When memory is allocated in 'ql_alloc_buffer_queues()' GFP_KERNEL can be used because this flag is already used just a few line above. When memory is allocated in 'ql_alloc_small_buffers()' GFP_KERNEL can be used because it is only called from 'ql_alloc_mem_resources()' which already calls 'ql_alloc_buffer_queues()' which uses GFP_KERNEL. (see above) When memory is allocated in 'ql_alloc_mem_resources()' GFP_KERNEL can be used because this function already calls 'ql_alloc_buffer_queues()' which uses GFP_KERNEL. (see above) While at it, use 'dma_set_mask_and_coherent()' instead of 'dma_set_mask()/ dma_set_coherent_mask()' in order to slightly simplify code. @@ @@ -PCI_DMA_BIDIRECTIONAL +DMA_BIDIRECTIONAL @@ @@ -PCI_DMA_TODEVICE +DMA_TO_DEVICE @@ @@ -PCI_DMA_FROMDEVICE +DMA_FROM_DEVICE @@ @@ -PCI_DMA_NONE +DMA_NONE @@ expression e1, e2, e3; @@ -pci_alloc_consistent(e1, e2, e3) +dma_alloc_coherent(&e1->dev, e2, e3, GFP_) @@ expression e1, e2, e3; @@ -pci_zalloc_consistent(e1, e2, e3) +dma_alloc_coherent(&e1->dev, e2, e3, GFP_) @@ expression e1, e2, e3, e4; @@ -pci_free_consistent(e1, e2, e3, e4) +dma_free_coherent(&e1->dev, e2, e3, e4) @@ expression e1, e2, e3, e4; @@ -pci_map_single(e1, e2, e3, e4) +dma_map_single(&e1->dev, e2, e3, e4) @@ expression e1, e2, e3, e4; @@ -pci_unmap_single(e1, e2, e3, e4) +dma_unmap_single(&e1->dev, e2, e3, e4) @@ expression e1, e2, e3, e4, e5; @@ -pci_map_page(e1, e2, e3, e4, e5) +dma_map_page(&e1->dev, e2, e3, e4, e5) @@ expression e1, e2, e3, e4; @@ -pci_unmap_page(e1, e2, e3, e4) +dma_unmap_page(&e1->dev, e2, e3, e4) @@ expression e1, e2, e3, e4; @@ -pci_map_sg(e1, e2, e3, e4) +dma_map_sg(&e1->dev, e2, e3, e4) @@ expression e1, e2, e3, e4; @@ -pci_unmap_sg(e1, e2, e3, e4) +dma_unmap_sg(&e1->dev, e2, e3, e4) @@ expression e1, e2, e3, e4; @@ -pci_dma_sync_single_for_cpu(e1, e2, e3, e4) +dma_sync_single_for_cpu(&e1->dev, e2, e3, e4) @@ expression e1, e2, e3, e4; @@ -pci_dma_sync_single_for_device(e1, e2, e3, e4) +dma_sync_single_for_device(&e1->dev, e2, e3, e4) @@ expression e1, e2, e3, e4; @@ -pci_dma_sync_sg_for_cpu(e1, e2, e3, e4) +dma_sync_sg_for_cpu(&e1->dev, e2, e3, e4) @@ expression e1, e2, e3, e4; @@ -pci_dma_sync_sg_for_device(e1, e2, e3, e4) +dma_sync_sg_for_device(&e1->dev, e2, e3, e4) @@ expression e1, e2; @@ -pci_dma_mapping_error(e1, e2) +dma_mapping_error(&e1->dev, e2) @@ expression e1, e2; @@ -pci_set_dma_mask(e1, e2) +dma_set_mask(&e1->dev, e2) @@ expression e1, e2; @@ -pci_set_consistent_dma_mask(e1, e2) +dma_set_coherent_mask(&e1->dev, e2) Signed-off-by: Christophe JAILLET --- If needed, see post from Christoph Hellwig on the kernel-janitors ML: https://marc.info/?l=kernel-janitors&m=158745678307186&w=4 --- drivers/net/ethernet/qlogic/qla3xxx.c | 196 -- 1 file changed, 87 insertions(+), 109 deletions(-) diff --git a/drivers/net/ethernet/qlogic/qla3xxx.c b/drivers/net/ethernet/qlogic/qla3xxx.c index 27740c027681..214e347097a7 100644 --- a/drivers/net/ethernet/qlogic/qla3xxx.c +++ b/drivers/net/ethernet/qlogic/qla3xxx.c @@ -315,12 +315,11 @@ static void ql_release_to_lrg_buf_free_list(struct ql3_adapter *qdev, * buffer */ skb_reserve(lrg_buf_cb->skb, QL_HEADER_SPACE); - map = pci_map_single(qdev->pdev, + map = dma_map_single(&qdev->pdev->dev, lrg_buf_cb->skb->data, -qdev->lrg_buffer_len - -QL_HEADER_SPACE, -PCI_DMA_FROMDEVICE); - err = pci_dma_mapping_error(qdev->pdev, map); +qdev->lrg_buffer_len - QL_HEADER_SPACE, +DMA_FROM_DEVICE); + err = dma_mapping_error(&qdev->pdev->dev, map); if (err) { netdev_err(qdev->ndev, "PCI mapping failed with error: %d\n", @@ -1802,13 +1801,12 @@ static int ql_populate_free_queue(struct ql3_adapter *qdev) * first buffer */ skb_reserve(lrg_buf_cb->skb, QL_HEADER_SPAC
[PATCH mlx5-next v3 4/5] net/mlx5: Dynamically assign MSI-X vectors count
From: Leon Romanovsky The number of MSI-X vectors is PCI property visible through lspci, that field is read-only and configured by the device. The static assignment of an amount of MSI-X vectors doesn't allow utilize the newly created VF because it is not known to the device the future load and configuration where that VF will be used. To overcome the inefficiency in the spread of such MSI-X vectors, we allow the kernel to instruct the device with the needed number of such vectors. Such change immediately increases the amount of MSI-X vectors for the system with @ VFs from 12 vectors per-VF, to be 32 vectors per-VF. Before this patch: [root@server ~]# lspci -vs :08:00.2 08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function] Capabilities: [9c] MSI-X: Enable- Count=12 Masked- After this patch: [root@server ~]# lspci -vs :08:00.2 08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function] Capabilities: [9c] MSI-X: Enable- Count=32 Masked- Signed-off-by: Leon Romanovsky --- .../net/ethernet/mellanox/mlx5/core/main.c| 4 ++ .../ethernet/mellanox/mlx5/core/mlx5_core.h | 5 ++ .../net/ethernet/mellanox/mlx5/core/pci_irq.c | 72 +++ .../net/ethernet/mellanox/mlx5/core/sriov.c | 13 +++- 4 files changed, 92 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c index c08315b51fd3..8269cfbfc69d 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c @@ -567,6 +567,10 @@ static int handle_hca_cap(struct mlx5_core_dev *dev, void *set_ctx) if (MLX5_CAP_GEN_MAX(dev, mkey_by_name)) MLX5_SET(cmd_hca_cap, set_hca_cap, mkey_by_name, 1); + if (MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix)) + MLX5_SET(cmd_hca_cap, set_hca_cap, num_total_dynamic_vf_msix, +MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix)); + return set_caps(dev, set_ctx, MLX5_SET_HCA_CAP_OP_MOD_GENERAL_DEVICE); } diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h index 0a0302ce7144..5babb4434a87 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h @@ -172,6 +172,11 @@ int mlx5_irq_attach_nb(struct mlx5_irq_table *irq_table, int vecidx, struct notifier_block *nb); int mlx5_irq_detach_nb(struct mlx5_irq_table *irq_table, int vecidx, struct notifier_block *nb); + +int mlx5_set_msix_vec_count(struct mlx5_core_dev *dev, int devfn, + int msix_vec_count); +int mlx5_get_default_msix_vec_count(struct mlx5_core_dev *dev, int num_vfs); + struct cpumask * mlx5_irq_get_affinity_mask(struct mlx5_irq_table *irq_table, int vecidx); struct cpu_rmap *mlx5_irq_get_rmap(struct mlx5_irq_table *table); diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c index 6fd974920394..2a35888fcff0 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/pci_irq.c @@ -55,6 +55,78 @@ static struct mlx5_irq *mlx5_irq_get(struct mlx5_core_dev *dev, int vecidx) return &irq_table->irq[vecidx]; } +/** + * mlx5_get_default_msix_vec_count() - Get defaults of number of MSI-X vectors + * to be set + * @dev: PF to work on + * @num_vfs: Number of VFs was asked when SR-IOV was enabled + **/ +int mlx5_get_default_msix_vec_count(struct mlx5_core_dev *dev, int num_vfs) +{ + int num_vf_msix, min_msix, max_msix; + + num_vf_msix = MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix); + if (!num_vf_msix) + return 0; + + min_msix = MLX5_CAP_GEN(dev, min_dynamic_vf_msix_table_size); + max_msix = MLX5_CAP_GEN(dev, max_dynamic_vf_msix_table_size); + + /* Limit maximum number of MSI-X to leave some of them free in the +* pool and ready to be assigned by the users without need to resize +* other Vfs. +*/ + return max(min(num_vf_msix / num_vfs, max_msix / 2), min_msix); +} + +/** + * mlx5_set_msix_vec_count() - Set dynamically allocated MSI-X to the VF + * @dev: PF to work on + * @function_id: Internal PCI VF function id + * @msix_vec_count: Number of MSI-X to set + **/ +int mlx5_set_msix_vec_count(struct mlx5_core_dev *dev, int function_id, + int msix_vec_count) +{ + int sz = MLX5_ST_SZ_BYTES(set_hca_cap_in); + int num_vf_msix, min_msix, max_msix; + void *hca_cap, *cap; + int ret; + + num_vf_msix = MLX5_CAP_GEN_MAX(dev, num_total_dynamic_vf_msix); + if (!num_vf_msix) + return 0; + + if (!MLX5_CAP_GEN(dev, vport_group_manager) || !mlx5_core_is_pf(dev)) +
[PATCH mlx5-next v3 1/5] PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs
From: Leon Romanovsky Extend PCI sysfs interface with a new callback that allows configure the number of MSI-X vectors for specific SR-IO VF. This is needed to optimize the performance of newly bound devices by allocating the number of vectors based on the administrator knowledge of targeted VM. This function is applicable for SR-IOV VF because such devices allocate their MSI-X table before they will run on the VMs and HW can't guess the right number of vectors, so the HW allocates them statically and equally. The newly added /sys/bus/pci/devices/.../sriov_vf_msix_count file will be seen for the VFs and it is writable as long as a driver is not bounded to the VF. The values accepted are: * > 0 - this will be number reported by the VF's MSI-X capability * < 0 - not valid * = 0 - will reset to the device default value Signed-off-by: Leon Romanovsky --- Documentation/ABI/testing/sysfs-bus-pci | 20 ++ drivers/pci/iov.c | 82 + drivers/pci/msi.c | 47 ++ drivers/pci/pci-sysfs.c | 1 + drivers/pci/pci.h | 2 + include/linux/pci.h | 3 + 6 files changed, 155 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci index 25c9c39770c6..34a8c6bcde70 100644 --- a/Documentation/ABI/testing/sysfs-bus-pci +++ b/Documentation/ABI/testing/sysfs-bus-pci @@ -375,3 +375,23 @@ Description: The value comes from the PCI kernel device state and can be one of: "unknown", "error", "D0", D1", "D2", "D3hot", "D3cold". The file is read only. + +What: /sys/bus/pci/devices/.../sriov_vf_msix_count +Date: December 2020 +Contact: Leon Romanovsky +Description: + This file is associated with the SR-IOV VFs. + It allows configuration of the number of MSI-X vectors for + the VF. This is needed to optimize performance of newly bound + devices by allocating the number of vectors based on the + administrator knowledge of targeted VM. + + The values accepted are: +* > 0 - this will be number reported by the VF's MSI-X +capability +* < 0 - not valid +* = 0 - will reset to the device default value + + The file is writable if the PF is bound to a driver that + set sriov_vf_total_msix > 0 and there is no driver bound + to the VF. diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 4afd4ee4f7f0..b8d7a3b26f87 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -31,6 +31,7 @@ int pci_iov_virtfn_devfn(struct pci_dev *dev, int vf_id) return (dev->devfn + dev->sriov->offset + dev->sriov->stride * vf_id) & 0xff; } +EXPORT_SYMBOL(pci_iov_virtfn_devfn); /* * Per SR-IOV spec sec 3.3.10 and 3.3.11, First VF Offset and VF Stride may @@ -426,6 +427,68 @@ const struct attribute_group sriov_dev_attr_group = { .is_visible = sriov_attrs_are_visible, }; +#ifdef CONFIG_PCI_MSI +static ssize_t sriov_vf_msix_count_show(struct device *dev, + struct device_attribute *attr, + char *buf) +{ + struct pci_dev *pdev = to_pci_dev(dev); + int count = pci_msix_vec_count(pdev); + + if (count < 0) + return count; + + return sprintf(buf, "%d\n", count); +} + +static ssize_t sriov_vf_msix_count_store(struct device *dev, +struct device_attribute *attr, +const char *buf, size_t count) +{ + struct pci_dev *vf_dev = to_pci_dev(dev); + int val, ret; + + ret = kstrtoint(buf, 0, &val); + if (ret) + return ret; + + ret = pci_vf_set_msix_vec_count(vf_dev, val); + if (ret) + return ret; + + return count; +} +static DEVICE_ATTR_RW(sriov_vf_msix_count); +#endif + +static struct attribute *sriov_vf_dev_attrs[] = { +#ifdef CONFIG_PCI_MSI + &dev_attr_sriov_vf_msix_count.attr, +#endif + NULL, +}; + +static umode_t sriov_vf_attrs_are_visible(struct kobject *kobj, + struct attribute *a, int n) +{ + struct device *dev = kobj_to_dev(kobj); + struct pci_dev *vf_dev; + + if (dev_is_pf(dev)) + return 0; + + vf_dev = to_pci_dev(dev); + if (!vf_dev->msix_cap) + return 0; + + return a->mode; +} + +const struct attribute_group sriov_vf_dev_attr_group = { + .attrs = sriov_vf_dev_attrs, + .is_visible = sriov_vf_attrs_are_visible, +}; + int __weak pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs) { return 0; @@ -1054,6 +1117,25 @@ int pci_sriov_get_totalvfs(struct pci_dev *dev) } EXPORT_SYM
[PATCH mlx5-next v3 0/5] Dynamically assign MSI-X vectors count
From: Leon Romanovsky Changelog v3: * Renamed pci_set_msix_vec_count to be pci_vf_set_msix_vec_count. * Added VF msix_cap check to hide sysfs entry if device doesn't support msix. * Changed "-" to be ":" in the mlx5 patch to silence CI warnings about missing kdoc description. * Split differently error print in mlx5 driver to avoid checkpatch warning. v2: https://lore.kernel.org/linux-pci/20210114103140.866141-1-l...@kernel.org * Patch 1: * Renamed vf_msix_vec sysfs knob to be sriov_vf_msix_count * Added PF and VF device locks during set MSI-X call to protect from parallel driver bind/unbind operations. * Removed extra checks when reading sriov_vf_msix, because users will be able to distinguish between supported/not supported by looking on sriov_vf_total_msix count. * Changed all occurrences of "numb" to be "count" * Changed returned error from EOPNOTSUPP to be EBUSY if user tries to set MSI-X count after driver already bound to the VF. * Added extra comment in pci_set_msix_vec_count() to emphasize that driver should not be bound. * Patch 2: * Changed vf_total_msix from int to be u32 and updated function signatures accordingly. * Improved patch title v1: https://lore.kernel.org/linux-pci/20210110150727.1965295-1-l...@kernel.org * Improved wording and commit messages of first PCI patch * Added extra PCI patch to provide total number of MSI-X vectors * Prohibited read of vf_msix_vec sysfs file if driver doesn't support write * Removed extra function definition in pci.h v0: https://lore.kernel.org/linux-pci/20210103082440.34994-1-l...@kernel.org Hi, The number of MSI-X vectors is PCI property visible through lspci, that field is read-only and configured by the device. The static assignment of an amount of MSI-X vectors doesn't allow utilize the newly created VF because it is not known to the device the future load and configuration where that VF will be used. The VFs are created on the hypervisor and forwarded to the VMs that have different properties (for example number of CPUs). To overcome the inefficiency in the spread of such MSI-X vectors, we allow the kernel to instruct the device with the needed number of such vectors, before VF is initialized and bounded to the driver. Before this series: [root@server ~]# lspci -vs :08:00.2 08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function] Capabilities: [9c] MSI-X: Enable- Count=12 Masked- Configuration script: 1. Start fresh echo 0 > /sys/bus/pci/devices/\:08\:00.0/sriov_numvfs modprobe -q -r mlx5_ib mlx5_core 2. Ensure that driver doesn't run and it is safe to change MSI-X echo 0 > /sys/bus/pci/devices/\:08\:00.0/sriov_drivers_autoprobe 3. Load driver for the PF modprobe mlx5_core 4. Configure one of the VFs with new number echo 2 > /sys/bus/pci/devices/\:08\:00.0/sriov_numvfs echo 21 > /sys/bus/pci/devices/\:08\:00.2/sriov_vf_msix_count After this series: [root@server ~]# lspci -vs :08:00.2 08:00.2 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function] Capabilities: [9c] MSI-X: Enable- Count=21 Masked- Thanks Leon Romanovsky (5): PCI: Add sysfs callback to allow MSI-X table size change of SR-IOV VFs PCI: Add SR-IOV sysfs entry to read total number of dynamic MSI-X vectors net/mlx5: Add dynamic MSI-X capabilities bits net/mlx5: Dynamically assign MSI-X vectors count net/mlx5: Allow to the users to configure number of MSI-X vectors Documentation/ABI/testing/sysfs-bus-pci | 34 +++ .../net/ethernet/mellanox/mlx5/core/main.c| 5 + .../ethernet/mellanox/mlx5/core/mlx5_core.h | 6 ++ .../net/ethernet/mellanox/mlx5/core/pci_irq.c | 72 ++ .../net/ethernet/mellanox/mlx5/core/sriov.c | 51 +- drivers/pci/iov.c | 94 +++ drivers/pci/msi.c | 47 ++ drivers/pci/pci-sysfs.c | 1 + drivers/pci/pci.h | 5 + include/linux/mlx5/mlx5_ifc.h | 11 ++- include/linux/pci.h | 5 + 11 files changed, 328 insertions(+), 3 deletions(-) -- 2.29.2
[PATCH net-next 5/5] selftests: mlxsw: RED: Add selftests for the mark qevent
From: Petr Machata Add do_mark_test(), which is to do_ecn_test() like do_drop_test() is to do_red_test(): meant to test that actions on the RED mark qevent block are offloaded, and executed on ECN-marked packets. Signed-off-by: Petr Machata Signed-off-by: Ido Schimmel --- .../drivers/net/mlxsw/sch_red_core.sh | 82 +++ .../drivers/net/mlxsw/sch_red_ets.sh | 74 +++-- 2 files changed, 151 insertions(+), 5 deletions(-) diff --git a/tools/testing/selftests/drivers/net/mlxsw/sch_red_core.sh b/tools/testing/selftests/drivers/net/mlxsw/sch_red_core.sh index d439ee0eaa8a..6c2ddf76b4b8 100644 --- a/tools/testing/selftests/drivers/net/mlxsw/sch_red_core.sh +++ b/tools/testing/selftests/drivers/net/mlxsw/sch_red_core.sh @@ -544,6 +544,51 @@ do_mc_backlog_test() log_test "TC $((vlan - 10)): Qdisc reports MC backlog" } +do_mark_test() +{ + local vlan=$1; shift + local limit=$1; shift + local subtest=$1; shift + local fetch_counter=$1; shift + local should_fail=$1; shift + local base + + RET=0 + + start_tcp_traffic $h1.$vlan $(ipaddr 1 $vlan) $(ipaddr 3 $vlan) $h3_mac tos=0x01 + + # Create a bit of a backlog and observe no mirroring due to marks. + qevent_rule_install_$subtest + + build_backlog $vlan $((2 * limit / 3)) tcp tos=0x01 >/dev/null + + base=$($fetch_counter) + count=$(busywait 1100 until_counter_is ">= $((base + 1))" $fetch_counter) + check_fail $? "Spurious packets ($base -> $count) observed without buffer pressure" + + # Above limit, everything should be mirrored, we should see lots of + # packets. + build_backlog $vlan $((3 * limit / 2)) tcp tos=0x01 >/dev/null + busywait_for_counter 1100 +1 \ +$fetch_counter > /dev/null + check_err_fail "$should_fail" $? "ECN-marked packets $subtest'd" + + # When the rule is uninstalled, there should be no mirroring. + qevent_rule_uninstall_$subtest + busywait_for_counter 1100 +10 \ +$fetch_counter > /dev/null + check_fail $? "Spurious packets observed after uninstall" + + if ((should_fail)); then + log_test "TC $((vlan - 10)): marked packets not $subtest'd" + else + log_test "TC $((vlan - 10)): marked packets $subtest'd" + fi + + stop_traffic + sleep 1 +} + do_drop_test() { local vlan=$1; shift @@ -626,6 +671,22 @@ do_drop_mirror_test() tc filter del dev $h2 ingress pref 1 handle 101 flower } +do_mark_mirror_test() +{ + local vlan=$1; shift + local limit=$1; shift + + tc filter add dev $h2 ingress pref 1 handle 101 prot ip \ + flower skip_sw ip_proto tcp \ + action drop + + do_mark_test "$vlan" "$limit" mirror \ +qevent_counter_fetch_mirror \ +$(: should_fail=)0 + + tc filter del dev $h2 ingress pref 1 handle 101 flower +} + qevent_rule_install_trap() { tc filter add block 10 pref 1234 handle 102 matchall skip_sw \ @@ -653,3 +714,24 @@ do_drop_trap_test() do_drop_test "$vlan" "$limit" "$trap_name" trap \ "qevent_counter_fetch_trap $trap_name" } + +do_mark_trap_test() +{ + local vlan=$1; shift + local limit=$1; shift + local should_fail=$1; shift + + do_mark_test "$vlan" "$limit" trap \ +"qevent_counter_fetch_trap ecn_mark" \ +"$should_fail" +} + +do_mark_trap_test_pass() +{ + do_mark_trap_test "$@" 0 +} + +do_mark_trap_test_fail() +{ + do_mark_trap_test "$@" 1 +} diff --git a/tools/testing/selftests/drivers/net/mlxsw/sch_red_ets.sh b/tools/testing/selftests/drivers/net/mlxsw/sch_red_ets.sh index 3f007c5f8361..566cad32f346 100755 --- a/tools/testing/selftests/drivers/net/mlxsw/sch_red_ets.sh +++ b/tools/testing/selftests/drivers/net/mlxsw/sch_red_ets.sh @@ -9,6 +9,8 @@ ALL_TESTS=" mc_backlog_test red_mirror_test red_trap_test + ecn_trap_test + ecn_mirror_test " : ${QDISC:=ets} source sch_red_core.sh @@ -21,28 +23,60 @@ source sch_red_core.sh BACKLOG1=20 BACKLOG2=50 -install_qdisc() +install_root_qdisc() { - local -a args=("$@") - tc qdisc add dev $swp3 root handle 10: $QDISC \ bands 8 priomap 7 6 5 4 3 2 1 0 +} + +install_qdisc_tc0() +{ + local -a args=("$@") + tc qdisc add dev $swp3 parent 10:8 handle 108: red \ limit 100 min $BACKLOG1 max $((BACKLOG1 + 1)) \ probability 1.0 avpkt 8000 burst 38 "${args[@]}" +} + +install_qdisc_tc1() +{ + local -a args=("$@") + tc qdisc add dev $swp3 parent 10:7 handle 107: red \ limit 100 min $BACKLOG2 max $((BACKLOG2 + 1)) \ probability 1.0 avpkt 8000 burst 63 "${args[@]}" +} + +install_qdisc() +{ + install_root_qdisc + install_qdisc_tc0 "$@" + inst
[PATCH mlx5-next v3 3/5] net/mlx5: Add dynamic MSI-X capabilities bits
From: Leon Romanovsky These new fields declare the number of MSI-X vectors that is possible to allocate on the VF through PF configuration. Value must be in range defined by min_dynamic_vf_msix_table_size and max_dynamic_vf_msix_table_size. The driver should continue to query its MSI-X table through PCI configuration header. Signed-off-by: Leon Romanovsky --- include/linux/mlx5/mlx5_ifc.h | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h index 8c5d5fe58051..7ac614bc592a 100644 --- a/include/linux/mlx5/mlx5_ifc.h +++ b/include/linux/mlx5/mlx5_ifc.h @@ -1656,7 +1656,16 @@ struct mlx5_ifc_cmd_hca_cap_bits { u8 reserved_at_6e0[0x10]; u8 sf_base_id[0x10]; - u8 reserved_at_700[0x80]; + u8 reserved_at_700[0x8]; + u8 num_total_dynamic_vf_msix[0x18]; + u8 reserved_at_720[0x14]; + u8 dynamic_msix_table_size[0xc]; + u8 reserved_at_740[0xc]; + u8 min_dynamic_vf_msix_table_size[0x4]; + u8 reserved_at_750[0x4]; + u8 max_dynamic_vf_msix_table_size[0xc]; + + u8 reserved_at_760[0x20]; u8 vhca_tunnel_commands[0x40]; u8 reserved_at_7c0[0x40]; }; -- 2.29.2
[PATCH net] net: lapb: Add locking to the lapb module
In the lapb module, the timers may run concurrently with other code in this module, and there is currently no locking to prevent the code from racing on "struct lapb_cb". This patch adds locking to prevent racing. 1. Add "spinlock_t lock" to "struct lapb_cb"; Add "spin_lock_bh" and "spin_unlock_bh" to APIs, timer functions and notifier functions. 2. Add "bool t1timer_stop, t2timer_stop" to "struct lapb_cb" to make us able to ask running timers to abort; Modify "lapb_stop_t1timer" and "lapb_stop_t2timer" to make them able to abort running timers; Modify "lapb_t2timer_expiry" and "lapb_t1timer_expiry" to make them abort after they are stopped by "lapb_stop_t1timer", "lapb_stop_t2timer", and "lapb_start_t1timer", "lapb_start_t2timer". 3. In lapb_unregister, change "lapb_stop_t1timer" and "lapb_stop_t2timer" to "del_timer_sync" to make sure all running timers have exited. 4. In lapb_device_event, replace the "lapb_disconnect_request" call with the content of "lapb_disconnect_request", to avoid trying to hold the lock twice. When doing this, "lapb_start_t1timer" is removed because I don't think it's necessary to start the timer when "NETDEV_GOING_DOWN". Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: Martin Schiller Signed-off-by: Xie He --- include/net/lapb.h| 2 ++ net/lapb/lapb_iface.c | 54 +++ net/lapb/lapb_timer.c | 30 3 files changed, 78 insertions(+), 8 deletions(-) diff --git a/include/net/lapb.h b/include/net/lapb.h index ccc3d1f020b0..eee73442a1ba 100644 --- a/include/net/lapb.h +++ b/include/net/lapb.h @@ -92,6 +92,7 @@ struct lapb_cb { unsigned short n2, n2count; unsigned short t1, t2; struct timer_list t1timer, t2timer; + boolt1timer_stop, t2timer_stop; /* Internal control information */ struct sk_buff_head write_queue; @@ -103,6 +104,7 @@ struct lapb_cb { struct lapb_frame frmr_data; unsigned char frmr_type; + spinlock_t lock; refcount_t refcnt; }; diff --git a/net/lapb/lapb_iface.c b/net/lapb/lapb_iface.c index 40961889e9c0..ad6197381d05 100644 --- a/net/lapb/lapb_iface.c +++ b/net/lapb/lapb_iface.c @@ -122,6 +122,8 @@ static struct lapb_cb *lapb_create_cb(void) timer_setup(&lapb->t1timer, NULL, 0); timer_setup(&lapb->t2timer, NULL, 0); + lapb->t1timer_stop = true; + lapb->t2timer_stop = true; lapb->t1 = LAPB_DEFAULT_T1; lapb->t2 = LAPB_DEFAULT_T2; @@ -129,6 +131,8 @@ static struct lapb_cb *lapb_create_cb(void) lapb->mode= LAPB_DEFAULT_MODE; lapb->window = LAPB_DEFAULT_WINDOW; lapb->state = LAPB_STATE_0; + + spin_lock_init(&lapb->lock); refcount_set(&lapb->refcnt, 1); out: return lapb; @@ -178,8 +182,8 @@ int lapb_unregister(struct net_device *dev) goto out; lapb_put(lapb); - lapb_stop_t1timer(lapb); - lapb_stop_t2timer(lapb); + del_timer_sync(&lapb->t1timer); + del_timer_sync(&lapb->t2timer); lapb_clear_queues(lapb); @@ -201,6 +205,8 @@ int lapb_getparms(struct net_device *dev, struct lapb_parms_struct *parms) if (!lapb) goto out; + spin_lock_bh(&lapb->lock); + parms->t1 = lapb->t1 / HZ; parms->t2 = lapb->t2 / HZ; parms->n2 = lapb->n2; @@ -219,6 +225,7 @@ int lapb_getparms(struct net_device *dev, struct lapb_parms_struct *parms) else parms->t2timer = (lapb->t2timer.expires - jiffies) / HZ; + spin_unlock_bh(&lapb->lock); lapb_put(lapb); rc = LAPB_OK; out: @@ -234,6 +241,8 @@ int lapb_setparms(struct net_device *dev, struct lapb_parms_struct *parms) if (!lapb) goto out; + spin_lock_bh(&lapb->lock); + rc = LAPB_INVALUE; if (parms->t1 < 1 || parms->t2 < 1 || parms->n2 < 1) goto out_put; @@ -256,6 +265,7 @@ int lapb_setparms(struct net_device *dev, struct lapb_parms_struct *parms) rc = LAPB_OK; out_put: + spin_unlock_bh(&lapb->lock); lapb_put(lapb); out: return rc; @@ -270,6 +280,8 @@ int lapb_connect_request(struct net_device *dev) if (!lapb) goto out; + spin_lock_bh(&lapb->lock); + rc = LAPB_OK; if (lapb->state == LAPB_STATE_1) goto out_put; @@ -285,6 +297,7 @@ int lapb_connect_request(struct net_device *dev) rc = LAPB_OK; out_put: + spin_unlock_bh(&lapb->lock); lapb_put(lapb); out: return rc; @@ -299,6 +312,8 @@ int lapb_disconnect_request(struct net_device *dev) if (!lapb) goto out; + spin_lock_bh(&lapb->lock); + switch (lapb->state) { case LAPB_STATE_0: rc = LAPB_NOTCONNECTED; @@ -330,6 +345,7 @@ int l
Re: [PATCH v2 net-next 19/21] net/mlx5e: NVMEoTCP, data-path for DDP offload
On 16/01/2021 6:57, David Ahern wrote: > I have not had time to review this version of the patches, but this > patch seems very similar to 13 of 15 from v1 and you did not respond to > my question on it ... > > On 1/14/21 8:10 AM, Boris Pismenny wrote: >> diff --git >> a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c >> b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c >> new file mode 100644 >> index ..f446b5d56d64 >> --- /dev/null >> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c >> @@ -0,0 +1,243 @@ >> +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB >> +/* Copyright (c) 2021 Mellanox Technologies. */ >> + >> +#include "en_accel/nvmeotcp_rxtx.h" >> +#include "en_accel/nvmeotcp.h" >> +#include >> + >> +#define MLX5E_TC_FLOW_ID_MASK 0x00ff >> +static void nvmeotcp_update_resync(struct mlx5e_nvmeotcp_queue *queue, >> + struct mlx5e_cqe128 *cqe128) >> +{ >> +const struct tcp_ddp_ulp_ops *ulp_ops; >> +u32 seq; >> + >> +seq = be32_to_cpu(cqe128->resync_tcp_sn); >> +ulp_ops = inet_csk(queue->sk)->icsk_ulp_ddp_ops; >> +if (ulp_ops && ulp_ops->resync_request) >> +ulp_ops->resync_request(queue->sk, seq, TCP_DDP_RESYNC_REQ); >> +} >> + >> +static void mlx5e_nvmeotcp_advance_sgl_iter(struct mlx5e_nvmeotcp_queue >> *queue) >> +{ >> +struct nvmeotcp_queue_entry *nqe = &queue->ccid_table[queue->ccid]; >> + >> +queue->ccoff += nqe->sgl[queue->ccsglidx].length; >> +queue->ccoff_inner = 0; >> +queue->ccsglidx++; >> +} >> + >> +static inline void >> +mlx5e_nvmeotcp_add_skb_frag(struct net_device *netdev, struct sk_buff *skb, >> +struct mlx5e_nvmeotcp_queue *queue, >> +struct nvmeotcp_queue_entry *nqe, u32 fragsz) >> +{ >> +dma_sync_single_for_cpu(&netdev->dev, >> +nqe->sgl[queue->ccsglidx].offset + >> queue->ccoff_inner, >> +fragsz, DMA_FROM_DEVICE); >> +page_ref_inc(compound_head(sg_page(&nqe->sgl[queue->ccsglidx]))); >> +// XXX: consider reducing the truesize, as no new memory is consumed >> +skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, >> +sg_page(&nqe->sgl[queue->ccsglidx]), >> +nqe->sgl[queue->ccsglidx].offset + queue->ccoff_inner, >> +fragsz, >> +fragsz); >> +} >> + >> +static struct sk_buff* >> +mlx5_nvmeotcp_add_tail_nonlinear(struct mlx5e_nvmeotcp_queue *queue, >> + struct sk_buff *skb, skb_frag_t *org_frags, >> + int org_nr_frags, int frag_index) >> +{ >> +struct mlx5e_priv *priv = queue->priv; >> + >> +while (org_nr_frags != frag_index) { >> +if (skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS) { >> +dev_kfree_skb_any(skb); >> +return NULL; >> +} >> +skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, >> +skb_frag_page(&org_frags[frag_index]), >> +skb_frag_off(&org_frags[frag_index]), >> +skb_frag_size(&org_frags[frag_index]), >> +skb_frag_size(&org_frags[frag_index])); >> +page_ref_inc(skb_frag_page(&org_frags[frag_index])); >> +frag_index++; >> +} >> +return skb; >> +} >> + >> +static struct sk_buff* >> +mlx5_nvmeotcp_add_tail(struct mlx5e_nvmeotcp_queue *queue, struct sk_buff >> *skb, >> + int offset, int len) >> +{ >> +struct mlx5e_priv *priv = queue->priv; >> + >> +if (skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS) { >> +dev_kfree_skb_any(skb); >> +return NULL; >> +} >> +skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, >> +virt_to_page(skb->data), >> +offset, >> +len, >> +len); >> +page_ref_inc(virt_to_page(skb->data)); >> +return skb; >> +} >> + >> +static void mlx5_nvmeotcp_trim_nonlinear(struct sk_buff *skb, >> + skb_frag_t *org_frags, >> + int *frag_index, >> + int remaining) >> +{ >> +unsigned int frag_size; >> +int nr_frags; >> + >> +/* skip @remaining bytes in frags */ >> +*frag_index = 0; >> +while (remaining) { >> +frag_size = skb_frag_size(&skb_shinfo(skb)->frags[*frag_index]); >> +if (frag_size > remaining) { >> +skb_frag_off_add(&skb_shinfo(skb)->frags[*frag_index], >> + remaining); >> +skb_frag_size_sub(&skb_shinfo(skb)->frags[*frag_index], >> + remaining); >> +remaining = 0; >> +} else { >> +remaining -= fra
[PATCH net-next 0/5] mlxsw: Add support for RED qevent "mark"
From: Ido Schimmel Petr says: The RED qdisc currently supports two qevents: "early_drop" and "mark". The filters added to the block bound to the "early_drop" qevent are executed on packets for which the RED algorithm decides that they should be early-dropped. The "mark" filters are similarly executed on ECT packets that are marked as ECN-CE (Congestion Encountered). A previous patchset has offloaded "early_drop" filters on Spectrum-2 and later, provided that the classifier used is "matchall", that the action used is either "trap" or "mirred", and a handful or further limitations. This patchset similarly offloads "mark" filters. Patch set overview: Patches #1 and #2 add the trap, under which packets will be reported to the CPU, if the qevent filter uses the action "trap". Patch #3 then recognizes FLOW_BLOCK_BINDER_TYPE_RED_MARK as a binder type, and offloads the attached filters similarly to _EARLY_DROP. Patch #4 cleans up some unused variables in a selftest, and patch #5 adds a new selftest for the RED "mark" qevent offload. Petr Machata (5): devlink: Add ecn_mark trap mlxsw: spectrum_trap: Add ecn_mark trap mlxsw: spectrum_qdisc: Offload RED qevent mark selftests: mlxsw: sch_red_core: Drop two unused variables selftests: mlxsw: RED: Add selftests for the mark qevent .../networking/devlink/devlink-trap.rst | 4 + .../net/ethernet/mellanox/mlxsw/spectrum.c| 2 + .../net/ethernet/mellanox/mlxsw/spectrum.h| 2 + .../ethernet/mellanox/mlxsw/spectrum_qdisc.c | 14 +++- .../ethernet/mellanox/mlxsw/spectrum_span.c | 16 .../ethernet/mellanox/mlxsw/spectrum_span.h | 1 + .../ethernet/mellanox/mlxsw/spectrum_trap.c | 9 ++ include/net/devlink.h | 3 + net/core/devlink.c| 1 + .../drivers/net/mlxsw/sch_red_core.sh | 84 ++- .../drivers/net/mlxsw/sch_red_ets.sh | 74 ++-- 11 files changed, 200 insertions(+), 10 deletions(-) -- 2.29.2
[PATCH rdma-rc] Revert "RDMA/mlx5: Fix devlink deadlock on net namespace deletion"
From: Parav Pandit This reverts commit fbdd0049d98d44914fc57d4b91f867f4996c787b. Due to commit in fixes tag, netdevice events were received only in one net namespace of mlx5_core_dev. Due to this when netdevice events arrive in net namespace other than net namespace of mlx5_core_dev, they are missed. This results in empty GID table due to RDMA device being detached from its net device. Hence, revert back to receive netdevice events in all net namespaces to restore back RDMA functionality in non init_net net namespace. Fixes: fbdd0049d98d ("RDMA/mlx5: Fix devlink deadlock on net namespace deletion") Signed-off-by: Parav Pandit Signed-off-by: Leon Romanovsky --- drivers/infiniband/hw/mlx5/main.c | 6 ++ .../net/ethernet/mellanox/mlx5/core/lib/mlx5.h | 5 + include/linux/mlx5/driver.h| 18 -- 3 files changed, 7 insertions(+), 22 deletions(-) diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c index ed13c1bb031e..36f8ae4fe619 100644 --- a/drivers/infiniband/hw/mlx5/main.c +++ b/drivers/infiniband/hw/mlx5/main.c @@ -3335,8 +3335,7 @@ static int mlx5_add_netdev_notifier(struct mlx5_ib_dev *dev, u8 port_num) int err; dev->port[port_num].roce.nb.notifier_call = mlx5_netdev_event; - err = register_netdevice_notifier_net(mlx5_core_net(dev->mdev), - &dev->port[port_num].roce.nb); + err = register_netdevice_notifier(&dev->port[port_num].roce.nb); if (err) { dev->port[port_num].roce.nb.notifier_call = NULL; return err; @@ -3348,8 +3347,7 @@ static int mlx5_add_netdev_notifier(struct mlx5_ib_dev *dev, u8 port_num) static void mlx5_remove_netdev_notifier(struct mlx5_ib_dev *dev, u8 port_num) { if (dev->port[port_num].roce.nb.notifier_call) { - unregister_netdevice_notifier_net(mlx5_core_net(dev->mdev), - &dev->port[port_num].roce.nb); + unregister_netdevice_notifier(&dev->port[port_num].roce.nb); dev->port[port_num].roce.nb.notifier_call = NULL; } } diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/mlx5.h b/drivers/net/ethernet/mellanox/mlx5/core/lib/mlx5.h index 3a9fa629503f..d046db7bb047 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/mlx5.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/mlx5.h @@ -90,4 +90,9 @@ int mlx5_create_encryption_key(struct mlx5_core_dev *mdev, u32 key_type, u32 *p_key_id); void mlx5_destroy_encryption_key(struct mlx5_core_dev *mdev, u32 key_id); +static inline struct net *mlx5_core_net(struct mlx5_core_dev *dev) +{ + return devlink_net(priv_to_devlink(dev)); +} + #endif diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h index 4901b4fadabb..c4939b28ceed 100644 --- a/include/linux/mlx5/driver.h +++ b/include/linux/mlx5/driver.h @@ -1210,22 +1210,4 @@ static inline bool mlx5_is_roce_enabled(struct mlx5_core_dev *dev) return val.vbool; } -/** - * mlx5_core_net - Provide net namespace of the mlx5_core_dev - * @dev: mlx5 core device - * - * mlx5_core_net() returns the net namespace of mlx5 core device. - * This can be called only in below described limited context. - * (a) When a devlink instance for mlx5_core is registered and - * when devlink reload operation is disabled. - * or - * (b) during devlink reload reload_down() and reload_up callbacks - * where it is ensured that devlink instance's net namespace is - * stable. - */ -static inline struct net *mlx5_core_net(struct mlx5_core_dev *dev) -{ - return devlink_net(priv_to_devlink(dev)); -} - #endif /* MLX5_DRIVER_H */ -- 2.29.2
[PATCH net-next] net: core: init every ctl_table in netns_core_table
From: Menglong Dong For now, there is only one element in netns_core_table, and it is inited directly in sysctl_core_net_init. To make it more flexible, we can init every element at once, just like what ipv4_sysctl_init_net() did. Signed-off-by: Menglong Dong --- net/core/sysctl_net_core.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c index d86d8d11cfe4..966d976dee84 100644 --- a/net/core/sysctl_net_core.c +++ b/net/core/sysctl_net_core.c @@ -606,15 +606,19 @@ static __net_init int sysctl_core_net_init(struct net *net) tbl = netns_core_table; if (!net_eq(net, &init_net)) { + int i; + tbl = kmemdup(tbl, sizeof(netns_core_table), GFP_KERNEL); if (tbl == NULL) goto err_dup; - tbl[0].data = &net->core.sysctl_somaxconn; + /* Update the variables to point into the current struct net */ + for (i = 0; i < ARRAY_SIZE(netns_core_table) - 1; i++) { + tbl[i].data += (void *)net - (void *)&init_net; - /* Don't export any sysctls to unprivileged users */ - if (net->user_ns != &init_user_ns) { - tbl[0].procname = NULL; + /* Don't export any sysctls to unprivileged users */ + if (net->user_ns != &init_user_ns) + tbl[i].procname = NULL; } } -- 2.30.0
ibmvnic: Race condition in remove callback
Hello, while working on some cleanup I stumbled over a problem in the ibmvnic's remove callback. Since commit 7d7195a026ba ("ibmvnic: Do not process device remove during device reset") there is the following code in the remove callback: static int ibmvnic_remove(struct vio_dev *dev) { ... spin_lock_irqsave(&adapter->state_lock, flags); if (test_bit(0, &adapter->resetting)) { spin_unlock_irqrestore(&adapter->state_lock, flags); return -EBUSY; } adapter->state = VNIC_REMOVING; spin_unlock_irqrestore(&adapter->state_lock, flags); flush_work(&adapter->ibmvnic_reset); flush_delayed_work(&adapter->ibmvnic_delayed_reset); ... } Unfortunately returning -EBUSY doesn't work as intended. That's because the return value of this function is ignored[1] and the device is considered unbound by the device core (shortly) after ibmvnic_remove() returns. While looking into fixing that I noticed a worse problem: If ibmvnic_reset() (e.g. called by the tx_timeout callback) calls schedule_work(&adapter->ibmvnic_reset); just after the work queue is flushed above the problem that 7d7195a026ba intends to fix will trigger resulting in a use-after-free. Also ibmvnic_reset() checks for adapter->state without holding the lock which might be racy, too. Best regards Uwe [1] vio_bus_remove (in arch/powerpc/platforms/pseries/vio.c) records the return value and passes it on. But the driver core doesn't care for the return value (see __device_release_driver() in drivers/base/dd.c calling dev->bus->remove()). -- Pengutronix e.K. | Uwe Kleine-König| Industrial Linux Solutions | https://www.pengutronix.de/ | signature.asc Description: PGP signature
Re: [PATCH net-next V2] net: ks8851: Fix mixed module/builtin build
On Sat, Jan 16, 2021 at 10:41 PM Heiner Kallweit wrote: > >> > >> It seems unlikely that a system uses both, the parallel *and* the SPI > >> variant of the ks8851. So the additional memory necessary because of > >> code duplication wouldn't matter in practice. > > > > I have a board with both options populated on my desk, sorry. > > Making the common part a separate module shouldn't be that hard. > AFAICS it would just take: > - export 4 functions from common > - extend Kconfig > - extend Makefile > One similar configuration that comes to my mind and could be used as > template is SPI_FSL_LIB. There is no need to even change Kconfig, just simplify the Makefile to obj-$(CONFIG_KS8851) += ks8851_common.o ks8851_spi.o obj-$(CONFIG_KS8851_MLL) += ks8851_common.o ks8851_par.o This will do the right thing and build ks8851_common.ko into vmlinux if at least one of the two front-ends is built-in, and otherwise build it at a loadable module if there is another module using it. Arnd
[PATCH net-next] net: core: Namespace-ify sysctl_wmem_default and sysctl_rmem_default
From: Menglong Dong For now, sysctl_wmem_default and sysctl_rmem_default are globally unified. It's not convenient in some case. For example, when we use docker and try to control the default udp socket receive buffer for each container. For that reason, make sysctl_wmem_default and sysctl_rmem_default per-namespace. Signed-off-by: Menglong Dong --- include/net/netns/core.h | 2 ++ include/net/sock.h | 3 --- net/core/net_namespace.c | 2 ++ net/core/sock.c| 6 ++ net/core/sysctl_net_core.c | 32 net/ipv4/ip_output.c | 2 +- 6 files changed, 23 insertions(+), 24 deletions(-) diff --git a/include/net/netns/core.h b/include/net/netns/core.h index 36c2d998a43c..317b47df6d08 100644 --- a/include/net/netns/core.h +++ b/include/net/netns/core.h @@ -9,6 +9,8 @@ struct netns_core { /* core sysctls */ struct ctl_table_header *sysctl_hdr; + int sysctl_wmem_default; + int sysctl_rmem_default; int sysctl_somaxconn; #ifdef CONFIG_PROC_FS diff --git a/include/net/sock.h b/include/net/sock.h index bdc4323ce53c..b846a6d24459 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -2653,9 +2653,6 @@ extern __u32 sysctl_rmem_max; extern int sysctl_tstamp_allow_data; extern int sysctl_optmem_max; -extern __u32 sysctl_wmem_default; -extern __u32 sysctl_rmem_default; - DECLARE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key); static inline int sk_get_wmem0(const struct sock *sk, const struct proto *proto) diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 2ef3b4557f40..eb4ea99131d6 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -374,6 +374,8 @@ static __net_init int setup_net(struct net *net, struct user_namespace *user_ns) static int __net_init net_defaults_init_net(struct net *net) { + net->core.sysctl_rmem_default = SK_RMEM_MAX; + net->core.sysctl_wmem_default = SK_WMEM_MAX; net->core.sysctl_somaxconn = SOMAXCONN; return 0; } diff --git a/net/core/sock.c b/net/core/sock.c index bbcd4b97eddd..2421e4ea1915 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -270,8 +270,6 @@ __u32 sysctl_wmem_max __read_mostly = SK_WMEM_MAX; EXPORT_SYMBOL(sysctl_wmem_max); __u32 sysctl_rmem_max __read_mostly = SK_RMEM_MAX; EXPORT_SYMBOL(sysctl_rmem_max); -__u32 sysctl_wmem_default __read_mostly = SK_WMEM_MAX; -__u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX; /* Maximal space eaten by iovec or ancillary data plus some space */ int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512); @@ -2970,8 +2968,8 @@ void sock_init_data(struct socket *sock, struct sock *sk) timer_setup(&sk->sk_timer, NULL, 0); sk->sk_allocation = GFP_KERNEL; - sk->sk_rcvbuf = sysctl_rmem_default; - sk->sk_sndbuf = sysctl_wmem_default; + sk->sk_rcvbuf = sock_net(sk)->core.sysctl_rmem_default; + sk->sk_sndbuf = sock_net(sk)->core.sysctl_wmem_default; sk->sk_state= TCP_CLOSE; sk_set_socket(sk, sock); diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c index 966d976dee84..5c1c75e42a09 100644 --- a/net/core/sysctl_net_core.c +++ b/net/core/sysctl_net_core.c @@ -326,22 +326,6 @@ static struct ctl_table net_core_table[] = { .proc_handler = proc_dointvec_minmax, .extra1 = &min_rcvbuf, }, - { - .procname = "wmem_default", - .data = &sysctl_wmem_default, - .maxlen = sizeof(int), - .mode = 0644, - .proc_handler = proc_dointvec_minmax, - .extra1 = &min_sndbuf, - }, - { - .procname = "rmem_default", - .data = &sysctl_rmem_default, - .maxlen = sizeof(int), - .mode = 0644, - .proc_handler = proc_dointvec_minmax, - .extra1 = &min_rcvbuf, - }, { .procname = "dev_weight", .data = &weight_p, @@ -584,6 +568,22 @@ static struct ctl_table netns_core_table[] = { .extra1 = SYSCTL_ZERO, .proc_handler = proc_dointvec_minmax }, + { + .procname = "wmem_default", + .data = &init_net.core.sysctl_wmem_default, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &min_sndbuf, + }, + { + .procname = "rmem_default", + .data = &init_net.core.sysctl_rmem_default, + .maxlen = sizeof(int), + .mode
[PATCH net-next] net: core: Namespace-ify sysctl_rmem_max and sysctl_wmem_max
From: Menglong Dong For now, sysctl_wmem_max and sysctl_rmem_max are globally unified. It's not convenient in some case. For example, when we use docker and try to control the default udp socket receive buffer for each container. For that reason, make sysctl_wmem_max and sysctl_rmem_max per-namespace. Signed-off-by: Menglong Dong --- include/net/netns/core.h| 2 ++ include/net/sock.h | 3 --- net/core/filter.c | 4 ++-- net/core/net_namespace.c| 2 ++ net/core/sock.c | 12 net/core/sysctl_net_core.c | 32 net/ipv4/tcp_output.c | 2 +- net/netfilter/ipvs/ip_vs_sync.c | 4 ++-- 8 files changed, 29 insertions(+), 32 deletions(-) diff --git a/include/net/netns/core.h b/include/net/netns/core.h index 317b47df6d08..b4aecac6e8ce 100644 --- a/include/net/netns/core.h +++ b/include/net/netns/core.h @@ -11,6 +11,8 @@ struct netns_core { int sysctl_wmem_default; int sysctl_rmem_default; + int sysctl_wmem_max; + int sysctl_rmem_max; int sysctl_somaxconn; #ifdef CONFIG_PROC_FS diff --git a/include/net/sock.h b/include/net/sock.h index b846a6d24459..f6b0f2c482ad 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -2647,9 +2647,6 @@ void sk_get_meminfo(const struct sock *sk, u32 *meminfo); #define SK_WMEM_MAX(_SK_MEM_OVERHEAD * _SK_MEM_PACKETS) #define SK_RMEM_MAX(_SK_MEM_OVERHEAD * _SK_MEM_PACKETS) -extern __u32 sysctl_wmem_max; -extern __u32 sysctl_rmem_max; - extern int sysctl_tstamp_allow_data; extern int sysctl_optmem_max; diff --git a/net/core/filter.c b/net/core/filter.c index 255aeee72402..3dca58f6c40c 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4717,13 +4717,13 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname, /* Only some socketops are supported */ switch (optname) { case SO_RCVBUF: - val = min_t(u32, val, sysctl_rmem_max); + val = min_t(u32, val, sock_net(sk)->core.sysctl_rmem_max); sk->sk_userlocks |= SOCK_RCVBUF_LOCK; WRITE_ONCE(sk->sk_rcvbuf, max_t(int, val * 2, SOCK_MIN_RCVBUF)); break; case SO_SNDBUF: - val = min_t(u32, val, sysctl_wmem_max); + val = min_t(u32, val, sock_net(sk)->core.sysctl_wmem_max); sk->sk_userlocks |= SOCK_SNDBUF_LOCK; WRITE_ONCE(sk->sk_sndbuf, max_t(int, val * 2, SOCK_MIN_SNDBUF)); diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index eb4ea99131d6..552e3c5b2a41 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -376,6 +376,8 @@ static int __net_init net_defaults_init_net(struct net *net) { net->core.sysctl_rmem_default = SK_RMEM_MAX; net->core.sysctl_wmem_default = SK_WMEM_MAX; + net->core.sysctl_rmem_max = SK_RMEM_MAX; + net->core.sysctl_wmem_max = SK_WMEM_MAX; net->core.sysctl_somaxconn = SOMAXCONN; return 0; } diff --git a/net/core/sock.c b/net/core/sock.c index 2421e4ea1915..eb7eaaa840ce 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -265,12 +265,6 @@ static struct lock_class_key af_wlock_keys[AF_MAX]; static struct lock_class_key af_elock_keys[AF_MAX]; static struct lock_class_key af_kern_callback_keys[AF_MAX]; -/* Run time adjustable parameters. */ -__u32 sysctl_wmem_max __read_mostly = SK_WMEM_MAX; -EXPORT_SYMBOL(sysctl_wmem_max); -__u32 sysctl_rmem_max __read_mostly = SK_RMEM_MAX; -EXPORT_SYMBOL(sysctl_rmem_max); - /* Maximal space eaten by iovec or ancillary data plus some space */ int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512); EXPORT_SYMBOL(sysctl_optmem_max); @@ -877,7 +871,7 @@ int sock_setsockopt(struct socket *sock, int level, int optname, * play 'guess the biggest size' games. RCVBUF/SNDBUF * are treated in BSD as hints */ - val = min_t(u32, val, sysctl_wmem_max); + val = min_t(u32, val, sock_net(sk)->core.sysctl_wmem_max); set_sndbuf: /* Ensure val * 2 fits into an int, to prevent max_t() * from treating it as a negative value. @@ -909,7 +903,9 @@ int sock_setsockopt(struct socket *sock, int level, int optname, * play 'guess the biggest size' games. RCVBUF/SNDBUF * are treated in BSD as hints */ - __sock_set_rcvbuf(sk, min_t(u32, val, sysctl_rmem_max)); + __sock_set_rcvbuf(sk, + min_t(u32, val, + sock_net(sk)->core.sysctl_rmem_max)); break; case SO_RCVBUFFORCE: diff -
Re: [PATCH net-next V2 1/8] net: netdevice: Add operation ndo_sk_get_slave
On 1/17/2021 4:51 AM, Jakub Kicinski wrote: On Thu, 14 Jan 2021 20:01:28 +0200 Tariq Toukan wrote: ndo_sk_get_slave returns the slave that corresponds to a given socket. Additionally, we implement a helper netdev_sk_get_lowest_dev() to get the lowest slave netdevice. Please don't add new uses of the word "slave" outside of the bond, and preferably even there. I'll fix this.
Re: [PATCH net-next V2 5/8] net/bonding: Implement TLS TX device offload
On 1/17/2021 4:54 AM, Jakub Kicinski wrote: On Thu, 14 Jan 2021 20:01:32 +0200 Tariq Toukan wrote: As the bond interface is being bypassed by the TLS module, interacting directly against the slaves, there is no way for the bond interface to disable its device offload capabilities, as long as the mode/policy config allows it. Hence, the feature flag is not directly controllable, but just reflects the current offload status based on the logic under bond_sk_check(). In that case why set it in ->hw_features ? IIRC features set only in ->features but not ->hw_features show up to userspace as "fixed" which I gather is what we want here, no? On one hand, by showing "off [Fixed]" we might hide the fact that bond driver now does support the TLS offload feature, you simply need to choose the proper mode/xmit_policy. On the other hand, as the feature flag toggling has totally no impact, I don't see a point in opening it for toggling. So yeah, I'll fix. +#if IS_ENABLED(CONFIG_TLS_DEVICE) + bond_dev->hw_features |= BOND_TLS_FEATURES; + if (bond_sk_check(bond)) + bond_dev->features |= BOND_TLS_FEATURES; +#endif
Re: [PATCH v2 net-next 01/14] net: mscc: ocelot: allow offloading of bridge on top of LAG
On Sat, Jan 16, 2021 at 05:26:23PM -0800, Jakub Kicinski wrote: > On Sat, 16 Jan 2021 02:59:30 +0200 Vladimir Oltean wrote: > > From: Vladimir Oltean > > > > Commit 7afb3e575e5a ("net: mscc: ocelot: don't handle netdev events for > > other netdevs") was too aggressive, and it made ocelot_netdevice_event > > react only to network interface events emitted for the ocelot switch > > ports. > > > > In fact, only the PRECHANGEUPPER should have had that check. > > > > When we ignore all events that are not for us, we miss the fact that the > > upper of the LAG changes, and the bonding interface gets enslaved to a > > bridge. This is an operation we could offload under certain conditions. > > I see the commit in question is in net, perhaps worth spelling out why > this is not a fix? Perhaps add some "in the future" to the last > sentence if it's the case that this will only matter with the following > patches applied? It is a fix. However, so is patch 13/14 "net: mscc: ocelot: rebalance LAGs on link up/down events", but I didn't see an easy way to backport that. Honestly the reasons why I did not attempt to split this series into a part for "net" and one for "net-next" are: (a) It would unnecessarily complicate my work for felix DSA, where this is considered a new feature as opposed to ocelot switchdev where it was supposedly already working (although.. not quite, due to the lack of rebalancing, a link down would throw off the LAG). I don't really think that anybody was seriously using LAG offload on ocelot so far. (b) Even if I were to split this patch, it can only be trivially backported as far as commit 9c90eea310f8 ("net: mscc: ocelot: move net_device related functions to ocelot_net.c") from June 2020 anyway. (c) I cannot test the mscc_ocelot.ko switchdev driver with traffic, since I don't have the hardware (I just have a local patch that I keep rebasing on top of net-next which makes me able to at least probe it and access its registers on a different switch revision, but the traffic I/O procedure there is completely different). So I can not really confirm what is the state I'm leaving the mscc_ocelot driver in, for stable kernels. At least now, I've made the entry points into the control code path very similar to those of DSA, and I've exercised the switchdev driver in blind (without traffic), so I have a bit more confidence that it should work. (d) Had the AUTOSEL guys picked up this patch, I would have probably had no objection (since my belief is that there's nothing to break and nothing to fix in stable kernels). That being said, if we want to engage in a rigid demonstration of procedures, sure we can do that. I have other patches anyway to fill the pipeline until "net" is merged back into "net-next" :)
Re: [PATCH net-next v5] GTP: add support for flow based tunneling API
Hi Jakub, On 17/01/2021 01:46, Jakub Kicinski wrote: On Sat, 9 Jan 2021 23:00:21 -0800 Pravin B Shelar wrote: Following patch add support for flow based tunneling API to send and recv GTP tunnel packet over tunnel metadata API. This would allow this device integration with OVS or eBPF using flow based tunneling APIs. Signed-off-by: Pravin B Shelar Applied, thanks! This patch hasn't received any ACK's from either the maintainers or anyone else providing review. The following issues remain unaddressed after review: i) the patch contains several logically separate changes that would be better served as smaller patches ii) functionality like the handling of end markers has been introduced without further explanation iii) symmetry between the handling of GTPv0 and GTPv1 has been unnecessarily broken iv) there are no available userspace tools to allow for testing this functionality I have requested that this patch be reworked into a series of smaller changes. That would allow: i) reasonable review ii) the possibility to explain _why_ things are being done in the patch comment where this isn't obvious (like the handling of end markers) iii) the chance to do a reasonable rebase of other ongoing work onto this patch (series): this one patch is invasive and difficult to rebase onto I'm not sure what the hurry is to get this patch into mainline. Large and complicated patches like this take time to review; please revert this and allow that process to happen. Thanks, Jonas
[PATCH v1] ipv4: add iPv4_is_multicast() check in ip_mc_leave_group().
From: Yingjie Wang There is no iPv4_is_multicast() check added to ip_mc_leave_group() to check if imr->imr_multiaddr.s_addr is a multicast address. If not a multicast address, it may result in an error. In some cases, the callers of ip_mc_leave_group() don't check whether it is multicast address or not before calling such as do_ip_setsockopt(). So I suggest adding the ipv4_is_multicast() check to the ip_mc_leave_group() to prevent this from happening. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Yingjie Wang --- net/ipv4/igmp.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c index 7b272bbed2b4..1b6f91271cfd 100644 --- a/net/ipv4/igmp.c +++ b/net/ipv4/igmp.c @@ -2248,6 +2248,9 @@ int ip_mc_leave_group(struct sock *sk, struct ip_mreqn *imr) u32 ifindex; int ret = -EADDRNOTAVAIL; + if (!ipv4_is_multicast(group)) + return -EINVAL; + ASSERT_RTNL(); in_dev = ip_mc_find_dev(net, imr); -- 2.7.4
Re: [PATCH net-next v5] GTP: add support for flow based tunneling API
Hi Pravin, Again, this patch is too big and contains too many disparate changes to allow for easy review. Please break this up into a series of logical changes for easier review. On 10/01/2021 08:00, Pravin B Shelar wrote: Following patch add support for flow based tunneling API to send and recv GTP tunnel packet over tunnel metadata API. This would allow this device integration with OVS or eBPF using flow based tunneling APIs. Signed-off-by: Pravin B Shelar --- v4-v5: - coding style changes v3-v4: - add check for non-zero dst port v2-v3: - Fixed coding style - changed IFLA_GTP_FD1 to optional param for LWT dev. v1-v2: - Fixed according to comments from Jonas Bonn drivers/net/gtp.c | 527 + include/uapi/linux/gtp.h | 12 + include/uapi/linux/if_link.h | 1 + include/uapi/linux/if_tunnel.h | 1 + tools/include/uapi/linux/if_link.h | 1 + 5 files changed, 398 insertions(+), 144 deletions(-) diff --git a/drivers/net/gtp.c b/drivers/net/gtp.c index 4c04e271f184..851364314ecc 100644 --- a/drivers/net/gtp.c +++ b/drivers/net/gtp.c @@ -21,6 +21,7 @@ #include #include +#include #include #include #include @@ -73,6 +74,9 @@ struct gtp_dev { unsigned inthash_size; struct hlist_head *tid_hash; struct hlist_head *addr_hash; + /* Used by LWT tunnel. */ + boolcollect_md; + struct socket *collect_md_sock; }; Can't you use the 'sock' member of sk1u as the reference this socket? static unsigned int gtp_net_id __read_mostly; @@ -179,33 +183,121 @@ static bool gtp_check_ms(struct sk_buff *skb, struct pdp_ctx *pctx, return false; } -static int gtp_rx(struct pdp_ctx *pctx, struct sk_buff *skb, - unsigned int hdrlen, unsigned int role) +static int gtp_set_tun_dst(struct gtp_dev *gtp, struct sk_buff *skb, + unsigned int hdrlen, u8 gtp_version, + __be64 tid, u8 flags) { - if (!gtp_check_ms(skb, pctx, hdrlen, role)) { - netdev_dbg(pctx->dev, "No PDP ctx for this MS\n"); - return 1; + struct metadata_dst *tun_dst; + int opts_len = 0; + + if (unlikely(flags & GTP1_F_MASK)) + opts_len = sizeof(struct gtpu_metadata); + + tun_dst = udp_tun_rx_dst(skb, gtp->sk1u->sk_family, TUNNEL_KEY, tid, opts_len); + if (!tun_dst) { + netdev_dbg(gtp->dev, "Failed to allocate tun_dst"); + goto err; } + netdev_dbg(gtp->dev, "attaching metadata_dst to skb, gtp ver %d hdrlen %d\n", + gtp_version, hdrlen); + if (unlikely(opts_len)) { + struct gtpu_metadata *opts; + struct gtp1_header *gtp1; + + opts = ip_tunnel_info_opts(&tun_dst->u.tun_info); + gtp1 = (struct gtp1_header *)(skb->data + sizeof(struct udphdr)); + opts->ver = GTP_METADATA_V1; + opts->flags = gtp1->flags; + opts->type = gtp1->type; + netdev_dbg(gtp->dev, "recved control pkt: flag %x type: %d\n", + opts->flags, opts->type); + tun_dst->u.tun_info.key.tun_flags |= TUNNEL_GTPU_OPT; + tun_dst->u.tun_info.options_len = opts_len; + skb->protocol = htons(0x); /* Unknown */ + } /* Get rid of the GTP + UDP headers. */ if (iptunnel_pull_header(skb, hdrlen, skb->protocol, -!net_eq(sock_net(pctx->sk), dev_net(pctx->dev - return -1; +!net_eq(sock_net(gtp->sk1u), dev_net(gtp->dev { + gtp->dev->stats.rx_length_errors++; + goto err; + } + + skb_dst_set(skb, &tun_dst->dst); + return 0; +err: + return -1; +} + +static int gtp_rx(struct gtp_dev *gtp, struct sk_buff *skb, + unsigned int hdrlen, u8 gtp_version, unsigned int role, + __be64 tid, u8 flags, u8 type) +{ + if (ip_tunnel_collect_metadata() || gtp->collect_md) { + int err; + + err = gtp_set_tun_dst(gtp, skb, hdrlen, gtp_version, tid, flags); + if (err) + goto err; + } else { + struct pdp_ctx *pctx; + + if (flags & GTP1_F_MASK) + hdrlen += 4; - netdev_dbg(pctx->dev, "forwarding packet from GGSN to uplink\n"); + if (type != GTP_TPDU) + return 1; + + if (gtp_version == GTP_V0) + pctx = gtp0_pdp_find(gtp, be64_to_cpu(tid)); + else + pctx = gtp1_pdp_find(gtp, be64_to_cpu(tid)); + if (!pctx) { + netdev_dbg(gtp->dev, "No PDP ctx to decap skb=%p\n", skb); +
Re: [PATCH net-next 1/1] stmmac: intel: change all EHL/TGL to auto detect phy addr
On Sat, Jan 16, 2021 at 04:59:14PM -0800, Jakub Kicinski wrote: > On Sat, 16 Jan 2021 10:12:21 +0100 Jan Kiszka wrote: > > On 06.11.20 10:43, Wong Vee Khee wrote: > > > From: Voon Weifeng > > > > > > Set all EHL/TGL phy_addr to -1 so that the driver will automatically > > > detect it at run-time by probing all the possible 32 addresses. > > > > > > Signed-off-by: Voon Weifeng > > > Signed-off-by: Wong Vee Khee > > > > This fixes PHY detection on one of our EHL-based boards. Can this also > > be applied to stable 5.10? > > Sure. > > Greg, we'd like to request a backport of the following commit to 5.10. > > commit bff6f1db91e330d7fba56f815cdbc412c75fe163 > Author: Voon Weifeng > Date: Fri Nov 6 17:43:41 2020 +0800 > > stmmac: intel: change all EHL/TGL to auto detect phy addr > > Set all EHL/TGL phy_addr to -1 so that the driver will automatically > detect it at run-time by probing all the possible 32 addresses. > > Signed-off-by: Voon Weifeng > Signed-off-by: Wong Vee Khee > Link: > https://lore.kernel.org/r/20201106094341.4241-1-vee.khee.w...@intel.com > Signed-off-by: Jakub Kicinski > > > It's relatively small, and Jan reports it makes his boards detect the > PHY. The change went in via -next and into Linus's tree during the 5.11 > merge window. Now queued up, thanks. greg k-h
[PATCH net-next V3 5/8] net/bonding: Implement TLS TX device offload
Implement TLS TX device offload for bonding interfaces. This allows kTLS sockets running on a bond to benefit from the device offload on capable lower devices. To allow a simple and fast maintenance of the TLS context in SW and lower devices, we bind the TLS socket to a specific lower dev. To achieve a behavior similar to SW kTLS, we support only balance-xor and 802.3ad modes, with xmit_hash_policy=layer3+4. This is enforced in bond_sk_check(), done in a previous patch. For the above configuration, the SW implementation keeps picking the same exact lower dev for all the socket's SKBs. The device offload behaves similarly, making the decision once at the connection creation. Per socket, the TLS module should work directly with the lowest netdev in chain, to call the tls_dev_ops operations. As the bond interface is being bypassed by the TLS module, interacting directly against the lower devs, there is no way for the bond interface to disable its device offload capabilities, as long as the mode/policy config allows it. Hence, the feature flag is not directly controllable, but just reflects the current offload status based on the logic under bond_sk_check(). Signed-off-by: Tariq Toukan Reviewed-by: Boris Pismenny --- drivers/net/bonding/bond_main.c| 29 + drivers/net/bonding/bond_options.c | 27 +-- include/net/bonding.h | 2 ++ 3 files changed, 56 insertions(+), 2 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 09524f99c753..539c6bc218df 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -83,6 +83,9 @@ #include #include #include +#if IS_ENABLED(CONFIG_TLS_DEVICE) +#include +#endif #include "bonding_priv.h" @@ -1225,6 +1228,13 @@ static netdev_features_t bond_fix_features(struct net_device *dev, netdev_features_t mask; struct slave *slave; +#if IS_ENABLED(CONFIG_TLS_DEVICE) + if (bond_sk_check(bond)) + features |= BOND_TLS_FEATURES; + else + features &= ~BOND_TLS_FEATURES; +#endif + mask = features; features &= ~NETIF_F_ONE_FOR_ALL; @@ -4647,6 +4657,16 @@ static struct net_device *bond_sk_get_lower_dev(struct net_device *dev, return lower; } +#if IS_ENABLED(CONFIG_TLS_DEVICE) +static netdev_tx_t bond_tls_device_xmit(struct bonding *bond, struct sk_buff *skb, + struct net_device *dev) +{ + if (likely(bond_get_slave_by_dev(bond, tls_get_ctx(skb->sk)->netdev))) + return bond_dev_queue_xmit(bond, skb, tls_get_ctx(skb->sk)->netdev); + return bond_tx_drop(dev, skb); +} +#endif + static netdev_tx_t __bond_start_xmit(struct sk_buff *skb, struct net_device *dev) { struct bonding *bond = netdev_priv(dev); @@ -4655,6 +4675,11 @@ static netdev_tx_t __bond_start_xmit(struct sk_buff *skb, struct net_device *dev !bond_slave_override(bond, skb)) return NETDEV_TX_OK; +#if IS_ENABLED(CONFIG_TLS_DEVICE) + if (skb->sk && tls_is_sk_tx_device_offloaded(skb->sk)) + return bond_tls_device_xmit(bond, skb, dev); +#endif + switch (BOND_MODE(bond)) { case BOND_MODE_ROUNDROBIN: return bond_xmit_roundrobin(skb, dev); @@ -4855,6 +4880,10 @@ void bond_setup(struct net_device *bond_dev) if (BOND_MODE(bond) == BOND_MODE_ACTIVEBACKUP) bond_dev->features |= BOND_XFRM_FEATURES; #endif /* CONFIG_XFRM_OFFLOAD */ +#if IS_ENABLED(CONFIG_TLS_DEVICE) + if (bond_sk_check(bond)) + bond_dev->features |= BOND_TLS_FEATURES; +#endif } /* Destroy a bonding device. diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c index 7f0ad97926de..8fcbf7f9c7b2 100644 --- a/drivers/net/bonding/bond_options.c +++ b/drivers/net/bonding/bond_options.c @@ -758,6 +758,19 @@ static bool bond_set_xfrm_features(struct bonding *bond) return true; } +static bool bond_set_tls_features(struct bonding *bond) +{ + if (!IS_ENABLED(CONFIG_TLS_DEVICE)) + return false; + + if (bond_sk_check(bond)) + bond->dev->wanted_features |= BOND_TLS_FEATURES; + else + bond->dev->wanted_features &= ~BOND_TLS_FEATURES; + + return true; +} + static int bond_option_mode_set(struct bonding *bond, const struct bond_opt_value *newval) { @@ -784,9 +797,15 @@ static int bond_option_mode_set(struct bonding *bond, bond->params.arp_validate = BOND_ARP_VALIDATE_NONE; bond->params.mode = newval->value; - if (bond->dev->reg_state == NETREG_REGISTERED) - if (bond_set_xfrm_features(bond)) + if (bond->dev->reg_state == NETREG_REGISTERED) { + bool update = false; + + update |= bond_set_xfrm_features(bond); + update |= bond_set_tl
[PATCH net-next V3 8/8] net/tls: Except bond interface from some TLS checks
In the tls_dev_event handler, ignore tlsdev_ops requirement for bond interfaces, they do not exist as the interaction is done directly with the lower device. Also, make the validate function pass when it's called with the upper bond interface. Signed-off-by: Tariq Toukan Reviewed-by: Boris Pismenny --- net/tls/tls_device.c | 2 ++ net/tls/tls_device_fallback.c | 2 +- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c index 75ceea0a41bf..d9cd229aa111 100644 --- a/net/tls/tls_device.c +++ b/net/tls/tls_device.c @@ -1329,6 +1329,8 @@ static int tls_dev_event(struct notifier_block *this, unsigned long event, switch (event) { case NETDEV_REGISTER: case NETDEV_FEAT_CHANGE: + if (netif_is_bond_master(dev)) + return NOTIFY_DONE; if ((dev->features & NETIF_F_HW_TLS_RX) && !dev->tlsdev_ops->tls_dev_resync) return NOTIFY_BAD; diff --git a/net/tls/tls_device_fallback.c b/net/tls/tls_device_fallback.c index d946817ed065..cacf040872c7 100644 --- a/net/tls/tls_device_fallback.c +++ b/net/tls/tls_device_fallback.c @@ -424,7 +424,7 @@ struct sk_buff *tls_validate_xmit_skb(struct sock *sk, struct net_device *dev, struct sk_buff *skb) { - if (dev == tls_get_ctx(sk)->netdev) + if (dev == tls_get_ctx(sk)->netdev || netif_is_bond_master(dev)) return skb; return tls_sw_fallback(sk, skb); -- 2.21.0
[PATCH net-next V3 7/8] net/tls: Device offload to use lowest netdevice in chain
Do not call the tls_dev_ops of upper devices. Instead, ask them for the proper lowest device and communicate with it directly. Signed-off-by: Tariq Toukan Reviewed-by: Boris Pismenny --- net/tls/tls_device.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c index f7fb7d2c1de1..75ceea0a41bf 100644 --- a/net/tls/tls_device.c +++ b/net/tls/tls_device.c @@ -113,7 +113,7 @@ static struct net_device *get_netdev_for_sock(struct sock *sk) struct net_device *netdev = NULL; if (likely(dst)) { - netdev = dst->dev; + netdev = netdev_sk_get_lowest_dev(dst->dev, sk); dev_hold(netdev); } -- 2.21.0
[PATCH net-next V3 6/8] net/bonding: Declare TLS RX device offload support
Following the description in previous patch (for TX): As the bond interface is being bypassed by the TLS module, interacting directly against the lower devs, there is no way for the bond interface to disable its device offload capabilities, as long as the mode/policy config allows it. Hence, the feature flag is not directly controllable, but just reflects the offload status based on the logic under bond_sk_check(). Here we just declare RX device offload support, and expose it via the NETIF_F_HW_TLS_RX flag. Signed-off-by: Tariq Toukan Reviewed-by: Boris Pismenny --- include/net/bonding.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/net/bonding.h b/include/net/bonding.h index 97fbec02df2d..019e998d944a 100644 --- a/include/net/bonding.h +++ b/include/net/bonding.h @@ -89,7 +89,7 @@ #define BOND_XFRM_FEATURES (NETIF_F_HW_ESP | NETIF_F_HW_ESP_TX_CSUM | \ NETIF_F_GSO_ESP) -#define BOND_TLS_FEATURES (NETIF_F_HW_TLS_TX) +#define BOND_TLS_FEATURES (NETIF_F_HW_TLS_TX | NETIF_F_HW_TLS_RX) #ifdef CONFIG_NET_POLL_CONTROLLER extern atomic_t netpoll_block_tx; -- 2.21.0
[PATCH net-next V3 4/8] net/bonding: Take update_features call out of XFRM funciton
In preparation for more cases that call netdev_update_features(). While here, move the features logic to the stage where struct bond is already updated, and pass it as the only parameter to function bond_set_xfrm_features(). Signed-off-by: Tariq Toukan Reviewed-by: Boris Pismenny --- drivers/net/bonding/bond_options.c | 19 ++- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c index a4e4e15f574d..7f0ad97926de 100644 --- a/drivers/net/bonding/bond_options.c +++ b/drivers/net/bonding/bond_options.c @@ -745,17 +745,17 @@ const struct bond_option *bond_opt_get(unsigned int option) return &bond_opts[option]; } -static void bond_set_xfrm_features(struct net_device *bond_dev, u64 mode) +static bool bond_set_xfrm_features(struct bonding *bond) { if (!IS_ENABLED(CONFIG_XFRM_OFFLOAD)) - return; + return false; - if (mode == BOND_MODE_ACTIVEBACKUP) - bond_dev->wanted_features |= BOND_XFRM_FEATURES; + if (BOND_MODE(bond) == BOND_MODE_ACTIVEBACKUP) + bond->dev->wanted_features |= BOND_XFRM_FEATURES; else - bond_dev->wanted_features &= ~BOND_XFRM_FEATURES; + bond->dev->wanted_features &= ~BOND_XFRM_FEATURES; - netdev_update_features(bond_dev); + return true; } static int bond_option_mode_set(struct bonding *bond, @@ -780,13 +780,14 @@ static int bond_option_mode_set(struct bonding *bond, if (newval->value == BOND_MODE_ALB) bond->params.tlb_dynamic_lb = 1; - if (bond->dev->reg_state == NETREG_REGISTERED) - bond_set_xfrm_features(bond->dev, newval->value); - /* don't cache arp_validate between modes */ bond->params.arp_validate = BOND_ARP_VALIDATE_NONE; bond->params.mode = newval->value; + if (bond->dev->reg_state == NETREG_REGISTERED) + if (bond_set_xfrm_features(bond)) + netdev_update_features(bond->dev); + return 0; } -- 2.21.0
[PATCH net-next V3 2/8] net/bonding: Take IP hash logic into a helper
Hash logic on L3 will be used in a downstream patch for one more use case. Take it to a function for a better code reuse. Signed-off-by: Tariq Toukan Reviewed-by: Boris Pismenny --- drivers/net/bonding/bond_main.c | 16 +++- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index ad5192ee1845..759ad22b7279 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -3541,6 +3541,16 @@ static bool bond_flow_dissect(struct bonding *bond, struct sk_buff *skb, return true; } +static u32 bond_ip_hash(u32 hash, struct flow_keys *flow) +{ + hash ^= (__force u32)flow_get_u32_dst(flow) ^ + (__force u32)flow_get_u32_src(flow); + hash ^= (hash >> 16); + hash ^= (hash >> 8); + /* discard lowest hash bit to deal with the common even ports pattern */ + return hash >> 1; +} + /** * bond_xmit_hash - generate a hash value based on the xmit policy * @bond: bonding device @@ -3571,12 +3581,8 @@ u32 bond_xmit_hash(struct bonding *bond, struct sk_buff *skb) else memcpy(&hash, &flow.ports.ports, sizeof(hash)); } - hash ^= (__force u32)flow_get_u32_dst(&flow) ^ - (__force u32)flow_get_u32_src(&flow); - hash ^= (hash >> 16); - hash ^= (hash >> 8); - return hash >> 1; + return bond_ip_hash(hash, &flow); } /*-- Device entry points */ -- 2.21.0
[PATCH net-next V3 3/8] net/bonding: Implement ndo_sk_get_lower_dev
Add ndo_sk_get_lower_dev() implementation for bond interfaces. Support only for the cases where the socket's and SKBs' hash yields identical value for the whole connection lifetime. Here we restrict it to L3+4 sockets only, with xmit_hash_policy==LAYER34 and bond modes xor/802.3ad. Signed-off-by: Tariq Toukan Reviewed-by: Boris Pismenny --- drivers/net/bonding/bond_main.c | 93 + include/net/bonding.h | 2 + 2 files changed, 95 insertions(+) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 759ad22b7279..09524f99c753 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -301,6 +301,19 @@ netdev_tx_t bond_dev_queue_xmit(struct bonding *bond, struct sk_buff *skb, return dev_queue_xmit(skb); } +bool bond_sk_check(struct bonding *bond) +{ + switch (BOND_MODE(bond)) { + case BOND_MODE_8023AD: + case BOND_MODE_XOR: + if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER34) + return true; + fallthrough; + default: + return false; + } +} + /*-- VLAN ---*/ /* In the following 2 functions, bond_vlan_rx_add_vid and bond_vlan_rx_kill_vid, @@ -4555,6 +4568,85 @@ static struct net_device *bond_xmit_get_slave(struct net_device *master_dev, return NULL; } +static void bond_sk_to_flow(struct sock *sk, struct flow_keys *flow) +{ + switch (sk->sk_family) { +#if IS_ENABLED(CONFIG_IPV6) + case AF_INET6: + if (sk->sk_ipv6only || + ipv6_addr_type(&sk->sk_v6_daddr) != IPV6_ADDR_MAPPED) { + flow->control.addr_type = FLOW_DISSECTOR_KEY_IPV6_ADDRS; + flow->addrs.v6addrs.src = inet6_sk(sk)->saddr; + flow->addrs.v6addrs.dst = sk->sk_v6_daddr; + break; + } + fallthrough; +#endif + default: /* AF_INET */ + flow->control.addr_type = FLOW_DISSECTOR_KEY_IPV4_ADDRS; + flow->addrs.v4addrs.src = inet_sk(sk)->inet_rcv_saddr; + flow->addrs.v4addrs.dst = inet_sk(sk)->inet_daddr; + break; + } + + flow->ports.src = inet_sk(sk)->inet_sport; + flow->ports.dst = inet_sk(sk)->inet_dport; +} + +/** + * bond_sk_hash_l34 - generate a hash value based on the socket's L3 and L4 fields + * @sk: socket to use for headers + * + * This function will extract the necessary field from the socket and use + * them to generate a hash based on the LAYER34 xmit_policy. + * Assumes that sk is a TCP or UDP socket. + */ +static u32 bond_sk_hash_l34(struct sock *sk) +{ + struct flow_keys flow; + u32 hash; + + bond_sk_to_flow(sk, &flow); + + /* L4 */ + memcpy(&hash, &flow.ports.ports, sizeof(hash)); + /* L3 */ + return bond_ip_hash(hash, &flow); +} + +static struct net_device *__bond_sk_get_lower_dev(struct bonding *bond, + struct sock *sk) +{ + struct bond_up_slave *slaves; + struct slave *slave; + unsigned int count; + u32 hash; + + slaves = rcu_dereference(bond->usable_slaves); + count = slaves ? READ_ONCE(slaves->count) : 0; + if (unlikely(!count)) + return NULL; + + hash = bond_sk_hash_l34(sk); + slave = slaves->arr[hash % count]; + + return slave->dev; +} + +static struct net_device *bond_sk_get_lower_dev(struct net_device *dev, + struct sock *sk) +{ + struct bonding *bond = netdev_priv(dev); + struct net_device *lower = NULL; + + rcu_read_lock(); + if (bond_sk_check(bond)) + lower = __bond_sk_get_lower_dev(bond, sk); + rcu_read_unlock(); + + return lower; +} + static netdev_tx_t __bond_start_xmit(struct sk_buff *skb, struct net_device *dev) { struct bonding *bond = netdev_priv(dev); @@ -4691,6 +4783,7 @@ static const struct net_device_ops bond_netdev_ops = { .ndo_fix_features = bond_fix_features, .ndo_features_check = passthru_features_check, .ndo_get_xmit_slave = bond_xmit_get_slave, + .ndo_sk_get_lower_dev = bond_sk_get_lower_dev, }; static const struct device_type bond_type = { diff --git a/include/net/bonding.h b/include/net/bonding.h index adc3da776970..21497193c4a4 100644 --- a/include/net/bonding.h +++ b/include/net/bonding.h @@ -265,6 +265,8 @@ struct bond_vlan_tag { unsigned short vlan_id; }; +bool bond_sk_check(struct bonding *bond); + /** * Returns NULL if the net_device does not belong to any of the bond's slaves * -- 2.21.0
[PATCH net-next V3 0/8] TLS device offload for Bond
Hi, This series opens TX and RX TLS device offload for bond interfaces. This allows bond interfaces to benefit from capable lower devices. We add a new ndo_sk_get_lower_dev() to be used to get the lower dev that corresponds to a given socket. The TLS module uses it to interact directly with the lowest device in chain, and invoke the control operations in tlsdev_ops. This means that the bond interface doesn't have his own struct tlsdev_ops instance and derived logic/callbacks. To keep simple track of the HW and SW TLS contexts, we bind each socket to a specific lower device for the socket's whole lifetime. This is logically valid (and similar to the SW kTLS behavior) in the following bond configuration, so we restrict the offload support to it: ((mode == balance-xor) or (mode == 802.3ad)) and xmit_hash_policy == layer3+4. In this design, TLS TX/RX offload feature flags of the bond device are independent from the lower devices. They reflect the current features state, but are not directly controllable. This is because the bond driver is bypassed by the call to ndo_sk_get_lower_dev(), without him knowing who the caller is. The bond TLS feature flags are set/cleared only according to the configuration of the mode and xmit_hash_policy. Bypass is true only for the control flow. Packets in fast path still go through the bond logic. The design here differs from the xfrm/ipsec offload, where the bond driver has his own copy of struct xfrmdev_ops and callbacks. Regards, Tariq V3: - Use "lower device" instead of "slave". - Make TLS TX/RX devie offload feature flags non-controllable [Fixed]. V2: - Declare RX support. - Enhance the feature flags logic. - Slight modifications for bond_set_xfrm_features(). - RFC: - New design for the tlsdev_ops calls, introduce and use ndo_sk_get_slave() to interact directly with the slave netdev. - Remove bond copy of tlsdev_ops callbacks. - In TLS module: Use netdev_sk_get_lowest_dev(), give exceptions to some checks to allow bond support. Tariq Toukan (8): net: netdevice: Add operation ndo_sk_get_lower_dev net/bonding: Take IP hash logic into a helper net/bonding: Implement ndo_sk_get_lower_dev net/bonding: Take update_features call out of XFRM funciton net/bonding: Implement TLS TX device offload net/bonding: Declare TLS RX device offload support net/tls: Device offload to use lowest netdevice in chain net/tls: Except bond interface from some TLS checks drivers/net/bonding/bond_main.c| 138 +++-- drivers/net/bonding/bond_options.c | 42 +++-- include/linux/netdevice.h | 4 + include/net/bonding.h | 4 + net/core/dev.c | 33 +++ net/tls/tls_device.c | 4 +- net/tls/tls_device_fallback.c | 2 +- 7 files changed, 211 insertions(+), 16 deletions(-) -- 2.21.0
[PATCH net-next V3 1/8] net: netdevice: Add operation ndo_sk_get_lower_dev
ndo_sk_get_lower_dev returns the lower netdev that corresponds to a given socket. Additionally, we implement a helper netdev_sk_get_lowest_dev() to get the lowest one in chain. Signed-off-by: Tariq Toukan Reviewed-by: Boris Pismenny --- include/linux/netdevice.h | 4 net/core/dev.c| 33 + 2 files changed, 37 insertions(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 5b949076ed23..02dcef4d66e2 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1398,6 +1398,8 @@ struct net_device_ops { struct net_device* (*ndo_get_xmit_slave)(struct net_device *dev, struct sk_buff *skb, bool all_slaves); + struct net_device* (*ndo_sk_get_lower_dev)(struct net_device *dev, + struct sock *sk); netdev_features_t (*ndo_fix_features)(struct net_device *dev, netdev_features_t features); int (*ndo_set_features)(struct net_device *dev, @@ -2858,6 +2860,8 @@ int init_dummy_netdev(struct net_device *dev); struct net_device *netdev_get_xmit_slave(struct net_device *dev, struct sk_buff *skb, bool all_slaves); +struct net_device *netdev_sk_get_lowest_dev(struct net_device *dev, + struct sock *sk); struct net_device *dev_get_by_index(struct net *net, int ifindex); struct net_device *__dev_get_by_index(struct net *net, int ifindex); struct net_device *dev_get_by_index_rcu(struct net *net, int ifindex); diff --git a/net/core/dev.c b/net/core/dev.c index bae35c1ae192..6b90520a01b1 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -8105,6 +8105,39 @@ struct net_device *netdev_get_xmit_slave(struct net_device *dev, } EXPORT_SYMBOL(netdev_get_xmit_slave); +static struct net_device *netdev_sk_get_lower_dev(struct net_device *dev, + struct sock *sk) +{ + const struct net_device_ops *ops = dev->netdev_ops; + + if (!ops->ndo_sk_get_lower_dev) + return NULL; + return ops->ndo_sk_get_lower_dev(dev, sk); +} + +/** + * netdev_sk_get_lowest_dev - Get the lowest device in chain given device and socket + * @dev: device + * @sk: the socket + * + * %NULL is returned if no lower device is found. + */ + +struct net_device *netdev_sk_get_lowest_dev(struct net_device *dev, + struct sock *sk) +{ + struct net_device *lower; + + lower = netdev_sk_get_lower_dev(dev, sk); + while (lower) { + dev = lower; + lower = netdev_sk_get_lower_dev(dev, sk); + } + + return dev; +} +EXPORT_SYMBOL(netdev_sk_get_lowest_dev); + static void netdev_adjacent_add_links(struct net_device *dev) { struct netdev_adjacent *iter; -- 2.21.0
Re: [PATCH v2] hv_netvsc: Add (more) validation for untrusted Hyper-V values
On Sat, Jan 16, 2021 at 02:02:01PM +0100, Andrea Parri wrote: > On Fri, Jan 15, 2021 at 08:30:22PM -0800, Jakub Kicinski wrote: > > On Thu, 14 Jan 2021 21:26:28 +0100 Andrea Parri (Microsoft) wrote: > > > For additional robustness in the face of Hyper-V errors or malicious > > > behavior, validate all values that originate from packets that Hyper-V > > > has sent to the guest. Ensure that invalid values cannot cause indexing > > > off the end of an array, or subvert an existing validation via integer > > > overflow. Ensure that outgoing packets do not have any leftover guest > > > memory that has not been zeroed out. > > > > > > Reported-by: Juan Vazquez > > > Signed-off-by: Andrea Parri (Microsoft) > > > Cc: "David S. Miller" > > > Cc: Jakub Kicinski > > > Cc: Alexei Starovoitov > > > Cc: Daniel Borkmann > > > Cc: Andrii Nakryiko > > > Cc: Martin KaFai Lau > > > Cc: Song Liu > > > Cc: Yonghong Song > > > Cc: John Fastabend > > > Cc: KP Singh > > > Cc: netdev@vger.kernel.org > > > Cc: b...@vger.kernel.org > > > --- > > > Applies to 5.11-rc3 (and hyperv-next). > > > > So this is for hyperv-next or should we take it via netdev trees? > > No preference, either way is good for me. To be clear: There is no dependency on any patch in hyperv-next, right? That's my understanding, but I would like to confirm it. Wei. > > Thanks, > Andrea
[PATCH net] net: Disable NETIF_F_HW_TLS_RX when RXCSUM is disabled
With NETIF_F_HW_TLS_RX packets are decrypted in HW. This cannot be logically done when RXCSUM offload is off. Fixes: 14136564c8ee ("net: Add TLS RX offload feature") Signed-off-by: Tariq Toukan Reviewed-by: Boris Pismenny --- Documentation/networking/tls-offload.rst | 3 +++ net/core/dev.c | 5 + 2 files changed, 8 insertions(+) Hi, Please queue to -stable >= v4.19. Thanks, Tariq diff --git a/Documentation/networking/tls-offload.rst b/Documentation/networking/tls-offload.rst index 9af3334d9ad0..5f0dea3d571e 100644 --- a/Documentation/networking/tls-offload.rst +++ b/Documentation/networking/tls-offload.rst @@ -534,3 +534,6 @@ offload. Hence, TLS TX device feature flag requires TX csum offload being set. Disabling the latter implies clearing the former. Disabling TX checksum offload should not affect old connections, and drivers should make sure checksum calculation does not break for them. +Similarly, device-offloaded TLS decryption implies doing RXCSUM. If the user +does not want to enable RX csum offload, TLS RX device feature is disabled +as well. diff --git a/net/core/dev.c b/net/core/dev.c index c360bb5367e2..a979b86dbacd 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -9672,6 +9672,11 @@ static netdev_features_t netdev_fix_features(struct net_device *dev, } } + if ((features & NETIF_F_HW_TLS_RX) && !(features & NETIF_F_RXCSUM)) { + netdev_dbg(dev, "Dropping TLS RX HW offload feature since no RXCSUM feature.\n"); + features &= ~NETIF_F_HW_TLS_RX; + } + return features; } -- 2.21.0
Re: [Patch net-next] net_sched: fix RTNL deadlock again caused by request_module()
On 2021-01-16 7:56 p.m., Cong Wang wrote: From: Cong Wang tcf_action_init_1() loads tc action modules automatically with request_module() after parsing the tc action names, and it drops RTNL lock and re-holds it before and after request_module(). This causes a lot of troubles, as discovered by syzbot, because we can be in the middle of batch initializations when we create an array of tc actions. One of the problem is deadlock: CPU 0 CPU 1 rtnl_lock(); for (...) { tcf_action_init_1(); -> rtnl_unlock(); -> request_module(); rtnl_lock(); for (...) { tcf_action_init_1(); -> tcf_idr_check_alloc(); // Insert one action into idr, // but it is not committed until // tcf_idr_insert_many(), then drop // the RTNL lock in the _next_ // iteration -> rtnl_unlock(); -> rtnl_lock(); -> a_o->init(); -> tcf_idr_check_alloc(); // Now waiting for the same index // to be committed -> request_module(); -> rtnl_lock() // Now waiting for RTNL lock } rtnl_unlock(); } rtnl_unlock(); This is not easy to solve, we can move the request_module() before this loop and pre-load all the modules we need for this netlink message and then do the rest initializations. So the loop breaks down to two now: for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) { struct tc_action_ops *a_o; a_o = tc_action_load_ops(name, tb[i]...); ops[i - 1] = a_o; } for (i = 1; i <= TCA_ACT_MAX_PRIO && tb[i]; i++) { act = tcf_action_init_1(ops[i - 1]...); } Although this looks serious, it only has been reported by syzbot, so it seems hard to trigger this by humans. And given the size of this patch, I'd suggest to make it to net-next and not to backport to stable. This patch has been tested by syzbot and tested with tdc.py by me. LGTM. Initially i was worried about performance impact but i found nothing observable. We need to add a tdc test for batch (I can share how i did batch testing at next meet). Tested-by: Jamal Hadi Salim Acked-by: Jamal Hadi Salim cheers, jamal
Re: [PATCH net-next v5] GTP: add support for flow based tunneling API
Hi Jonas, Jakub and others, On Sun, Jan 17, 2021 at 02:23:52PM +0100, Jonas Bonn wrote: > This patch hasn't received any ACK's from either the maintainers or anyone > else providing review. The following issues remain unaddressed after > review: [...] Full ACK from my point of view. The patch is so massive that I as the original co-author and co-maintainer of the GTP kernel module have problems understanding what it is doing at all. Furthermore, I am actually wondering if there is any commonality between the existing use cases and whatever the modified gtp.ko is trying to achieve. Up to the point on whether or not it makes sense to have both functionalities in the same driver/module at all > I'm not sure what the hurry is to get this patch into mainline. Large and > complicated patches like this take time to review; please revert this and > allow that process to happen. Also acknowledged and supported from my side. -- - Harald Weltehttp://laforge.gnumonks.org/ "Privacy in residential applications is a desirable marketing option." (ETSI EN 300 175-7 Ch. A6)
Re: [PATCH net-next 7/7] net: ipa: allow arbitrary number of interconnects
On 1/16/21 9:12 PM, Jakub Kicinski wrote: On Fri, 15 Jan 2021 06:50:50 -0600 Alex Elder wrote: Currently we assume that the IPA hardware has exactly three interconnects. But that won't be guaranteed for all platforms, so allow any number of interconnects to be specified in the configuration data. For each platform, define an array of interconnect data entries (still associated with the IPA clock structure), and record the number of entries initialized in that array. Loop over all entries in this array when initializing, enabling, disabling, or tearing down the set of interconnects. With this change we no longer need the ipa_interconnect_id enumerated type, so get rid of it. Okay, all the platforms supported as of the end of the series still have 3 interconnects, or there is no upstream user of this functionality, if you will. What's the story? The short answer is that there is another version of IPA that has four interconnects instead of three. (A DTS file for it is in "sdxprairie.dtsi" in the Qualcomm "downstream" code.) I hope to have that version supported this year, but it's not my top priority right now. Generalizing the interconnect definitions as done in this series improves the driver, but you're right, it is technically not required at this time. And some more background: The upstream IPA driver is derived from the Qualcomm downstream code published at codeaurora.org. The downstream driver is huge (it's well over 100,000 lines of code) and it supports lots of IPA versions and some features that are not present upstream. In order to have any hope of getting upstream support for the IPA hardware, the downstream driver functionality was reduced, removing support for filtering, routing, and NAT. I spent many months refactoring and reworking that code to make it more "upstreamable," and eventually posted the result for review. Now that there is an upstream driver, a long term goal is to add back functionality that got removed, matching the most important features and hardware support found in the downstream code. So in cases like this, even though this feature is not yet required, adding it now lays groundwork to make later work easier. Everything I do with the upstream IPA driver is aimed at in-tree support for additional IPA features and hardware versions. So even if an improvement isn't required *now*, there is at least a general plan to add support "soon" for something that will need it. Beyond even that, there are some things I intend to do that will improve the driver, even if they aren't technically required near-term. For example, I'd like to dynamically allocate endpoints based on what's needed, rather than having the driver support any number of them up to a maximum fixed at build time. Probably a longer response than you needed, but I thought it would help to provide more background. Besides, you *did* ask for "the story..." -Alex
Re: [EXT] Re: [PATCH net-next 1/3] qede: add netpoll support for qede driver
> On Fri, 15 Jan 2021 01:06:08 -0800 Bhaskar Upadhaya wrote: >> Add net poll controller support to transmit kernel printks >> over UDP > > Why do you need this patch? Couple years back netpoll was taught > how to pull NAPIs by itself, and all you do is schedule NAPIs. > > All the driver should do is to make sure that when napi is called > with budget of 0 it only processes Tx completions, not Rx traffic. Hi Jakub, Thanks for the hint, we were not aware of that. I see our driver may not handle zero budget accordingly. Will check. But then, all this means .ndo_poll_controller is basically deprecated? Regards, Igor
Re: [EXT] Re: [PATCH net-next 3/3] qede: set default per queue rx/tx usecs coalescing parameters
>> Here we do the initialization of coalescing values on load. >> Although the default device values are the same - explicit >> config is better visible. > > Can you also make the driver store the settings across ifdown / ifup > and allow the settings to be checked while the interface is down, > while at it? Good suggestion, we'll implement that. Thanks, Igor
Re: [RFC net-next 1/7] net: bridge: switchdev: Refactor br_switchdev_fdb_notify
On Sat, Jan 16, 2021 at 02:25:09AM +0100, Tobias Waldekranz wrote: > Instead of having to add more and more arguments to > br_switchdev_fdb_call_notifiers, get rid of it and build the info > struct directly in br_switchdev_fdb_notify. > > Signed-off-by: Tobias Waldekranz > --- Reviewed-by: Vladimir Oltean
Re: [PATCH 17/18] net: iosm: readme file
Sorry about being much too late into this discussion. I'm not having the bandwidth to read netdev anymore, and just stumbled across this now. Andrew Lunn writes: > So, this is what all the Ethernet nonsense is all about. You have a > session ID you need to somehow represent to user space. And you > decided to use VLANs. But to use VLANs, you need an Ethernet > header. So you added a bogus Ethernet header. Actually, the original reasoning was the other way around. The bogus ethernet header was added because I had seen the 3G modem vendors do that for a few years already, in the modem firmware. And I didn't think enough about it to realize that it was a really bad idea, or even that it was something I could change. Or should change. I cannot blame the MBIM sesison to VLAN mapping idea on anyone else. As far as I can remember, that was just something that popped up in my head while working on the cdc_mbim driver. But it came as a consequence of already having the bogus ethernet header. And I didn't really understand that I could define a new wwan subsystem with new device types. I thought I had to use whatever was there already. I was young and stupid. Now I'm not that young anymore ;-) Never ever imagined that this would be replicated in another driver, though. That doesn't really make much sense. We have learned by now, haven't we? This subject has been discussed a few times in the past, and Johannes summary is my understanding as well: "I don't think anyone likes that" The DSS mapping sucks even more that the IPS mapping, BTW. I don't think there are any real users? Not that I know of, at least. DSS is much better implmeneted as some per-session character device, as requested by numerous people for years. Sorry for not listening. Looks like it is too late now. > Is any of this VLAN stuff required by MBIM? No. It's my fault and mine alone. > I suggest you throw away the pretence this is an Ethernet device. It > is not. I completely agree. I wish I had gone for simple raw-ip devices both in the qmi_wwan and cdc_mbim. But qmi_wwan got them later, so there is already support for such things in wwan userspace. Bjørn
Re: [PATCH v2] hv_netvsc: Add (more) validation for untrusted Hyper-V values
On Sun, Jan 17, 2021 at 03:10:32PM +, Wei Liu wrote: > On Sat, Jan 16, 2021 at 02:02:01PM +0100, Andrea Parri wrote: > > On Fri, Jan 15, 2021 at 08:30:22PM -0800, Jakub Kicinski wrote: > > > On Thu, 14 Jan 2021 21:26:28 +0100 Andrea Parri (Microsoft) wrote: > > > > For additional robustness in the face of Hyper-V errors or malicious > > > > behavior, validate all values that originate from packets that Hyper-V > > > > has sent to the guest. Ensure that invalid values cannot cause indexing > > > > off the end of an array, or subvert an existing validation via integer > > > > overflow. Ensure that outgoing packets do not have any leftover guest > > > > memory that has not been zeroed out. > > > > > > > > Reported-by: Juan Vazquez > > > > Signed-off-by: Andrea Parri (Microsoft) > > > > Cc: "David S. Miller" > > > > Cc: Jakub Kicinski > > > > Cc: Alexei Starovoitov > > > > Cc: Daniel Borkmann > > > > Cc: Andrii Nakryiko > > > > Cc: Martin KaFai Lau > > > > Cc: Song Liu > > > > Cc: Yonghong Song > > > > Cc: John Fastabend > > > > Cc: KP Singh > > > > Cc: netdev@vger.kernel.org > > > > Cc: b...@vger.kernel.org > > > > --- > > > > Applies to 5.11-rc3 (and hyperv-next). > > > > > > So this is for hyperv-next or should we take it via netdev trees? > > > > No preference, either way is good for me. > > To be clear: There is no dependency on any patch in hyperv-next, right? > > That's my understanding, but I would like to confirm it. Well, I wrote that this *applies* to hyperv-next... but that's indeed the only 'dependency' I can think of. Hope this helps. Thanks, Andrea
How to port rtl8xxxu to new device...?
Hi all, I'm currently trying to port the rtl8xxxu driver to RTL8822u chips (one of which I own), but I'm finding it hard to figure out where to start and I figured that some wiser minds could maybe help steer me in the right direction. I've managed to get as far as reading and writing to various registers as part of the device setup, but the dongle appears to crash at the downloading firmware stage. At the moment it just feels like I'm changing random bits and pieces to see what happens. I'm using this downstream driver for reference: https://github.com/RinCat/RTL88x2BU-Linux-Driver Does anyone know of any documentation or anything that could be useful for working on this? Best, Alex
[PATCH] arcnet: fix macro name when DEBUG is defined
From: Tom Rix When DEBUG is defined this error occurs drivers/net/arcnet/com20020_cs.c:70:15: error: ‘com20020_REG_W_ADDR_HI’ undeclared (first use in this function); did you mean ‘COM20020_REG_W_ADDR_HI’? ioaddr, com20020_REG_W_ADDR_HI); ^~ >From reviewing the context, the suggestion is what is meant. Fixes: 0fec65130b9f ("arcnet: com20020: Use arcnet_ routines") Signed-off-by: Tom Rix --- drivers/net/arcnet/com20020_cs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/arcnet/com20020_cs.c b/drivers/net/arcnet/com20020_cs.c index cf607ffcf358..81223f6bebcc 100644 --- a/drivers/net/arcnet/com20020_cs.c +++ b/drivers/net/arcnet/com20020_cs.c @@ -67,7 +67,7 @@ static void regdump(struct net_device *dev) /* set up the address register */ count = 0; arcnet_outb((count >> 8) | RDDATAflag | AUTOINCflag, - ioaddr, com20020_REG_W_ADDR_HI); + ioaddr, COM20020_REG_W_ADDR_HI); arcnet_outb(count & 0xff, ioaddr, COM20020_REG_W_ADDR_LO); for (count = 0; count < 256 + 32; count++) { -- 2.27.0
Re: [PATCH] arcnet: fix macro name when DEBUG is defined
On Sun, 2021-01-17 at 10:15 -0800, t...@redhat.com wrote: > From: Tom Rix > > When DEBUG is defined this error occurs > > drivers/net/arcnet/com20020_cs.c:70:15: error: ‘com20020_REG_W_ADDR_HI’ > undeclared (first use in this function); > did you mean ‘COM20020_REG_W_ADDR_HI’? > ioaddr, com20020_REG_W_ADDR_HI); > ^~ > > From reviewing the context, the suggestion is what is meant. > > Fixes: 0fec65130b9f ("arcnet: com20020: Use arcnet_ routines") Nice find thanks, especially seeing as how this hasn't been tested or compiled in 5+ years... commit 0fec65130b9f11a73d74f47025491f97f82ba070 Author: Joe Perches Date: Tue May 5 10:06:06 2015 -0700 Acked-by: Joe Perches > Signed-off-by: Tom Rix > --- > drivers/net/arcnet/com20020_cs.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/net/arcnet/com20020_cs.c > b/drivers/net/arcnet/com20020_cs.c > index cf607ffcf358..81223f6bebcc 100644 > --- a/drivers/net/arcnet/com20020_cs.c > +++ b/drivers/net/arcnet/com20020_cs.c > @@ -67,7 +67,7 @@ static void regdump(struct net_device *dev) > /* set up the address register */ > count = 0; > arcnet_outb((count >> 8) | RDDATAflag | AUTOINCflag, > - ioaddr, com20020_REG_W_ADDR_HI); > + ioaddr, COM20020_REG_W_ADDR_HI); > arcnet_outb(count & 0xff, ioaddr, COM20020_REG_W_ADDR_LO); > > > for (count = 0; count < 256 + 32; count++) {
[PATCH] net: hns: fix variable used when DEBUG is defined
From: Tom Rix When DEBUG is defined this error occurs drivers/net/ethernet/hisilicon/hns/hns_enet.c:1505:36: error: ‘struct net_device’ has no member named ‘ae_handle’; did you mean ‘rx_handler’? assert(skb->queue_mapping < ndev->ae_handle->q_num); ^ ae_handle is an element of struct hns_nic_priv, so change ndev to priv. Fixes: b5996f11ea54 ("net: add Hisilicon Network Subsystem basic ethernet support") Signed-off-by: Tom Rix --- drivers/net/ethernet/hisilicon/hns/hns_enet.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/hisilicon/hns/hns_enet.c b/drivers/net/ethernet/hisilicon/hns/hns_enet.c index 858cb293152a..5d7824d2b4d4 100644 --- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c +++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c @@ -1502,7 +1502,7 @@ static netdev_tx_t hns_nic_net_xmit(struct sk_buff *skb, { struct hns_nic_priv *priv = netdev_priv(ndev); - assert(skb->queue_mapping < ndev->ae_handle->q_num); + assert(skb->queue_mapping < priv->ae_handle->q_num); return hns_nic_net_xmit_hw(ndev, skb, &tx_ring_data(priv, skb->queue_mapping)); -- 2.27.0
Re: [RFC net-next 2/7] net: bridge: switchdev: Include local flag in FDB notifications
Hi Tobias, On Sat, Jan 16, 2021 at 02:25:10AM +0100, Tobias Waldekranz wrote: > Some switchdev drivers, notably DSA, ignore all dynamically learned > address notifications (!added_by_user) as these are autonomously added > by the switch. Previously, such a notification was indistinguishable > from a local address notification. Include a local bit in the > notification so that the two classes can be discriminated. > > This allows DSA-like devices to add local addresses to the hardware > FDB (with the CPU as the destination), thereby avoiding flows towards > the CPU being flooded by the switch as unknown unicast. > > Signed-off-by: Tobias Waldekranz > --- In an ideal world, the BR_FDB_LOCAL bit of an FDB entry is what you would probably want to use as an indication that the packet must be delivered upstream by the hardware, considering that this is what the software data path does: br_handle_frame_finish: if (test_bit(BR_FDB_LOCAL, &dst->flags)) return br_pass_frame_up(skb); However, we are not in an ideal world, but in a cacophony of nonsensical flags that must be passed to the 'bridge fdb add' command. For example, I noticed this usage pattern in your patch 6/7: |br0 |/ \ | swp0 dummy0 | | $ bridge fdb add 02:00:de:ad:00:01 dev dummy0 vlan 1 master Do you know that this command doesn't do what you think it does (assuming that 02:00:de:ad:00:01 is not the MAC address of dummy0)? The command you wrote will add a _local_ FDB entry on dummy0. I tried it on a DSA switch interface (swp0): $ bridge fdb add 02:00:de:ad:00:01 dev swp0 vlan 1 master [ 3162.165561] rtnl_fdb_add: addr 02:00:de:ad:00:01 vid 1 ndm_state NUD_NOARP|NUD_PERMANENT [ 3162.172487] fdb_add_entry: fdb 02:00:de:ad:00:01 state NUD_NOARP|NUD_PERMANENT, fdb_to_nud NUD_REACHABLE, flags 0x0 [ 3162.180515] mscc_felix :00:00.5 swp0: local fdb: 02:00:de:ad:00:01 vid 1 So, after your patches, this command syntax will stop adding a front-facing FDB entry on swp0. It will add a CPU-facing FDB entry instead. You know why the bridge neighbour state is NUD_NOARP|NUD_PERMANENT in rtnl_fdb_add? Well, because iproute2 set it that way: /* Assume permanent */ if (!(req.ndm.ndm_state&(NUD_PERMANENT|NUD_REACHABLE))) req.ndm.ndm_state |= NUD_PERMANENT; See iproute2 commit 0849e60a10d0 ("bridge: manage VXLAN based forwarding tables"). I know so little about VXLAN's use of the bridge command, that I cannot tell why it was decided to "assume permanent" (which seems to have changed the default behavior for everybody). Otherwise said, even if not mentioned in the man page, the default FDB entry type is NUD_PERMANENT (which, in short, means a "local" entry, see a more detailed explanation at the end). The man page just says: bridge fdb add - add a new fdb entry This command creates a new fdb entry. LLADDR the Ethernet MAC address. dev DEV the interface to which this address is associated. local - is a local permanent fdb entry static - is a static (no arp) fdb entry which is utterly misleading and useless. It does not say: (a) what a local FDB entry is (b) that if neither "local" or "static"|"dynamic" is specified, "local" is default This already creates pretty bad premises. You would have needed to explicitly add "static" to your command. Not only you, but in fact also thousands of other people who already have switchdev deployments using the 'bridge fdb' command. So, in short, if everybody with switchdev hardware used the 'bridge fdb' command correctly so far, your series would have been great. But in fact, nobody did (me included). So we need to be more creative. For example, there's that annoying "self" flag. As far as I understand, there is zero reason for a DSA driver to use the "self" flag, since that means "bypass the bridge FDB and just call the .ndo_fdb_add of the device driver, which in the case of DSA is dsa_legacy_fdb_add". Instead you would just use the "master" flag, which makes the operation be propagated through br_fdb_add and the software FDB has a chance to be updated. Only that there's no one preventing you from using the 'self' and 'master' flags together. Which means that the FDB would be offloaded to the device twice: once through the SWITCHDEV_FDB_ADD_TO_DEVICE notification emitted by br_fdb_add, and once through dsa_legacy_fdb_add. Contradict me if I'm wrong, but I'm thinking that you may have missed this detail that bridge fdb addresses are implicitly 'local' because you also used some 'self master' commands, and the "to-CPU" address installed through switchdev was immediately overwritten by the correct one installed through dsa_legacy_fdb_add. So I think there is a very real issue in that the FDB entries with the is_local bit was never specified to switchdev thus far, and now suddenly is. I'm sorry, but what you're saying in the commit message, that "!added_by_user ha
Re: [PATCH] rtnetlink.7: Remove IPv4 from description
Hi Alex and Pali On 1/16/21 4:04 PM, Alejandro Colomar wrote: > From: Pali Rohár > > rtnetlink is not only used for IPv4 > > Signed-off-by: Pali Rohár > Signed-off-by: Alejandro Colomar Thanks. Patch applied. Cheers, Michael > --- > man7/rtnetlink.7 | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/man7/rtnetlink.7 b/man7/rtnetlink.7 > index cd6809320..aec005ff9 100644 > --- a/man7/rtnetlink.7 > +++ b/man7/rtnetlink.7 > @@ -13,7 +13,7 @@ > .\" > .TH RTNETLINK 7 2020-06-09 "Linux" "Linux Programmer's Manual" > .SH NAME > -rtnetlink \- Linux IPv4 routing socket > +rtnetlink \- Linux routing socket > .SH SYNOPSIS > .nf > .B #include > -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/
Re: [PATCH net-next v5] GTP: add support for flow based tunneling API
On Sun, Jan 17, 2021 at 5:46 AM Jonas Bonn wrote: > > Hi Pravin, > > Again, this patch is too big and contains too many disparate changes to > allow for easy review. Please break this up into a series of logical > changes for easier review. > > On 10/01/2021 08:00, Pravin B Shelar wrote: > > Following patch add support for flow based tunneling API > > to send and recv GTP tunnel packet over tunnel metadata API. > > This would allow this device integration with OVS or eBPF using > > flow based tunneling APIs. > > > > Signed-off-by: Pravin B Shelar > > --- > > v4-v5: > > - coding style changes > > v3-v4: > > - add check for non-zero dst port > > v2-v3: > > - Fixed coding style > > - changed IFLA_GTP_FD1 to optional param for LWT dev. > > v1-v2: > > - Fixed according to comments from Jonas Bonn > > > > drivers/net/gtp.c | 527 + > > include/uapi/linux/gtp.h | 12 + > > include/uapi/linux/if_link.h | 1 + > > include/uapi/linux/if_tunnel.h | 1 + > > tools/include/uapi/linux/if_link.h | 1 + > > 5 files changed, 398 insertions(+), 144 deletions(-) > > > > diff --git a/drivers/net/gtp.c b/drivers/net/gtp.c > > index 4c04e271f184..851364314ecc 100644 > > --- a/drivers/net/gtp.c > > +++ b/drivers/net/gtp.c > > @@ -21,6 +21,7 @@ > > #include > > #include > > > > +#include > > #include > > #include > > #include > > @@ -73,6 +74,9 @@ struct gtp_dev { > > unsigned inthash_size; > > struct hlist_head *tid_hash; > > struct hlist_head *addr_hash; > > + /* Used by LWT tunnel. */ > > + boolcollect_md; > > + struct socket *collect_md_sock; > > }; > > Can't you use the 'sock' member of sk1u as the reference this socket? As discussed earlier I use it to destroy the LWT socket in the cleanup code path. I need to test a solution to avoid this reference. I can send an incremental patch if that works. > > > > > static unsigned int gtp_net_id __read_mostly; > > @@ -179,33 +183,121 @@ static bool gtp_check_ms(struct sk_buff *skb, struct > > pdp_ctx *pctx, > > return false; > > } > > > > -static int gtp_rx(struct pdp_ctx *pctx, struct sk_buff *skb, > > - unsigned int hdrlen, unsigned int role) > > +static int gtp_set_tun_dst(struct gtp_dev *gtp, struct sk_buff *skb, > > +unsigned int hdrlen, u8 gtp_version, > > +__be64 tid, u8 flags) > > { > > - if (!gtp_check_ms(skb, pctx, hdrlen, role)) { > > - netdev_dbg(pctx->dev, "No PDP ctx for this MS\n"); > > - return 1; > > + struct metadata_dst *tun_dst; > > + int opts_len = 0; > > + > > + if (unlikely(flags & GTP1_F_MASK)) > > + opts_len = sizeof(struct gtpu_metadata); > > + > > + tun_dst = udp_tun_rx_dst(skb, gtp->sk1u->sk_family, TUNNEL_KEY, tid, > > opts_len); > > + if (!tun_dst) { > > + netdev_dbg(gtp->dev, "Failed to allocate tun_dst"); > > + goto err; > > } > > > > + netdev_dbg(gtp->dev, "attaching metadata_dst to skb, gtp ver %d > > hdrlen %d\n", > > +gtp_version, hdrlen); > > + if (unlikely(opts_len)) { > > + struct gtpu_metadata *opts; > > + struct gtp1_header *gtp1; > > + > > + opts = ip_tunnel_info_opts(&tun_dst->u.tun_info); > > + gtp1 = (struct gtp1_header *)(skb->data + sizeof(struct > > udphdr)); > > + opts->ver = GTP_METADATA_V1; > > + opts->flags = gtp1->flags; > > + opts->type = gtp1->type; > > + netdev_dbg(gtp->dev, "recved control pkt: flag %x type: %d\n", > > +opts->flags, opts->type); > > + tun_dst->u.tun_info.key.tun_flags |= TUNNEL_GTPU_OPT; > > + tun_dst->u.tun_info.options_len = opts_len; > > + skb->protocol = htons(0x); /* Unknown */ > > + } > > /* Get rid of the GTP + UDP headers. */ > > if (iptunnel_pull_header(skb, hdrlen, skb->protocol, > > - !net_eq(sock_net(pctx->sk), > > dev_net(pctx->dev > > - return -1; > > + !net_eq(sock_net(gtp->sk1u), > > dev_net(gtp->dev { > > + gtp->dev->stats.rx_length_errors++; > > + goto err; > > + } > > + > > + skb_dst_set(skb, &tun_dst->dst); > > + return 0; > > +err: > > + return -1; > > +} > > + > > +static int gtp_rx(struct gtp_dev *gtp, struct sk_buff *skb, > > + unsigned int hdrlen, u8 gtp_version, unsigned int role, > > + __be64 tid, u8 flags, u8 type) > > +{ > > + if (ip_tunnel_collect_metadata() || gtp->collect_md) { > > + int err; > > + > > + err = gtp_set_tun_dst(gtp, skb, hdrlen, gtp_version, tid, > > flags); > > + if (err) > > + goto err; > > + }
Re: [PATCH net-next v5] GTP: add support for flow based tunneling API
On Sun, Jan 17, 2021 at 5:25 AM Jonas Bonn wrote: > > Hi Jakub, > > On 17/01/2021 01:46, Jakub Kicinski wrote: > > On Sat, 9 Jan 2021 23:00:21 -0800 Pravin B Shelar wrote: > >> Following patch add support for flow based tunneling API > >> to send and recv GTP tunnel packet over tunnel metadata API. > >> This would allow this device integration with OVS or eBPF using > >> flow based tunneling APIs. > >> > >> Signed-off-by: Pravin B Shelar > > > > Applied, thanks! > > > > This patch hasn't received any ACK's from either the maintainers or > anyone else providing review. The following issues remain unaddressed > after review: > This patch was first sent out on Dec 10 and you responded on Dec 11. I think there was a reasonable amount of time given for reviews. > i) the patch contains several logically separate changes that would be > better served as smaller patches Given this patch is adding a single feature, to add support for LWT, I sent a single patch. Now sure how much benefit it would be to divide it in smaller patches. In future I will be more careful with dividing the patch, since that is THE objection you have presented here. > ii) functionality like the handling of end markers has been introduced > without further explanation This is the first time you are raising this question. End marker is introduced to handle these packets in a single LWT device. This way the control plane can use a single device to program all GTP user-plane functionality. > iii) symmetry between the handling of GTPv0 and GTPv1 has been > unnecessarily broken This is already discussed in previous review: Once we add support for LWT for v0, we can get symmetry between V1 and V0. At this point there is no use case to use V0 in LWT, so I do not see a point introducing the support. > iv) there are no available userspace tools to allow for testing this > functionality > This is not true. I have mentioned and provided open source tools multiple times on review tread: Patch for iproute to add support for LWT GTP devices. https://github.com/pshelar/iproute2/commit/d6e99f8342672e6e9ce0b71e153296f8e2b41cfc OVS support with integration test: https://github.com/pshelar/ovs/blob/6ec6a2a86adc56c7c9dcab7b3a7b70bb6dad35c9/tests/system-layer3-tunnels.at#L158 > I have requested that this patch be reworked into a series of smaller > changes. That would allow: > > i) reasonable review > ii) the possibility to explain _why_ things are being done in the patch > comment where this isn't obvious (like the handling of end markers) > iii) the chance to do a reasonable rebase of other ongoing work onto > this patch (series): this one patch is invasive and difficult to rebase > onto > > I'm not sure what the hurry is to get this patch into mainline. Large > and complicated patches like this take time to review; please revert > this and allow that process to happen. > Rather than reverting this patch, I can handle comments you have posted. Those can be fixed by minor incremental patches. Let me know if you find any regression introduced by this patch set, I can quickly fix it. Thanks, Pravin.
Re: [PATCH net-next v5] GTP: add support for flow based tunneling API
On Sun, Jan 17, 2021 at 7:31 AM Harald Welte wrote: > > Hi Jonas, Jakub and others, > > On Sun, Jan 17, 2021 at 02:23:52PM +0100, Jonas Bonn wrote: > > This patch hasn't received any ACK's from either the maintainers or anyone > > else providing review. The following issues remain unaddressed after > > review: > > [...] > > Full ACK from my point of view. The patch is so massive that I > as the original co-author and co-maintainer of the GTP kernel module > have problems understanding what it is doing at all. Furthermore, > I am actually wondering if there is any commonality between the existing > use cases and whatever the modified gtp.ko is trying to achieve. Up to > the point on whether or not it makes sense to have both functionalities > in the same driver/module at all > This is not modifying existing functionality. This patch is adding LWT tunneling API. Existing functionality remains the same. Let me know if you find any regression. I can fix it. LWT is a well known method to implement scalable tunneling which most of the tunneling modules (GENEVE, GRE, VxLAN etc..) in linux kernel already supports. If we separate out gtp.ko. from its LWT implementation, we will need to duplicate a bunch of existing code as well as code that Jonas is adding to improve performance using UDP tunnel offloading APIs. I don't think that is the right approach. Existing tunneling modules also use the unified module approach to implement traditional and LWT based tunnel devices. > > I'm not sure what the hurry is to get this patch into mainline. Large and > > complicated patches like this take time to review; please revert this and > > allow that process to happen. > > Also acknowledged and supported from my side. > > -- > - Harald Weltehttp://laforge.gnumonks.org/ > > "Privacy in residential applications is a desirable marketing option." > (ETSI EN 300 175-7 Ch. A6)
Re: [PATCH 0/2] net: dsa: mv88e6xxx: fix vlan filtering for 6250
Hi Rasmus, On Sat, Jan 16, 2021 at 03:39:34AM +0100, Rasmus Villemoes wrote: > I finally managed to figure out why enabling VLAN filtering on the > 6250 broke all (ingressing) traffic, > cf. > https://lore.kernel.org/netdev/6424c14e-bd25-2a06-cf0b-f1a07f9a3...@prevas.dk/ > . > > The first patch is the minimal fix and for net, while the second one > is a little cleanup for net-next. > > Rasmus Villemoes (2): > net: dsa: mv88e6xxx: also read STU state in mv88e6250_g1_vtu_getnext > net: dsa: mv88e6xxx: use mv88e6185_g1_vtu_getnext() for the 6250 It's strange to put a patch for net and one for net-next in the same series. Nobody will keep a note for you to apply the second patch after net has been merged back into net-next. So if you want to keep the two-patch approach, you'd have to send just the "net" patch now, and the "net-next" patch later. But is there any reason why you don't just apply the second patch to "net"?
Re: [PATCH v2 net-next 1/1] net: dsa: hellcreek: Add TAPRIO offloading support
On Sat, Jan 16, 2021 at 01:49:22PM +0100, Kurt Kanzenbach wrote: > The switch has support for the 802.1Qbv Time Aware Shaper (TAS). Traffic > schedules may be configured individually on each front port. Each port has > eight > egress queues. The traffic is mapped to a traffic class respectively via the > PCP > field of a VLAN tagged frame. > > The TAPRIO Qdisc already implements that. Therefore, this interface can simply > be reused. Add .port_setup_tc() accordingly. > > The activation of a schedule on a port is split into two parts: > > * Programming the necessary gate control list (GCL) > * Setup delayed work for starting the schedule > > The hardware supports starting a schedule up to eight seconds in the future. > The > TAPRIO interface provides an absolute base time. Therefore, periodic delayed > work is leveraged to check whether a schedule may be started or not. > > Signed-off-by: Kurt Kanzenbach > --- Reviewed-by: Vladimir Oltean
RE: [PATCH bpf-next] xsk: build skb by page
Xuan Zhuo wrote: > This patch is used to construct skb based on page to save memory copy > overhead. > > This has one problem: > > We construct the skb by fill the data page as a frag into the skb. In > this way, the linear space is empty, and the header information is also > in the frag, not in the linear space, which is not allowed for some > network cards. For example, Mellanox Technologies MT27710 Family > [ConnectX-4 Lx] will get the following error message: > > mlx5_core :3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8, qn 0x1dbb, > opcode 0xd, syndrome 0x1, vendor syndrome 0x68 > : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 0030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2 > WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64 > : 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00 > 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 0020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00 > 0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > mlx5_core :3b:00.1 eth1: ERR CQE on SQ: 0x1dbb > > I also tried to use build_skb to construct skb, but because of the > existence of skb_shinfo, it must be behind the linear space, so this > method is not working. We can't put skb_shinfo on desc->addr, it will be > exposed to users, this is not safe. > > Finally, I added a feature NETIF_F_SKB_NO_LINEAR to identify whether the > network card supports the header information of the packet in the frag > and not in the linear space. > > Performance Testing > > The test environment is Aliyun ECS server. > Test cmd: > ``` > xdpsock -i eth0 -t -S -s > ``` > > Test result data: > > size64 512 10241500 > copy1916747 1775988 1600203 1440054 > page1974058 1953655 1945463 1904478 > percent 3.0%10.0% 21.58% 32.3% Looks like a good perf bump. Some easy suggestions below > +static struct sk_buff *xsk_build_skb(struct xdp_sock *xs, > + struct xdp_desc *desc, int *err) > +{ Passing a 'int *err' here is ugly IMO use the ERR_PTR/PTR_ERR macros and roll it into the return value. or maybe use the out: pattern used in the kernel, but just doing direct returns like now but with ERR_PTR() would also be fine. > + struct sk_buff *skb ; struct sk_buff *skb = NULL; err = -ENOMEM; > + > + if (xs->dev->features & NETIF_F_SKB_NO_LINEAR) { > + skb = xsk_build_skb_zerocopy(xs, desc); > + if (unlikely(!skb)) { goto out > + *err = -ENOMEM; > + return NULL; > + } > + } else { > + char *buffer; > + u64 addr; > + u32 len; > + int err; > + > + len = desc->len; > + skb = sock_alloc_send_skb(&xs->sk, len, 1, &err); > + if (unlikely(!skb)) { goto out; > + *err = -ENOMEM; > + return NULL; > + } > + > + skb_put(skb, len); > + addr = desc->addr; > + buffer = xsk_buff_raw_get_data(xs->pool, desc->addr); > + err = skb_store_bits(skb, 0, buffer, len); > + > + if (unlikely(err)) { > + kfree_skb(skb); err = -EINVAL; goto out > + *err = -EINVAL; > + return NULL; > + } > + } > + > + skb->dev = xs->dev; > + skb->priority = xs->sk.sk_priority; > + skb->mark = xs->sk.sk_mark; > + skb_shinfo(skb)->destructor_arg = (void *)(long)desc->addr; > + skb->destructor = xsk_destruct_skb; > + > + return skb; out: kfree_skb(skb) return ERR_PTR(err); > +} > + Otherwise looks good thanks.
[PATCH v2 bpf-next 0/4] libbpf: BTF dumper support for typed data
Add a libbpf dumper function that supports dumping a representation of data passed in using the BTF id associated with the data in a manner similar to the bpf_snprintf_btf helper. Default output format is identical to that dumped by bpf_snprintf_btf(), for example a "struct sk_buff" representation would look like this: struct sk_buff){ (union){ (struct){ .next = (struct sk_buff *)0x, .prev = (struct sk_buff *)0x, (union){ .dev = (struct net_device *)0x, .dev_scratch = (long unsigned int)18446744073709551615, }, }, ... Patches 1 and 2 make functions available that are needed during dump operations. Patch 3 implements the dump functionality in a manner similar to that in kernel/bpf/btf.c, but with a view to fitting into libbpf more naturally. For example, rather than using flags, boolean dump options are used to control output. Patch 4 is a selftest that utilizes a dump printf function to snprintf the dump output to a string for comparison with expected output. Tests deliberately mirror those in snprintf_btf helper test to keep output consistent. Changes since RFC [1] - The initial approach explored was to share the kernel code with libbpf using #defines to paper over the different needs; however it makes more sense to try and fit in with libbpf code style for maintenance. A comment in the code points at the implementation in kernel/bpf/btf.c and notes that any issues found in it should be fixed there or vice versa; mirroring the tests should help with this also (Andrii) [1] https://lore.kernel.org/bpf/1610386373-24162-1-git-send-email-alan.magu...@oracle.com/T/#t Alan Maguire (4): libbpf: add btf_has_size() and btf_int() inlines libbpf: make skip_mods_and_typedefs available internally in libbpf libbpf: BTF dumper support for typed data selftests/bpf: add dump type data tests to btf dump tests tools/lib/bpf/btf.h | 36 + tools/lib/bpf/btf_dump.c | 974 ++ tools/lib/bpf/libbpf.c| 4 +- tools/lib/bpf/libbpf.map | 5 + tools/lib/bpf/libbpf_internal.h | 2 + tools/testing/selftests/bpf/prog_tests/btf_dump.c | 233 ++ 6 files changed, 1251 insertions(+), 3 deletions(-) -- 1.8.3.1
[PATCH v2 bpf-next 2/4] libbpf: make skip_mods_and_typedefs available internally in libbpf
btf_dump.c will need it for type-based data display. Signed-off-by: Alan Maguire --- tools/lib/bpf/libbpf.c | 4 +--- tools/lib/bpf/libbpf_internal.h | 2 ++ 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 2abbc38..4ef84e1 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -73,8 +73,6 @@ #define __printf(a, b) __attribute__((format(printf, a, b))) static struct bpf_map *bpf_object__add_map(struct bpf_object *obj); -static const struct btf_type * -skip_mods_and_typedefs(const struct btf *btf, __u32 id, __u32 *res_id); static int __base_pr(enum libbpf_print_level level, const char *format, va_list args) @@ -1885,7 +1883,7 @@ static int bpf_object__init_user_maps(struct bpf_object *obj, bool strict) return 0; } -static const struct btf_type * +const struct btf_type * skip_mods_and_typedefs(const struct btf *btf, __u32 id, __u32 *res_id) { const struct btf_type *t = btf__type_by_id(btf, id); diff --git a/tools/lib/bpf/libbpf_internal.h b/tools/lib/bpf/libbpf_internal.h index 969d0ac..c25d2df 100644 --- a/tools/lib/bpf/libbpf_internal.h +++ b/tools/lib/bpf/libbpf_internal.h @@ -108,6 +108,8 @@ static inline void *libbpf_reallocarray(void *ptr, size_t nmemb, size_t size) void *btf_add_mem(void **data, size_t *cap_cnt, size_t elem_sz, size_t cur_cnt, size_t max_cnt, size_t add_cnt); int btf_ensure_mem(void **data, size_t *cap_cnt, size_t elem_sz, size_t need_cnt); +const struct btf_type *skip_mods_and_typedefs(const struct btf *btf, __u32 id, + __u32 *res_id); static inline bool libbpf_validate_opts(const char *opts, size_t opts_sz, size_t user_sz, -- 1.8.3.1
[PATCH v2 bpf-next 4/4] selftests/bpf: add dump type data tests to btf dump tests
Test various type data dumping operations by comparing expected format with the dumped string; an snprintf-style printf function is used to record the string dumped. Signed-off-by: Alan Maguire --- tools/testing/selftests/bpf/prog_tests/btf_dump.c | 233 ++ 1 file changed, 233 insertions(+) diff --git a/tools/testing/selftests/bpf/prog_tests/btf_dump.c b/tools/testing/selftests/bpf/prog_tests/btf_dump.c index c60091e..262561f4 100644 --- a/tools/testing/selftests/bpf/prog_tests/btf_dump.c +++ b/tools/testing/selftests/bpf/prog_tests/btf_dump.c @@ -232,6 +232,237 @@ void test_btf_dump_incremental(void) btf__free(btf); } +#define STRSIZE2048 +#defineEXPECTED_STRSIZE256 + +void btf_dump_snprintf(void *ctx, const char *fmt, va_list args) +{ + char *s = ctx, new[STRSIZE]; + + vsnprintf(new, STRSIZE, fmt, args); + strncat(s, new, STRSIZE); + vfprintf(ctx, fmt, args); +} + +/* skip "enum "/"struct " prefixes */ +#define SKIP_PREFIX(_typestr, _prefix) \ + do {\ + if (strstr(_typestr, _prefix) == _typestr) \ + _typestr += strlen(_prefix) + 1;\ + } while (0) + +int btf_dump_data(struct btf *btf, struct btf_dump *d, + char *ptrtype, __u64 flags, void *ptr, + char *str, char *expectedval) +{ + struct btf_dump_emit_type_data_opts opts = { 0 }; + int ret = 0, cmp; + __s32 type_id; + + opts.sz = sizeof(opts); + opts.compact = true; + if (flags & BTF_F_NONAME) + opts.noname = true; + if (flags & BTF_F_ZERO) + opts.zero = true; + SKIP_PREFIX(ptrtype, "enum"); + SKIP_PREFIX(ptrtype, "struct"); + SKIP_PREFIX(ptrtype, "union"); + type_id = btf__find_by_name(btf, ptrtype); + if (CHECK(type_id <= 0, "find type id", + "no '%s' in BTF: %d\n", ptrtype, type_id)) { + ret = -ENOENT; + goto err; + } + str[0] = '\0'; + ret = btf_dump__emit_type_data(d, type_id, &opts, ptr); + if (CHECK(ret < 0, "btf_dump__emit_type_data", + "failed: %d\n", ret)) + goto err; + + cmp = strncmp(str, expectedval, EXPECTED_STRSIZE); + if (CHECK(cmp, "ensure expected/actual match", + "'%s' does not match expected '%s': %d\n", + str, expectedval, cmp)) + ret = -EFAULT; + +err: + if (ret) + btf_dump__free(d); + return ret; +} + +#define TEST_BTF_DUMP_DATA(_b, _d, _str, _type, _flags, _expected, ...) \ + do {\ + char _expectedval[EXPECTED_STRSIZE] = _expected;\ + char __ptrtype[64] = #_type;\ + char *_ptrtype = (char *)__ptrtype; \ + static _type _ptrdata = __VA_ARGS__;\ + void *_ptr = &_ptrdata; \ + \ + if (btf_dump_data(_b, _d, _ptrtype, _flags, _ptr, \ + _str, _expectedval)) \ + return; \ + } while (0) + +/* Use where expected data string matches its stringified declaration */ +#define TEST_BTF_DUMP_DATA_C(_b, _d, _str, _type, _opts, ...) \ + TEST_BTF_DUMP_DATA(_b, _d, _str, _type, _opts, \ + "(" #_type ")" #__VA_ARGS__, __VA_ARGS__) + +void test_btf_dump_data(void) +{ + struct btf *btf = libbpf_find_kernel_btf(); + char str[STRSIZE]; + struct btf_dump_opts opts = { .ctx = str }; + struct btf_dump *d; + + if (CHECK(!btf, "get kernel BTF", "no kernel BTF found")) + return; + + d = btf_dump__new(btf, NULL, &opts, btf_dump_snprintf); + + if (CHECK(!d, "new dump", "could not create BTF dump")) + return; + + /* Verify type display for various types. */ + + /* simple int */ + TEST_BTF_DUMP_DATA_C(btf, d, str, int, 0, 1234); + TEST_BTF_DUMP_DATA(btf, d, str, int, BTF_F_NONAME, "1234", 1234); + + /* zero value should be printed at toplevel */ + TEST_BTF_DUMP_DATA(btf, d, str, int, 0, "(int)0", 0); + TEST_BTF_DUMP_DATA(btf, d, str, int, BTF_F_NONAME, "0", 0); + TEST_BTF_DUMP_DATA(btf, d, str, int, BTF_F_ZERO, "(int)0", 0); + TEST_BTF_DUMP_DATA(btf, d, str, int, BTF_F_NONAME | BTF_F_ZERO, + "0", 0); + TEST_BTF_DUMP_DATA_C(btf, d, str, int, 0, -4567); + TEST_BTF_DUMP_DATA(btf, d, str, int, BTF_F_NONAME, "-4567", -456
[PATCH v2 bpf-next 3/4] libbpf: BTF dumper support for typed data
Add a BTF dumper for typed data, so that the user can dump a typed version of the data provided. The API is int btf_dump__emit_type_data(struct btf_dump *d, __u32 id, const struct btf_dump_emit_type_data_opts *opts, void *data); ...where the id is the BTF id of the data pointed to by the "void *" argument; for example the BTF id of "struct sk_buff" for a "struct skb *" data pointer. Options supported are - a starting indent level (indent_lvl) - a set of boolean options to control dump display, similar to those used for BPF helper bpf_snprintf_btf(). Options are - compact : omit newlines and other indentation - noname: omit member names - zero: show zero-value members Default output format is identical to that dumped by bpf_snprintf_btf(), for example a "struct sk_buff" representation would look like this: struct sk_buff){ (union){ (struct){ .next = (struct sk_buff *)0x, .prev = (struct sk_buff *)0x, (union){ .dev = (struct net_device *)0x, .dev_scratch = (long unsigned int)18446744073709551615, }, }, ... Signed-off-by: Alan Maguire --- tools/lib/bpf/btf.h | 17 + tools/lib/bpf/btf_dump.c | 974 +++ tools/lib/bpf/libbpf.map | 5 + 3 files changed, 996 insertions(+) diff --git a/tools/lib/bpf/btf.h b/tools/lib/bpf/btf.h index 0c48f2e..7937124 100644 --- a/tools/lib/bpf/btf.h +++ b/tools/lib/bpf/btf.h @@ -180,6 +180,23 @@ struct btf_dump_emit_type_decl_opts { btf_dump__emit_type_decl(struct btf_dump *d, __u32 id, const struct btf_dump_emit_type_decl_opts *opts); + +struct btf_dump_emit_type_data_opts { + /* size of this struct, for forward/backward compatibility */ + size_t sz; + int indent_level; + /* below match "show" flags for bpf_show_snprintf() */ + bool compact; + bool noname; + bool zero; +}; +#define btf_dump_emit_type_data_opts__last_field zero + +LIBBPF_API int +btf_dump__emit_type_data(struct btf_dump *d, __u32 id, +const struct btf_dump_emit_type_data_opts *opts, +void *data); + /* * A set of helpers for easier BTF types handling */ diff --git a/tools/lib/bpf/btf_dump.c b/tools/lib/bpf/btf_dump.c index 2f9d685..04d604f 100644 --- a/tools/lib/bpf/btf_dump.c +++ b/tools/lib/bpf/btf_dump.c @@ -10,6 +10,8 @@ #include #include #include +#include +#include #include #include #include @@ -19,14 +21,31 @@ #include "libbpf.h" #include "libbpf_internal.h" +#define BITS_PER_BYTE 8 +#define BITS_PER_U128 (sizeof(__u64) * BITS_PER_BYTE * 2) +#define BITS_PER_BYTE_MASK (BITS_PER_BYTE - 1) +#define BITS_PER_BYTE_MASKED(bits) ((bits) & BITS_PER_BYTE_MASK) +#define BITS_ROUNDDOWN_BYTES(bits) ((bits) >> 3) +#define BITS_ROUNDUP_BYTES(bits) \ + (BITS_ROUNDDOWN_BYTES(bits) + !!BITS_PER_BYTE_MASKED(bits)) + static const char PREFIXES[] = "\t\t\t\t\t\t\t\t\t\t\t\t\t"; static const size_t PREFIX_CNT = sizeof(PREFIXES) - 1; + static const char *pfx(int lvl) { return lvl >= PREFIX_CNT ? PREFIXES : &PREFIXES[PREFIX_CNT - lvl]; } +static const char SPREFIXES[] = " "; +static const size_t SPREFIX_CNT = sizeof(SPREFIXES) - 1; + +static const char *spfx(int lvl) +{ + return lvl >= SPREFIX_CNT ? SPREFIXES : &SPREFIXES[SPREFIX_CNT - lvl]; +} + enum btf_dump_type_order_state { NOT_ORDERED, ORDERING, @@ -53,6 +72,49 @@ struct btf_dump_type_aux_state { __u8 referenced: 1; }; +#define BTF_DUMP_DATA_MAX_NAME_LEN 256 + +/* + * Common internal data for BTF type data dump operations. + * + * The implementation here is similar to that in kernel/bpf/btf.c + * that supports the bpf_snprintf_btf() helper, so any bugs in + * type data dumping here are likely in that code also. + * + * One challenge with showing nested data is we want to skip 0-valued + * data, but in order to figure out whether a nested object is all zeros + * we need to walk through it. As a result, we need to make two passes + * when handling structs, unions and arrays; the first path simply looks + * for nonzero data, while the second actually does the display. The first + * pass is signalled by state.depth_check being set, and if we + * encounter a non-zero value we set state.depth_to_show to the depth + * at which we encountered it. When we have completed the first pass, + * we will know if anything needs to be displayed if + * state.depth_to_show > state.depth. See btf_dump_emit_[struct,array]_data() + * for the implementation of this. + * + */ +struct btf_dump_data { + bool compact; + bool noname; + bool zero; + __u8 indent_lvl;/* base indent level */ + /* below are used during iteration */ + struct { + __u8 dept
[PATCH v2 bpf-next 1/4] libbpf: add btf_has_size() and btf_int() inlines
BTF type data dumping will use them in later patches, and they are useful generally when handling BTF data. Signed-off-by: Alan Maguire --- tools/lib/bpf/btf.h | 19 +++ 1 file changed, 19 insertions(+) diff --git a/tools/lib/bpf/btf.h b/tools/lib/bpf/btf.h index 1237bcd..0c48f2e 100644 --- a/tools/lib/bpf/btf.h +++ b/tools/lib/bpf/btf.h @@ -294,6 +294,20 @@ static inline bool btf_is_datasec(const struct btf_type *t) return btf_kind(t) == BTF_KIND_DATASEC; } +static inline bool btf_has_size(const struct btf_type *t) +{ + switch (BTF_INFO_KIND(t->info)) { + case BTF_KIND_INT: + case BTF_KIND_STRUCT: + case BTF_KIND_UNION: + case BTF_KIND_ENUM: + case BTF_KIND_DATASEC: + return true; + default: + return false; + } +} + static inline __u8 btf_int_encoding(const struct btf_type *t) { return BTF_INT_ENCODING(*(__u32 *)(t + 1)); @@ -309,6 +323,11 @@ static inline __u8 btf_int_bits(const struct btf_type *t) return BTF_INT_BITS(*(__u32 *)(t + 1)); } +static inline __u32 btf_int(const struct btf_type *t) +{ + return *(__u32 *)(t + 1); +} + static inline struct btf_array *btf_array(const struct btf_type *t) { return (struct btf_array *)(t + 1); -- 1.8.3.1
[PATCH net v2] net: lapb: Add locking to the lapb module
In the lapb module, the timers may run concurrently with other code in this module, and there is currently no locking to prevent the code from racing on "struct lapb_cb". This patch adds locking to prevent racing. 1. Add "spinlock_t lock" to "struct lapb_cb"; Add "spin_lock_bh" and "spin_unlock_bh" to APIs, timer functions and notifier functions. 2. Add "bool t1timer_stop, t2timer_stop" to "struct lapb_cb" to make us able to ask running timers to abort; Modify "lapb_stop_t1timer" and "lapb_stop_t2timer" to make them able to abort running timers; Modify "lapb_t2timer_expiry" and "lapb_t1timer_expiry" to make them abort after they are stopped by "lapb_stop_t1timer", "lapb_stop_t2timer", and "lapb_start_t1timer", "lapb_start_t2timer". 3. In lapb_unregister, change "lapb_stop_t1timer" and "lapb_stop_t2timer" to "del_timer_sync" to make sure all running timers have exited. 4. In lapb_device_event, replace the "lapb_disconnect_request" call with the content of "lapb_disconnect_request", to avoid trying to hold the lock twice. When I do this, I removed "lapb_start_t1timer" because I don't think it's necessary to start the timer when "NETDEV_GOING_DOWN". Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: Martin Schiller Signed-off-by: Xie He --- include/net/lapb.h| 2 ++ net/lapb/lapb_iface.c | 56 +++ net/lapb/lapb_timer.c | 30 +++ 3 files changed, 80 insertions(+), 8 deletions(-) diff --git a/include/net/lapb.h b/include/net/lapb.h index ccc3d1f020b0..eee73442a1ba 100644 --- a/include/net/lapb.h +++ b/include/net/lapb.h @@ -92,6 +92,7 @@ struct lapb_cb { unsigned short n2, n2count; unsigned short t1, t2; struct timer_list t1timer, t2timer; + boolt1timer_stop, t2timer_stop; /* Internal control information */ struct sk_buff_head write_queue; @@ -103,6 +104,7 @@ struct lapb_cb { struct lapb_frame frmr_data; unsigned char frmr_type; + spinlock_t lock; refcount_t refcnt; }; diff --git a/net/lapb/lapb_iface.c b/net/lapb/lapb_iface.c index 40961889e9c0..45f332607685 100644 --- a/net/lapb/lapb_iface.c +++ b/net/lapb/lapb_iface.c @@ -122,6 +122,8 @@ static struct lapb_cb *lapb_create_cb(void) timer_setup(&lapb->t1timer, NULL, 0); timer_setup(&lapb->t2timer, NULL, 0); + lapb->t1timer_stop = true; + lapb->t2timer_stop = true; lapb->t1 = LAPB_DEFAULT_T1; lapb->t2 = LAPB_DEFAULT_T2; @@ -129,6 +131,8 @@ static struct lapb_cb *lapb_create_cb(void) lapb->mode= LAPB_DEFAULT_MODE; lapb->window = LAPB_DEFAULT_WINDOW; lapb->state = LAPB_STATE_0; + + spin_lock_init(&lapb->lock); refcount_set(&lapb->refcnt, 1); out: return lapb; @@ -178,8 +182,8 @@ int lapb_unregister(struct net_device *dev) goto out; lapb_put(lapb); - lapb_stop_t1timer(lapb); - lapb_stop_t2timer(lapb); + del_timer_sync(&lapb->t1timer); + del_timer_sync(&lapb->t2timer); lapb_clear_queues(lapb); @@ -201,6 +205,8 @@ int lapb_getparms(struct net_device *dev, struct lapb_parms_struct *parms) if (!lapb) goto out; + spin_lock_bh(&lapb->lock); + parms->t1 = lapb->t1 / HZ; parms->t2 = lapb->t2 / HZ; parms->n2 = lapb->n2; @@ -219,6 +225,7 @@ int lapb_getparms(struct net_device *dev, struct lapb_parms_struct *parms) else parms->t2timer = (lapb->t2timer.expires - jiffies) / HZ; + spin_unlock_bh(&lapb->lock); lapb_put(lapb); rc = LAPB_OK; out: @@ -234,6 +241,8 @@ int lapb_setparms(struct net_device *dev, struct lapb_parms_struct *parms) if (!lapb) goto out; + spin_lock_bh(&lapb->lock); + rc = LAPB_INVALUE; if (parms->t1 < 1 || parms->t2 < 1 || parms->n2 < 1) goto out_put; @@ -256,6 +265,7 @@ int lapb_setparms(struct net_device *dev, struct lapb_parms_struct *parms) rc = LAPB_OK; out_put: + spin_unlock_bh(&lapb->lock); lapb_put(lapb); out: return rc; @@ -270,6 +280,8 @@ int lapb_connect_request(struct net_device *dev) if (!lapb) goto out; + spin_lock_bh(&lapb->lock); + rc = LAPB_OK; if (lapb->state == LAPB_STATE_1) goto out_put; @@ -285,6 +297,7 @@ int lapb_connect_request(struct net_device *dev) rc = LAPB_OK; out_put: + spin_unlock_bh(&lapb->lock); lapb_put(lapb); out: return rc; @@ -299,6 +312,8 @@ int lapb_disconnect_request(struct net_device *dev) if (!lapb) goto out; + spin_lock_bh(&lapb->lock); + switch (lapb->state) { case LAPB_STATE_0: rc = LAPB_NOTCONNECTED; @@ -330,6 +345,7 @@ int lapb
[PATCH iproute2 1/2] vrf: print BPF log buffer if bpf_program_load fails
Necessary to understand what is going on when bpf_program_load fails Signed-off-by: Luca Boccassi --- ip/ipvrf.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/ip/ipvrf.c b/ip/ipvrf.c index 42779e5c..91578031 100644 --- a/ip/ipvrf.c +++ b/ip/ipvrf.c @@ -278,8 +278,8 @@ static int vrf_configure_cgroup(const char *path, int ifindex) */ prog_fd = prog_load(ifindex); if (prog_fd < 0) { - fprintf(stderr, "Failed to load BPF prog: '%s'\n", - strerror(errno)); + fprintf(stderr, "Failed to load BPF prog: '%s'\n%s", + strerror(errno), bpf_log_buf); if (errno != EPERM) { fprintf(stderr, -- 2.29.2
[PATCH iproute2 2/2] vrf: fix ip vrf exec with libbpf
The size of bpf_insn is passed to bpf_load_program instead of the number of elements as it expects, so ip vrf exec fails with: $ sudo ip link add vrf-blue type vrf table 10 $ sudo ip link set dev vrf-blue up $ sudo ip/ip vrf exec vrf-blue ls Failed to load BPF prog: 'Invalid argument' last insn is not an exit or jmp processed 0 insns (limit 100) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0 Kernel compiled with CGROUP_BPF enabled? https://bugs.debian.org/980046 Reported-by: Emmanuel DECAEN Signed-off-by: Luca Boccassi --- lib/bpf_glue.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/lib/bpf_glue.c b/lib/bpf_glue.c index fa609bfe..d00a0dc1 100644 --- a/lib/bpf_glue.c +++ b/lib/bpf_glue.c @@ -14,7 +14,8 @@ int bpf_program_load(enum bpf_prog_type type, const struct bpf_insn *insns, size_t size_log) { #ifdef HAVE_LIBBPF - return bpf_load_program(type, insns, size_insns, license, 0, log, size_log); + return bpf_load_program(type, insns, size_insns / sizeof(struct bpf_insn), + license, 0, log, size_log); #else return bpf_prog_load_dev(type, insns, size_insns, license, 0, log, size_log); #endif -- 2.29.2
RE: [PATCHv14 bpf-next 1/6] bpf: run devmap xdp_prog on flush instead of bulk enqueue
Hangbin Liu wrote: > From: Jesper Dangaard Brouer > > This changes the devmap XDP program support to run the program when the > bulk queue is flushed instead of before the frame is enqueued. This has > a couple of benefits: > > - It "sorts" the packets by destination devmap entry, and then runs the > same BPF program on all the packets in sequence. This ensures that we > keep the XDP program and destination device properties hot in I-cache. > > - It makes the multicast implementation simpler because it can just > enqueue packets using bq_enqueue() without having to deal with the > devmap program at all. > > The drawback is that if the devmap program drops the packet, the enqueue > step is redundant. However, arguably this is mostly visible in a > micro-benchmark, and with more mixed traffic the I-cache benefit should > win out. The performance impact of just this patch is as follows: > > Using xdp_redirect_map(with a 2nd xdp_prog patch[1]) in sample/bpf and send > pkts via pktgen cmd: > ./pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -t 10 > -s 64 > > There are about +/- 0.1M deviation for native testing, the performance > improved for the base-case, but some drop back with xdp devmap prog attached. > > Version | Test | Generic | Native | Native > + 2nd xdp_prog > 5.10 rc6 | xdp_redirect_map i40e->i40e |2.0M | 9.1M | 8.0M > 5.10 rc6 | xdp_redirect_map i40e->veth |1.7M | 11.0M | 9.7M > 5.10 rc6 + patch | xdp_redirect_map i40e->i40e |2.0M | 9.5M | 7.5M > 5.10 rc6 + patch | xdp_redirect_map i40e->veth |1.7M | 11.6M | 9.1M > > [1] > https://patchwork.ozlabs.org/project/netdev/patch/20201208120159.2278277-1-liuhang...@gmail.com/ > > Signed-off-by: Jesper Dangaard Brouer > Signed-off-by: Hangbin Liu > > -- > v14: no update, only rebase the code > v13: pass in xdp_prog through __xdp_enqueue() > v2-v12: no this patch > --- > kernel/bpf/devmap.c | 115 +++- > 1 file changed, 72 insertions(+), 43 deletions(-) > > diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c > index f6e9c68afdd4..84fe15950e44 100644 > --- a/kernel/bpf/devmap.c > +++ b/kernel/bpf/devmap.c > @@ -57,6 +57,7 @@ struct xdp_dev_bulk_queue { > struct list_head flush_node; > struct net_device *dev; > struct net_device *dev_rx; > + struct bpf_prog *xdp_prog; > unsigned int count; > }; > > @@ -327,40 +328,92 @@ bool dev_map_can_have_prog(struct bpf_map *map) > return false; > } > > +static int dev_map_bpf_prog_run(struct bpf_prog *xdp_prog, > + struct xdp_frame **frames, int n, > + struct net_device *dev) > +{ > + struct xdp_txq_info txq = { .dev = dev }; > + struct xdp_buff xdp; > + int i, nframes = 0; > + > + for (i = 0; i < n; i++) { > + struct xdp_frame *xdpf = frames[i]; > + u32 act; > + int err; > + > + xdp_convert_frame_to_buff(xdpf, &xdp); Hi, slightly higher level question about the desgin. How come we have to bounce the xdp_frame back and forth between an xdp_buff<->xdp-frame? Seems a bit wasteful. > + xdp.txq = &txq; > + > + act = bpf_prog_run_xdp(xdp_prog, &xdp); > + switch (act) { > + case XDP_PASS: > + err = xdp_update_frame_from_buff(&xdp, xdpf); xdp_update_frame_from_buff will then convert it back from the xdp_buff? struct xdp_buff { void *data; void *data_end; void *data_meta; void *data_hard_start; struct xdp_rxq_info *rxq; struct xdp_txq_info *txq; u32 frame_sz; /* frame size to deduce data_hard_end/reserved tailroom*/ }; struct xdp_frame { void *data; u16 len; u16 headroom; u32 metasize:8; u32 frame_sz:24; /* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time, * while mem info is valid on remote CPU. */ struct xdp_mem_info mem; struct net_device *dev_rx; /* used by cpumap */ }; It looks like we could embed xdp_buff in xdp_frame and then keep the metadata at the end. Because you are working performance here wdyt? <- @Jesper as well. > + if (unlikely(err < 0)) > + xdp_return_frame_rx_napi(xdpf); > + else > + frames[nframes++] = xdpf; > + break; > + default: > + bpf_warn_invalid_xdp_action(act); > + fallthrough; > + case XDP_ABORTED: > + trace_xdp_exception(dev, xdp_prog, act); > + fallthrough; > + case XDP_DROP: > + xdp_return_frame_rx_napi(xdpf); > + break; > + } > + } > + return n - nframes; /* d
RE: [PATCH net v2] udp: ipv4: manipulate network header of NATed UDP GRO fraglist
On 2021-01-16 02:12, Alexander Lobakin wrote: > From: Dongseok Yi > Date: Fri, 15 Jan 2021 22:20:35 +0900 > > > UDP/IP header of UDP GROed frag_skbs are not updated even after NAT > > forwarding. Only the header of head_skb from ip_finish_output_gso -> > > skb_gso_segment is updated but following frag_skbs are not updated. > > > > A call path skb_mac_gso_segment -> inet_gso_segment -> > > udp4_ufo_fragment -> __udp_gso_segment -> __udp_gso_segment_list > > does not try to update UDP/IP header of the segment list but copy > > only the MAC header. > > > > Update dport, daddr and checksums of each skb of the segment list > > in __udp_gso_segment_list. It covers both SNAT and DNAT. > > > > Fixes: 9fd1ff5d2ac7 (udp: Support UDP fraglist GRO/GSO.) > > Signed-off-by: Dongseok Yi > > --- > > v1: > > Steffen Klassert said, there could be 2 options. > > https://lore.kernel.org/patchwork/patch/1362257/ > > I was trying to write a quick fix, but it was not easy to forward > > segmented list. Currently, assuming DNAT only. > > > > v2: > > Per Steffen Klassert request, move the procedure from > > udp4_ufo_fragment to __udp_gso_segment_list and support SNAT. > > > > To Alexander Lobakin, I've checked your email late. Just use this > > patch as a reference. It support SNAT too, but does not support IPv6 > > yet. I cannot make IPv6 header changes in __udp_gso_segment_list due > > to the file is in IPv4 directory. > > I used another approach, tried to make fraglist GRO closer to plain > in terms of checksummming, as it is confusing to me why GSO packet > should have CHECKSUM_UNNECESSARY. Just let Netfilter do its mangling, > and then use classic UDP GSO magic at the end of segmentation. > I also see the idea of explicit comparing and editing of IP and UDP > headers right in __udp_gso_segment_list() rather unacceptable. If I understand UDP GRO fraglist correctly, it keeps the length of each skb of the fraglist. But your approach might change the lengths by gso_size. What if each skb of the fraglist had different lengths? For CHECKSUM_UNNECESSARY, GROed head_skb might have an invalid checksum. But finally, the fraglist will be segmented to queue to sk_receive_queue with head_skb. We could pass the GROed head_skb with CHECKSUM_UNNECESSARY. > > Dongseok, Steffen, please test this WIP diff and tell if this one > works for you, so I could clean up the code and make a patch. > For me, it works now in any configurations, with and without > checksum/GSO/fraglist offload. > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index c1a6f262636a..646a42e88e83 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -3674,6 +3674,7 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb, >unsigned int offset) > { > struct sk_buff *list_skb = skb_shinfo(skb)->frag_list; > + unsigned int doffset = skb->data - skb_mac_header(skb); > unsigned int tnl_hlen = skb_tnl_header_len(skb); > unsigned int delta_truesize = 0; > unsigned int delta_len = 0; > @@ -3681,7 +3682,7 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb, > struct sk_buff *nskb, *tmp; > int err; > > - skb_push(skb, -skb_network_offset(skb) + offset); > + skb_push(skb, doffset); > > skb_shinfo(skb)->frag_list = NULL; > > @@ -3716,12 +3717,11 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb, > delta_len += nskb->len; > delta_truesize += nskb->truesize; > > - skb_push(nskb, -skb_network_offset(nskb) + offset); > + skb_push(nskb, skb_headroom(nskb) - skb_headroom(skb)); > > skb_release_head_state(nskb); > - __copy_skb_header(nskb, skb); > + __copy_skb_header(nskb, skb); > > - skb_headers_offset_update(nskb, skb_headroom(nskb) - > skb_headroom(skb)); > skb_copy_from_linear_data_offset(skb, -tnl_hlen, >nskb->data - tnl_hlen, >offset + tnl_hlen); > diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c > index ff39e94781bf..61665fcd8c85 100644 > --- a/net/ipv4/udp_offload.c > +++ b/net/ipv4/udp_offload.c > @@ -190,13 +190,58 @@ EXPORT_SYMBOL(skb_udp_tunnel_segment); > static struct sk_buff *__udp_gso_segment_list(struct sk_buff *skb, > netdev_features_t features) > { > - unsigned int mss = skb_shinfo(skb)->gso_size; > + struct sk_buff *seg; > + struct udphdr *uh; > + unsigned int mss; > + __be16 newlen; > + __sum16 check; > + > + mss = skb_shinfo(skb)->gso_size; > + if (skb->len <= sizeof(*uh) + mss) > + return ERR_PTR(-EINVAL); > > - skb = skb_segment_list(skb, features, skb_mac_header_len(skb)); > + skb_pull(skb, sizeof(*uh)); > + > + skb = skb_segment_list(skb, features, skb->data - skb_mac_header(skb)); > if (IS_ERR(skb)) >
RE: [PATCHv14 bpf-next 3/6] xdp: add a new helper for dev map multicast support
Hangbin Liu wrote: > This patch is for xdp multicast support. which has been discussed > before[0], The goal is to be able to implement an OVS-like data plane in > XDP, i.e., a software switch that can forward XDP frames to multiple ports. > > To achieve this, an application needs to specify a group of interfaces > to forward a packet to. It is also common to want to exclude one or more > physical interfaces from the forwarding operation - e.g., to forward a > packet to all interfaces in the multicast group except the interface it > arrived on. While this could be done simply by adding more groups, this > quickly leads to a combinatorial explosion in the number of groups an > application has to maintain. > > To avoid the combinatorial explosion, we propose to include the ability > to specify an "exclude group" as part of the forwarding operation. This > needs to be a group (instead of just a single port index), because a > physical interface can be part of a logical grouping, such as a bond > device. > > Thus, the logical forwarding operation becomes a "set difference" > operation, i.e. "forward to all ports in group A that are not also in > group B". This series implements such an operation using device maps to > represent the groups. This means that the XDP program specifies two > device maps, one containing the list of netdevs to redirect to, and the > other containing the exclude list. > > To achieve this, I re-implement a new helper bpf_redirect_map_multi() > to accept two maps, the forwarding map and exclude map. The forwarding > map could be DEVMAP or DEVMAP_HASH, but the exclude map *must* be > DEVMAP_HASH to get better performace. If user don't want to use exclude > map and just want simply stop redirecting back to ingress device, they > can use flag BPF_F_EXCLUDE_INGRESS. > > As both bpf_xdp_redirect_map() and this new helpers are using struct > bpf_redirect_info, I add a new ex_map and set tgt_value to NULL in the > new helper to make a difference with bpf_xdp_redirect_map(). > > Also I keep the general data path in net/core/filter.c, the native data > path in kernel/bpf/devmap.c so we can use direct calls to get better > performace. [...] > diff --git a/include/net/xdp.h b/include/net/xdp.h > index 0cf3976ce77c..0e6468cd0ab9 100644 > --- a/include/net/xdp.h > +++ b/include/net/xdp.h > @@ -164,6 +164,7 @@ void xdp_warn(const char *msg, const char *func, const > int line); > #define XDP_WARN(msg) xdp_warn(msg, __func__, __LINE__) > > struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp); > +struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf); > > static inline > void xdp_convert_frame_to_buff(struct xdp_frame *frame, struct xdp_buff *xdp) > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index a1ad32456f89..ecf5d117b96a 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -3830,6 +3830,27 @@ union bpf_attr { > * Return > * A pointer to a struct socket on success or NULL if the file is > * not a socket. > + * > + * long bpf_redirect_map_multi(struct bpf_map *map, struct bpf_map *ex_map, > u64 flags) > + * Description > + * This is a multicast implementation for XDP redirect. It will > + * redirect the packet to ALL the interfaces in *map*, but > + * exclude the interfaces in *ex_map*. > + * > + * The forwarding *map* could be either BPF_MAP_TYPE_DEVMAP or > + * BPF_MAP_TYPE_DEVMAP_HASH. But the *ex_map* must be > + * BPF_MAP_TYPE_DEVMAP_HASH to get better performance. Would be good to add a note ex_map _must_ be keyed by ifindex for the helper to work. Its the obvious way to key a hashmap, but not required iirc. > + * > + * Currently the *flags* only supports *BPF_F_EXCLUDE_INGRESS*, > + * which additionally excludes the current ingress device. > + * > + * See also bpf_redirect_map() as a unicast implementation, > + * which supports redirecting packet to a specific ifindex > + * in the map. As both helpers use struct bpf_redirect_info > + * to store the redirect info, we will use a a NULL tgt_value > + * to distinguish multicast and unicast redirecting. > + * Return > + * **XDP_REDIRECT** on success, or **XDP_ABORTED** on error. > */ [...] > + > +int dev_map_enqueue_multi(struct xdp_buff *xdp, struct net_device *dev_rx, > + struct bpf_map *map, struct bpf_map *ex_map, > + u32 flags) > +{ > + struct bpf_dtab_netdev *obj = NULL, *next_obj = NULL; > + struct xdp_frame *xdpf, *nxdpf; > + bool last_one = false; > + int ex_ifindex; > + u32 key, next_key; > + > + ex_ifindex = flags & BPF_F_EXCLUDE_INGRESS ? dev_rx->ifindex : 0; > + > + /* Find first available obj */ > + obj = devmap_get_next_obj(xdp, map, ex_map, NULL, &key, ex_ifindex); > + if (!obj) > + return 0
Re: [PATCH v3 1/3] bpf,x64: pad NOPs to make images converge more easily
On Fri, Jan 15, 2021 at 08:04:06AM -0800, Alexei Starovoitov wrote: > On Fri, Jan 15, 2021 at 1:41 AM Gary Lin wrote: > > > > On Thu, Jan 14, 2021 at 10:37:33PM -0800, Alexei Starovoitov wrote: > > > On Thu, Jan 14, 2021 at 1:54 AM Gary Lin wrote: > > > > * pass to emit the final image. > > > > */ > > > > - for (pass = 0; pass < 20 || image; pass++) { > > > > - proglen = do_jit(prog, addrs, image, oldproglen, &ctx); > > > > + for (pass = 0; pass < MAX_PASSES || image; pass++) { > > > > + if (!padding && pass >= PADDING_PASSES) > > > > + padding = true; > > > > + proglen = do_jit(prog, addrs, image, oldproglen, &ctx, > > > > padding); > > > > > > I'm struggling to reconcile the discussion we had before holidays with > > > the discussion you guys had in v2: > > > > > > >> What is the rationale for the latter when JIT is called again for > > > >> subprog to fill in relative > > > >> call locations? > > > >> > > > > H, my thinking was that we only enable padding for those programs > > > > which are already padded before. But, you're right. For the programs > > > > converging without padding, enabling padding won't change the final > > > > image, so it's safe to always set "padding" to true for the extra pass. > > > > > > > > Will remove the "padded" flag in v3. > > > > > > I'm not following why "enabling padding won't change the final image" > > > is correct. > > > Say the subprog image converges without padding. > > > Then for subprog we call JIT again. > > > Now extra_pass==true and padding==true. > > > The JITed image will be different. > > Actually no. > > > > > The test in patch 3 should have caught it, but it didn't, > > > because it checks for a subprog that needed padding. > > > The extra_pass needs to emit insns exactly in the right spots. > > > Otherwise jump targets will be incorrect. > > > The saved addrs[] array is crucial. > > > If extra_pass emits different things the instruction starts won't align > > > to places where addrs[] expects them to be. > > > > > When calculating padding bytes, if the image already converges, the > > emitted instruction size just matches (addrs[i] - addrs[i-1]), so > > emit_nops() emits 0 byte, and the image doesn't change. > > I see. You're right. That's very tricky. > > The patch set doesn't apply cleanly. > Could you please rebase and add a detailed comment about this logic? > > Also please add comments why we check: > nops != 0 && nops != 4 > nops != 0 && nops != 2 && nops != 5 > nops != 0 && nops != 3 > None of it is obvious. Sure, I'll add comments for them. > > Does your single test cover all combinations of numbers? > The test case only covers the NOP JUMP for nops == 0 and nops == 2. I have to figure out how to create a large enough program to trigger the transition of imm32 jump to imm8 jump. Gary Lin
[PATCH v3 12/21] net: stmmac: dwmac-sun8i: Prepare for second EMAC clock register
The Allwinner H616 SoC has two EMAC controllers, with the second one being tied to the internal PHY, but also using a separate EMAC clock register. To tell the driver about which clock register to use, we add a parameter to our syscon phandle. The driver will use this value as an index into the regmap, so that we can address more than the first register, if needed. Signed-off-by: Andre Przywara --- drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c index 58e0511badba..00c10ec7b693 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-sun8i.c @@ -1129,6 +1129,8 @@ static int sun8i_dwmac_probe(struct platform_device *pdev) struct stmmac_priv *priv; struct net_device *ndev; struct regmap *regmap; + struct reg_field syscon_field; + u32 syscon_idx = 0; ret = stmmac_get_platform_resources(pdev, &stmmac_res); if (ret) @@ -1190,8 +1192,12 @@ static int sun8i_dwmac_probe(struct platform_device *pdev) return ret; } - gmac->regmap_field = devm_regmap_field_alloc(dev, regmap, - *gmac->variant->syscon_field); + syscon_field = *gmac->variant->syscon_field; + ret = of_property_read_u32_index(pdev->dev.of_node, "syscon", 1, +&syscon_idx); + if (!ret) + syscon_field.reg += syscon_idx * sizeof(u32); + gmac->regmap_field = devm_regmap_field_alloc(dev, regmap, syscon_field); if (IS_ERR(gmac->regmap_field)) { ret = PTR_ERR(gmac->regmap_field); dev_err(dev, "Unable to map syscon register: %d\n", ret); @@ -1263,6 +1269,8 @@ static const struct of_device_id sun8i_dwmac_match[] = { .data = &emac_variant_a64 }, { .compatible = "allwinner,sun50i-h6-emac", .data = &emac_variant_h6 }, + { .compatible = "allwinner,sun50i-h616-emac", + .data = &emac_variant_h6 }, { } }; MODULE_DEVICE_TABLE(of, sun8i_dwmac_match); -- 2.17.5
Re: [RFC PATCH 0/7] Support for virtio-net hash reporting
On 2021/1/17 下午3:57, Yuri Benditovich wrote: On Thu, Jan 14, 2021 at 5:39 AM Jason Wang wrote: On 2021/1/13 下午10:33, Willem de Bruijn wrote: On Tue, Jan 12, 2021 at 11:11 PM Jason Wang wrote: On 2021/1/13 上午7:47, Willem de Bruijn wrote: On Tue, Jan 12, 2021 at 3:29 PM Yuri Benditovich wrote: On Tue, Jan 12, 2021 at 9:49 PM Yuri Benditovich wrote: On Tue, Jan 12, 2021 at 9:41 PM Yuri Benditovich wrote: Existing TUN module is able to use provided "steering eBPF" to calculate per-packet hash and derive the destination queue to place the packet to. The eBPF uses mapped configuration data containing a key for hash calculation and indirection table with array of queues' indices. This series of patches adds support for virtio-net hash reporting feature as defined in virtio specification. It extends the TUN module and the "steering eBPF" as follows: Extended steering eBPF calculates the hash value and hash type, keeps hash value in the skb->hash and returns index of destination virtqueue and the type of the hash. TUN module keeps returned hash type in (currently unused) field of the skb. skb->__unused renamed to 'hash_report_type'. When TUN module is called later to allocate and fill the virtio-net header and push it to destination virtqueue it populates the hash and the hash type into virtio-net header. VHOST driver is made aware of respective virtio-net feature that extends the virtio-net header to report the hash value and hash report type. Comment from Willem de Bruijn: Skbuff fields are in short supply. I don't think we need to add one just for this narrow path entirely internal to the tun device. We understand that and try to minimize the impact by using an already existing unused field of skb. Not anymore. It was repurposed as a flags field very recently. This use case is also very narrow in scope. And a very short path from data producer to consumer. So I don't think it needs to claim scarce bits in the skb. tun_ebpf_select_queue stores the field, tun_put_user reads it and converts it to the virtio_net_hdr in the descriptor. tun_ebpf_select_queue is called from .ndo_select_queue. Storing the field in skb->cb is fragile, as in theory some code could overwrite that between field between ndo_select_queue and ndo_start_xmit/tun_net_xmit, from which point it is fully under tun control again. But in practice, I don't believe anything does. Alternatively an existing skb field that is used only on disjoint datapaths, such as ingress-only, could be viable. A question here. We had metadata support in XDP for cooperation between eBPF programs. Do we have something similar in the skb? E.g in the RSS, if we want to pass some metadata information between eBPF program and the logic that generates the vnet header (either hard logic in the kernel or another eBPF program). Is there any way that can avoid the possible conflicts of qdiscs? Not that I am aware of. The closest thing is cb[]. It'll have to aliase a field like that, that is known unused for the given path. Right, we need to make sure cb is not used by other ones. I'm not sure how hard to achieve that consider Qemu installs the eBPF program but it doesn't deal with networking configurations. One other approach that has been used within linear call stacks is out of band. Like percpu variables softnet_data.xmit.more and mirred_rec_level. But that is perhaps a bit overwrought for this use case. Yes, and if we go that way then eBPF turns out to be a burden since we need to invent helpers to access those auxiliary data structure. It would be better then to hard-coded the RSS in the kernel. Instead, you could just run the flow_dissector in tun_put_user if the feature is negotiated. Indeed, the flow dissector seems more apt to me than BPF here. Note that the flow dissector internally can be overridden by a BPF program if the admin so chooses. When this set of patches is related to hash delivery in the virtio-net packet in general, it was prepared in context of RSS feature implementation as defined in virtio spec [1] In case of RSS it is not enough to run the flow_dissector in tun_put_user: in tun_ebpf_select_queue the TUN calls eBPF to calculate the hash, hash type and queue index according to the (mapped) parameters (key, hash types, indirection table) received from the guest. TUNSETSTEERINGEBPF was added to support more diverse queue selection than the default in case of multiqueue tun. Not sure what the exact use cases are. But RSS is exactly the purpose of the flow dissector. It is used for that purpose in the software variant RPS. The flow dissector implements a superset of the RSS spec, and certainly computes a four-tuple for TCP/IPv6. In the case of RPS, it is skipped if the NIC has already computed a 4-tuple hash. What it does not give is a type indication, such as VIRTIO_NET_HASH_TYPE_TCPv6. I don't understand how this would be used. In datapaths where the NIC has already computed the four-tuple hash and stored i
Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
On Sat, Jan 16, 2021 at 12:20 AM Leon Romanovsky wrote: > > On Fri, Jan 15, 2021 at 05:48:59PM -0800, Alexander Duyck wrote: > > On Fri, Jan 15, 2021 at 7:53 AM Leon Romanovsky wrote: > > > > > > On Fri, Jan 15, 2021 at 10:06:19AM -0400, Jason Gunthorpe wrote: > > > > On Thu, Jan 14, 2021 at 05:56:20PM -0800, Alexander Duyck wrote: > > > > > > > > > That said, it only works at the driver level. So if the firmware is > > > > > the one that is having to do this it also occured to me that if this > > > > > update happened on FLR that would probably be preferred. > > > > > > > > FLR is not free, I'd prefer not to require it just for some > > > > philosophical reason. > > > > > > > > > Since the mlx5 already supports devlink I don't see any reason why the > > > > > driver couldn't be extended to also support the devlink resource > > > > > interface and apply it to interrupts. > > > > > > > > So you are OK with the PF changing the VF as long as it is devlink not > > > > sysfs? Seems rather arbitary? > > > > > > > > Leon knows best, but if I recall devlink becomes wonky when the VF > > > > driver doesn't provide a devlink instance. How does it do reload of a > > > > VF then? > > > > > > > > I think you end up with essentially the same logic as presented here > > > > with sysfs. > > > > > > The reasons why I decided to go with sysfs are: > > > 1. This MSI-X table size change is applicable to ALL devices in the world, > > > and not only netdev. > > > > In the PCI world MSI-X table size is a read only value. That is why I > > am pushing back on this as a PCI interface. > > And it stays read-only. Only if you come at it directly. What this is adding is a back door that is visible as a part of the VF sysfs. > > > > > 2. This is purely PCI field and apply equally with same logic to all > > > subsystems and not to netdev only. > > > > Again, calling this "purely PCI" is the sort of wording that has me > > concerned. I would prefer it if we avoid that wording. There is much > > more to this than just modifying the table size field. The firmware is > > having to shift resources between devices and this potentially has an > > effect on the entire part, not just one VF. > > It is internal to HW implementation, dumb device can solve it differently. That is my point. I am worried about "dumb devices" that may follow. I would like to see the steps that should be taken to prevent these sort of things called out specifically. Basically this isn't just modifying the PCIe config space, it is actually resizing the PBA and MSI-X table. > > > > > 3. The sysfs interface is the standard way of configuring PCI/core, not > > > devlink. > > > > This isn't PCI core that is being configured. It is the firmware for > > the device. You are working with resources that are shared between > > multiple functions. > > I'm ensuring that "lspci -vv .." will work correctly after such change. > It is PCI core responsibility. The current code doesn't work on anything with a driver loaded on it. In addition the messaging provided is fairly minimal which results in an interface that will be difficult to understand when it doesn't work. In addition there is currently only one piece of hardware that works with this interface which is the mlx5. My concern is this is adding overhead to all VFs that will not be used by most SR-IOV capable devices. In my view it would make much more sense to have a top-down approach instead of bottom-up where the PF is registering interfaces for the VFs. If you want yet another compromise I would be much happier with the PF registering the sysfs interfaces on the VFs rather than the VFs registering the interface and hoping the PF supports it. At least with that you are guaranteed the PF will respond to the interface when it is registered. > > > > > 4. This is how orchestration software provisioning VFs already. It fits > > > real world usage of SR-IOV, not the artificial one that is proposed during > > > the discussion. > > > > What do you mean this is how they are doing it already? Do you have > > something out-of-tree and that is why you are fighting to keep the > > sysfs? If so that isn't a valid argument. > > I have Kubernetes and OpenStack, indeed they are not part of the kernel tree. > They already use sriov_driver_autoprobe sysfs knob to disable autobind > before even starting. They configure MACs and bind VFs through sysfs/netlink > already. For them, the read/write of sysfs that is going to be bound to > the already created VM with known CPU properties, fits perfectly. By that argument the same could be said about netlink. What I don't get is why it is okay to configure the MAC through netlink but suddenly when we are talking about interrupts it is out of the question. As far as the binding that is the driver interface which is more or less grandfathered in anyway as there aren't too many ways to deal with them as there isn't an alternate interface for the drivers to define support. > > > > > So the idea to
Re: [PATCH bpf 1/2] samples/bpf: Set flag __SANE_USERSPACE_TYPES__ for MIPS to fix build warnings
On 01/14/2021 01:12 AM, Yonghong Song wrote: On 1/13/21 2:57 AM, Tiezhu Yang wrote: MIPS needs __SANE_USERSPACE_TYPES__ before to select 'int-ll64.h' in arch/mips/include/uapi/asm/types.h and avoid compile warnings when printing __u64 with %llu, %llx or %lld. could you mention which command produces the following warning? make M=samples/bpf printf("0x%02x : %llu\n", key, value); ~~~^ ~ %lu printf("%s/%llx;", sym->name, addr); ~~~^ %lx printf(";%s %lld\n", key->waker, count); ~~~^ ~ %ld Signed-off-by: Tiezhu Yang --- samples/bpf/Makefile| 4 tools/include/linux/types.h | 3 +++ 2 files changed, 7 insertions(+) diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 26fc96c..27de306 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -183,6 +183,10 @@ BPF_EXTRA_CFLAGS := $(ARM_ARCH_SELECTOR) TPROGS_CFLAGS += $(ARM_ARCH_SELECTOR) endif +ifeq ($(ARCH), mips) +TPROGS_CFLAGS += -D__SANE_USERSPACE_TYPES__ +endif + This change looks okay based on description in arch/mips/include/uapi/asm/types.h ''' /* * We don't use int-l64.h for the kernel anymore but still use it for * userspace to avoid code changes. * * However, some user programs (e.g. perf) may not want this. They can * flag __SANE_USERSPACE_TYPES__ to get int-ll64.h here. */ ''' TPROGS_CFLAGS += -Wall -O2 TPROGS_CFLAGS += -Wmissing-prototypes TPROGS_CFLAGS += -Wstrict-prototypes diff --git a/tools/include/linux/types.h b/tools/include/linux/types.h index 154eb4e..e9c5a21 100644 --- a/tools/include/linux/types.h +++ b/tools/include/linux/types.h @@ -6,7 +6,10 @@ #include #include +#ifndef __SANE_USERSPACE_TYPES__ #define __SANE_USERSPACE_TYPES__/* For PPC64, to get LL64 types */ +#endif What problem this patch fixed? If add "TPROGS_CFLAGS += -D__SANE_USERSPACE_TYPES__" in samples/bpf/Makefile, it appears the following error: Auto-detecting system features: ...libelf: [ on ] ... zlib: [ on ] ... bpf: [ OFF ] BPF API too old make[3]: *** [Makefile:293: bpfdep] Error 1 make[2]: *** [Makefile:156: all] Error 2 With #ifndef __SANE_USERSPACE_TYPES__ in tools/include/linux/types.h, the above error has gone. If this header is used, you can just change comment from "PPC64" to "PPC64/MIPS", right? If include in the source files which have compile warnings when printing __u64 with %llu, %llx or %lld, it has no effect due to actually it includes usr/include/linux/types.h instead of tools/include/linux/types.h, this is because the include-directories in samples/bpf/Makefile are searched in the order, -I./usr/include is in the front of -I./tools/include. So I think define __SANE_USERSPACE_TYPES__ for MIPS in samples/bpf/Makefile is proper, at the same time, add #ifndef __SANE_USERSPACE_TYPES__ in tools/include/linux/types.h can avoid build error and have no side effect. I will send v2 later with mention in the commit message that this is mips related. Thanks, Tiezhu + #include #include
Re: [PATCH iproute2] iplink: work around rtattr length limits for IFLA_VFINFO_LIST
On 1/16/21 6:21 PM, Jakub Kicinski wrote: > > I wonder. There is something inherently risky about making > a precedent for user space depending on invalid kernel output. > > _If_ we want to fix the kernel, IMO we should only fix the kernel. > IMHO this is a kernel bug that should be fixed. An easy fix to check the overflow in nla_nest_end and return an error. Sadly, nla_nest_end return code is ignored and backporting any change to fix that will be nightmare. A warning will identify places that need to be fixed. We can at least catch and fix this overflow which is by far the primary known victim of the rollover.
Re: [PATCH iproute2 2/2] vrf: fix ip vrf exec with libbpf
On 1/17/21 3:54 PM, Luca Boccassi wrote: > The size of bpf_insn is passed to bpf_load_program instead of the number > of elements as it expects, so ip vrf exec fails with: > > $ sudo ip link add vrf-blue type vrf table 10 > $ sudo ip link set dev vrf-blue up > $ sudo ip/ip vrf exec vrf-blue ls > Failed to load BPF prog: 'Invalid argument' > last insn is not an exit or jmp > processed 0 insns (limit 100) max_states_per_insn 0 total_states 0 > peak_states 0 mark_read 0 > Kernel compiled with CGROUP_BPF enabled? > > https://bugs.debian.org/980046 > > Reported-by: Emmanuel DECAEN > > Signed-off-by: Luca Boccassi > --- > lib/bpf_glue.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > Reviewed-by: David Ahern
Re: [PATCH iproute2 1/2] vrf: print BPF log buffer if bpf_program_load fails
On 1/17/21 3:54 PM, Luca Boccassi wrote: > Necessary to understand what is going on when bpf_program_load fails > > Signed-off-by: Luca Boccassi > --- > ip/ipvrf.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > Reviewed-by: David Ahern
Re: [PATCH iproute2-next v2 0/7] dcb: Support APP, DCBX objects
On 1/1/21 5:03 PM, Petr Machata wrote: > Add support to the dcb tool for the following two DCB objects: > > - APP, which allows configuration of traffic prioritization rules based on > several possible packet headers. > > - DCBX, which is a 1-byte bitfield of flags that configure whether the DCBX > protocol is implemented in the device or in the host, and which version > of the protocol should be used. > > Patch #1 adds a new helper for finding a name of a given dsfield value. > This is useful for APP DSCP-to-priority rules, which can use human-readable > DSCP names. > > Patches #2, #3 and #4 extend existing interfaces for, respectively, parsing > of the X:Y mappings, for setting a DCB object, and for getting a DCB > object. > > In patch #5, support for the command line argument -N / --Numeric is > added. The APP tool later uses it to decide whether to format DSCP values > as human-readable strings or as plain numbers. > > Patches #6 and #7 add the subtools themselves and their man pages. > applied to iproute2-next.
[PATCH net-next] net/sched: cls_flower add CT_FLAGS_INVALID flag support
From: wenxu This patch add the TCA_FLOWER_KEY_CT_FLAGS_INVALID flag to match the ct_state with invalid for conntrack. Signed-off-by: wenxu --- include/linux/skbuff.h | 4 ++-- include/net/sch_generic.h| 1 + include/uapi/linux/pkt_cls.h | 1 + net/core/dev.c | 2 ++ net/core/flow_dissector.c| 13 + net/sched/act_ct.c | 1 + net/sched/cls_flower.c | 6 +- 7 files changed, 21 insertions(+), 7 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index c9568cf..e22ccf0 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1353,8 +1353,8 @@ void skb_flow_dissect_meta(const struct sk_buff *skb, skb_flow_dissect_ct(const struct sk_buff *skb, struct flow_dissector *flow_dissector, void *target_container, - u16 *ctinfo_map, - size_t mapsize); + u16 *ctinfo_map, size_t mapsize, + bool post_ct); void skb_flow_dissect_tunnel_info(const struct sk_buff *skb, struct flow_dissector *flow_dissector, diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 639e465..e7bee99 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -388,6 +388,7 @@ struct qdisc_skb_cb { #define QDISC_CB_PRIV_LEN 20 unsigned char data[QDISC_CB_PRIV_LEN]; u16 mru; + boolpost_ct; }; typedef void tcf_chain_head_change_t(struct tcf_proto *tp_head, void *priv); diff --git a/include/uapi/linux/pkt_cls.h b/include/uapi/linux/pkt_cls.h index ee95f42..709668e 100644 --- a/include/uapi/linux/pkt_cls.h +++ b/include/uapi/linux/pkt_cls.h @@ -591,6 +591,7 @@ enum { TCA_FLOWER_KEY_CT_FLAGS_ESTABLISHED = 1 << 1, /* Part of an existing connection. */ TCA_FLOWER_KEY_CT_FLAGS_RELATED = 1 << 2, /* Related to an established connection. */ TCA_FLOWER_KEY_CT_FLAGS_TRACKED = 1 << 3, /* Conntrack has occurred. */ + TCA_FLOWER_KEY_CT_FLAGS_INVALID = 1 << 4, /* Conntrack is invalid. */ }; enum { diff --git a/net/core/dev.c b/net/core/dev.c index bae35c1..9dce3f7 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3878,6 +3878,7 @@ int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff *skb) /* qdisc_skb_cb(skb)->pkt_len was already set by the caller. */ qdisc_skb_cb(skb)->mru = 0; + qdisc_skb_cb(skb)->post_ct = false; mini_qdisc_bstats_cpu_update(miniq, skb); switch (tcf_classify(skb, miniq->filter_list, &cl_res, false)) { @@ -4960,6 +4961,7 @@ static __latent_entropy void net_tx_action(struct softirq_action *h) qdisc_skb_cb(skb)->pkt_len = skb->len; qdisc_skb_cb(skb)->mru = 0; + qdisc_skb_cb(skb)->post_ct = false; skb->tc_at_ingress = 1; mini_qdisc_bstats_cpu_update(miniq, skb); diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c index 2d70ded..c565c7a 100644 --- a/net/core/flow_dissector.c +++ b/net/core/flow_dissector.c @@ -237,9 +237,8 @@ void skb_flow_dissect_meta(const struct sk_buff *skb, void skb_flow_dissect_ct(const struct sk_buff *skb, struct flow_dissector *flow_dissector, - void *target_container, - u16 *ctinfo_map, - size_t mapsize) + void *target_container, u16 *ctinfo_map, + size_t mapsize, bool post_ct) { #if IS_ENABLED(CONFIG_NF_CONNTRACK) struct flow_dissector_key_ct *key; @@ -251,13 +250,19 @@ void skb_flow_dissect_meta(const struct sk_buff *skb, return; ct = nf_ct_get(skb, &ctinfo); - if (!ct) + if (!ct && !post_ct) return; key = skb_flow_dissector_target(flow_dissector, FLOW_DISSECTOR_KEY_CT, target_container); + if (!ct) { + key->ct_state = TCA_FLOWER_KEY_CT_FLAGS_TRACKED | + TCA_FLOWER_KEY_CT_FLAGS_INVALID; + return; + } + if (ctinfo < mapsize) key->ct_state = ctinfo_map[ctinfo]; #if IS_ENABLED(CONFIG_NF_CONNTRACK_ZONES) diff --git a/net/sched/act_ct.c b/net/sched/act_ct.c index 83a5c67..b344207 100644 --- a/net/sched/act_ct.c +++ b/net/sched/act_ct.c @@ -1030,6 +1030,7 @@ static int tcf_ct_act(struct sk_buff *skb, const struct tc_action *a, out: tcf_action_update_bstats(&c->common, skb); + qdisc_skb_cb(skb)->post_ct = true; if (defrag) qdisc_skb_cb(skb)->pkt_len = skb->len; return retval; diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c index 1319986..e8653d9 100644 --- a/net/sched/cls_flower.c +++ b/net/sched/cls_flower.c @@ -305,6 +305,9 @@ static int fl_classify(struct sk_buff *skb, const struct tcf_proto *tp,
[PATCH] net: usb: qmi_wwan: added support for Thales Cinterion PLSx3 modem family
Bus 003 Device 009: ID 1e2d:006f Device Descriptor: bLength18 bDescriptorType 1 bcdUSB 2.00 bDeviceClass 239 Miscellaneous Device bDeviceSubClass 2 ? bDeviceProtocol 1 Interface Association bMaxPacketSize064 idVendor 0x1e2d idProduct 0x006f bcdDevice0.00 iManufacturer 3 Cinterion Wireless Modules iProduct2 PLSx3 iSerial 4 fa3c1419 bNumConfigurations 1 Configuration Descriptor: bLength 9 bDescriptorType 2 wTotalLength 303 bNumInterfaces 9 bConfigurationValue 1 iConfiguration 1 Cinterion Configuration bmAttributes 0xe0 Self Powered Remote Wakeup MaxPower 500mA Interface Association: bLength 8 bDescriptorType11 bFirstInterface 0 bInterfaceCount 2 bFunctionClass 2 Communications bFunctionSubClass 2 Abstract (modem) bFunctionProtocol 1 AT-commands (v.25ter) iFunction 0 Interface Descriptor: bLength 9 bDescriptorType 4 bInterfaceNumber0 bAlternateSetting 0 bNumEndpoints 1 bInterfaceClass 2 Communications bInterfaceSubClass 2 Abstract (modem) bInterfaceProtocol 1 AT-commands (v.25ter) iInterface 0 CDC Header: bcdCDC 1.10 CDC ACM: bmCapabilities 0x02 line coding and serial state CDC Call Management: bmCapabilities 0x03 call management use DataInterface bDataInterface 1 CDC Union: bMasterInterface0 bSlaveInterface 1 Endpoint Descriptor: bLength 7 bDescriptorType 5 bEndpointAddress 0x81 EP 1 IN bmAttributes3 Transfer TypeInterrupt Synch Type None Usage Type Data wMaxPacketSize 0x0040 1x 64 bytes bInterval 5 Interface Descriptor: bLength 9 bDescriptorType 4 bInterfaceNumber1 bAlternateSetting 0 bNumEndpoints 2 bInterfaceClass10 CDC Data bInterfaceSubClass 0 Unused bInterfaceProtocol 0 iInterface 0 Endpoint Descriptor: bLength 7 bDescriptorType 5 bEndpointAddress 0x82 EP 2 IN bmAttributes2 Transfer TypeBulk Synch Type None Usage Type Data wMaxPacketSize 0x0200 1x 512 bytes bInterval 0 Endpoint Descriptor: bLength 7 bDescriptorType 5 bEndpointAddress 0x01 EP 1 OUT bmAttributes2 Transfer TypeBulk Synch Type None Usage Type Data wMaxPacketSize 0x0200 1x 512 bytes bInterval 0 Interface Association: bLength 8 bDescriptorType11 bFirstInterface 2 bInterfaceCount 2 bFunctionClass 2 Communications bFunctionSubClass 2 Abstract (modem) bFunctionProtocol 1 AT-commands (v.25ter) iFunction 0 Interface Descriptor: bLength 9 bDescriptorType 4 bInterfaceNumber2 bAlternateSetting 0 bNumEndpoints 1 bInterfaceClass 2 Communications bInterfaceSubClass 2 Abstract (modem) bInterfaceProtocol 1 AT-commands (v.25ter) iInterface 0 CDC Header: bcdCDC 1.10 CDC ACM: bmCapabilities 0x02 line coding and serial state CDC Call Management: bmCapabilities 0x03 call management use DataInterface bDataInterface 3 CDC Union: bMasterInterface2 bSlaveInterface 3 Endpoint Descriptor: bLength 7 bDescriptorType 5 bEndpointAddress 0x83 EP 3 IN bmAttributes3 Transfer TypeInterrupt Synch Type None Usage Type Data wMaxPacketSize 0x0040 1x 64 bytes bInterval 5 Interface Descriptor: bLength 9 bDescriptorType 4 bInterfaceNumber3 bAlternateSett
[PATCH net] tcp: Fix potential use-after-free due to double kfree().
Receiving ACK with a valid SYN cookie, cookie_v4_check() allocates struct request_sock and then can allocate inet_rsk(req)->ireq_opt. After that, tcp_v4_syn_recv_sock() allocates struct sock and copies ireq_opt to inet_sk(sk)->inet_opt. Normally, tcp_v4_syn_recv_sock() inserts the full socket into ehash and sets NULL to ireq_opt. Otherwise, tcp_v4_syn_recv_sock() has to reset inet_opt by NULL and free the full socket. The commit 01770a1661657 ("tcp: fix race condition when creating child sockets from syncookies") added a new path, in which more than one cores create full sockets for the same SYN cookie. Currently, the core which loses the race frees the full socket without resetting inet_opt, resulting in that both sock_put() and reqsk_put() call kfree() for the same memory: sock_put sk_free __sk_free sk_destruct __sk_destruct sk->sk_destruct/inet_sock_destruct kfree(rcu_dereference_protected(inet->inet_opt, 1)); reqsk_put reqsk_free __reqsk_free req->rsk_ops->destructor/tcp_v4_reqsk_destructor kfree(rcu_dereference_protected(inet_rsk(req)->ireq_opt, 1)); Calling kmalloc() between the double kfree() can lead to use-after-free, so this patch fixes it by setting NULL to inet_opt before sock_put(). As a side note, this kind of issue does not happen for IPv6. This is because tcp_v6_syn_recv_sock() clones both ipv6_opt and pktopts which correspond to ireq_opt in IPv4. Fixes: 01770a166165 ("tcp: fix race condition when creating child sockets from syncookies") CC: Ricardo Dias Signed-off-by: Kuniyuki Iwashima Reviewed-by: Benjamin Herrenschmidt --- net/ipv4/tcp_ipv4.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 58207c7769d0..87eb614dab27 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1595,6 +1595,8 @@ struct sock *tcp_v4_syn_recv_sock(const struct sock *sk, struct sk_buff *skb, tcp_move_syn(newtp, req); ireq->ireq_opt = NULL; } else { + newinet->inet_opt = NULL; + if (!req_unhash && found_dup_sk) { /* This code path should only be executed in the * syncookie case only @@ -1602,8 +1604,6 @@ struct sock *tcp_v4_syn_recv_sock(const struct sock *sk, struct sk_buff *skb, bh_unlock_sock(newsk); sock_put(newsk); newsk = NULL; - } else { - newinet->inet_opt = NULL; } } return newsk; -- 2.17.2 (Apple Git-113)
Re: [PATCH net-next v2 0/7] virtio-net support xdp socket zero copy xmit
On 2021/1/16 上午10:59, Xuan Zhuo wrote: XDP socket is an excellent by pass kernel network transmission framework. The zero copy feature of xsk (XDP socket) needs to be supported by the driver. The performance of zero copy is very good. mlx5 and intel ixgbe already support this feature, This patch set allows virtio-net to support xsk's zerocopy xmit feature. And xsk's zerocopy rx has made major changes to virtio-net, and I hope to submit it after this patch set are received. Compared with other drivers, virtio-net does not directly obtain the dma address, so I first obtain the xsk page, and then pass the page to virtio. When recycling the sent packets, we have to distinguish between skb and xdp. Now we have to distinguish between skb, xdp, xsk. So the second patch solves this problem first. The last four patches are used to support xsk zerocopy in virtio-net: 1. support xsk enable/disable 2. realize the function of xsk packet sending 3. implement xsk wakeup callback 4. set xsk completed when packet sent done Performance Testing The udp package tool implemented by the interface of xsk vs sockperf(kernel udp) for performance testing: xsk zero copy in virtio-net: CPUPPS MSGSIZE 28.7% 3833857 64 38.5% 3689491 512 38.9% 2787096 1456 Some questions on the results: 1) What's the setup on the vhost? 2) What's the setup of the mitigation in both host and guest? 3) Any analyze on the possible bottleneck via perf or other tools? Thanks xsk without zero copy in virtio-net: CPUPPS MSGSIZE 100% 1916747 64 100% 1775988 512 100% 1440054 1456 sockperf: CPUPPS MSGSIZE 100% 713274 64 100% 701024 512 100% 695832 1456 Xuan Zhuo (7): xsk: support get page for drv virtio-net, xsk: distinguish XDP_TX and XSK XMIT ctx xsk, virtio-net: prepare for support xsk zerocopy xmit virtio-net, xsk: support xsk enable/disable virtio-net, xsk: realize the function of xsk packet sending virtio-net, xsk: implement xsk wakeup callback virtio-net, xsk: set xsk completed when packet sent done drivers/net/virtio_net.c| 559 +++- include/linux/netdevice.h | 1 + include/net/xdp_sock_drv.h | 10 + include/net/xsk_buff_pool.h | 1 + net/xdp/xsk_buff_pool.c | 10 +- 5 files changed, 523 insertions(+), 58 deletions(-) -- 1.8.3.1
Re: [PATCH net v2] udp: ipv4: manipulate network header of NATed UDP GRO fraglist
On Fri, Jan 15, 2021 at 05:12:33PM +, Alexander Lobakin wrote: > From: Dongseok Yi > Date: Fri, 15 Jan 2021 22:20:35 +0900 > > > UDP/IP header of UDP GROed frag_skbs are not updated even after NAT > > forwarding. Only the header of head_skb from ip_finish_output_gso -> > > skb_gso_segment is updated but following frag_skbs are not updated. > > > > A call path skb_mac_gso_segment -> inet_gso_segment -> > > udp4_ufo_fragment -> __udp_gso_segment -> __udp_gso_segment_list > > does not try to update UDP/IP header of the segment list but copy > > only the MAC header. > > > > Update dport, daddr and checksums of each skb of the segment list > > in __udp_gso_segment_list. It covers both SNAT and DNAT. > > > > Fixes: 9fd1ff5d2ac7 (udp: Support UDP fraglist GRO/GSO.) > > Signed-off-by: Dongseok Yi > > --- > > v1: > > Steffen Klassert said, there could be 2 options. > > https://lore.kernel.org/patchwork/patch/1362257/ > > I was trying to write a quick fix, but it was not easy to forward > > segmented list. Currently, assuming DNAT only. > > > > v2: > > Per Steffen Klassert request, move the procedure from > > udp4_ufo_fragment to __udp_gso_segment_list and support SNAT. > > > > To Alexander Lobakin, I've checked your email late. Just use this > > patch as a reference. It support SNAT too, but does not support IPv6 > > yet. I cannot make IPv6 header changes in __udp_gso_segment_list due > > to the file is in IPv4 directory. > > I used another approach, tried to make fraglist GRO closer to plain > in terms of checksummming, as it is confusing to me why GSO packet > should have CHECKSUM_UNNECESSARY. This is intentional. With fraglist GRO, we don't mangle packets in the standard (non NAT) case. So the checksum is still correct after segmentation. That is one reason why it has good forwarding performance when software segmentation is needed. Checksuming touches the whole packet and has a lot of overhead, so it is heplfull to avoid it whenever possible. We should find a way to do the checksum only when we really need it. I.e. only if the headers of the head skb changed.
Re: [PATCH net-next v2 2/7] virtio-net, xsk: distinguish XDP_TX and XSK XMIT ctx
On 2021/1/16 上午10:59, Xuan Zhuo wrote: If support xsk, a new ptr will be recovered during the process of freeing the old ptr. In order to distinguish between ctx sent by XDP_TX and ctx sent by xsk, a struct is added here to distinguish between these two situations. virtnet_xdp_type.type It is used to distinguish different ctx, and virtnet_xdp_type.offset is used to record the offset between "true ctx" and virtnet_xdp_type. The newly added virtnet_xsk_hdr will be used for xsk. Signed-off-by: Xuan Zhuo Any reason that you can't simply encode the type in the pointer itself as we used to do? #define VIRTIO_XSK_FLAG BIT(1) ? --- drivers/net/virtio_net.c | 75 ++-- 1 file changed, 60 insertions(+), 15 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index ba8e637..e707c31 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -94,6 +94,22 @@ struct virtnet_rq_stats { u64 kicks; }; +enum { + XDP_TYPE_XSK, + XDP_TYPE_TX, +}; + +struct virtnet_xdp_type { + int offset:24; + unsigned type:8; +}; + +struct virtnet_xsk_hdr { + struct virtnet_xdp_type type; + struct virtio_net_hdr_mrg_rxbuf hdr; + u32 len; +}; + #define VIRTNET_SQ_STAT(m)offsetof(struct virtnet_sq_stats, m) #define VIRTNET_RQ_STAT(m)offsetof(struct virtnet_rq_stats, m) @@ -251,14 +267,19 @@ static bool is_xdp_frame(void *ptr) return (unsigned long)ptr & VIRTIO_XDP_FLAG; } -static void *xdp_to_ptr(struct xdp_frame *ptr) +static void *xdp_to_ptr(struct virtnet_xdp_type *ptr) { return (void *)((unsigned long)ptr | VIRTIO_XDP_FLAG); } -static struct xdp_frame *ptr_to_xdp(void *ptr) +static struct virtnet_xdp_type *ptr_to_xtype(void *ptr) +{ + return (struct virtnet_xdp_type *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG); +} + +static void *xtype_get_ptr(struct virtnet_xdp_type *xdptype) { - return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG); + return (char *)xdptype + xdptype->offset; } /* Converting between virtqueue no. and kernel tx/rx queue no. @@ -459,11 +480,16 @@ static int __virtnet_xdp_xmit_one(struct virtnet_info *vi, struct xdp_frame *xdpf) { struct virtio_net_hdr_mrg_rxbuf *hdr; + struct virtnet_xdp_type *xdptype; int err; - if (unlikely(xdpf->headroom < vi->hdr_len)) + if (unlikely(xdpf->headroom < vi->hdr_len + sizeof(*xdptype))) return -EOVERFLOW; + xdptype = (struct virtnet_xdp_type *)(xdpf + 1); + xdptype->offset = (char *)xdpf - (char *)xdptype; + xdptype->type = XDP_TYPE_TX; + /* Make room for virtqueue hdr (also change xdpf->headroom?) */ xdpf->data -= vi->hdr_len; /* Zero header and leave csum up to XDP layers */ @@ -473,7 +499,7 @@ static int __virtnet_xdp_xmit_one(struct virtnet_info *vi, sg_init_one(sq->sg, xdpf->data, xdpf->len); - err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdp_to_ptr(xdpf), + err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdp_to_ptr(xdptype), GFP_ATOMIC); if (unlikely(err)) return -ENOSPC; /* Caller handle free/refcnt */ @@ -523,8 +549,11 @@ static int virtnet_xdp_xmit(struct net_device *dev, /* Free up any pending old buffers before queueing new ones. */ while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) { if (likely(is_xdp_frame(ptr))) { - struct xdp_frame *frame = ptr_to_xdp(ptr); + struct virtnet_xdp_type *xtype; + struct xdp_frame *frame; + xtype = ptr_to_xtype(ptr); + frame = xtype_get_ptr(xtype); bytes += frame->len; xdp_return_frame(frame); } else { @@ -1373,24 +1402,34 @@ static int virtnet_receive(struct receive_queue *rq, int budget, static void free_old_xmit_skbs(struct send_queue *sq, bool in_napi) { - unsigned int len; unsigned int packets = 0; unsigned int bytes = 0; - void *ptr; + unsigned int len; + struct virtnet_xdp_type *xtype; + struct xdp_frame*frame; + struct virtnet_xsk_hdr *xskhdr; + struct sk_buff *skb; + void*ptr; while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) { if (likely(!is_xdp_frame(ptr))) { - struct sk_buff *skb = ptr; + skb = ptr; pr_debug("Sent skb %p\n", skb); bytes += skb->len; napi_consume_skb(skb, in_napi); } else { - struct xdp_frame *frame = ptr_to_xdp(ptr); + xtype = ptr_to_xtype(ptr); - bytes += frame->len; - xdp_return_fra
Re: [PATCH mlx5-next v1 2/5] PCI: Add SR-IOV sysfs entry to read number of MSI-X vectors
On Sun, Jan 17, 2021 at 07:16:30PM -0800, Alexander Duyck wrote: > On Sat, Jan 16, 2021 at 12:20 AM Leon Romanovsky wrote: > > > > On Fri, Jan 15, 2021 at 05:48:59PM -0800, Alexander Duyck wrote: > > > On Fri, Jan 15, 2021 at 7:53 AM Leon Romanovsky wrote: > > > > > > > > On Fri, Jan 15, 2021 at 10:06:19AM -0400, Jason Gunthorpe wrote: > > > > > On Thu, Jan 14, 2021 at 05:56:20PM -0800, Alexander Duyck wrote: > > > > > > > > > > > That said, it only works at the driver level. So if the firmware is > > > > > > the one that is having to do this it also occured to me that if this > > > > > > update happened on FLR that would probably be preferred. > > > > > > > > > > FLR is not free, I'd prefer not to require it just for some > > > > > philosophical reason. > > > > > > > > > > > Since the mlx5 already supports devlink I don't see any reason why > > > > > > the > > > > > > driver couldn't be extended to also support the devlink resource > > > > > > interface and apply it to interrupts. > > > > > > > > > > So you are OK with the PF changing the VF as long as it is devlink not > > > > > sysfs? Seems rather arbitary? > > > > > > > > > > Leon knows best, but if I recall devlink becomes wonky when the VF > > > > > driver doesn't provide a devlink instance. How does it do reload of a > > > > > VF then? > > > > > > > > > > I think you end up with essentially the same logic as presented here > > > > > with sysfs. > > > > > > > > The reasons why I decided to go with sysfs are: > > > > 1. This MSI-X table size change is applicable to ALL devices in the > > > > world, > > > > and not only netdev. > > > > > > In the PCI world MSI-X table size is a read only value. That is why I > > > am pushing back on this as a PCI interface. > > > > And it stays read-only. > > Only if you come at it directly. What this is adding is a back door > that is visible as a part of the VF sysfs. > > > > > > > > 2. This is purely PCI field and apply equally with same logic to all > > > > subsystems and not to netdev only. > > > > > > Again, calling this "purely PCI" is the sort of wording that has me > > > concerned. I would prefer it if we avoid that wording. There is much > > > more to this than just modifying the table size field. The firmware is > > > having to shift resources between devices and this potentially has an > > > effect on the entire part, not just one VF. > > > > It is internal to HW implementation, dumb device can solve it differently. > > That is my point. I am worried about "dumb devices" that may follow. I > would like to see the steps that should be taken to prevent these sort > of things called out specifically. Basically this isn't just modifying > the PCIe config space, it is actually resizing the PBA and MSI-X > table. Exactly the last line the dumb device can implement differently. The request is simple - configure MSI-X table size to be the new size. > > > > > > > > 3. The sysfs interface is the standard way of configuring PCI/core, not > > > > devlink. > > > > > > This isn't PCI core that is being configured. It is the firmware for > > > the device. You are working with resources that are shared between > > > multiple functions. > > > > I'm ensuring that "lspci -vv .." will work correctly after such change. > > It is PCI core responsibility. > > The current code doesn't work on anything with a driver loaded on it. The problem that no one care about this case, because in opposite to other devices that usually operates in the hypervisor and probed during the boot, the VFs are used differently. They run in VMs, probed there and (usually) not needed in hypervisor. The driver reload would make sense if PF MSI-X table was changed. > In addition the messaging provided is fairly minimal which results in > an interface that will be difficult to understand when it doesn't > work. I'm fond of simple interfaces: 0, EBUSY and EINVAL are common way to inform user. We must remember that this interface is for low-level PCI property and is needed for expert users who needs to squeeze maximum for their VMs out of expensive high speed network card that supports SR-IOV. According to the ebay, the CX6 card costs between 1000 and 1700 USDs, not really home equipment. > In addition there is currently only one piece of hardware that > works with this interface which is the mlx5. It is not different from any other feature, someone should be first. This has very clear purpose, scoped well and understandable when and why it is needed. Kernel is full of devices and features that exist in one device only. > My concern is this is > adding overhead to all VFs that will not be used by most SR-IOV > capable devices. In my view it would make much more sense to have a > top-down approach instead of bottom-up where the PF is registering > interfaces for the VFs. > > If you want yet another compromise I would be much happier with the PF > registering the sysfs interfaces on the VFs rather than the VFs > registering the interface and hopi
RE: [PATCH net v2] udp: ipv4: manipulate network header of NATed UDP GRO fraglist
On 2021-01-18 15:37, Steffen Klassert wrote: > On Fri, Jan 15, 2021 at 05:12:33PM +, Alexander Lobakin wrote: > > From: Dongseok Yi > > Date: Fri, 15 Jan 2021 22:20:35 +0900 > > > > > UDP/IP header of UDP GROed frag_skbs are not updated even after NAT > > > forwarding. Only the header of head_skb from ip_finish_output_gso -> > > > skb_gso_segment is updated but following frag_skbs are not updated. > > > > > > A call path skb_mac_gso_segment -> inet_gso_segment -> > > > udp4_ufo_fragment -> __udp_gso_segment -> __udp_gso_segment_list > > > does not try to update UDP/IP header of the segment list but copy > > > only the MAC header. > > > > > > Update dport, daddr and checksums of each skb of the segment list > > > in __udp_gso_segment_list. It covers both SNAT and DNAT. > > > > > > Fixes: 9fd1ff5d2ac7 (udp: Support UDP fraglist GRO/GSO.) > > > Signed-off-by: Dongseok Yi > > > --- > > > v1: > > > Steffen Klassert said, there could be 2 options. > > > https://lore.kernel.org/patchwork/patch/1362257/ > > > I was trying to write a quick fix, but it was not easy to forward > > > segmented list. Currently, assuming DNAT only. > > > > > > v2: > > > Per Steffen Klassert request, move the procedure from > > > udp4_ufo_fragment to __udp_gso_segment_list and support SNAT. > > > > > > To Alexander Lobakin, I've checked your email late. Just use this > > > patch as a reference. It support SNAT too, but does not support IPv6 > > > yet. I cannot make IPv6 header changes in __udp_gso_segment_list due > > > to the file is in IPv4 directory. > > > > I used another approach, tried to make fraglist GRO closer to plain > > in terms of checksummming, as it is confusing to me why GSO packet > > should have CHECKSUM_UNNECESSARY. > > This is intentional. With fraglist GRO, we don't mangle packets > in the standard (non NAT) case. So the checksum is still correct > after segmentation. That is one reason why it has good forwarding > performance when software segmentation is needed. Checksuming > touches the whole packet and has a lot of overhead, so it is > heplfull to avoid it whenever possible. > > We should find a way to do the checksum only when we really > need it. I.e. only if the headers of the head skb changed. It would be not easy to detect if the skb is mangled by netfilter. I think v2 patch has little impact on the performance. Can you suggest an another version? If not, I can make v3 including 80 columns warning fix.
Re: [PATCH net-next 0/4] i40e: small improvements on XDP path
On Thu, Jan 14, 2021 at 3:34 PM Cristian Dumitrescu wrote: > > This patchset introduces some small and straightforward improvements > to the Intel i40e driver XDP path. Each improvement is fully described > in its associated patch. > > Cristian Dumitrescu (4): > i40e: remove unnecessary memory writes of the next to clean pointer > i40e: remove unnecessary cleaned_count updates > i40e: remove the redundant buffer info updates > i40: consolidate handling of XDP program actions > > drivers/net/ethernet/intel/i40e/i40e_xsk.c | 149 +++-- > 1 file changed, 79 insertions(+), 70 deletions(-) > > -- > 2.25.1 > Thank you for these clean ups Cristian! For the series: Acked-by: Magnus Karlsson