Re: [PATCH] selftests: hugetlb_dio: Fixup Check for initial conditions to skip in the start
On Sun, 10 Nov 2024 12:22:12 +0530 Donet Tom wrote: > > Fixes: 0268d4579901 ("selftests: hugetlb_dio: check for initial conditions > > to skip in the start") > > ... > > Would you prefer I send this fixup patch as a new series, or is it okay as is? All looks good, thanks. I added cc:stable, because 0268d4579901 was cc:stable.
[PATCH] ipvs: fix UB due to uninitialized stack access in ip_vs_protocol_init()
Under certain kernel configurations when building with Clang/LLVM, the compiler does not generate a return or jump as the terminator instruction for ip_vs_protocol_init(), triggering the following objtool warning during build time: vmlinux.o: warning: objtool: ip_vs_protocol_init() falls through to next function __initstub__kmod_ip_vs_rr__935_123_ip_vs_rr_init6() At runtime, this either causes an oops when trying to load the ipvs module or a boot-time panic if ipvs is built-in. This same issue has been reported by the Intel kernel test robot previously. Digging deeper into both LLVM and the kernel code reveals this to be a undefined behavior problem. ip_vs_protocol_init() uses a on-stack buffer of 64 chars to store the registered protocol names and leaves it uninitialized after definition. The function calls strnlen() when concatenating protocol names into the buffer. With CONFIG_FORTIFY_SOURCE strnlen() performs an extra step to check whether the last byte of the input char buffer is a null character (commit 3009f891bb9f ("fortify: Allow strlen() and strnlen() to pass compile-time known lengths")). This, together with possibly other configurations, cause the following IR to be generated: define hidden i32 @ip_vs_protocol_init() local_unnamed_addr #5 section ".init.text" align 16 !kcfi_type !29 { %1 = alloca [64 x i8], align 16 ... 14: ; preds = %11 %15 = getelementptr inbounds i8, ptr %1, i64 63 %16 = load i8, ptr %15, align 1 %17 = tail call i1 @llvm.is.constant.i8(i8 %16) %18 = icmp eq i8 %16, 0 %19 = select i1 %17, i1 %18, i1 false br i1 %19, label %20, label %23 20: ; preds = %14 %21 = call i64 @strlen(ptr noundef nonnull dereferenceable(1) %1) #23 ... 23: ; preds = %14, %11, %20 %24 = call i64 @strnlen(ptr noundef nonnull dereferenceable(1) %1, i64 noundef 64) #24 ... } The above code calculates the address of the last char in the buffer (value %15) and then loads from it (value %16). Because the buffer is never initialized, the LLVM GVN pass marks value %16 as undefined: %13 = getelementptr inbounds i8, ptr %1, i64 63 br i1 undef, label %14, label %17 This gives later passes (SCCP, in particular) to more DCE opportunities by propagating the undef value further, and eventually removes everything after the load on the uninitialized stack location: define hidden i32 @ip_vs_protocol_init() local_unnamed_addr #0 section ".init.text" align 16 !kcfi_type !11 { %1 = alloca [64 x i8], align 16 ... 12: ; preds = %11 %13 = getelementptr inbounds i8, ptr %1, i64 63 unreachable } In this way, the generated native code will just fall through to the next function, as LLVM does not generate any code for the unreachable IR instruction and leaves the function without a terminator. Zero the on-stack buffer to avoid this possible UB. Reported-by: kernel test robot Closes: https://lore.kernel.org/oe-kbuild-all/202402100205.pwxiz1zk-...@intel.com/ Co-developed-by: Ruowen Qin Signed-off-by: Ruowen Qin Signed-off-by: Jinghao Jia --- net/netfilter/ipvs/ip_vs_proto.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/net/netfilter/ipvs/ip_vs_proto.c b/net/netfilter/ipvs/ip_vs_proto.c index f100da4ba3bc..a9fd1d3fc2cb 100644 --- a/net/netfilter/ipvs/ip_vs_proto.c +++ b/net/netfilter/ipvs/ip_vs_proto.c @@ -340,7 +340,7 @@ void __net_exit ip_vs_protocol_net_cleanup(struct netns_ipvs *ipvs) int __init ip_vs_protocol_init(void) { - char protocols[64]; + char protocols[64] = { 0 }; #define REGISTER_PROTOCOL(p) \ do {\ register_ip_vs_protocol(p); \ @@ -348,8 +348,6 @@ int __init ip_vs_protocol_init(void) strcat(protocols, (p)->name); \ } while (0) - protocols[0] = '\0'; - protocols[2] = '\0'; #ifdef CONFIG_IP_VS_PROTO_TCP REGISTER_PROTOCOL(&ip_vs_protocol_tcp); #endif -- 2.47.0
Re: [PATCH] selftest: drivers: Add support to check duplicate hwirq
On 2024/10/19 3:34 AM, Bjorn Helgaas wrote: On Tue, Sep 03, 2024 at 06:44:26PM -0700, Joseph Jang wrote: Validate there are no duplicate hwirq from the irq debug file system /sys/kernel/debug/irq/irqs/* per chip name. One example log show 2 duplicated hwirq in the irq debug file system. $ sudo cat /sys/kernel/debug/irq/irqs/163 handler: handle_fasteoi_irq device: 0019:00:00.0 node: 1 affinity: 72-143 effectiv: 76 domain: irqchip@0x10002204-3 hwirq: 0xc800 chip:ITS-MSI flags: 0x20 $ sudo cat /sys/kernel/debug/irq/irqs/174 handler: handle_fasteoi_irq device: 0039:00:00.0 node: 3 affinity: 216-287 effectiv: 221 domain: irqchip@0x30002204-3 hwirq: 0xc800 chip:ITS-MSI flags: 0x20 The irq-check.sh can help to collect hwirq and chip name from /sys/kernel/debug/irq/irqs/* and print error log when find duplicate hwirq per chip name. Kernel patch ("PCI/MSI: Fix MSI hwirq truncation") [1] fix above issue. [1]: https://lore.kernel.org/all/20240115135649.708536-1-vid...@nvidia.com/ I don't know enough about this issue to understand the details. It seems like you look for duplicate hwirqs in chips with the same name, e.g., "ITS-MSI" in this case? That name seems too generic to me (might there be several instances of "ITS-MSI" in a system?) As I know, each PCIe device typically has only one ITS-MSI controller. Having multiple ITS-MSI instances for the same device would lead to confusion and potential conflicts in interrupt routing. Also, the name may come from chip->irq_print_chip(), so it apparently relies on irqchip drivers to make the names unique if there are multiple instances? I would have expected looking for duplicates inside something more specific, like "irqchip@0x30002204-3". But again, I don't know enough about the problem to speak confidently here. In our case, If we look for duplicates by different irq domains like "irqchip@0x10002204-3" and "irqchip@0x30002204-3" as following example. $ sudo cat /sys/kernel/debug/irq/irqs/163 handler: handle_fasteoi_irq device: 0019:00:00.0 node: 1 affinity: 72-143 effectiv: 76 domain: irqchip@0x10002204-3 hwirq: 0xc800 chip:ITS-MSI flags: 0x20 $ sudo cat /sys/kernel/debug/irq/irqs/174 handler: handle_fasteoi_irq device: 0039:00:00.0 node: 3 affinity: 216-287 effectiv: 221 domain: irqchip@0x30002204-3 hwirq: 0xc800 chip:ITS-MSI flags: 0x20 We could not detect the duplicated hwirq number (0xc800) in this case. Cosmetic nits: - Tweak subject to match history (use "git log --oneline tools/testing/selftests/drivers/" to see it), e.g., selftests: irq: Add check for duplicate hwirq - Rewrap commit log to fill 75 columns. No point in using shorter lines. - Indent the "$ sudu cat ..." block by a couple spaces since it's effectively a quotation, not part of the main text body. - Possibly include sample output of irq-check.sh (also indented as a quote) when run on the system where you manually found the duplicate via "sudo cat /sys/kernel/debug/irq/irqs/..." - Reword "The irq-check.sh can help ..." to something like this: Add an irq-check.sh test to report errors when there are duplicate hwirqs per chip name. - Since the kernel patch has already been merged, cite it like this instead of using the https://lore URL: db744ddd59be ("PCI/MSI: Prevent MSI hardware interrupt number truncation") If you agree to use irq chip name ("ITS-MSI") to scan duplicate hwirq, I could send version 2 patch to fix above suggestions. Thank you, Joseph. Signed-off-by: Joseph Jang Reviewed-by: Matthew R. Ochs --- tools/testing/selftests/drivers/irq/Makefile | 5 +++ tools/testing/selftests/drivers/irq/config| 2 + .../selftests/drivers/irq/irq-check.sh| 39 +++ 3 files changed, 46 insertions(+) create mode 100644 tools/testing/selftests/drivers/irq/Makefile create mode 100644 tools/testing/selftests/drivers/irq/config create mode 100755 tools/testing/selftests/drivers/irq/irq-check.sh diff --git a/tools/testing/selftests/drivers/irq/Makefile b/tools/testing/selftests/drivers/irq/Makefile new file mode 100644 index ..d6998017c861 --- /dev/null +++ b/tools/testing/selftests/drivers/irq/Makefile @@ -0,0 +1,5 @@ +# SPDX-License-Identifier: GPL-2.0 + +TEST_PROGS := irq-check.sh + +include ../../lib.mk diff --git a/tools/testing/selftests/drivers/irq/config b/tools/testing/selftests/drivers/irq/config new file mode 100644 index ..a53d3b713728 --- /dev/null +++ b/tools/testing/selftests/drivers/irq/config @@ -0,0 +1,2 @@ +CONFIG_GENERIC_IRQ_DEBUGFS=y +CONFIG_GENERIC_IRQ_INJECTION=y diff --git a/tools/testing/selftests/drivers/irq/irq-check.sh b/tool
Re: [PATCH net 0/4] virtio_net: Make RSS interact properly with queue number
On Wed, Nov 6, 2024 at 5:00 PM Xuan Zhuo wrote: > > Hi Jason, could you review this firstly? > > Thanks. > It looks like the series has been merged. Anyhow it looks good to me. Thanks
[PATCHv2 net-next] selftests: wireguards: use nft by default
Use nft by default if it's supported, as nft is the replacement for iptables, which is used by default in some releases. Additionally, iptables is dropped in some releases. Signed-off-by: Hangbin Liu --- v2: use one nft table for testing (Phil Sutter) --- tools/testing/selftests/wireguard/netns.sh | 63 ++ 1 file changed, 53 insertions(+), 10 deletions(-) diff --git a/tools/testing/selftests/wireguard/netns.sh b/tools/testing/selftests/wireguard/netns.sh index 405ff262ca93..be4e3b13ed22 100755 --- a/tools/testing/selftests/wireguard/netns.sh +++ b/tools/testing/selftests/wireguard/netns.sh @@ -44,6 +44,7 @@ sleep() { read -t "$1" -N 1 || true; } waitiperf() { pretty "${1//*-}" "wait for iperf:${3:-5201} pid $2"; while [[ $(ss -N "$1" -tlpH "sport = ${3:-5201}") != *\"iperf3\",pid=$2,fd=* ]]; do sleep 0.1; done; } waitncatudp() { pretty "${1//*-}" "wait for udp: pid $2"; while [[ $(ss -N "$1" -ulpH 'sport = ') != *\"ncat\",pid=$2,fd=* ]]; do sleep 0.1; done; } waitiface() { pretty "${1//*-}" "wait for $2 to come up"; ip netns exec "$1" bash -c "while [[ \$(< \"/sys/class/net/$2/operstate\") != up ]]; do read -t .1 -N 0 || true; done;"; } +use_nft() { nft --version &> /dev/null; } cleanup() { set +e @@ -75,6 +76,12 @@ pp ip netns add $netns1 pp ip netns add $netns2 ip0 link set up dev lo +if use_nft; then + n0 nft add table ip wgtest + n1 nft add table ip wgtest + n2 nft add table ip wgtest +fi + ip0 link add dev wg0 type wireguard ip0 link set wg0 netns $netns1 ip0 link add dev wg0 type wireguard @@ -196,13 +203,22 @@ ip1 link set wg0 mtu 1300 ip2 link set wg0 mtu 1300 n1 wg set wg0 peer "$pub2" endpoint 127.0.0.1:2 n2 wg set wg0 peer "$pub1" endpoint 127.0.0.1:1 -n0 iptables -A INPUT -m length --length 1360 -j DROP +if use_nft; then + n0 nft add chain ip wgtest INPUT { type filter hook input priority filter \; policy accept \; } + n0 nft add rule ip wgtest INPUT meta length 1360 counter drop +else + n0 iptables -A INPUT -m length --length 1360 -j DROP +fi n1 ip route add 192.168.241.2/32 dev wg0 mtu 1299 n2 ip route add 192.168.241.1/32 dev wg0 mtu 1299 n2 ping -c 1 -W 1 -s 1269 192.168.241.1 n2 ip route delete 192.168.241.1/32 dev wg0 mtu 1299 n1 ip route delete 192.168.241.2/32 dev wg0 mtu 1299 -n0 iptables -F INPUT +if use_nft; then + n0 nft flush table ip wgtest +else + n0 iptables -F INPUT +fi ip1 link set wg0 mtu $orig_mtu ip2 link set wg0 mtu $orig_mtu @@ -334,7 +350,12 @@ waitiface $netns2 veths n0 bash -c 'printf 1 > /proc/sys/net/ipv4/ip_forward' n0 bash -c 'printf 2 > /proc/sys/net/netfilter/nf_conntrack_udp_timeout' n0 bash -c 'printf 2 > /proc/sys/net/netfilter/nf_conntrack_udp_timeout_stream' -n0 iptables -t nat -A POSTROUTING -s 192.168.1.0/24 -d 10.0.0.0/24 -j SNAT --to 10.0.0.1 +if use_nft; then + n0 nft add chain ip wgtest POSTROUTING { type nat hook postrouting priority srcnat\; policy accept \; } + n0 nft add rule ip wgtest POSTROUTING ip saddr 192.168.1.0/24 ip daddr 10.0.0.0/24 counter snat to 10.0.0.1 +else + n0 iptables -t nat -A POSTROUTING -s 192.168.1.0/24 -d 10.0.0.0/24 -j SNAT --to 10.0.0.1 +fi n1 wg set wg0 peer "$pub2" endpoint 10.0.0.100:2 persistent-keepalive 1 n1 ping -W 1 -c 1 192.168.241.2 @@ -348,10 +369,19 @@ n1 wg set wg0 peer "$pub2" persistent-keepalive 0 # Test that sk_bound_dev_if works n1 ping -I wg0 -c 1 -W 1 192.168.241.2 # What about when the mark changes and the packet must be rerouted? -n1 iptables -t mangle -I OUTPUT -j MARK --set-xmark 1 +if use_nft; then + n1 nft add chain ip wgtest OUTPUT { type route hook output priority mangle\; policy accept \; } + n1 nft add rule ip wgtest OUTPUT counter meta mark set 0x1 +else + n1 iptables -t mangle -I OUTPUT -j MARK --set-xmark 1 +fi n1 ping -c 1 -W 1 192.168.241.2 # First the boring case n1 ping -I wg0 -c 1 -W 1 192.168.241.2 # Then the sk_bound_dev_if case -n1 iptables -t mangle -D OUTPUT -j MARK --set-xmark 1 +if use_nft; then + n1 nft flush table ip wgtest +else + n1 iptables -t mangle -D OUTPUT -j MARK --set-xmark 1 +fi # Test that onion routing works, even when it loops n1 wg set wg0 peer "$pub3" allowed-ips 192.168.242.2/32 endpoint 192.168.241.2:5 @@ -385,16 +415,29 @@ n1 ping -W 1 -c 100 -f 192.168.99.7 n1 ping -W 1 -c 100 -f abab:: # Have ns2 NAT into wg0 packets from ns0, but return an icmp error along the right route. -n2 iptables -t nat -A POSTROUTING -s 10.0.0.0/24 -d 192.168.241.0/24 -j SNAT --to 192.168.241.2 -n0 iptables -t filter -A INPUT \! -s 10.0.0.0/24 -i vethrs -j DROP # Manual rpfilter just to be explicit. +if use_nft; then + n2 nft add chain ip wgtest POSTROUTING { type nat hook postrouting priority srcnat\; policy accept \; } + n2 nft add rule ip wgtest POSTROUTING ip saddr 10.0.0.0/24 ip daddr 192.168.241.0/24 counter snat to 192.168.241.2 + + n0 nft add chain ip wgtest
Re: [PATCH] vhost/net: Set num_buffers for virtio 1.0
On Wed, Nov 6, 2024 at 4:54 PM Michael S. Tsirkin wrote: > > On Sun, Sep 15, 2024 at 10:35:53AM +0900, Akihiko Odaki wrote: > > The specification says the device MUST set num_buffers to 1 if > > VIRTIO_NET_F_MRG_RXBUF has not been negotiated. > > > > Fixes: 41e3e42108bc ("vhost/net: enable virtio 1.0") > > Signed-off-by: Akihiko Odaki > > True, this is out of spec. But, qemu is also out of spec :( > > Given how many years this was out there, I wonder whether > we should just fix the spec, instead of changing now. > > Jason, what's your take? Fixing the spec (if you mean release the requirement) seems to be less risky. Thanks > > > > --- > > drivers/vhost/net.c | 5 - > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c > > index f16279351db5..d4d97fa9cc8f 100644 > > --- a/drivers/vhost/net.c > > +++ b/drivers/vhost/net.c > > @@ -1107,6 +1107,7 @@ static void handle_rx(struct vhost_net *net) > > size_t vhost_hlen, sock_hlen; > > size_t vhost_len, sock_len; > > bool busyloop_intr = false; > > + bool set_num_buffers; > > struct socket *sock; > > struct iov_iter fixup; > > __virtio16 num_buffers; > > @@ -1129,6 +1130,8 @@ static void handle_rx(struct vhost_net *net) > > vq_log = unlikely(vhost_has_feature(vq, VHOST_F_LOG_ALL)) ? > > vq->log : NULL; > > mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF); > > + set_num_buffers = mergeable || > > + vhost_has_feature(vq, VIRTIO_F_VERSION_1); > > > > do { > > sock_len = vhost_net_rx_peek_head_len(net, sock->sk, > > @@ -1205,7 +1208,7 @@ static void handle_rx(struct vhost_net *net) > > /* TODO: Should check and handle checksum. */ > > > > num_buffers = cpu_to_vhost16(vq, headcount); > > - if (likely(mergeable) && > > + if (likely(set_num_buffers) && > > copy_to_iter(&num_buffers, sizeof num_buffers, > >&fixup) != sizeof num_buffers) { > > vq_err(vq, "Failed num_buffers write"); > > > > --- > > base-commit: 46a0057a5853cbdb58211c19e89badc6fd50 > > change-id: 20240908-v1-90fc83ff8b09 > > > > Best regards, > > -- > > Akihiko Odaki >
Re: [PATCH net-next v11 09/23] ovpn: implement basic RX path (UDP)
On 29.10.2024 12:47, Antonio Quartulli wrote: Packets received over the socket are forwarded to the user device. Implementation is UDP only. TCP will be added by a later patch. Note: no decryption/decapsulation exists yet, packets are forwarded as they arrive without much processing. Signed-off-by: Antonio Quartulli --- drivers/net/ovpn/io.c | 66 ++- drivers/net/ovpn/io.h | 2 + drivers/net/ovpn/main.c | 13 +- drivers/net/ovpn/ovpnstruct.h | 3 ++ drivers/net/ovpn/proto.h | 75 ++ drivers/net/ovpn/socket.c | 24 ++ drivers/net/ovpn/udp.c| 104 +- drivers/net/ovpn/udp.h| 3 +- 8 files changed, 286 insertions(+), 4 deletions(-) diff --git a/drivers/net/ovpn/io.c b/drivers/net/ovpn/io.c index 77ba4d33ae0bd2f52e8bd1c06a182d24285297b4..791a1b117125118b179cb13cdfd5fbab6523a360 100644 --- a/drivers/net/ovpn/io.c +++ b/drivers/net/ovpn/io.c @@ -9,15 +9,79 @@ #include #include +#include #include -#include "io.h" #include "ovpnstruct.h" #include "peer.h" +#include "io.h" +#include "netlink.h" +#include "proto.h" #include "udp.h" #include "skb.h" #include "socket.h" +/* Called after decrypt to write the IP packet to the device. + * This method is expected to manage/free the skb. + */ +static void ovpn_netdev_write(struct ovpn_peer *peer, struct sk_buff *skb) +{ + unsigned int pkt_len; + + /* we can't guarantee the packet wasn't corrupted before entering the +* VPN, therefore we give other layers a chance to check that +*/ + skb->ip_summed = CHECKSUM_NONE; + + /* skb hash for transport packet no longer valid after decapsulation */ + skb_clear_hash(skb); + + /* post-decrypt scrub -- prepare to inject encapsulated packet onto the +* interface, based on __skb_tunnel_rx() in dst.h +*/ + skb->dev = peer->ovpn->dev; + skb_set_queue_mapping(skb, 0); + skb_scrub_packet(skb, true); + The skb->protocol field is going to be updated in the upcoming patch in the caller (ovpn_decrypt_post). Shall we put a comment here clarifying, why do not touch the protocol field here? + skb_reset_network_header(skb); ovpn_decrypt_post() already reseted the network header. Why do we need it here again? + skb_reset_transport_header(skb); + skb_probe_transport_header(skb); + skb_reset_inner_headers(skb); + + memset(skb->cb, 0, sizeof(skb->cb)); Why do we need to zero the control buffer here? + /* cause packet to be "received" by the interface */ + pkt_len = skb->len; + if (likely(gro_cells_receive(&peer->ovpn->gro_cells, +skb) == NET_RX_SUCCESS)) + /* update RX stats with the size of decrypted packet */ + dev_sw_netstats_rx_add(peer->ovpn->dev, pkt_len); +} + +static void ovpn_decrypt_post(struct sk_buff *skb, int ret) +{ + struct ovpn_peer *peer = ovpn_skb_cb(skb)->peer; + + if (unlikely(ret < 0)) + goto drop; + + ovpn_netdev_write(peer, skb); + /* skb is passed to upper layer - don't free it */ + skb = NULL; +drop: + if (unlikely(skb)) + dev_core_stats_rx_dropped_inc(peer->ovpn->dev); + ovpn_peer_put(peer); + kfree_skb(skb); +} + +/* pick next packet from RX queue, decrypt and forward it to the device */ The function now receives packets from externel callers. Should we update the above comment? +void ovpn_recv(struct ovpn_peer *peer, struct sk_buff *skb) +{ + ovpn_skb_cb(skb)->peer = peer; + ovpn_decrypt_post(skb, 0); +} + static void ovpn_encrypt_post(struct sk_buff *skb, int ret) { struct ovpn_peer *peer = ovpn_skb_cb(skb)->peer; diff --git a/drivers/net/ovpn/io.h b/drivers/net/ovpn/io.h index aa259be66441f7b0262f39da12d6c3dce0a9b24c..9667a0a470e0b4b427524fffb5b9b395007e5a2f 100644 --- a/drivers/net/ovpn/io.h +++ b/drivers/net/ovpn/io.h @@ -12,4 +12,6 @@ netdev_tx_t ovpn_net_xmit(struct sk_buff *skb, struct net_device *dev); +void ovpn_recv(struct ovpn_peer *peer, struct sk_buff *skb); + #endif /* _NET_OVPN_OVPN_H_ */ diff --git a/drivers/net/ovpn/main.c b/drivers/net/ovpn/main.c index 5492ce07751d135c1484fe1ed8227c646df94969..73348765a8cf24321aa6be78e75f607d6dbffb1d 100644 --- a/drivers/net/ovpn/main.c +++ b/drivers/net/ovpn/main.c @@ -11,6 +11,7 @@ #include #include #include +#include #include #include #include @@ -32,7 +33,16 @@ static void ovpn_struct_free(struct net_device *net) static int ovpn_net_init(struct net_device *dev) { - return 0; + struct ovpn_struct *ovpn = netdev_priv(dev); + + return gro_cells_init(&ovpn->gro_cells, dev); +} + +static void ovpn_net_uninit(struct net_device *dev) +{ + struct ovpn_struct *ovpn = netdev_priv(dev); + +
Re: [PATCH net-next v7] ipv6: Fix soft lockups in fib6_select_path under high next hop churn
On 11/5/24 6:02 PM, Omid Ehtemam-Haghighi wrote: > Soft lockups have been observed on a cluster of Linux-based edge routers > located in a highly dynamic environment. Using the `bird` service, these > routers continuously update BGP-advertised routes due to frequently > changing nexthop destinations, while also managing significant IPv6 > traffic. The lockups occur during the traversal of the multipath > circular linked-list in the `fib6_select_path` function, particularly > while iterating through the siblings in the list. The issue typically > arises when the nodes of the linked list are unexpectedly deleted > concurrently on a different core—indicated by their 'next' and > 'previous' elements pointing back to the node itself and their reference > count dropping to zero. This results in an infinite loop, leading to a > soft lockup that triggers a system panic via the watchdog timer. > > Apply RCU primitives in the problematic code sections to resolve the > issue. Where necessary, update the references to fib6_siblings to > annotate or use the RCU APIs. > > Include a test script that reproduces the issue. The script > periodically updates the routing table while generating a heavy load > of outgoing IPv6 traffic through multiple iperf3 clients. It > consistently induces infinite soft lockups within a couple of minutes. > > Kernel log: > > 0 [bd13003e8d30] machine_kexec at 8ceaf3eb > 1 [bd13003e8d90] __crash_kexec at 8d0120e3 > 2 [bd13003e8e58] panic at 8cef65d4 > 3 [bd13003e8ed8] watchdog_timer_fn at 8d05cb03 > 4 [bd13003e8f08] __hrtimer_run_queues at 8cfec62f > 5 [bd13003e8f70] hrtimer_interrupt at 8cfed756 > 6 [bd13003e8fd0] __sysvec_apic_timer_interrupt at 8cea01af > 7 [bd13003e8ff0] sysvec_apic_timer_interrupt at 8df1b83d > -- -- > 8 [bd13003d3708] asm_sysvec_apic_timer_interrupt at 8e000ecb > [exception RIP: fib6_select_path+299] > RIP: 8ddafe7b RSP: bd13003d37b8 RFLAGS: 0287 > RAX: 975850b43600 RBX: 975850b40200 RCX: > RDX: 3fff RSI: 51d383e4 RDI: 975850b43618 > RBP: bd13003d3800 R8: R9: 975850b40200 > R10: R11: R12: bd13003d3830 > R13: 975850b436a8 R14: 975850b43600 R15: 0007 > ORIG_RAX: CS: 0010 SS: 0018 > 9 [bd13003d3808] ip6_pol_route at 8ddb030c > 10 [bd13003d3888] ip6_pol_route_input at 8ddb068c > 11 [bd13003d3898] fib6_rule_lookup at 8ddf02b5 > 12 [bd13003d3928] ip6_route_input at 8ddb0f47 > 13 [bd13003d3a18] ip6_rcv_finish_core.constprop.0 at 8dd950d0 > 14 [bd13003d3a30] ip6_list_rcv_finish.constprop.0 at 8dd96274 > 15 [bd13003d3a98] ip6_sublist_rcv at 8dd96474 > 16 [bd13003d3af8] ipv6_list_rcv at 8dd96615 > 17 [bd13003d3b60] __netif_receive_skb_list_core at 8dc16fec > 18 [bd13003d3be0] netif_receive_skb_list_internal at 8dc176b3 > 19 [bd13003d3c50] napi_gro_receive at 8dc565b9 > 20 [bd13003d3c80] ice_receive_skb at c087e4f5 [ice] > 21 [bd13003d3c90] ice_clean_rx_irq at c0881b80 [ice] > 22 [bd13003d3d20] ice_napi_poll at c088232f [ice] > 23 [bd13003d3d80] __napi_poll at 8dc18000 > 24 [bd13003d3db8] net_rx_action at 8dc18581 > 25 [bd13003d3e40] __do_softirq at 8df352e9 > 26 [bd13003d3eb0] run_ksoftirqd at 8ceffe47 > 27 [bd13003d3ec0] smpboot_thread_fn at 8cf36a30 > 28 [bd13003d3ee8] kthread at 8cf2b39f > 29 [bd13003d3f28] ret_from_fork at 8ce5fa64 > 30 [bd13003d3f50] ret_from_fork_asm at 8ce03cbb > > Fixes: 66f5d6ce53e6 ("ipv6: replace rwlock with rcu and spinlock in > fib6_table") > Reported-by: Adrian Oliver > Signed-off-by: Omid Ehtemam-Haghighi > Cc: David S. Miller > Cc: David Ahern > Cc: Eric Dumazet > Cc: Jakub Kicinski > Cc: Paolo Abeni > Cc: Shuah Khan > Cc: Ido Schimmel > Cc: Kuniyuki Iwashima > Cc: Simon Horman > Cc: Omid Ehtemam-Haghighi > Cc: net...@vger.kernel.org > Cc: linux-kselft...@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > --- > v6 -> v7: > * Rebased on top of 'net-next' > > v5 -> v6: > * Adjust the comment line lengths in the test script to a maximum of > 80 characters > * Change memory allocation in inet6_rt_notify from gfp_any() to > GFP_ATOMIC for > atomic allocation in non-blocking contexts, as suggested by Ido > Schimmel > * NOTE: I have executed the test script on both bare-metal servers and > virtualized environments such as QEMU and vng. In the case of > bare-metal, it > consistently triggers a soft lockup in under a minute on unpatched > kernels. > For the virtualized environments, a
Re: [PATCH] virtio_ring: skip cpu sync when mapping fails
On Mon, 11 Nov 2024 10:55:38 +0800, Jason Wang wrote: > There's no need to sync DMA for CPU on mapping errors. So this patch > skips the CPU sync in the error handling path of DMA mapping. > > Signed-off-by: Jason Wang Reviewed-by: Xuan Zhuo > --- > drivers/virtio/virtio_ring.c | 98 +--- > 1 file changed, 57 insertions(+), 41 deletions(-) > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > index be7309b1e860..b422b5fb22db 100644 > --- a/drivers/virtio/virtio_ring.c > +++ b/drivers/virtio/virtio_ring.c > @@ -441,8 +441,10 @@ static void virtqueue_init(struct vring_virtqueue *vq, > u32 num) > */ > > static void vring_unmap_one_split_indirect(const struct vring_virtqueue *vq, > -const struct vring_desc *desc) > +const struct vring_desc *desc, > +bool skip_sync) > { > + unsigned long attrs = skip_sync ? DMA_ATTR_SKIP_CPU_SYNC : 0; > u16 flags; > > if (!vq->do_unmap) > @@ -450,16 +452,18 @@ static void vring_unmap_one_split_indirect(const struct > vring_virtqueue *vq, > > flags = virtio16_to_cpu(vq->vq.vdev, desc->flags); > > - dma_unmap_page(vring_dma_dev(vq), > -virtio64_to_cpu(vq->vq.vdev, desc->addr), > -virtio32_to_cpu(vq->vq.vdev, desc->len), > -(flags & VRING_DESC_F_WRITE) ? > -DMA_FROM_DEVICE : DMA_TO_DEVICE); > + dma_unmap_page_attrs(vring_dma_dev(vq), > + virtio64_to_cpu(vq->vq.vdev, desc->addr), > + virtio32_to_cpu(vq->vq.vdev, desc->len), > + (flags & VRING_DESC_F_WRITE) ? > + DMA_FROM_DEVICE : DMA_TO_DEVICE, > + attrs); > } > > static unsigned int vring_unmap_one_split(const struct vring_virtqueue *vq, > - unsigned int i) > + unsigned int i, bool skip_sync) > { > + unsigned long attrs = skip_sync ? DMA_ATTR_SKIP_CPU_SYNC : 0; > struct vring_desc_extra *extra = vq->split.desc_extra; > u16 flags; > > @@ -469,20 +473,22 @@ static unsigned int vring_unmap_one_split(const struct > vring_virtqueue *vq, > if (!vq->use_dma_api) > goto out; > > - dma_unmap_single(vring_dma_dev(vq), > - extra[i].addr, > - extra[i].len, > - (flags & VRING_DESC_F_WRITE) ? > - DMA_FROM_DEVICE : DMA_TO_DEVICE); > + dma_unmap_single_attrs(vring_dma_dev(vq), > +extra[i].addr, > +extra[i].len, > +(flags & VRING_DESC_F_WRITE) ? > +DMA_FROM_DEVICE : DMA_TO_DEVICE, > +attrs); > } else { > if (!vq->do_unmap) > goto out; > > - dma_unmap_page(vring_dma_dev(vq), > -extra[i].addr, > -extra[i].len, > -(flags & VRING_DESC_F_WRITE) ? > -DMA_FROM_DEVICE : DMA_TO_DEVICE); > + dma_unmap_page_attrs(vring_dma_dev(vq), > + extra[i].addr, > + extra[i].len, > + (flags & VRING_DESC_F_WRITE) ? > + DMA_FROM_DEVICE : DMA_TO_DEVICE, > + attrs); > } > > out: > @@ -717,10 +723,10 @@ static inline int virtqueue_add_split(struct virtqueue > *_vq, > if (i == err_idx) > break; > if (indirect) { > - vring_unmap_one_split_indirect(vq, &desc[i]); > + vring_unmap_one_split_indirect(vq, &desc[i], true); > i = virtio16_to_cpu(_vq->vdev, desc[i].next); > } else > - i = vring_unmap_one_split(vq, i); > + i = vring_unmap_one_split(vq, i, true); > } > > free_indirect: > @@ -775,12 +781,12 @@ static void detach_buf_split(struct vring_virtqueue > *vq, unsigned int head, > i = head; > > while (vq->split.vring.desc[i].flags & nextflag) { > - vring_unmap_one_split(vq, i); > + vring_unmap_one_split(vq, i, false); > i = vq->split.desc_extra[i].next; > vq->vq.num_free++; > } > > - vring_unmap_one_split(vq, i); > + vring_unmap_one_split(vq, i, false); > vq->split.desc_extra[i].next = vq->free_head; > vq->free_head = head; > > @@ -804,7 +810,8 @@ static void detach_buf_split(struct vring_virtqueue *vq, > unsigned i
[PATCH] virtio_ring: skip cpu sync when mapping fails
There's no need to sync DMA for CPU on mapping errors. So this patch skips the CPU sync in the error handling path of DMA mapping. Signed-off-by: Jason Wang --- drivers/virtio/virtio_ring.c | 98 +--- 1 file changed, 57 insertions(+), 41 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index be7309b1e860..b422b5fb22db 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -441,8 +441,10 @@ static void virtqueue_init(struct vring_virtqueue *vq, u32 num) */ static void vring_unmap_one_split_indirect(const struct vring_virtqueue *vq, - const struct vring_desc *desc) + const struct vring_desc *desc, + bool skip_sync) { + unsigned long attrs = skip_sync ? DMA_ATTR_SKIP_CPU_SYNC : 0; u16 flags; if (!vq->do_unmap) @@ -450,16 +452,18 @@ static void vring_unmap_one_split_indirect(const struct vring_virtqueue *vq, flags = virtio16_to_cpu(vq->vq.vdev, desc->flags); - dma_unmap_page(vring_dma_dev(vq), - virtio64_to_cpu(vq->vq.vdev, desc->addr), - virtio32_to_cpu(vq->vq.vdev, desc->len), - (flags & VRING_DESC_F_WRITE) ? - DMA_FROM_DEVICE : DMA_TO_DEVICE); + dma_unmap_page_attrs(vring_dma_dev(vq), +virtio64_to_cpu(vq->vq.vdev, desc->addr), +virtio32_to_cpu(vq->vq.vdev, desc->len), +(flags & VRING_DESC_F_WRITE) ? +DMA_FROM_DEVICE : DMA_TO_DEVICE, +attrs); } static unsigned int vring_unmap_one_split(const struct vring_virtqueue *vq, - unsigned int i) + unsigned int i, bool skip_sync) { + unsigned long attrs = skip_sync ? DMA_ATTR_SKIP_CPU_SYNC : 0; struct vring_desc_extra *extra = vq->split.desc_extra; u16 flags; @@ -469,20 +473,22 @@ static unsigned int vring_unmap_one_split(const struct vring_virtqueue *vq, if (!vq->use_dma_api) goto out; - dma_unmap_single(vring_dma_dev(vq), -extra[i].addr, -extra[i].len, -(flags & VRING_DESC_F_WRITE) ? -DMA_FROM_DEVICE : DMA_TO_DEVICE); + dma_unmap_single_attrs(vring_dma_dev(vq), + extra[i].addr, + extra[i].len, + (flags & VRING_DESC_F_WRITE) ? + DMA_FROM_DEVICE : DMA_TO_DEVICE, + attrs); } else { if (!vq->do_unmap) goto out; - dma_unmap_page(vring_dma_dev(vq), - extra[i].addr, - extra[i].len, - (flags & VRING_DESC_F_WRITE) ? - DMA_FROM_DEVICE : DMA_TO_DEVICE); + dma_unmap_page_attrs(vring_dma_dev(vq), +extra[i].addr, +extra[i].len, +(flags & VRING_DESC_F_WRITE) ? +DMA_FROM_DEVICE : DMA_TO_DEVICE, +attrs); } out: @@ -717,10 +723,10 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, if (i == err_idx) break; if (indirect) { - vring_unmap_one_split_indirect(vq, &desc[i]); + vring_unmap_one_split_indirect(vq, &desc[i], true); i = virtio16_to_cpu(_vq->vdev, desc[i].next); } else - i = vring_unmap_one_split(vq, i); + i = vring_unmap_one_split(vq, i, true); } free_indirect: @@ -775,12 +781,12 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head, i = head; while (vq->split.vring.desc[i].flags & nextflag) { - vring_unmap_one_split(vq, i); + vring_unmap_one_split(vq, i, false); i = vq->split.desc_extra[i].next; vq->vq.num_free++; } - vring_unmap_one_split(vq, i); + vring_unmap_one_split(vq, i, false); vq->split.desc_extra[i].next = vq->free_head; vq->free_head = head; @@ -804,7 +810,8 @@ static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head, if (vq->do_unmap) { for (j = 0; j < len / sizeof(struct vring_desc); j++) -
[PATCH v2 3/3] kunit: qemu_configs: loongarch: Enable shutdown
QEMU for LoongArch does not yet support shutdown/restart through ACPI. Use the pvpanic driver to enable shutdowns. This requires 9.1.0 for shutdown support in pvpanic, but that is the requirement of kunit on LoongArch anyways. Signed-off-by: Thomas Weißschuh Reviewed-by: Bibo Mao Reviewed-by: David Gow --- tools/testing/kunit/qemu_configs/loongarch.py | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/tools/testing/kunit/qemu_configs/loongarch.py b/tools/testing/kunit/qemu_configs/loongarch.py index a874a2156e0f791caaaf30e5f7b17fc175bb55ee..a92422967d1da9f1658ef1e80d0d7365ddbae307 100644 --- a/tools/testing/kunit/qemu_configs/loongarch.py +++ b/tools/testing/kunit/qemu_configs/loongarch.py @@ -6,13 +6,16 @@ QEMU_ARCH = QemuArchParams(linux_arch='loongarch', kconfig=''' CONFIG_EFI_STUB=n CONFIG_PCI_HOST_GENERIC=y +CONFIG_PVPANIC=y +CONFIG_PVPANIC_PCI=y CONFIG_SERIAL_8250=y CONFIG_SERIAL_8250_CONSOLE=y CONFIG_SERIAL_OF_PLATFORM=y ''', qemu_arch='loongarch64', kernel_path='arch/loongarch/boot/vmlinux.elf', - kernel_command_line='console=ttyS0', + kernel_command_line='console=ttyS0 kunit_shutdown=poweroff', extra_qemu_params=[ '-machine', 'virt', + '-device', 'pvpanic-pci', '-cpu', 'max',]) -- 2.47.0
[PATCH v2 0/3] kunit: Add support for LoongArch
Enable LoongArch support in kunit. Example: $ ./tools/testing/kunit/kunit.py run --arch=loongarch --cross_compile=$CROSS_COMPILE [13:32:45] Configuring KUnit Kernel ... [13:32:45] Building KUnit Kernel ... Populating config with: $ make ARCH=loongarch olddefconfig CROSS_COMPILE=$CROSS_COMPILE Building with: $ make all compile_commands.json ARCH=loongarch --jobs=8 CROSS_COMPILE=$CROSS_COMPILE [13:32:48] Starting KUnit Kernel (1/1)... [13:32:48] Running tests with: $ qemu-system-loongarch64 -nodefaults -m 1024 -kernel .kunit/arch/loongarch/boot/vmlinux.elf -append 'kunit.enable=1 console=ttyS0 kunit_shutdown=poweroff' -no-reboot -nographic -serial stdio -machine virt -device pvpanic-pci -cpu max ... [13:33:14] [13:33:14] Testing complete. Ran 493 tests: passed: 453, skipped: 40 [13:33:14] Elapsed time: 28.862s total, 0.002s configuring, 2.526s building, 26.302s running This series is meant to be merged through the kselftest tree. Signed-off-by: Thomas Weißschuh --- Changes in v2: - Drop already applied patch - Pick up review tags - Add SPDX header - Link to v1: https://lore.kernel.org/r/20241014-kunit-loongarch-v1-0-1699b2ad6...@linutronix.de --- Thomas Weißschuh (3): kunit: qemu_configs: Add LoongArch config kunit: tool: Allow overriding the shutdown mode from qemu config kunit: qemu_configs: loongarch: Enable shutdown tools/testing/kunit/kunit_kernel.py | 4 +++- tools/testing/kunit/qemu_configs/loongarch.py | 21 + 2 files changed, 24 insertions(+), 1 deletion(-) --- base-commit: 2d5404caa8c7bb5c4e0435f94b28834ae5456623 change-id: 20241014-kunit-loongarch-98a5b756e818 Best regards, -- Thomas Weißschuh
[PATCH v2 1/3] kunit: qemu_configs: Add LoongArch config
Add a basic config to run kunit tests on LoongArch. This requires QEMU 9.1.0 or later for the necessary direct kernel boot support. Signed-off-by: Thomas Weißschuh Reviewed-by: Bibo Mao Reviewed-by: David Gow --- tools/testing/kunit/qemu_configs/loongarch.py | 18 ++ 1 file changed, 18 insertions(+) diff --git a/tools/testing/kunit/qemu_configs/loongarch.py b/tools/testing/kunit/qemu_configs/loongarch.py new file mode 100644 index ..a874a2156e0f791caaaf30e5f7b17fc175bb55ee --- /dev/null +++ b/tools/testing/kunit/qemu_configs/loongarch.py @@ -0,0 +1,18 @@ +# SPDX-License-Identifier: GPL-2.0 + +from ..qemu_config import QemuArchParams + +QEMU_ARCH = QemuArchParams(linux_arch='loongarch', + kconfig=''' +CONFIG_EFI_STUB=n +CONFIG_PCI_HOST_GENERIC=y +CONFIG_SERIAL_8250=y +CONFIG_SERIAL_8250_CONSOLE=y +CONFIG_SERIAL_OF_PLATFORM=y +''', + qemu_arch='loongarch64', + kernel_path='arch/loongarch/boot/vmlinux.elf', + kernel_command_line='console=ttyS0', + extra_qemu_params=[ + '-machine', 'virt', + '-cpu', 'max',]) -- 2.47.0
[PATCH v2 2/3] kunit: tool: Allow overriding the shutdown mode from qemu config
Not all platforms support machine reboot. If it a proper reboot is not supported the machine will hang. Allow the QEMU configuration to override the necessary shutdown mode for the specific system under test. Signed-off-by: Thomas Weißschuh Reviewed-by: David Gow --- tools/testing/kunit/kunit_kernel.py | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/tools/testing/kunit/kunit_kernel.py b/tools/testing/kunit/kunit_kernel.py index 61931c4926fd6645f2c62dd13f9842a432ec4167..e76d7894b6c5195ece49f0d8c7ac35130df428a9 100644 --- a/tools/testing/kunit/kunit_kernel.py +++ b/tools/testing/kunit/kunit_kernel.py @@ -105,7 +105,9 @@ class LinuxSourceTreeOperationsQemu(LinuxSourceTreeOperations): self._kconfig = qemu_arch_params.kconfig self._qemu_arch = qemu_arch_params.qemu_arch self._kernel_path = qemu_arch_params.kernel_path - self._kernel_command_line = qemu_arch_params.kernel_command_line + ' kunit_shutdown=reboot' + self._kernel_command_line = qemu_arch_params.kernel_command_line + if 'kunit_shutdown=' not in self._kernel_command_line: + self._kernel_command_line += ' kunit_shutdown=reboot' self._extra_qemu_params = qemu_arch_params.extra_qemu_params self._serial = qemu_arch_params.serial -- 2.47.0
Re: [PATCH] virtio_ring: skip cpu sync when mapping fails
On Mon, Nov 11, 2024 at 10:55:38AM +0800, Jason Wang wrote: > There's no need to sync DMA for CPU on mapping errors. So this patch > skips the CPU sync in the error handling path of DMA mapping. > > Signed-off-by: Jason Wang DMA sync is idempotent. Extra work for slow path. Why do we bother? > --- > drivers/virtio/virtio_ring.c | 98 +--- > 1 file changed, 57 insertions(+), 41 deletions(-) > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > index be7309b1e860..b422b5fb22db 100644 > --- a/drivers/virtio/virtio_ring.c > +++ b/drivers/virtio/virtio_ring.c > @@ -441,8 +441,10 @@ static void virtqueue_init(struct vring_virtqueue *vq, > u32 num) > */ > > static void vring_unmap_one_split_indirect(const struct vring_virtqueue *vq, > -const struct vring_desc *desc) > +const struct vring_desc *desc, > +bool skip_sync) > { > + unsigned long attrs = skip_sync ? DMA_ATTR_SKIP_CPU_SYNC : 0; > u16 flags; > > if (!vq->do_unmap) > @@ -450,16 +452,18 @@ static void vring_unmap_one_split_indirect(const struct > vring_virtqueue *vq, > > flags = virtio16_to_cpu(vq->vq.vdev, desc->flags); > > - dma_unmap_page(vring_dma_dev(vq), > -virtio64_to_cpu(vq->vq.vdev, desc->addr), > -virtio32_to_cpu(vq->vq.vdev, desc->len), > -(flags & VRING_DESC_F_WRITE) ? > -DMA_FROM_DEVICE : DMA_TO_DEVICE); > + dma_unmap_page_attrs(vring_dma_dev(vq), > + virtio64_to_cpu(vq->vq.vdev, desc->addr), > + virtio32_to_cpu(vq->vq.vdev, desc->len), > + (flags & VRING_DESC_F_WRITE) ? > + DMA_FROM_DEVICE : DMA_TO_DEVICE, > + attrs); > } > > static unsigned int vring_unmap_one_split(const struct vring_virtqueue *vq, > - unsigned int i) > + unsigned int i, bool skip_sync) > { > + unsigned long attrs = skip_sync ? DMA_ATTR_SKIP_CPU_SYNC : 0; > struct vring_desc_extra *extra = vq->split.desc_extra; > u16 flags; > > @@ -469,20 +473,22 @@ static unsigned int vring_unmap_one_split(const struct > vring_virtqueue *vq, > if (!vq->use_dma_api) > goto out; > > - dma_unmap_single(vring_dma_dev(vq), > - extra[i].addr, > - extra[i].len, > - (flags & VRING_DESC_F_WRITE) ? > - DMA_FROM_DEVICE : DMA_TO_DEVICE); > + dma_unmap_single_attrs(vring_dma_dev(vq), > +extra[i].addr, > +extra[i].len, > +(flags & VRING_DESC_F_WRITE) ? > +DMA_FROM_DEVICE : DMA_TO_DEVICE, > +attrs); > } else { > if (!vq->do_unmap) > goto out; > > - dma_unmap_page(vring_dma_dev(vq), > -extra[i].addr, > -extra[i].len, > -(flags & VRING_DESC_F_WRITE) ? > -DMA_FROM_DEVICE : DMA_TO_DEVICE); > + dma_unmap_page_attrs(vring_dma_dev(vq), > + extra[i].addr, > + extra[i].len, > + (flags & VRING_DESC_F_WRITE) ? > + DMA_FROM_DEVICE : DMA_TO_DEVICE, > + attrs); > } > > out: > @@ -717,10 +723,10 @@ static inline int virtqueue_add_split(struct virtqueue > *_vq, > if (i == err_idx) > break; > if (indirect) { > - vring_unmap_one_split_indirect(vq, &desc[i]); > + vring_unmap_one_split_indirect(vq, &desc[i], true); > i = virtio16_to_cpu(_vq->vdev, desc[i].next); > } else > - i = vring_unmap_one_split(vq, i); > + i = vring_unmap_one_split(vq, i, true); > } > > free_indirect: > @@ -775,12 +781,12 @@ static void detach_buf_split(struct vring_virtqueue > *vq, unsigned int head, > i = head; > > while (vq->split.vring.desc[i].flags & nextflag) { > - vring_unmap_one_split(vq, i); > + vring_unmap_one_split(vq, i, false); > i = vq->split.desc_extra[i].next; > vq->vq.num_free++; > } > > - vring_unmap_one_split(vq, i); > + vring_unmap_one_split(vq, i, false); > vq->split.desc_extra[i].next = vq->free_head; > vq->free_head = head; > > @@ -804,7 +81
Re: [PATCH 2/2] rcu/nocb: Fix missed RCU barrier on deoffloading
> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h > index 16865475120b..2605dd234a13 100644 > --- a/kernel/rcu/tree_nocb.h > +++ b/kernel/rcu/tree_nocb.h > @@ -891,7 +891,18 @@ static void nocb_cb_wait(struct rcu_data *rdp) > swait_event_interruptible_exclusive(rdp->nocb_cb_wq, > nocb_cb_wait_cond(rdp)); > if (kthread_should_park()) { > - kthread_parkme(); > + /* > + * kthread_park() must be preceded by an rcu_barrier(). > + * But yet another rcu_barrier() might have sneaked in between > + * the barrier callback execution and the callbacks counter > + * decrement. > + */ > + if (rdp->nocb_cb_sleep) { Is READ_ONCE() not required here? - Neeraj > + rcu_nocb_lock_irqsave(rdp, flags); > + WARN_ON_ONCE(rcu_segcblist_n_cbs(&rdp->cblist)); > + rcu_nocb_unlock_irqrestore(rdp, flags); > + kthread_parkme(); > + } > } else if (READ_ONCE(rdp->nocb_cb_sleep)) { > WARN_ON(signal_pending(current)); > trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WokeEmpty"));
Re: [syzbot] [acpi?] [nvdimm?] KASAN: vmalloc-out-of-bounds Read in acpi_nfit_ctl (2)
#syz test On Tue, Nov 5, 2024 at 8:58 PM syzbot < syzbot+7534f060ebda6b8b5...@syzkaller.appspotmail.com> wrote: > Hello, > > syzbot found the following issue on: > > HEAD commit:2e1b3cc9d7f7 Merge tag 'arm-fixes-6.12-2' of > git://git.ker.. > git tree: upstream > console output: https://syzkaller.appspot.com/x/log.txt?x=12418e3058 > kernel config: https://syzkaller.appspot.com/x/.config?x=11254d3590b16717 > dashboard link: > https://syzkaller.appspot.com/bug?extid=7534f060ebda6b8b51b3 > compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for > Debian) 2.40 > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=12170f4058 > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=16418e3058 > > Downloadable assets: > disk image (non-bootable): > https://storage.googleapis.com/syzbot-assets/7feb34a89c2a/non_bootable_disk-2e1b3cc9.raw.xz > vmlinux: > https://storage.googleapis.com/syzbot-assets/2f2588b04ae9/vmlinux-2e1b3cc9.xz > kernel image: > https://storage.googleapis.com/syzbot-assets/2c9324cf16df/bzImage-2e1b3cc9.xz > > IMPORTANT: if you fix the issue, please add the following tag to the > commit: > Reported-by: syzbot+7534f060ebda6b8b5...@syzkaller.appspotmail.com > > == > BUG: KASAN: vmalloc-out-of-bounds in cmd_to_func > drivers/acpi/nfit/core.c:416 [inline] > BUG: KASAN: vmalloc-out-of-bounds in acpi_nfit_ctl+0x20e8/0x24a0 > drivers/acpi/nfit/core.c:459 > Read of size 4 at addr c9e0e038 by task syz-executor229/5316 > > CPU: 0 UID: 0 PID: 5316 Comm: syz-executor229 Not tainted > 6.12.0-rc6-syzkaller-00077-g2e1b3cc9d7f7 #0 > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 > Call Trace: > > __dump_stack lib/dump_stack.c:94 [inline] > dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 > print_address_description mm/kasan/report.c:377 [inline] > print_report+0x169/0x550 mm/kasan/report.c:488 > kasan_report+0x143/0x180 mm/kasan/report.c:601 > cmd_to_func drivers/acpi/nfit/core.c:416 [inline] > acpi_nfit_ctl+0x20e8/0x24a0 drivers/acpi/nfit/core.c:459 > __nd_ioctl drivers/nvdimm/bus.c:1186 [inline] > nd_ioctl+0x1844/0x1fd0 drivers/nvdimm/bus.c:1264 > vfs_ioctl fs/ioctl.c:51 [inline] > __do_sys_ioctl fs/ioctl.c:907 [inline] > __se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893 > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 > entry_SYSCALL_64_after_hwframe+0x77/0x7f > RIP: 0033:0x7fb399ccda79 > Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 c1 17 00 00 90 48 89 f8 48 89 f7 > 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff > ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 > RSP: 002b:7ffcf6cb8d88 EFLAGS: 0246 ORIG_RAX: 0010 > RAX: ffda RBX: RCX: 7fb399ccda79 > RDX: 2180 RSI: c008640a RDI: 0003 > RBP: 7fb399d405f0 R08: 0006 R09: 0006 > R10: 0006 R11: 0246 R12: 0001 > R13: 431bde82d7b634db R14: 0001 R15: 0001 > > > The buggy address belongs to the virtual mapping at > [c9e0e000, c9e1) created by: > __nd_ioctl drivers/nvdimm/bus.c:1169 [inline] > nd_ioctl+0x1594/0x1fd0 drivers/nvdimm/bus.c:1264 > > The buggy address belongs to the physical page: > page: refcount:1 mapcount:0 mapping: > index:0x8880401b9a80 pfn:0x401b9 > flags: 0x4fff000(node=1|zone=1|lastcpupid=0x7ff) > raw: 04fff000 dead0122 > raw: 8880401b9a80 0001 > page dumped because: kasan: bad access detected > page_owner tracks the page as allocated > page last allocated via order 0, migratetype Unmovable, gfp_mask > 0x2cc2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_NOWARN), pid 5316, tgid 5316 > (syz-executor229), ts 69039468240, free_ts 68666765389 > set_page_owner include/linux/page_owner.h:32 [inline] > post_alloc_hook+0x1f3/0x230 mm/page_alloc.c:1537 > prep_new_page mm/page_alloc.c:1545 [inline] > get_page_from_freelist+0x303f/0x3190 mm/page_alloc.c:3457 > __alloc_pages_noprof+0x292/0x710 mm/page_alloc.c:4733 > alloc_pages_bulk_noprof+0x729/0xd40 mm/page_alloc.c:4681 > alloc_pages_bulk_array_mempolicy_noprof+0x8ea/0x1600 mm/mempolicy.c:2556 > vm_area_alloc_pages mm/vmalloc.c:3542 [inline] > __vmalloc_area_node mm/vmalloc.c:3646 [inline] > __vmalloc_node_range_noprof+0x752/0x13f0 mm/vmalloc.c:3828 > __vmalloc_node_noprof mm/vmalloc.c:3893 [inline] > vmalloc_noprof+0x79/0x90 mm/vmalloc.c:3926 > __nd_ioctl drivers/nvdimm/bus.c:1169 [inline] > nd_ioctl+0x1594/0x1fd0 drivers/nvdimm/bus.c:1264 > vfs_ioctl fs/ioctl.c:51 [inline] > __do_sys_ioctl fs/ioctl.c:907 [inline] > __se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893 > do_syscall_x64 arch/x86/entry/common.c:52 [inline] > do_syscall_6
Re: [syzbot] [acpi?] [nvdimm?] KASAN: vmalloc-out-of-bounds Read in acpi_nfit_ctl (2)
Hello, syzbot has tested the proposed patch and the reproducer did not trigger any issue: Reported-by: syzbot+7534f060ebda6b8b5...@syzkaller.appspotmail.com Tested-by: syzbot+7534f060ebda6b8b5...@syzkaller.appspotmail.com Tested on: commit: de2f378f Merge tag 'nfsd-6.12-4' of git://git.kernel.o.. git tree: upstream console output: https://syzkaller.appspot.com/x/log.txt?x=102594e858 kernel config: https://syzkaller.appspot.com/x/.config?x=64aa0d9945bd5c1 dashboard link: https://syzkaller.appspot.com/bug?extid=7534f060ebda6b8b51b3 compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40 patch: https://syzkaller.appspot.com/x/patch.diff?x=11629ea798 Note: testing is done by a robot and is best-effort only.
[PATCH] kmod: verify module name before invoking modprobe
Sometimes when kernel calls request_module to load a module into kernel space, it doesn't pass the module name appropriately, and request_module doesn't verify it as well. As a result, modprobe is invoked anyway and spend a lot of time searching a nonsense name. For example reported from a customer, he runs a user space process to call ioctl(fd, SIOCGIFINDEX, &ifr), the callstack in kernel is like that: dev_ioctl(net/core/dev_iovtl.c) dev_load request_module("netdev-%s", name); or request_module("%s", name); However if name of NIC is empty, neither dev_load nor request_module checks it at the first place, modprobe will search module "netdev-" in its default path, env path and path configured in etc for nothing, increase a lot system overhead. To address this problem, this patch copies va_list and introduces a helper is_module_name_valid to verify the parameters validity one by one, either null or empty. if it fails, no modprobe invoked. Signed-off-by: Song Chen --- kernel/module/kmod.c | 29 - 1 file changed, 28 insertions(+), 1 deletion(-) diff --git a/kernel/module/kmod.c b/kernel/module/kmod.c index 0800d9891692..161ad41b864e 100644 --- a/kernel/module/kmod.c +++ b/kernel/module/kmod.c @@ -113,6 +113,27 @@ static int call_modprobe(char *orig_module_name, int wait) return -ENOMEM; } +static inline bool is_module_name_valid(const char *fmt, va_list args) +{ + va_list args_verify; + bool ret = true; + const char *p, *arg; + + va_copy(args_verify, args); + for (p = fmt; *p; p++) { + if (*p == '%' && *(++p) == 's') { + arg = va_arg(args_verify, const char *); + if (!arg || arg[0] == '\0') { + ret = false; + break; + } + } + } + va_end(args_verify); + + return ret; +} + /** * __request_module - try to load a kernel module * @wait: wait (or not) for the operation to complete @@ -147,7 +168,13 @@ int __request_module(bool wait, const char *fmt, ...) return -ENOENT; va_start(args, fmt); - ret = vsnprintf(module_name, MODULE_NAME_LEN, fmt, args); + if (is_module_name_valid(fmt, args)) + ret = vsnprintf(module_name, MODULE_NAME_LEN, fmt, args); + else { + pr_warn_ratelimited("request_module: modprobe cannot be processed due to invalid module name"); + va_end(args); + return -EINVAL; + } va_end(args); if (ret >= MODULE_NAME_LEN) return -ENAMETOOLONG; -- 2.25.1
[PATCH v2] rcu: Remove READ_ONCE() for rdp->gpwrap access in __note_gp_changes()
There is one access to rdp->gpwrap in the __note_gp_changes() function that does not use READ_ONCE() for protection, while other accesses to rdp->gpwrap do use READ_ONCE(). When using the 8*TREE03 and CONFIG_NR_CPUS=8 configuration, KCSAN found no data races at that point. This is because other functions should hold rnp->lock when calling __note_gp_changes(), which ensures proper exclusion of writes to the rdp->gpwrap fields for all CPUs associated with that leaf rcu_node structure. Therefore, using READ_ONCE() to protect rdp->gpwrap within the __note_gp_changes() function is unnecessary. Signed-off-by: Zilin Guan --- v1 -> v2: Remove READ_ONCE() from accesses to rdp->gpwrap. kernel/rcu/tree.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 77b5b39e19a8..68eb0f746575 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1275,7 +1275,7 @@ static bool __note_gp_changes(struct rcu_node *rnp, struct rcu_data *rdp) /* Handle the ends of any preceding grace periods first. */ if (rcu_seq_completed_gp(rdp->gp_seq, rnp->gp_seq) || - unlikely(READ_ONCE(rdp->gpwrap))) { + unlikely(rdp->gpwrap)) { if (!offloaded) ret = rcu_advance_cbs(rnp, rdp); /* Advance CBs. */ rdp->core_needs_qs = false; @@ -1289,7 +1289,7 @@ static bool __note_gp_changes(struct rcu_node *rnp, struct rcu_data *rdp) /* Now handle the beginnings of any new-to-this-CPU grace periods. */ if (rcu_seq_new_gp(rdp->gp_seq, rnp->gp_seq) || - unlikely(READ_ONCE(rdp->gpwrap))) { + unlikely(rdp->gpwrap)) { /* * If the current grace period is waiting for this CPU, * set up to detect a quiescent state, otherwise don't @@ -1304,7 +1304,7 @@ static bool __note_gp_changes(struct rcu_node *rnp, struct rcu_data *rdp) rdp->gp_seq = rnp->gp_seq; /* Remember new grace-period state. */ if (ULONG_CMP_LT(rdp->gp_seq_needed, rnp->gp_seq_needed) || rdp->gpwrap) WRITE_ONCE(rdp->gp_seq_needed, rnp->gp_seq_needed); - if (IS_ENABLED(CONFIG_PROVE_RCU) && READ_ONCE(rdp->gpwrap)) + if (IS_ENABLED(CONFIG_PROVE_RCU) && rdp->gpwrap) WRITE_ONCE(rdp->last_sched_clock, jiffies); WRITE_ONCE(rdp->gpwrap, false); rcu_gpnum_ovf(rnp, rdp); -- 2.34.1
[PATCH bpf v3 1/2] bpf: fix recursive lock when verdict program return SK_PASS
When the stream_verdict program returns SK_PASS, it places the received skb into its own receive queue, but a recursive lock eventually occurs, leading to an operating system deadlock. This issue has been present since v6.9. ''' sk_psock_strp_data_ready write_lock_bh(&sk->sk_callback_lock) strp_data_ready strp_read_sock read_sock -> tcp_read_sock strp_recv cb.rcv_msg -> sk_psock_strp_read # now stream_verdict return SK_PASS without peer sock assign __SK_PASS = sk_psock_map_verd(SK_PASS, NULL) sk_psock_verdict_apply sk_psock_skb_ingress_self sk_psock_skb_ingress_enqueue sk_psock_data_ready read_lock_bh(&sk->sk_callback_lock) <= dead lock ''' This topic has been discussed before, but it has not been fixed. Previous discussion: https://lore.kernel.org/all/6684a5864ec86_403d20898@john.notmuch Fixes: 6648e613226e ("bpf, skmsg: Fix NULL pointer dereference in sk_psock_skb_ingress_enqueue") Reported-by: Vincent Whitchurch Signed-off-by: Jiayuan Chen Signed-off-by: John Fastabend --- net/core/skmsg.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/core/skmsg.c b/net/core/skmsg.c index b1dcbd3be89e..e90fbab703b2 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -1117,9 +1117,9 @@ static void sk_psock_strp_data_ready(struct sock *sk) if (tls_sw_has_ctx_rx(sk)) { psock->saved_data_ready(sk); } else { - write_lock_bh(&sk->sk_callback_lock); + read_lock_bh(&sk->sk_callback_lock); strp_data_ready(&psock->strp); - write_unlock_bh(&sk->sk_callback_lock); + read_unlock_bh(&sk->sk_callback_lock); } } rcu_read_unlock(); -- 2.43.5
[PATCH bpf v3 2/2] selftests/bpf: Add some tests with sockmap SK_PASS
1. Add a new tests in sockmap_basic.c to test SK_PASS for sockmap 2. The return value of 'sk_skb/stream_parser' is used as a length, but the current eBPF program returns SK_PASS, which is semantically incorrect. This change modifies it to return skb->len. All tests related to this eBPF program have been tested (currently only in sockmap_basic.c). All tests are passed. Signed-off-by: Jiayuan Chen --- test result 310/1 sockmap_basic/sockmap create_update_free:OK 310/2 sockmap_basic/sockhash create_update_free:OK 310/3 sockmap_basic/sockmap sk_msg load helpers:OK 310/4 sockmap_basic/sockhash sk_msg load helpers:OK 310/5 sockmap_basic/sockmap update:OK 310/6 sockmap_basic/sockhash update:OK 310/7 sockmap_basic/sockmap update in unsafe context:OK 310/8 sockmap_basic/sockmap copy:OK 310/9 sockmap_basic/sockhash copy:OK 310/10 sockmap_basic/sockmap skb_verdict attach:OK 310/11 sockmap_basic/sockmap skb_verdict attach_with_link:OK 310/12 sockmap_basic/sockmap msg_verdict progs query:OK 310/13 sockmap_basic/sockmap stream_parser progs query:OK 310/14 sockmap_basic/sockmap stream_verdict progs query:OK 310/15 sockmap_basic/sockmap skb_verdict progs query:OK 310/16 sockmap_basic/sockmap skb_verdict shutdown:OK 310/17 sockmap_basic/sockmap stream_parser and stream_verdict pass:OK 310/18 sockmap_basic/sockmap skb_verdict fionread:OK 310/19 sockmap_basic/sockmap skb_verdict fionread on drop:OK 310/20 sockmap_basic/sockmap skb_verdict msg_f_peek:OK 310/21 sockmap_basic/sockmap skb_verdict msg_f_peek with link:OK 310/22 sockmap_basic/sockmap unconnected af_unix:OK 310/23 sockmap_basic/sockmap one socket to many map entries:OK 310/24 sockmap_basic/sockmap one socket to many maps:OK 310/25 sockmap_basic/sockmap same socket replace:OK 310/26 sockmap_basic/sockmap sk_msg attach sockmap helpers with link:OK 310/27 sockmap_basic/sockhash sk_msg attach sockhash helpers with link:OK 310 sockmap_basic:OK --- .../selftests/bpf/prog_tests/sockmap_basic.c | 54 +++ .../bpf/progs/test_sockmap_pass_prog.c| 2 +- 2 files changed, 55 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c index 82bfb266741c..59eafd0115df 100644 --- a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c +++ b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c @@ -501,6 +501,58 @@ static void test_sockmap_skb_verdict_shutdown(void) test_sockmap_pass_prog__destroy(skel); } +static void test_sockmap_stream_pass(void) +{ + int zero = 0, sent, recvd; + int verdict, parser; + int err, map; + int c = -1, p = -1; + struct test_sockmap_pass_prog *pass = NULL; + char snd[256] = "0123456789"; + char rcv[256] = "0"; + + pass = test_sockmap_pass_prog__open_and_load(); + verdict = bpf_program__fd(pass->progs.prog_skb_verdict); + parser = bpf_program__fd(pass->progs.prog_skb_parser); + map = bpf_map__fd(pass->maps.sock_map_rx); + + err = bpf_prog_attach(parser, map, BPF_SK_SKB_STREAM_PARSER, 0); + if (!ASSERT_OK(err, "bpf_prog_attach stream parser")) + goto out; + + err = bpf_prog_attach(verdict, map, BPF_SK_SKB_STREAM_VERDICT, 0); + if (!ASSERT_OK(err, "bpf_prog_attach stream verdict")) + goto out; + + err = create_pair(AF_INET, SOCK_STREAM, &c, &p); + if (err) + goto out; + + /* sk_data_ready of 'p' will be replaced by strparser handler */ + err = bpf_map_update_elem(map, &zero, &p, BPF_NOEXIST); + if (!ASSERT_OK(err, "bpf_map_update_elem(p)")) + goto out_close; + + /* +* as 'prog_skb_parser' return the original skb len and +* 'prog_skb_verdict' return SK_PASS, the kernel will just +* pass it through to original socket 'p' +*/ + sent = xsend(c, snd, sizeof(snd), 0); + ASSERT_EQ(sent, sizeof(snd), "xsend(c)"); + + recvd = recv_timeout(p, rcv, sizeof(rcv), SOCK_NONBLOCK, +IO_TIMEOUT_SEC); + ASSERT_EQ(recvd, sizeof(rcv), "recv_timeout(p)"); + +out_close: + close(c); + close(p); + +out: + test_sockmap_pass_prog__destroy(pass); +} + static void test_sockmap_skb_verdict_fionread(bool pass_prog) { int err, map, verdict, c0 = -1, c1 = -1, p0 = -1, p1 = -1; @@ -923,6 +975,8 @@ void test_sockmap_basic(void) test_sockmap_progs_query(BPF_SK_SKB_VERDICT); if (test__start_subtest("sockmap skb_verdict shutdown")) test_sockmap_skb_verdict_shutdown(); + if (test__start_subtest("sockmap stream_parser and stream_verdict pass")) + test_sockmap_stream_pass(); if (test__start_subtest("sockmap skb_verdict fionread")) test_sockmap_skb_verdict_fionread(true); if (test__start_subtest("sockmap skb_verdict fionread on dr
[PATCH bpf v3 0/2] bpf: fix recursive lock and add test
1. fix recursive lock when ebpf prog return SK_PASS. 2. add selftest to reproduce recursive lock. Note that if just the selftest merged without first patch, the test case will definitely fail, because the issue of deadlock is inevitable. --- v2->v3: fix line length reported by patchwork. (max_line_length is set to 80 in patchwork but default is 100 in kernel tree) v1->v2: 1.inspired by martin.lau to add selftest to reproduce the issue. 2. follow the community rules for patch. v1: https://lore.kernel.org/bpf/55fc6114-7e64-4b65-86d2-92cfd1e9e...@linux.dev/T/#u --- Jiayuan Chen (2): bpf: fix recursive lock when verdict program return SK_PASS selftests/bpf: Add some tests with sockmap SK_PASS net/core/skmsg.c | 4 +- .../selftests/bpf/prog_tests/sockmap_basic.c | 54 +++ .../bpf/progs/test_sockmap_pass_prog.c| 2 +- 3 files changed, 57 insertions(+), 3 deletions(-) -- 2.43.5
Re: [PATCH net-next v11 07/23] ovpn: introduce the ovpn_socket object
On 29.10.2024 12:47, Antonio Quartulli wrote: This specific structure is used in the ovpn kernel module to wrap and carry around a standard kernel socket. ovpn takes ownership of passed sockets and therefore an ovpn specific objects is attached to them for status tracking purposes. Initially only UDP support is introduced. TCP will come in a later patch. Signed-off-by: Antonio Quartulli [...] diff --git a/drivers/net/ovpn/socket.c b/drivers/net/ovpn/socket.c new file mode 100644 index ..090a3232ab0ec19702110f1a90f45c7f10889f6f --- /dev/null +++ b/drivers/net/ovpn/socket.c @@ -0,0 +1,120 @@ +// SPDX-License-Identifier: GPL-2.0 +/* OpenVPN data channel offload + * + * Copyright (C) 2020-2024 OpenVPN, Inc. + * + * Author:James Yonan + * Antonio Quartulli + */ + +#include +#include + +#include "ovpnstruct.h" +#include "main.h" +#include "io.h" +#include "peer.h" +#include "socket.h" +#include "udp.h" + +static void ovpn_socket_detach(struct socket *sock) +{ + if (!sock) + return; + + sockfd_put(sock); +} + +/** + * ovpn_socket_release_kref - kref_put callback + * @kref: the kref object + */ +void ovpn_socket_release_kref(struct kref *kref) +{ + struct ovpn_socket *sock = container_of(kref, struct ovpn_socket, + refcount); + + ovpn_socket_detach(sock->sock); + kfree_rcu(sock, rcu); +} + +static bool ovpn_socket_hold(struct ovpn_socket *sock) +{ + return kref_get_unless_zero(&sock->refcount); Why do we need to wrap this kref acquiring call into the function. Why we cannot simply call kref_get_unless_zero() from ovpn_socket_get()? +} + +static struct ovpn_socket *ovpn_socket_get(struct socket *sock) +{ + struct ovpn_socket *ovpn_sock; + + rcu_read_lock(); + ovpn_sock = rcu_dereference_sk_user_data(sock->sk); + if (!ovpn_socket_hold(ovpn_sock)) { + pr_warn("%s: found ovpn_socket with ref = 0\n", __func__); Should we be more specific here and print warning with netdev_warn(ovpn_sock->ovpn->dev, ...)? And, BTW, how we can pick-up a half-destroyed socket? + ovpn_sock = NULL; + } + rcu_read_unlock(); + + return ovpn_sock; +} + +static int ovpn_socket_attach(struct socket *sock, struct ovpn_peer *peer) +{ + int ret = -EOPNOTSUPP; + + if (!sock || !peer) + return -EINVAL; + + if (sock->sk->sk_protocol == IPPROTO_UDP) + ret = ovpn_udp_socket_attach(sock, peer->ovpn); + + return ret; +} + +/** + * ovpn_socket_new - create a new socket and initialize it + * @sock: the kernel socket to embed + * @peer: the peer reachable via this socket + * + * Return: an openvpn socket on success or a negative error code otherwise + */ +struct ovpn_socket *ovpn_socket_new(struct socket *sock, struct ovpn_peer *peer) +{ + struct ovpn_socket *ovpn_sock; + int ret; + + ret = ovpn_socket_attach(sock, peer); + if (ret < 0 && ret != -EALREADY) + return ERR_PTR(ret); + + /* if this socket is already owned by this interface, just increase the +* refcounter and use it as expected. +* +* Since UDP sockets can be used to talk to multiple remote endpoints, +* openvpn normally instantiates only one socket and shares it among all +* its peers. For this reason, when we find out that a socket is already +* used for some other peer in *this* instance, we can happily increase +* its refcounter and use it normally. +*/ + if (ret == -EALREADY) { + /* caller is expected to increase the sock refcounter before +* passing it to this function. For this reason we drop it if +* not needed, like when this socket is already owned. +*/ + ovpn_sock = ovpn_socket_get(sock); + sockfd_put(sock); + return ovpn_sock; + } + + ovpn_sock = kzalloc(sizeof(*ovpn_sock), GFP_KERNEL); + if (!ovpn_sock) + return ERR_PTR(-ENOMEM); + + ovpn_sock->ovpn = peer->ovpn; + ovpn_sock->sock = sock; + kref_init(&ovpn_sock->refcount); + + rcu_assign_sk_user_data(sock->sk, ovpn_sock); + + return ovpn_sock; +} diff --git a/drivers/net/ovpn/socket.h b/drivers/net/ovpn/socket.h new file mode 100644 index ..5ad9c5073b085482da95ee8ebf40acf20bf2e4b3 --- /dev/null +++ b/drivers/net/ovpn/socket.h @@ -0,0 +1,48 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* OpenVPN data channel offload + * + * Copyright (C) 2020-2024 OpenVPN, Inc. + * + * Author:James Yonan + * Antonio Quartulli + */ + +#ifndef _NET_OVPN_SOCK_H_ +#define _NET_OVPN_SOCK_H_ + +#include +#include +#include + +struct ovpn_struct; +struct ovpn_peer; + +/** + * struct ovpn_socket - a kernel socket ref
Re: [PATCH v4 3/6] iio: light: stk3310: Implement vdd and leda supplies
On Mon, Nov 04, 2024 at 10:35:42AM +0200, Andy Shevchenko wrote: > On Sat, Nov 02, 2024 at 03:50:39PM -0400, Aren Moynihan wrote: > > The vdd and leda supplies must be powered on for the chip to function > > and can be powered off during system suspend. > > > > This was originally based on a patch by Ondrej Jirman[1], but has been > > rewritten since. > > > > 1: > > https://codeberg.org/megi/linux/commit/a933aff8b7a0e6e3c9cf1d832dcba07022bbfa82 > > Make it a Link tag... > > > > > ...here > > Link: > https://codeberg.org/megi/linux/commit/a933aff8b7a0e6e3c9cf1d832dcba07022bbfa82 > [1] Makes sense > > Signed-off-by: Aren Moynihan > > ... > > > + ret = devm_regulator_bulk_get(&client->dev, ARRAY_SIZE(data->supplies), > > + data->supplies); > > + if (ret) > > + return dev_err_probe(&client->dev, ret, "get regulators > > failed\n"); > > With previously introduced temporary 'dev' variable these become: > > ret = devm_regulator_bulk_get(dev, ARRAY_SIZE(data->supplies), > data->supplies); > if (ret) > return dev_err_probe(dev, ret, "get regulators failed\n"); > > ... > > > + ret = stk3310_regulators_enable(data); > > + if (ret) > > + return dev_err_probe(&client->dev, ret, > > +"regulator enable failed\n"); > > + > > + ret = devm_add_action_or_reset(&client->dev, > > stk3310_regulators_disable, data); > > + if (ret) > > + return dev_err_probe(&client->dev, ret, > > +"failed to register regulator cleanup\n"); > > So do these... > Cleaning all of these up > > + ret = regulator_bulk_disable(ARRAY_SIZE(data->supplies), > > data->supplies); > > Is array_size.h included? It's not, it looks like it came one of the headers that's already included. I'll add it explicitly. Thanks - Aren
Re: [PATCH v4 5/6] iio: light: stk3310: log error if reading the chip id fails
On Mon, Nov 04, 2024 at 10:41:11AM +0200, Andy Shevchenko wrote: > On Sat, Nov 02, 2024 at 03:50:43PM -0400, Aren Moynihan wrote: > > If the chip isn't powered, this call is likely to return an error. > > Without a log here the driver will silently fail to probe. Potential > > errors include ENXIO (when the chip isn't powered) and ETIMEDOUT (when > > the i2c bus isn't powered). > > > > This function is only called from stk3310_probe, and this condition > > should return an error, which fits what dev_err_probe is designed for. > > ... > > > + return dev_err_probe(dev, ret, "failed to read chip id\n"); > > Please, make sure you have consistent style in the messages. Most of what > I have seen use period at the end. This one doesn't. All but two log messages in this driver don't have a period at the end. I'll correct those two in the next revision. - Aren
Re: [PATCH v4 4/6] iio: light: stk3310: use dev_err_probe where possible
On Mon, Nov 04, 2024 at 10:40:16AM +0200, Andy Shevchenko wrote: > On Sat, Nov 02, 2024 at 03:50:41PM -0400, Aren Moynihan wrote: > > Using dev_err_probe instead of dev_err and return makes the errors > > easier to understand by including the error name, and saves a little > > code. > > ... > > > #define STK3310_REGFIELD(name) > > \ > > do {\ > > data->reg_##name = \ > > - devm_regmap_field_alloc(&client->dev, regmap, \ > > + devm_regmap_field_alloc(dev, regmap,\ > > stk3310_reg_field_##name); \ > > - if (IS_ERR(data->reg_##name)) { \ > > - dev_err(&client->dev, "reg field alloc failed.\n"); \ > > - return PTR_ERR(data->reg_##name); \ > > - } \ > > + if (IS_ERR(data->reg_##name)) \ > > > + return dev_err_probe(dev, \ > > + PTR_ERR(data->reg_##name), \ > > AFAICS these two can be put on one. This doesn't leave room for whitespace between the end of line and "\", replacing "do { } while (0)" with "({ })" and deindenting could make enough room to clean this up the formatting of this macro though. > > + "reg field alloc failed.\n"); \ > > } while (0) > > > ... > > > @@ -654,12 +652,11 @@ static int stk3310_probe(struct i2c_client *client) > > int ret; > > struct iio_dev *indio_dev; > > struct stk3310_data *data; > > + struct device *dev = &client->dev; > > This should has been done a few patches earlier... Moving it there now - Aren
Re: [PATCH v4 2/6] iio: light: stk3310: handle all remove logic with devm callbacks
Sun, Nov 10, 2024 at 01:38:39PM -0500, Aren kirjoitti: > On Mon, Nov 04, 2024 at 10:32:08AM +0200, Andy Shevchenko wrote: > > On Sat, Nov 02, 2024 at 03:50:37PM -0400, Aren Moynihan wrote: ... > > > + ret = devm_add_action_or_reset(&client->dev, stk3310_set_state_disable, > > > data); > > > > Why not simply 'dev' as in below call? > > I was trying to avoid refactoring the entire function to replace > &client->dev with dev, I'll add a patch for that to the next revision. I'm not talking about refactoring, I'm talking only about the lines that you have touched / added. > > > + if (ret) > > > + return dev_err_probe(dev, ret, "failed to register cleanup > > > function\n"); -- With Best Regards, Andy Shevchenko
Re: [PATCH net-next v11 06/23] ovpn: introduce the ovpn_peer object
On 29.10.2024 12:47, Antonio Quartulli wrote: [...] +static void ovpn_peer_release(struct ovpn_peer *peer) +{ + ovpn_bind_reset(peer, NULL); + nit: this empty line after ovpn_bind_reset() is removed in the 'implement basic TX path (UDP)' patch. What tricks git and it produces a sensless diff with 'ovpn_bind_reset(...)' line beeing removed and then introduced again. If you do not like this empty line then remove it here, please :) + dst_cache_destroy(&peer->dst_cache); + netdev_put(peer->ovpn->dev, &peer->ovpn->dev_tracker); + kfree_rcu(peer, rcu); +} -- Sergey
Re: [PATCH v4 4/6] iio: light: stk3310: use dev_err_probe where possible
Sun, Nov 10, 2024 at 02:14:24PM -0500, Aren kirjoitti: > On Mon, Nov 04, 2024 at 10:40:16AM +0200, Andy Shevchenko wrote: > > On Sat, Nov 02, 2024 at 03:50:41PM -0400, Aren Moynihan wrote: ... > > > #define STK3310_REGFIELD(name) > > > \ > > > do {\ > > > data->reg_##name = \ > > > - devm_regmap_field_alloc(&client->dev, regmap, \ > > > + devm_regmap_field_alloc(dev, regmap,\ > > > stk3310_reg_field_##name); \ > > > - if (IS_ERR(data->reg_##name)) { \ > > > - dev_err(&client->dev, "reg field alloc failed.\n"); \ > > > - return PTR_ERR(data->reg_##name); \ > > > - } \ > > > + if (IS_ERR(data->reg_##name)) \ > > > > > + return dev_err_probe(dev, \ > > > + PTR_ERR(data->reg_##name), \ > > > > AFAICS these two can be put on one. > > This doesn't leave room for whitespace between the end of line and "\", Is it a problem? > replacing "do { } while (0)" with "({ })" and deindenting could make > enough room to clean this up the formatting of this macro though. do {} while (0) is C standard, ({}) is not. > > > + "reg field alloc failed.\n"); \ > > > } while (0) -- With Best Regards, Andy Shevchenko
Re: [PATCH v4 4/6] iio: light: stk3310: use dev_err_probe where possible
On Sun, Nov 10, 2024 at 09:52:32PM +0200, Andy Shevchenko wrote: > Sun, Nov 10, 2024 at 02:14:24PM -0500, Aren kirjoitti: > > On Mon, Nov 04, 2024 at 10:40:16AM +0200, Andy Shevchenko wrote: > > > On Sat, Nov 02, 2024 at 03:50:41PM -0400, Aren Moynihan wrote: > > ... > > > > > #define STK3310_REGFIELD(name) > > > > \ > > > > do { > > > > \ > > > > data->reg_##name = > > > > \ > > > > - devm_regmap_field_alloc(&client->dev, regmap, > > > > \ > > > > + devm_regmap_field_alloc(dev, regmap, > > > > \ > > > > stk3310_reg_field_##name); > > > > \ > > > > - if (IS_ERR(data->reg_##name)) { > > > > \ > > > > - dev_err(&client->dev, "reg field alloc > > > > failed.\n"); \ > > > > - return PTR_ERR(data->reg_##name); > > > > \ > > > > - } > > > > \ > > > > + if (IS_ERR(data->reg_##name)) > > > > \ > > > > > > > + return dev_err_probe(dev, > > > > \ > > > > + PTR_ERR(data->reg_##name), > > > > \ > > > > > > AFAICS these two can be put on one. > > > > This doesn't leave room for whitespace between the end of line and "\", > > Is it a problem? It feels a bit camped and not as readable to me: #define STK3310_REGFIELD(name) \ do {\ data->reg_##name = \ devm_regmap_field_alloc(dev, regmap,\ stk3310_reg_field_##name); \ if (IS_ERR(data->reg_##name)) \ return dev_err_probe(dev, PTR_ERR(data->reg_##name),\ "reg field alloc failed.\n"); \ } while (0) Removing a level of indentation makes it much better #define STK3310_REGFIELD(name) ({ \ data->reg_##name = devm_regmap_field_alloc(dev, regmap, \ stk3310_reg_field_##name); \ if (IS_ERR(data->reg_##name)) \ return dev_err_probe(dev, PTR_ERR(data->reg_##name), \ "reg field alloc failed\n"); \ }) > > replacing "do { } while (0)" with "({ })" and deindenting could make > > enough room to clean this up the formatting of this macro though. > > do {} while (0) is C standard, ({}) is not. ({ }) is used throughout the kernel, and is documented as such[1]. I don't see a reason to avoid it, if it helps readability. 1: the "GNU Extensions" section of Documentation/kernel-hacking/hacking.rst - Aren
Re: [PATCH v4 2/6] iio: light: stk3310: handle all remove logic with devm callbacks
On Mon, Nov 04, 2024 at 10:32:08AM +0200, Andy Shevchenko wrote: > On Sat, Nov 02, 2024 at 03:50:37PM -0400, Aren Moynihan wrote: > > Using devm callbacks helps to make the ordering of probe / remove > > operations easier to reason about and removes some duplicate code > > between the probe error path and driver remove. > > Where is SoB? Oops that got lost in a rebase > ... > > > + ret = devm_add_action_or_reset(&client->dev, stk3310_set_state_disable, > > data); > > Why not simply 'dev' as in below call? I was trying to avoid refactoring the entire function to replace &client->dev with dev, I'll add a patch for that to the next revision. > > + if (ret) > > + return dev_err_probe(dev, ret, "failed to register cleanup > > function\n"); > > ... > > > - mutex_init(&data->lock); > > + devm_mutex_init(&client->dev, &data->lock); > > Missed error check, otherwise what's the point? > > > Also can add a temporary variable for 'dev'. Yup, fixing... I need to read the docs / function type more carefully sometimes. Thanks - Aren
Re: [PATCH net-next v11 04/23] ovpn: add basic interface creation/destruction/management routines
Missed the most essential note regarding this patch :) On 29.10.2024 12:47, Antonio Quartulli wrote: +static int ovpn_net_open(struct net_device *dev) +{ + netif_tx_start_all_queues(dev); + return 0; +} + +static int ovpn_net_stop(struct net_device *dev) +{ + netif_tx_stop_all_queues(dev); Here we stop a user generated traffic in downlink. Shall we take care about other kinds of traffic: keepalive, uplink? I believe we should remove all the peers here or at least stop the keepalive generation. But peers removing is better since administratively down is administratively down, meaning user expected full traffic stop in any direction. And even if we only stop the keepalive generation then peer(s) anyway will destroy the tunnel on their side. This way we even should not care about peers removing on the device unregistering. What do you think? + return 0; +}
Re: [PATCH net-next v11 06/23] ovpn: introduce the ovpn_peer object
On 29.10.2024 12:47, Antonio Quartulli wrote: An ovpn_peer object holds the whole status of a remote peer (regardless whether it is a server or a client). This includes status for crypto, tx/rx buffers, napi, etc. Only support for one peer is introduced (P2P mode). Multi peer support is introduced with a later patch. Reviewing the peer creation/destroying code I came to a generic question. Did you consider keeping a single P2P peer in the peers table as well? Looks like such approach can greatly simply the code by dropping all these 'switch (ovpn->mode)' checks and implementing a unified peer management. The 'peer' field in the main private data structure can be kept to accelerate lookups, still using peers table for management tasks like removing all the peers on the interface teardown. Along with the ovpn_peer, also the ovpn_bind object is introcued as the two are strictly related. An ovpn_bind object wraps a sockaddr representing the local coordinates being used to talk to a specific peer. Signed-off-by: Antonio Quartulli --- drivers/net/ovpn/Makefile | 2 + drivers/net/ovpn/bind.c | 58 +++ drivers/net/ovpn/bind.h | 117 ++ Why do we need these bind.c/bind.h files? They contains a minimum of code and still anyway references the peer object. Can we merge these definitions and code into peer.c/peer.h? drivers/net/ovpn/main.c | 11 ++ drivers/net/ovpn/main.h | 2 + drivers/net/ovpn/ovpnstruct.h | 4 + drivers/net/ovpn/peer.c | 354 ++ drivers/net/ovpn/peer.h | 79 ++ 8 files changed, 627 insertions(+) [...] diff --git a/drivers/net/ovpn/bind.c b/drivers/net/ovpn/bind.c new file mode 100644 index ..b4d2ccec2ceddf43bc445b489cc62a578ef0ad0a --- /dev/null +++ b/drivers/net/ovpn/bind.c @@ -0,0 +1,58 @@ +// SPDX-License-Identifier: GPL-2.0 +/* OpenVPN data channel offload + * + * Copyright (C) 2012-2024 OpenVPN, Inc. + * + * Author:James Yonan + * Antonio Quartulli + */ + +#include +#include + +#include "ovpnstruct.h" +#include "bind.h" +#include "peer.h" + +/** + * ovpn_bind_from_sockaddr - retrieve binding matching sockaddr + * @ss: the sockaddr to match + * + * Return: the bind matching the passed sockaddr if found, NULL otherwise The function returns ERR_PTR() in case of error, the comment should be updated. + */ +struct ovpn_bind *ovpn_bind_from_sockaddr(const struct sockaddr_storage *ss) +{ + struct ovpn_bind *bind; + size_t sa_len; + + if (ss->ss_family == AF_INET) + sa_len = sizeof(struct sockaddr_in); + else if (ss->ss_family == AF_INET6) + sa_len = sizeof(struct sockaddr_in6); + else + return ERR_PTR(-EAFNOSUPPORT); + + bind = kzalloc(sizeof(*bind), GFP_ATOMIC); + if (unlikely(!bind)) + return ERR_PTR(-ENOMEM); + + memcpy(&bind->remote, ss, sa_len); + + return bind; +} + +/** + * ovpn_bind_reset - assign new binding to peer + * @peer: the peer whose binding has to be replaced + * @new: the new bind to assign + */ +void ovpn_bind_reset(struct ovpn_peer *peer, struct ovpn_bind *new) +{ + struct ovpn_bind *old; + + spin_lock_bh(&peer->lock); + old = rcu_replace_pointer(peer->bind, new, true); + spin_unlock_bh(&peer->lock); Locking will be removed from this function in the subsequent patch. Should we move the peer->lock usage to ovpn_peer_release() now? + + kfree_rcu(old, rcu); +} diff --git a/drivers/net/ovpn/bind.h b/drivers/net/ovpn/bind.h new file mode 100644 index ..859213d5040deb36c416eafcf5c6ab31c4d52c7a --- /dev/null +++ b/drivers/net/ovpn/bind.h @@ -0,0 +1,117 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* OpenVPN data channel offload + * + * Copyright (C) 2012-2024 OpenVPN, Inc. + * + * Author:James Yonan + * Antonio Quartulli + */ + +#ifndef _NET_OVPN_OVPNBIND_H_ +#define _NET_OVPN_OVPNBIND_H_ + +#include +#include +#include +#include +#include +#include + +struct ovpn_peer; + +/** + * union ovpn_sockaddr - basic transport layer address Why do we need this dedicated named union? Can we merge this union into the ovpn_bind struct as already done for the local address? + * @in4: IPv4 address + * @in6: IPv6 address + */ +union ovpn_sockaddr { Family type can be putted here as a dedicated element to make address type check simple: unsigned short int sa_family; + struct sockaddr_in in4; + struct sockaddr_in6 in6; +};> + +/** + * struct ovpn_bind - remote peer binding + * @remote: the remote peer sockaddress + * @local: local endpoint used to talk to the peer + * @local.ipv4: local IPv4 used to talk to the peer + * @local.ipv6: local IPv6 used to talk to the peer + * @rcu: used to schedule RCU cleanup job + */ +struct ovpn_bind { +
Re: [PATCH net] selftests: wireguard: load nf_conntrack if it's not present
On Thu, Nov 07, 2024 at 02:44:18AM +, Hangbin Liu wrote: > Some distros may not load nf_conntrack by default, which will cause > subsequent nf_conntrack settings to fail. Let's load this module if it's > not loaded by default. > > Fixes: e7096c131e51 ("net: WireGuard secure network tunnel") > Signed-off-by: Hangbin Liu > --- > tools/testing/selftests/wireguard/netns.sh | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/tools/testing/selftests/wireguard/netns.sh > b/tools/testing/selftests/wireguard/netns.sh > index 405ff262ca93..508b391e8d9a 100755 > --- a/tools/testing/selftests/wireguard/netns.sh > +++ b/tools/testing/selftests/wireguard/netns.sh > @@ -66,6 +66,7 @@ cleanup() { > orig_message_cost="$(< /proc/sys/net/core/message_cost)" > trap cleanup EXIT > printf 0 > /proc/sys/net/core/message_cost > +lsmod | grep -q nf_conntrack || modprobe nf_conntrack Hi Hangbin, As modprobe should be idempotent both for the case were nf_conntrack is built-in (I'm unsure if that case can ever occur) and the module has already been inserted, I think you simply use: modprobe nf_conntrack Of course, if nf_conntrack isn't available at all, then this will fail. But that was the case with your patch too. And so I assume it is intended. > > ip netns del $netns0 2>/dev/null || true > ip netns del $netns1 2>/dev/null || true > -- > 2.46.0 > >
Re: [PATCH net-next v11 08/23] ovpn: implement basic TX path (UDP)
On 29.10.2024 12:47, Antonio Quartulli wrote: Packets sent over the ovpn interface are processed and transmitted to the connected peer, if any. Implementation is UDP only. TCP will be added by a later patch. Note: no crypto/encapsulation exists yet. packets are just captured and sent. Signed-off-by: Antonio Quartulli --- drivers/net/ovpn/io.c | 138 +++- drivers/net/ovpn/peer.c | 37 +++- drivers/net/ovpn/peer.h | 4 + drivers/net/ovpn/skb.h | 51 +++ drivers/net/ovpn/udp.c | 232 drivers/net/ovpn/udp.h | 8 ++ 6 files changed, 468 insertions(+), 2 deletions(-) diff --git a/drivers/net/ovpn/io.c b/drivers/net/ovpn/io.c index ad3813419c33cbdfe7e8ad6f5c8b444a3540a69f..77ba4d33ae0bd2f52e8bd1c06a182d24285297b4 100644 --- a/drivers/net/ovpn/io.c +++ b/drivers/net/ovpn/io.c @@ -9,14 +9,150 @@ #include #include +#include #include "io.h" +#include "ovpnstruct.h" +#include "peer.h" +#include "udp.h" +#include "skb.h" +#include "socket.h" + +static void ovpn_encrypt_post(struct sk_buff *skb, int ret) +{ + struct ovpn_peer *peer = ovpn_skb_cb(skb)->peer; + + if (unlikely(ret < 0)) + goto err; + + skb_mark_not_on_list(skb); + + switch (peer->sock->sock->sk->sk_protocol) { + case IPPROTO_UDP: + ovpn_udp_send_skb(peer->ovpn, peer, skb); + break; + default: + /* no transport configured yet */ + goto err; + } Did you consider calling protocol specific sending function indirectly? E.g.: peer->sock->send(peer, skb); + /* skb passed down the stack - don't free it */ + skb = NULL; +err: + if (unlikely(skb)) + dev_core_stats_tx_dropped_inc(peer->ovpn->dev); + ovpn_peer_put(peer); + kfree_skb(skb); +} + +static bool ovpn_encrypt_one(struct ovpn_peer *peer, struct sk_buff *skb) +{ + ovpn_skb_cb(skb)->peer = peer; + + /* take a reference to the peer because the crypto code may run async. +* ovpn_encrypt_post() will release it upon completion +*/ + if (unlikely(!ovpn_peer_hold(peer))) { + DEBUG_NET_WARN_ON_ONCE(1); + return false; + } + + ovpn_encrypt_post(skb, 0); + return true; +} + +/* send skb to connected peer, if any */ +static void ovpn_send(struct ovpn_struct *ovpn, struct sk_buff *skb, + struct ovpn_peer *peer) +{ + struct sk_buff *curr, *next; + + if (likely(!peer)) + /* retrieve peer serving the destination IP of this packet */ + peer = ovpn_peer_get_by_dst(ovpn, skb); + if (unlikely(!peer)) { + net_dbg_ratelimited("%s: no peer to send data to\n", + ovpn->dev->name); + dev_core_stats_tx_dropped_inc(ovpn->dev); + goto drop; + } The function is called only from ovpn_xmit_special() and from ovpn_net_xmit(). The keepalive always provides a peer object, while ovpn_net_xmit() never do it. If we move the peer lookup call into ovpn_net_xmit() then we can eliminate all the above peer checks. + + /* this might be a GSO-segmented skb list: process each skb +* independently +*/ + skb_list_walk_safe(skb, curr, next) + if (unlikely(!ovpn_encrypt_one(peer, curr))) { + dev_core_stats_tx_dropped_inc(ovpn->dev); + kfree_skb(curr); + } + + /* skb passed over, no need to free */ + skb = NULL; +drop: + if (likely(peer)) + ovpn_peer_put(peer); + kfree_skb_list(skb); +} /* Send user data to the network */ netdev_tx_t ovpn_net_xmit(struct sk_buff *skb, struct net_device *dev) { + struct ovpn_struct *ovpn = netdev_priv(dev); + struct sk_buff *segments, *curr, *next; + struct sk_buff_head skb_list; + __be16 proto; + int ret; + + /* reset netfilter state */ + nf_reset_ct(skb); + + /* verify IP header size in network packet */ + proto = ovpn_ip_check_protocol(skb); + if (unlikely(!proto || skb->protocol != proto)) { + net_err_ratelimited("%s: dropping malformed payload packet\n", + dev->name); + dev_core_stats_tx_dropped_inc(ovpn->dev); + goto drop; + } + + if (skb_is_gso(skb)) { + segments = skb_gso_segment(skb, 0); + if (IS_ERR(segments)) { + ret = PTR_ERR(segments); + net_err_ratelimited("%s: cannot segment packet: %d\n", + dev->name, ret); + dev_core_stats_tx_dropped_inc(ovpn->dev); + goto drop; + } + + consume_skb(skb); + skb
Re: [PATCH v4 2/6] iio: light: stk3310: handle all remove logic with devm callbacks
On Sun, Nov 10, 2024 at 09:51:04PM +0200, Andy Shevchenko wrote: > Sun, Nov 10, 2024 at 01:38:39PM -0500, Aren kirjoitti: > > On Mon, Nov 04, 2024 at 10:32:08AM +0200, Andy Shevchenko wrote: > > > On Sat, Nov 02, 2024 at 03:50:37PM -0400, Aren Moynihan wrote: > > ... > > > > > + ret = devm_add_action_or_reset(&client->dev, > > > > stk3310_set_state_disable, data); > > > > > > Why not simply 'dev' as in below call? > > > > I was trying to avoid refactoring the entire function to replace > > &client->dev with dev, I'll add a patch for that to the next revision. > > I'm not talking about refactoring, I'm talking only about the lines that you > have touched / added. Ah right, this one makes sense, my comment should have been on the next patch in this series which is a little more complex. For that patch it seemed inconsistent to use dev only in new code and mix it with calls using &client->dev. - Aren
Re: [PATCH net-next v11 08/23] ovpn: implement basic TX path (UDP)
Another one forgotten question, sorry about this. Please find the question inlined. On 29.10.2024 12:47, Antonio Quartulli wrote: /* Send user data to the network */ netdev_tx_t ovpn_net_xmit(struct sk_buff *skb, struct net_device *dev) { + struct ovpn_struct *ovpn = netdev_priv(dev); + struct sk_buff *segments, *curr, *next; + struct sk_buff_head skb_list; + __be16 proto; + int ret; + + /* reset netfilter state */ + nf_reset_ct(skb); + + /* verify IP header size in network packet */ + proto = ovpn_ip_check_protocol(skb); + if (unlikely(!proto || skb->protocol != proto)) { + net_err_ratelimited("%s: dropping malformed payload packet\n", + dev->name); + dev_core_stats_tx_dropped_inc(ovpn->dev); + goto drop; + } The above check implies that kernel can feed a network device with skb->protocol value mismatches actual skb content. Can you share any example of such case? If you just want to be sure that the user packet is either IPv4 or IPv6 then it can be done like this and without error messages: /* Support only IPv4 or IPv6 traffic transporting */ if (unlikely(skb->protocol == ETH_P_IP || skb->protocol == ETH_P_IPV6)) goto drop; + + if (skb_is_gso(skb)) { + segments = skb_gso_segment(skb, 0); + if (IS_ERR(segments)) { + ret = PTR_ERR(segments); + net_err_ratelimited("%s: cannot segment packet: %d\n", + dev->name, ret); + dev_core_stats_tx_dropped_inc(ovpn->dev); + goto drop; + } + + consume_skb(skb); + skb = segments; + } + + /* from this moment on, "skb" might be a list */ + + __skb_queue_head_init(&skb_list); + skb_list_walk_safe(skb, curr, next) { + skb_mark_not_on_list(curr); + + curr = skb_share_check(curr, GFP_ATOMIC); + if (unlikely(!curr)) { + net_err_ratelimited("%s: skb_share_check failed\n", + dev->name); + dev_core_stats_tx_dropped_inc(ovpn->dev); + continue; + } + + __skb_queue_tail(&skb_list, curr); + } + skb_list.prev->next = NULL; + + ovpn_send(ovpn, skb_list.next, NULL); + + return NETDEV_TX_OK; + +drop: skb_tx_error(skb); - kfree_skb(skb); + kfree_skb_list(skb); return NET_XMIT_DROP; } -- Sergey