On Tuesday 09/09 at 07:05 -0700, Breno Leitao wrote: > On Mon, Sep 08, 2025 at 01:47:24PM -0700, Calvin Owens wrote: > > On Friday 09/05 at 10:25 -0700, Breno Leitao wrote: > > > commit efa95b01da18 ("netpoll: fix use after free") incorrectly > > > ignored the refcount and prematurely set dev->npinfo to NULL during > > > netpoll cleanup, leading to improper behavior and memory leaks. > > > > > > Scenario causing lack of proper cleanup: > > > > > > 1) A netpoll is associated with a NIC (e.g., eth0) and netdev->npinfo is > > > allocated, and refcnt = 1 > > > - Keep in mind that npinfo is shared among all netpoll instances. In > > > this case, there is just one. > > > > > > 2) Another netpoll is also associated with the same NIC and > > > npinfo->refcnt += 1. > > > - Now dev->npinfo->refcnt = 2; > > > - There is just one npinfo associated to the netdev. > > > > > > 3) When the first netpolls goes to clean up: > > > - The first cleanup succeeds and clears np->dev->npinfo, ignoring > > > refcnt. > > > - It basically calls `RCU_INIT_POINTER(np->dev->npinfo, NULL);` > > > - Set dev->npinfo = NULL, without proper cleanup > > > - No ->ndo_netpoll_cleanup() is either called > > > > > > 4) Now the second target tries to clean up > > > - The second cleanup fails because np->dev->npinfo is already NULL. > > > * In this case, ops->ndo_netpoll_cleanup() was never called, and > > > the skb pool is not cleaned as well (for the second netpoll > > > instance) > > > - This leaks npinfo and skbpool skbs, which is clearly reported by > > > kmemleak. > > > > > > Revert commit efa95b01da18 ("netpoll: fix use after free") and adds > > > clarifying comments emphasizing that npinfo cleanup should only happen > > > once the refcount reaches zero, ensuring stable and correct netpoll > > > behavior. > > > > This makes sense to me. > > > > Just curious, did you try the original OOPS reproducer? > > https://lore.kernel.org/lkml/96b940137a50e5c387687bb4f57de8b0435a653f.1404857349.git.de...@googlers.com/ > > Yes, but I have not been able to reproduce the problem at all. > I've have tested it using netdevsim, and here is a quick log of what I > run:
Nice, thanks for clarifying. I also tried reverting a few commits like [1] around the time that smell vaguely related, on top of your fix, but the repro still never triggers anything for me either. I was using virtio interfaces in KVM. The world may never know :) [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=69b0216ac255 > + modprobe netconsole > + modprobe bonding mode=4 > [ 86.540950] Warning: miimon must be specified, otherwise bonding > will not detect link failure, speed and duplex which are essential for > 802.3ad operation > [ 86.541617] Forcing miimon to 100msec > [ 86.541893] MII link monitoring set to 100 ms > + echo +bond0 > [ 86.547802] bonding: bond0 is being created... > + ifconfig bond0 192.168.56.3 up > + mkdir /sys/kernel/config/netconsole/blah > + echo 0 > [ 86.614772] netconsole: network logging has already stopped > ./run.sh: line 19: echo: write error: Invalid argument > + echo bond0 > + echo 192.168.56.42 > + echo 1 > [ 86.622318] netconsole: netconsole: local port 6665 > [ 86.622550] netconsole: netconsole: local IPv4 address 0.0.0.0 > [ 86.622819] netconsole: netconsole: interface name 'bond0' > [ 86.623038] netconsole: netconsole: local ethernet address > '00:00:00:00:00:00' > [ 86.623466] netconsole: netconsole: remote port 6666 > [ 86.623675] netconsole: netconsole: remote IPv4 address 192.168.56.42 > [ 86.623924] netconsole: netconsole: remote ethernet address > ff:ff:ff:ff:ff:ff > [ 86.624264] netpoll: netconsole: local IP 192.168.56.3 > [ 86.643174] netconsole: network logging started > + ifenslave bond0 eth1 > [ 86.659899] bond0: (slave eth1): Enslaving as a backup interface > with a down link > + ifenslave bond0 eth2 > [ 86.687630] bond0: (slave eth2): Enslaving as a backup interface > with a down link > + sleep 3 > + ifenslave -d bond0 eth1 > [ 89.735701] bond0: (slave eth1): Releasing backup interface > [ 89.737239] bond0: (slave eth1): the permanent HWaddr of slave - > 06:44:84:94:87:c7 - is still in use by bond - set the HWaddr of slave to a > different address to avoid conflicts > + sleep 1 > + echo -bond0 > [ 90.798676] bonding: bond0 is being deleted... > [ 90.815595] netconsole: network logging stopped on interface bond0 > as it unregistered > [ 90.816416] bond0 (unregistering): (slave eth2): Releasing backup > interface > [ 90.863054] bond0 (unregistering): Released all slaves > + ls -lR / > + tail -30 > <snip> > > + echo +bond0 > ./run.sh: line 39: /sys/class/net/bonding_masters: Permission denied I don't get -EACCES here like you seem to, but nothing interesting happens either. > + ifconfig bond0 192.168.56.3 up > SIOCSIFADDR: No such device > bond0: ERROR while getting interface flags: No such device > bond0: ERROR while