BTW, have you tried your previous proposed patch and confirmed it
would fix the issue?


Yes, we shared this with the customer and the refcount mismatch still occurred, so this doesn't seem sufficient either.

Could we further distinguish between dst added to the uncached list by
icmp6_dst_alloc() and xfrm6_fill_dst(), and confirm which ones are the
ones leaking reference?
I suspect it would be the xfrm ones, but I think it is worth verifying.


After digging into the DST allocation/destroy a bit more, it seems that there are some cases where the DST's refcount does not hit zero, causing them to never be freed and release their references. One case comes from here on the IPv6 packet output path (these DST structs would hold references to both the inet6_dev and the netdevice) ip6_pol_route_output+0x20/0x2c -> ip6_pol_route+0x1dc/0x34c -> rt6_make_pcpu_route+0x18/0xf4 -> ip6_rt_pcpu_alloc+0xb4/0x19c

We also see two DSTs where they are stored as the xdst->rt entry on the XFRM path that do not get released. One is allocated by the same path as above, and the other like this xfrm6_esp_err+0x7c/0xd4 -> esp6_err+0xc8/0x100 -> ip6_update_pmtu+0xc8/0x100 -> __ip6_rt_update_pmtu+0x248/0x434 -> ip6_rt_cache_alloc+0xa0/0x1dc

From those alloc paths it seems like the problem might not be coming from the uncached list after all.


Finally found the reference:

tools/testing/selftests/net/l2tp.sh at one point was triggering a
refcount leak:

https://lore.kernel.org/netdev/[email protected]/

And then Colin found more problems with it:

https://lore.kernel.org/netdev/[email protected]/


running that on a 5.8 kernel on Ubuntu 20.10 did not trigger the
problem. Neither did Ubuntu 20.04 with 5.4.0-51-generic.

Can you run it on your 5.4 version and see?

We let that run for two days on our setup and didn't see anything, unfortunately.

Thanks,
Sean

Reply via email to