> On Oct 25, 2023, at 11:27 PM, Kristof Provost <k...@freebsd.org> wrote: > > Hi, > > Several pfSense users report IPv6-related panics when an interface is deleted. > The relevant bug reports are https://redmine.pfsense.org/issues/14164 > <https://redmine.pfsense.org/issues/14164> and > https://redmine.pfsense.org/issues/14431 > <https://redmine.pfsense.org/issues/14431>. > The latest report is for a build that includes commits up to > 1a18383a52bc373e316d224cef1298debf6f7e25 (“libcrypto: link engines and the > legacy provider to libcrypto”, September 15th). > > I believe all reports are for users running PPPoE, via netgraph, but that > might be coincidental, as that’s the most likely way for interfaces to be > destroyed (when PPP disconnects and reconnects). > > There are a few different backtraces, but they appear to have the same root > cause, so I’ll focus on one of them: > > db:1:pfs> bt > Tracing pid 2 tid 100041 td 0xfffffe0085264560 > kdb_enter() at kdb_enter+0x32/frame 0xfffffe00850ad910 > vpanic() at vpanic+0x183/frame 0xfffffe00850ad960 > panic() at panic+0x43/frame 0xfffffe00850ad9c0 > trap_fatal() at trap_fatal+0x409/frame 0xfffffe00850ada20 > trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00850ada80 > calltrap() at calltrap+0x8/frame 0xfffffe00850ada80 > --- trap 0xc, rip = 0xffffffff80f5a036, rsp = 0xfffffe00850adb50, rbp = > 0xfffffe00850adb80 --- > in6_selecthlim() at in6_selecthlim+0x96/frame 0xfffffe00850adb80 > tcp_default_output() at tcp_default_output+0x1ded/frame 0xfffffe00850add70 > tcp_timer_rexmt() at tcp_timer_rexmt+0x514/frame 0xfffffe00850addd0 > tcp_timer_enter() at tcp_timer_enter+0x102/frame 0xfffffe00850ade10 > softclock_call_cc() at softclock_call_cc+0x13c/frame 0xfffffe00850adec0 > softclock_thread() at softclock_thread+0xe9/frame 0xfffffe00850adef0 > fork_exit() at fork_exit+0x7d/frame 0xfffffe00850adf30 > fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00850adf30 > --- trap 0, rip = 0, rsp = 0, rbp = 0 --- > This happens in the TCP output path, where we look up the hop limit for a > specific destination. I’ve obtained a core dump for such a crash, and I > believe the panic happens on line > https://cgit.freebsd.org/src/tree/sys/netinet6/in6_src.c#n861 > <https://cgit.freebsd.org/src/tree/sys/netinet6/in6_src.c#n861> > The call in tcp_default_output() is in6_selecthlim(int, NULL);, so we don’t > get an ifp from the caller, but instead perform a route lookup, and try to > obtain the hop limit through ND_IFINFO(nh->nh_ifp). This panics because the > afdata[AF_INET6] pointer is NULL. The core dump shows a deleted structure > ifnet: > >
`egrep -r 'if_afdata\[AF_INET6\]\s*[!=]=\s*NULL' sys/netinet6'` shows there're many places do the NULL check. I think we can do it in in6_selecthlim() as a workaround. > (kgdb) p *(struct ifnet *)0xfffff80203712800 > $3 = { > if_link = { > cstqe_next = 0x0 > }, > if_clones = { > le_next = 0x0, > le_prev = 0x0 > }, > if_groups = { > cstqh_first = 0x0, > cstqh_last = 0xfffff80203712818 > }, > if_alloctype = 53 '5', > if_numa_domain = 255 '\377', > if_softc = 0xfffff80103447a00, > if_llsoftc = 0x0, > if_l2com = 0x0, > if_dname = 0xffffffff81492f70 "ng", > if_dunit = 0, > if_index = 14, > if_idxgen = 2, > if_xname = "pppoe0\000\000\000\000\000\000\000\000\000", > if_description = 0xfffff8003a5f83d0 "WAN", > if_flags = 2132112, > if_drv_flags = 0, > if_capabilities = 0, > if_capabilities2 = 0, > … > if_afdata = {0x0 <repeats 44 times>}, > … > if_output = 0xffffffff80e29c60 <ifdead_output>, > if_input = 0xffffffff80e29c80 <ifdead_input>, > if_bridge_input = 0x0, > if_bridge_output = 0x0, > if_bridge_linkstate = 0x0, > if_start = 0xffffffff80e29c90 <ifdead_start>, > if_ioctl = 0xffffffff80e29ca0 <ifdead_ioctl>, > … > My understanding is that the fib table should get updated whenever we change > the routing table (such as during interface cleanup in if_detach_internal()). > Some quick experimentation with epair and dtrace also shows: > > 20 20388 sync_algo_end_cb:entry Stage 1 > kernel`setup_fd_instance+0x41f > kernel`rebuild_fd_flm+0x99 > kernel`rebuild_fd+0x136 > kernel`rib_notify+0x50 > kernel`rt_delete_conditional+0xf1 > kernel`rib_del_route+0x1fc > kernel`rib_handle_ifaddr_info+0xd9 > kernel`nd6_prefix_offlink+0x1ce > kernel`nd6_prefix_del+0x94 > kernel`if_purgeaddrs+0x148 > kernel`if_detach_internal+0x1e8 > kernel`if_detach+0x71 > if_epair.ko`epair_clone_destroy+0x62 > kernel`if_clone_destroyif_flags+0x6a > kernel`if_clone_destroy+0x100 > kernel`ifioctl+0x8a5 > kernel`kern_ioctl+0x286 > kernel`sys_ioctl+0x152 > kernel`amd64_syscall+0x153 > kernel`0xffffffff8102315b > In other words, when we delete the interface if_detach_internal() purges the > interface addresses, which ends up rebuilding the fib (rebuild_fd()) via > rib_del_route(). > That ought to ensure that we cannot end up finding this struct ifnet through > fib6_lookup(), as the purging of the addresses (and thus the rebuilding of > the fib) is done before we if_domdetach() at the end of if_detach_internal(), > and the NULL afdata[AF_INET6] demonstrates that we’ve gotten there. > > By intuition, fib6_lookup() should not return **INVALID** next hop (with detaching interfaces), unless explicitly requested. > We’ve also gone through if_free(), as the ifindex_table no longer contains > the struct ifnet pointer for the relevant interface. > We appear to have not yet called if_free_deferred() (and indeed, > ifp->if_refcount is 4, so we wouldn’t have called that yet). > > I’m confused as to how this can happen, and would appreciate hints. > > I believe Alexander has insight on this. > Thanks, > Kristof >