> On Oct 25, 2023, at 11:27 PM, Kristof Provost <k...@freebsd.org> wrote:
> 
> Hi,
> 
> Several pfSense users report IPv6-related panics when an interface is deleted.
> The relevant bug reports are https://redmine.pfsense.org/issues/14164 
> <https://redmine.pfsense.org/issues/14164> and 
> https://redmine.pfsense.org/issues/14431 
> <https://redmine.pfsense.org/issues/14431>.
> The latest report is for a build that includes commits up to 
> 1a18383a52bc373e316d224cef1298debf6f7e25 (“libcrypto: link engines and the 
> legacy provider to libcrypto”, September 15th).
> 
> I believe all reports are for users running PPPoE, via netgraph, but that 
> might be coincidental, as that’s the most likely way for interfaces to be 
> destroyed (when PPP disconnects and reconnects).
> 
> There are a few different backtraces, but they appear to have the same root 
> cause, so I’ll focus on one of them:
> 
> db:1:pfs> bt
> Tracing pid 2 tid 100041 td 0xfffffe0085264560
> kdb_enter() at kdb_enter+0x32/frame 0xfffffe00850ad910
> vpanic() at vpanic+0x183/frame 0xfffffe00850ad960
> panic() at panic+0x43/frame 0xfffffe00850ad9c0
> trap_fatal() at trap_fatal+0x409/frame 0xfffffe00850ada20
> trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00850ada80
> calltrap() at calltrap+0x8/frame 0xfffffe00850ada80
> --- trap 0xc, rip = 0xffffffff80f5a036, rsp = 0xfffffe00850adb50, rbp = 
> 0xfffffe00850adb80 ---
> in6_selecthlim() at in6_selecthlim+0x96/frame 0xfffffe00850adb80
> tcp_default_output() at tcp_default_output+0x1ded/frame 0xfffffe00850add70
> tcp_timer_rexmt() at tcp_timer_rexmt+0x514/frame 0xfffffe00850addd0
> tcp_timer_enter() at tcp_timer_enter+0x102/frame 0xfffffe00850ade10
> softclock_call_cc() at softclock_call_cc+0x13c/frame 0xfffffe00850adec0
> softclock_thread() at softclock_thread+0xe9/frame 0xfffffe00850adef0
> fork_exit() at fork_exit+0x7d/frame 0xfffffe00850adf30
> fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00850adf30
> --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
> This happens in the TCP output path, where we look up the hop limit for a 
> specific destination. I’ve obtained a core dump for such a crash, and I 
> believe the panic happens on line 
> https://cgit.freebsd.org/src/tree/sys/netinet6/in6_src.c#n861 
> <https://cgit.freebsd.org/src/tree/sys/netinet6/in6_src.c#n861>
> The call in tcp_default_output() is in6_selecthlim(int, NULL);, so we don’t 
> get an ifp from the caller, but instead perform a route lookup, and try to 
> obtain the hop limit through ND_IFINFO(nh->nh_ifp). This panics because the 
> afdata[AF_INET6] pointer is NULL. The core dump shows a deleted structure 
> ifnet:
> 
> 

`egrep -r 'if_afdata\[AF_INET6\]\s*[!=]=\s*NULL' sys/netinet6'` shows there're 
many places do the NULL check. I think we can do it in in6_selecthlim() as a 
workaround.
 
> (kgdb) p *(struct ifnet *)0xfffff80203712800
> $3 = {
>   if_link = {
>     cstqe_next = 0x0
>   },
>   if_clones = {
>     le_next = 0x0,
>     le_prev = 0x0
>   },
>   if_groups = {
>     cstqh_first = 0x0,
>     cstqh_last = 0xfffff80203712818
>   },
>   if_alloctype = 53 '5',
>   if_numa_domain = 255 '\377',
>   if_softc = 0xfffff80103447a00,
>   if_llsoftc = 0x0,
>   if_l2com = 0x0,
>   if_dname = 0xffffffff81492f70 "ng",
>   if_dunit = 0,
>   if_index = 14,
>   if_idxgen = 2,
>   if_xname = "pppoe0\000\000\000\000\000\000\000\000\000",
>   if_description = 0xfffff8003a5f83d0 "WAN",
>   if_flags = 2132112,
>   if_drv_flags = 0,
>   if_capabilities = 0,
>   if_capabilities2 = 0,
> …
>   if_afdata = {0x0 <repeats 44 times>},
> …
>   if_output = 0xffffffff80e29c60 <ifdead_output>,
>   if_input = 0xffffffff80e29c80 <ifdead_input>,
>   if_bridge_input = 0x0,
>   if_bridge_output = 0x0,
>   if_bridge_linkstate = 0x0,
>   if_start = 0xffffffff80e29c90 <ifdead_start>,
>   if_ioctl = 0xffffffff80e29ca0 <ifdead_ioctl>,
> …
> My understanding is that the fib table should get updated whenever we change 
> the routing table (such as during interface cleanup in if_detach_internal()). 
> Some quick experimentation with epair and dtrace also shows:
> 
>  20  20388           sync_algo_end_cb:entry Stage 1
>               kernel`setup_fd_instance+0x41f
>               kernel`rebuild_fd_flm+0x99
>               kernel`rebuild_fd+0x136
>               kernel`rib_notify+0x50
>               kernel`rt_delete_conditional+0xf1
>               kernel`rib_del_route+0x1fc
>               kernel`rib_handle_ifaddr_info+0xd9
>               kernel`nd6_prefix_offlink+0x1ce
>               kernel`nd6_prefix_del+0x94
>               kernel`if_purgeaddrs+0x148
>               kernel`if_detach_internal+0x1e8
>               kernel`if_detach+0x71
>               if_epair.ko`epair_clone_destroy+0x62
>               kernel`if_clone_destroyif_flags+0x6a
>               kernel`if_clone_destroy+0x100
>               kernel`ifioctl+0x8a5
>               kernel`kern_ioctl+0x286
>               kernel`sys_ioctl+0x152
>               kernel`amd64_syscall+0x153
>               kernel`0xffffffff8102315b
> In other words, when we delete the interface if_detach_internal() purges the 
> interface addresses, which ends up rebuilding the fib (rebuild_fd()) via 
> rib_del_route().
> That ought to ensure that we cannot end up finding this struct ifnet through 
> fib6_lookup(), as the purging of the addresses (and thus the rebuilding of 
> the fib) is done before we if_domdetach() at the end of if_detach_internal(), 
> and the NULL afdata[AF_INET6] demonstrates that we’ve gotten there.
> 
> 

By intuition, fib6_lookup() should not return **INVALID** next hop (with 
detaching interfaces), unless explicitly requested.
> We’ve also gone through if_free(), as the ifindex_table no longer contains 
> the struct ifnet pointer for the relevant interface.
> We appear to have not yet called if_free_deferred() (and indeed, 
> ifp->if_refcount is 4, so we wouldn’t have called that yet).
> 
> I’m confused as to how this can happen, and would appreciate hints.
> 
> 

I believe Alexander has insight on this.
> Thanks,
> Kristof
> 



Reply via email to