I believe I've found a way to reproduce this, simply an rtsol against 2 interfaces in isolated loops, which elicits a broadcasted RA. I am able to deadlock the system fairly quickly, which finally results in a core after __rw_wunlock_hard steps in. We have seen this deadlock in one other case, but the code path prior is the same as most of the others, so hopefully it's keeping me on the right track. I tested backing out r243148 without success.

load: 51.27 cmd: rtsol 7678 [*Giant] 832.44r 0.00u 0.00s 0% 1868k (prior to core)

https://github.com/freebsd/freebsd/blob/e5ee1c2b414851b17663cb491e2f2317a0af9bda/sys/netinet6/nd6_rtr.c#L654

Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 02
fault virtual address   = 0x30
fault code              = supervisor read data, page not present
instruction pointer     = 0x20:0xffffffff807ada7c
stack pointer           = 0x28:0xfffffe201edcff60
frame pointer           = 0x28:0xfffffe201edcff90
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = resume, IOPL = 0
current process         = 12 (irq279: ix0:que 1)
trap number             = 12
panic: page fault
cpuid = 1

(kgdb) bt
#0  doadump (textdump=1) at pcpu.h:219
#1 0xffffffff8075fa07 in kern_reboot (howto=16644) at /usr/src/sys/kern/kern_shutdown.c:451 #2 0xffffffff8075fe05 in vpanic (fmt=<value optimized out>, ap=<value optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:758
#3 0xffffffff8075fc93 in panic (fmt=0x0) at /usr/src/sys/kern/kern_shutdown.c:687 #4 0xffffffff80ace3cb in trap_fatal (frame=<value optimized out>, eva=<value optimized out>)
    at /usr/src/sys/amd64/amd64/trap.c:851
#5 0xffffffff80ace6cd in trap_pfault (frame=0xfffffe201edcfeb0, usermode=<value optimized out>)
    at /usr/src/sys/amd64/amd64/trap.c:674
#6 0xffffffff80acdd6a in trap (frame=0xfffffe201edcfeb0) at /usr/src/sys/amd64/amd64/trap.c:440 #7 0xffffffff80ab3d62 in calltrap () at /usr/src/sys/amd64/amd64/exception.S:236 #8 0xffffffff807ada7c in turnstile_broadcast (ts=0x0, queue=1) at /usr/src/sys/kern/subr_turnstile.c:838 #9 0xffffffff8075e080 in __rw_wunlock_hard (c=0xfffffe1f29f2f3f8, tid=1, file=0x1 <Address 0x1 out of bounds>, line=1)
    at /usr/src/sys/kern/kern_rwlock.c:988
#10 0xffffffff808a7136 in defrouter_select () at /usr/src/sys/netinet6/nd6_rtr.c:654 #11 0xffffffff808a5a58 in nd6_ra_input (m=<value optimized out>, off=<value optimized out>, icmp6len=<value optimized out>)
    at /usr/src/sys/netinet6/nd6_rtr.c:804
#12 0xffffffff8087f31f in icmp6_input (mp=<value optimized out>, offp=0xfffffe201edd064c, proto=<value optimized out>)
    at /usr/src/sys/netinet6/icmp6.c:808
#13 0xffffffff80894adc in ip6_input (m=0xfffff8038cab7000) at /usr/src/sys/netinet6/ip6_input.c:1019 #14 0xffffffff80832f02 in netisr_dispatch_src (proto=<value optimized out>, source=<value optimized out>, m=0x10)
    at /usr/src/sys/net/netisr.c:976
#15 0xffffffff8082a226 in ether_demux (ifp=<value optimized out>, m=0xfffff8038cab7000)
    at /usr/src/sys/net/if_ethersubr.c:851
#16 0xffffffff8082aece in ether_nh_input (m=<value optimized out>) at /usr/src/sys/net/if_ethersubr.c:646 #17 0xffffffff80832f02 in netisr_dispatch_src (proto=<value optimized out>, source=<value optimized out>, m=0x10)
    at /usr/src/sys/net/netisr.c:976
#18 0xffffffff804c7e89 in ixgbe_rxeof (que=0xfffff8011680d470) at /usr/src/sys/dev/ixgbe/ix_txrx.c:1681 #19 0xffffffff804c1e0a in ixgbe_msix_que (arg=0xfffff8011680d470) at /usr/src/sys/dev/ixgbe/if_ix.c:1391 #20 0xffffffff8072ff9b in intr_event_execute_handlers (p=<value optimized out>, ie=0xfffff80116806600)
    at /usr/src/sys/kern/kern_intr.c:1264
#21 0xffffffff80730936 in ithread_loop (arg=0xfffff8011681e480) at /usr/src/sys/kern/kern_intr.c:1277 #22 0xffffffff8072dbba in fork_exit (callout=0xffffffff807308a0 <ithread_loop>, arg=0xfffff8011681e480,
    frame=0xfffffe201edd0ac0) at /usr/src/sys/kern/kern_fork.c:1018
#23 0xffffffff80ab429e in fork_trampoline () at /usr/src/sys/amd64/amd64/exception.S:611
#24 0x0000000000000000 in ?? ()
(kgdb) frame 10
#10 0xffffffff808a7136 in defrouter_select () at /usr/src/sys/netinet6/nd6_rtr.c:654
654                     IF_AFDATA_UNLOCK(dr->ifp);
(kgdb) list
649                     if (selected_dr == NULL &&
650 (ln = nd6_lookup(&dr->rtaddr, 0, dr->ifp)) &&
651                         ND6_IS_LLINFO_PROBREACH(ln)) {
652                             selected_dr = dr;
653                     }
654                     IF_AFDATA_UNLOCK(dr->ifp);
655                     if (ln != NULL) {
656                             LLE_RUNLOCK(ln);
657                             ln = NULL;
658                     }
(kgdb) print *dr
Cannot access memory at address 0xa00000001
(kgdb) print dr->ifp
Cannot access memory at address 0xa00000031

Thanks again,
Jason

On 2015-12-07 22:32, Jason wrote:
Hi,

It appears the IPv6 router advertisement code paths were written
fairly lockless, assuming you would never process multiples
concurrently.  We are seeing multiple page faults in various places
processing the messages and modifying the routing table.  We have
multiple L3 devices and multiple v6 blocks broadcasting these messages
to hardware with dual uplinks in the same VLAN, which I believe is
making us susceptible to this.  Though I believe the dual uplink is
all that's required for this, as it can be seen in configurations with
a single v6 block.

We are running stable/10 @ r285800, and it doesn't appear anything
relevant has changed since then.  Our other widely deployed version is
8.3-RELEASE, which does not see this issue.  Upon bumping a machine
from 8.3 -> 10 we can see it start to exhibit this behavior.  The only
change I see that might be relevant is r243148, but these cores are
relatively rare, so testing is tough without a considerable
deployment.  So basically I'm hoping someone with a trained eye can
send us in the right direction before we go down that road.

Every backtrace looks pretty much like this, with the location in
nd6_rtr differing:

panic: page fault
#0  doadump (textdump=1) at pcpu.h:219
#1  0xffffffff8075fa07 in kern_reboot (howto=260) at
/usr/src/sys/kern/kern_shutdown.c:451
#2  0xffffffff8075fe05 in vpanic (fmt=<value optimized out>, ap=<value
optimized out>) at /usr/src/sys/kern/kern_shutdown.c:758
#3  0xffffffff8075fc93 in panic (fmt=0x0) at
/usr/src/sys/kern/kern_shutdown.c:687
#4  0xffffffff80acdf9b in trap_fatal (frame=<value optimized out>,
eva=<value optimized out>) at /usr/src/sys/amd64/amd64/trap.c:851
#5  0xffffffff80ace29d in trap_pfault (frame=0xfffffe0f959b0ff0,
usermode=<value optimized out>) at /usr/src/sys/amd64/amd64/trap.c:674
#6  0xffffffff80acd93a in trap (frame=0xfffffe0f959b0ff0) at
/usr/src/sys/amd64/amd64/trap.c:440
#7  0xffffffff80ab3932 in calltrap () at
/usr/src/sys/amd64/amd64/exception.S:236
#8  0xffffffff808a5550 in nd6_ra_input (m=<value optimized out>,
off=<value optimized out>, icmp6len=<value optimized out>)
    at /usr/src/sys/netinet6/nd6_rtr.c:739
#9  0xffffffff8087f31f in icmp6_input (mp=<value optimized out>,
offp=0xfffffe0f959b167c, proto=<value optimized out>)
    at /usr/src/sys/netinet6/icmp6.c:808
#10 0xffffffff808949fc in ip6_input (m=0xfffff8002e743200) at
/usr/src/sys/netinet6/ip6_input.c:1019
#11 0xffffffff80832f02 in netisr_dispatch_src (proto=<value optimized
out>, source=<value optimized out>, m=0x1)
    at /usr/src/sys/net/netisr.c:976
#12 0xffffffff8082a226 in ether_demux (ifp=<value optimized out>,
m=0xfffff8002e743200) at /usr/src/sys/net/if_ethersubr.c:851
#13 0xffffffff8082aece in ether_nh_input (m=<value optimized out>) at
/usr/src/sys/net/if_ethersubr.c:646
#14 0xffffffff80832f02 in netisr_dispatch_src (proto=<value optimized
out>, source=<value optimized out>, m=0x1)
    at /usr/src/sys/net/netisr.c:976

I'll link to GH for the various relevant bits, because I know everyone
can agree it's the superior RCS.  It appears to be that most of these
are caused by the dr struct being freed by concurrent processing:

https://github.com/freebsd/freebsd/blob/e5ee1c2b414851b17663cb491e2f2317a0af9bda/sys/netinet6/nd6_rtr.c#L578
https://github.com/freebsd/freebsd/blob/e5ee1c2b414851b17663cb491e2f2317a0af9bda/sys/netinet6/nd6_rtr.c#L654
https://github.com/freebsd/freebsd/blob/e5ee1c2b414851b17663cb491e2f2317a0af9bda/sys/netinet6/nd6_rtr.c#L728
https://github.com/freebsd/freebsd/blob/e5ee1c2b414851b17663cb491e2f2317a0af9bda/sys/netinet6/nd6_rtr.c#L739
https://github.com/freebsd/freebsd/blob/e5ee1c2b414851b17663cb491e2f2317a0af9bda/sys/netinet6/nd6_rtr.c#L800
https://github.com/freebsd/freebsd/blob/e5ee1c2b414851b17663cb491e2f2317a0af9bda/sys/netinet6/nd6_rtr.c#L1312

Thanks for any assistance,
Jason
_______________________________________________
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
_______________________________________________
freebsd-net@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Reply via email to