tl;dr --> https://reviews.freebsd.org/D6845
Navdeep and I have been poking at an LOR that seems to be popping up in -current that is related to lagg(4) and lagg_get_counter(). root@sysdev07:~ # ifconfig lagg0 create laggport ix0 laggproto lacp 192.168.100.11/24 lagg0: link state changed to DOWN root@sysdev07:~ # ifconfig ix0 up lock order reversal: 1st 0xfffff8002d7c9190 if_addr_lock (if_addr_lock) @ /usr/home/sbruno/fbsd_head/sys/net/rtsock.c:1717 2nd 0xfffff800271a5808 if_lagg rmlock (if_lagg rmlock) @ /usr/home/sbruno/fbsd_head/sys/modules/if_lagg/../../net/if_lagg.c:1057 stack backtrace: #0 0xffffffff80aa5ab0 at witness_debugger+0x70 #1 0xffffffff80aa59a4 at witness_checkorder+0xe54 #2 0xffffffff80a42521 at _rm_rlock_debug+0x111 #3 0xffffffff82222b2c at lagg_get_counter+0x4c #4 0xffffffff80b2ebd1 at if_data_copy+0xa1 #5 0xffffffff80b533bc at sysctl_rtsock+0x56c #6 0xffffffff80a53f0a at sysctl_root_handler_locked+0x8a #7 0xffffffff80a536c8 at sysctl_root+0x188 #8 0xffffffff80a53cbe at userland_sysctl+0x16e #9 0xffffffff80a53b14 at sys___sysctl+0x74 #10 0xffffffff80eb5b3b at amd64_syscall+0x2db #11 0xffffffff80e95c4b at Xfast_syscall+0xfb Running a netstat -w 1 in the backgrouund while repeatedly creating destroying the interface lagg0 will lead to either a panic or a deadlock: e.g. netstat -w 1 > /dev/null & while [ 1 ]; do ifconfig lagg0 destroy ifconfig lagg0 create laggport ix0 laggproto lacp 192.168.100.11/24 done When the system deadlocks on the console, kdb sees the locks held like this: KDB: enter: Break to debugger [ thread pid 11 tid 100007 ] Stopped at kdb_alt_break_internal+0x18e: movq $0,kdb_why db> show allocks No such command db> show alllocks Process 2173 (ifconfig) thread 0xfffff8002d125a00 (100186) exclusive rm if_lagg rmlock (if_lagg rmlock) r = 0 (0xfffff8002717e408) locked @ /usr/home/sbruno/fbsd_head/sys/modules/if_lagg/../../net/if_lagg.c:1530 exclusive sleep mutex in6_multi_mtx (in6_multi_mtx) r = 0 (0xffffffff81d7e288) locked @ /usr/home/sbruno/fbsd_head/sys/netinet6/in6_mcast.c:1142 Process 792 (netstat) thread 0xfffff80027e67a00 (100167) shared rw if_addr_lock (if_addr_lock) r = 0 (0xfffff80103e95190) locked @ /usr/home/sbruno/fbsd_head/sys/net/rtsock.c:1717 shared rw ifnet_rw (ifnet_rw) r = 0 (0xffffffff81d7b760) locked @ /usr/home/sbruno/fbsd_head/sys/net/rtsock.c:1713 exclusive sleep mutex Giant (Giant) r = 0 (0xffffffff81d55e08) locked @ /usr/home/sbruno/fbsd_head/sys/kern/kern_sysctl.c:164 This looks like the netstat is causing a call into the counter function while the destruction or creation is ongoing. Removing the LAGG_RLOCK() calls from lagg_get_counter() makes the deadlock, LOR and panic go away, however this can't be that easy. I'm unsure what the RLOCK is for in lagg_get_counter(). It appears that there is a higher lock in the ifnet access that is protecting simultaneous access already, but I'm very ignorant of what's going on here. I don't see any other driver with locks in its get_counter() functions, so I'm not sure what the best course of action here is. Sean
signature.asc
Description: OpenPGP digital signature