I haven't fully had a chance to investigate this, but it seems there's a
possible deadlock with the dpdk netdev - specifically vhostuser class.
I've seen it intermittently with 2.4, 2.5, and now master.  It's quite
rare to reproduce.

Below is a sample stack trace.  In it, notice that thread 12
(dpdk_watchdog) is blocked in a pthread_mutex_unlock; the mutex object
itself is indicating unlocked status.  I don't know how exactly this
would happen: guess is a possible removal of the device while the
dpdk_watchdog is running, but that's just a guess.  It has been stuck on
this unlock since last night, on the same device (which no longer has an
entry in the database).  I need more time to investigate, but I have two
patch sets I'm working on already, and can't take on another, yet.

I probably won't get to this in the next month or two for other reasons,
so I'm posting it here so it doesn't get lost (and also because someone
may have or see a fix immediately).

[root@wsfd-netdev9 ovs]# gdb ./vswitchd/ovs-vswitchd
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /root/git/ovs/vswitchd/ovs-vswitchd...done.
(gdb) attach 6041
Attaching to program: /root/git/ovs/./vswitchd/ovs-vswitchd, process 6041
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols 
found)...done.
[New LWP 6065]
[New LWP 6064]
[New LWP 6063]
[New LWP 6062]
[New LWP 6061]
[New LWP 6060]
[New LWP 6059]
[New LWP 6058]
[New LWP 6045]
[New LWP 6044]
[New LWP 6043]
[New LWP 6042]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libnuma.so.1...Reading symbols from 
/lib64/libnuma.so.1...(no debugging symbols found)...done.
(no debugging symbols found)...done.
Loaded symbols for /lib64/libnuma.so.1
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols 
found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libgcc_s.so.1
0x00007f0e471b6b7d in poll () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-105.el7.x86_64 
libgcc-4.8.5-4.el7.x86_64 numactl-libs-2.0.9-6.el7_2.x86_64
(gdb) thread apply all bt

Thread 13 (Thread 0x7f0e055f9700 (LWP 6042)):
#0  0x00007f0e471c17a3 in epoll_wait () from /lib64/libc.so.6
#1  0x000000000052b7b4 in eal_intr_thread_main ()
#2  0x00007f0e47ba9dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f0e471c11cd in clone () from /lib64/libc.so.6

Thread 12 (Thread 0x7f0e04df8700 (LWP 6043)):
#0  0x00007f0e47bacbdc in pthread_mutex_unlock () from /lib64/libpthread.so.0
#1  0x000000000062d038 in ovs_mutex_unlock (l_=l_@entry=0x7f0e07574ca8)
    at lib/ovs-thread.c:125
#2  0x0000000000671b28 in dpdk_watchdog (dummy=<optimized out>)
    at lib/netdev-dpdk.c:570
#3  0x000000000062cdd4 in ovsthread_wrapper (aux_=<optimized out>)
    at lib/ovs-thread.c:342
#4  0x00007f0e47ba9dc5 in start_thread () from /lib64/libpthread.so.0
#5  0x00007f0e471c11cd in clone () from /lib64/libc.so.6

Thread 11 (Thread 0x7f0e045f7700 (LWP 6044)):
#0  0x00007f0e471b88f3 in select () from /lib64/libc.so.6
#1  0x0000000000543b49 in fdset_event_dispatch ()
#2  0x0000000000542fde in rte_vhost_driver_session_start ()
#3  0x000000000066f01b in start_vhost_loop (dummy=<optimized out>)
---Type <return> to continue, or q <return> to quit---
    at lib/netdev-dpdk.c:2451
#4  0x000000000062cdd4 in ovsthread_wrapper (aux_=<optimized out>)
    at lib/ovs-thread.c:342
#5  0x00007f0e47ba9dc5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f0e471c11cd in clone () from /lib64/libc.so.6

Thread 10 (Thread 0x7f0e03df6700 (LWP 6045)):
#0  0x00007f0e471b6b7d in poll () from /lib64/libc.so.6
#1  0x000000000064f9c4 in time_poll (pollfds=pollfds@entry=0x7f0df80008c0, 
    n_pollfds=2, handles=handles@entry=0x0, timeout_when=9223372036854775807, 
    elapsed=elapsed@entry=0x7f0e03df5aac) at lib/timeval.c:305
#2  0x000000000063cecc in poll_block () at lib/poll-loop.c:364
#3  0x000000000062bef1 in ovsrcu_postpone_thread (arg=<optimized out>)
    at lib/ovs-rcu.c:310
#4  0x000000000062cdd4 in ovsthread_wrapper (aux_=<optimized out>)
    at lib/ovs-thread.c:342
#5  0x00007f0e47ba9dc5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f0e471c11cd in clone () from /lib64/libc.so.6

Thread 9 (Thread 0x7f0e01df2700 (LWP 6058)):
#0  0x00007f0e471b6b7d in poll () from /lib64/libc.so.6
#1  0x000000000064f9c4 in time_poll (pollfds=pollfds@entry=0x7f0df00049a0, 
    n_pollfds=3, handles=handles@entry=0x0, timeout_when=9223372036854775807, 
---Type <return> to continue, or q <return> to quit---
    elapsed=elapsed@entry=0x7f0e01df1a8c) at lib/timeval.c:305
#2  0x000000000063cecc in poll_block () at lib/poll-loop.c:364
#3  0x00000000005a0c8f in udpif_upcall_handler (arg=0x10afbe0)
    at ofproto/ofproto-dpif-upcall.c:725
#4  0x000000000062cdd4 in ovsthread_wrapper (aux_=<optimized out>)
    at lib/ovs-thread.c:342
#5  0x00007f0e47ba9dc5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f0e471c11cd in clone () from /lib64/libc.so.6

Thread 8 (Thread 0x7f0e025f3700 (LWP 6059)):
#0  0x00007f0e471b6b7d in poll () from /lib64/libc.so.6
#1  0x000000000064f9c4 in time_poll (pollfds=pollfds@entry=0x7f0dd80049a0, 
    n_pollfds=3, handles=handles@entry=0x0, timeout_when=9223372036854775807, 
    elapsed=elapsed@entry=0x7f0e025f2a8c) at lib/timeval.c:305
#2  0x000000000063cecc in poll_block () at lib/poll-loop.c:364
#3  0x00000000005a0c8f in udpif_upcall_handler (arg=0x10afbf8)
    at ofproto/ofproto-dpif-upcall.c:725
#4  0x000000000062cdd4 in ovsthread_wrapper (aux_=<optimized out>)
    at lib/ovs-thread.c:342
#5  0x00007f0e47ba9dc5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f0e471c11cd in clone () from /lib64/libc.so.6

Thread 7 (Thread 0x7f0e02df4700 (LWP 6060)):
---Type <return> to continue, or q <return> to quit---
#0  0x00007f0e471b6b7d in poll () from /lib64/libc.so.6
#1  0x000000000064f9c4 in time_poll (pollfds=pollfds@entry=0x7f0dec000c50, 
    n_pollfds=4, handles=handles@entry=0x0, timeout_when=65562769, 
    elapsed=elapsed@entry=0x7f0e02df3a4c) at lib/timeval.c:305
#2  0x000000000063cecc in poll_block () at lib/poll-loop.c:364
#3  0x00000000005a041b in udpif_revalidator (arg=0x10afb20)
    at ofproto/ofproto-dpif-upcall.c:921
#4  0x000000000062cdd4 in ovsthread_wrapper (aux_=<optimized out>)
    at lib/ovs-thread.c:342
#5  0x00007f0e47ba9dc5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f0e471c11cd in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7f0e035f5700 (LWP 6061)):
#0  0x00007f0e471b6b7d in poll () from /lib64/libc.so.6
#1  0x000000000064f9c4 in time_poll (pollfds=pollfds@entry=0x7f0ddc0008e0, 
    n_pollfds=2, handles=handles@entry=0x0, timeout_when=9223372036854775807, 
    elapsed=elapsed@entry=0x7f0e035f4a2c) at lib/timeval.c:305
#2  0x000000000063cecc in poll_block () at lib/poll-loop.c:364
#3  0x000000000062d7d6 in ovs_barrier_block (barrier=barrier@entry=0x107fcf8)
    at lib/ovs-thread.c:307
#4  0x00000000005a02e0 in udpif_revalidator (arg=0x10afb38)
    at ofproto/ofproto-dpif-upcall.c:873
#5  0x000000000062cdd4 in ovsthread_wrapper (aux_=<optimized out>)
---Type <return> to continue, or q <return> to quit---
    at lib/ovs-thread.c:342
#6  0x00007f0e47ba9dc5 in start_thread () from /lib64/libpthread.so.0
#7  0x00007f0e471c11cd in clone () from /lib64/libc.so.6

Thread 5 (Thread 0x7f0deb7fe700 (LWP 6062)):
#0  0x00007f0e471b6b7d in poll () from /lib64/libc.so.6
#1  0x000000000064f9c4 in time_poll (pollfds=pollfds@entry=0x7f0dd00008c0, 
    n_pollfds=2, handles=handles@entry=0x0, timeout_when=9223372036854775807, 
    elapsed=elapsed@entry=0x7f0deb7fda8c) at lib/timeval.c:305
#2  0x000000000063cecc in poll_block () at lib/poll-loop.c:364
#3  0x00000000005a0c8f in udpif_upcall_handler (arg=0x10635c0)
    at ofproto/ofproto-dpif-upcall.c:725
#4  0x000000000062cdd4 in ovsthread_wrapper (aux_=<optimized out>)
    at lib/ovs-thread.c:342
#5  0x00007f0e47ba9dc5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f0e471c11cd in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7f0debfff700 (LWP 6063)):
#0  0x00007f0e471b6b7d in poll () from /lib64/libc.so.6
#1  0x000000000064f9c4 in time_poll (pollfds=pollfds@entry=0x7f0dd40008c0, 
    n_pollfds=2, handles=handles@entry=0x0, timeout_when=9223372036854775807, 
    elapsed=elapsed@entry=0x7f0debffea8c) at lib/timeval.c:305
#2  0x000000000063cecc in poll_block () at lib/poll-loop.c:364
---Type <return> to continue, or q <return> to quit---
#3  0x00000000005a0c8f in udpif_upcall_handler (arg=0x10635d8)
    at ofproto/ofproto-dpif-upcall.c:725
#4  0x000000000062cdd4 in ovsthread_wrapper (aux_=<optimized out>)
    at lib/ovs-thread.c:342
#5  0x00007f0e47ba9dc5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f0e471c11cd in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7f0e009ef700 (LWP 6064)):
#0  0x00007f0e471b6b7d in poll () from /lib64/libc.so.6
#1  0x000000000064f9c4 in time_poll (pollfds=pollfds@entry=0x7f0de4000c50, 
    n_pollfds=4, handles=handles@entry=0x0, timeout_when=65562771, 
    elapsed=elapsed@entry=0x7f0e009eea4c) at lib/timeval.c:305
#2  0x000000000063cecc in poll_block () at lib/poll-loop.c:364
#3  0x00000000005a041b in udpif_revalidator (arg=0x1063490)
    at ofproto/ofproto-dpif-upcall.c:921
#4  0x000000000062cdd4 in ovsthread_wrapper (aux_=<optimized out>)
    at lib/ovs-thread.c:342
#5  0x00007f0e47ba9dc5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f0e471c11cd in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f0e011f0700 (LWP 6065)):
#0  0x00007f0e471b6b7d in poll () from /lib64/libc.so.6
#1  0x000000000064f9c4 in time_poll (pollfds=pollfds@entry=0x7f0dcc0008e0, 
---Type <return> to continue, or q <return> to quit---
    n_pollfds=2, handles=handles@entry=0x0, timeout_when=9223372036854775807, 
    elapsed=elapsed@entry=0x7f0e011efa2c) at lib/timeval.c:305
#2  0x000000000063cecc in poll_block () at lib/poll-loop.c:364
#3  0x000000000062d7d6 in ovs_barrier_block (barrier=barrier@entry=0x10a39c8)
    at lib/ovs-thread.c:307
#4  0x00000000005a02e0 in udpif_revalidator (arg=0x10634a8)
    at ofproto/ofproto-dpif-upcall.c:873
#5  0x000000000062cdd4 in ovsthread_wrapper (aux_=<optimized out>)
    at lib/ovs-thread.c:342
#6  0x00007f0e47ba9dc5 in start_thread () from /lib64/libpthread.so.0
#7  0x00007f0e471c11cd in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f0e481d7b40 (LWP 6041)):
#0  0x00007f0e471b6b7d in poll () from /lib64/libc.so.6
#1  0x000000000064f9c4 in time_poll (pollfds=pollfds@entry=0x10ab540, 
    n_pollfds=2, handles=handles@entry=0x0, timeout_when=66006888, 
    elapsed=elapsed@entry=0x7ffcbb24145c) at lib/timeval.c:305
#2  0x000000000063cecc in poll_block () at lib/poll-loop.c:364
#3  0x000000000062bc63 in ovsrcu_synchronize () at lib/ovs-rcu.c:230
#4  0x00000000005a9240 in xlate_txn_commit ()
    at ofproto/ofproto-dpif-xlate.c:909
#5  0x00000000005955fb in type_run (type=<optimized out>)
    at ofproto/ofproto-dpif.c:646
---Type <return> to continue, or q <return> to quit---
#6  0x000000000058353f in ofproto_type_run (datapath_type=<optimized out>, 
    datapath_type@entry=0x10773f0 "netdev") at ofproto/ofproto.c:1706
#7  0x0000000000573a25 in bridge_run__ () at vswitchd/bridge.c:2891
#8  0x000000000057865e in bridge_reconfigure (ovs_cfg=ovs_cfg@entry=0x10a8c30)
    at vswitchd/bridge.c:680
#9  0x0000000000579b14 in bridge_run () at vswitchd/bridge.c:2976
#10 0x000000000040fed5 in main (argc=11, argv=0x7ffcbb241aa8)
    at vswitchd/ovs-vswitchd.c:112
(gdb) p mutex
$1 = {lock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, 
      __kind = 2, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, 
    __size = '\000' <repeats 16 times>, "\002", '\000' <repeats 22 times>, 
    __align = 0}, where = 0x6d8117 "<unlocked>"}
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev

Reply via email to