Hi Aaron, Thanks for the report. I haven't been able to reproduce the problem, could you post ovs-vswitch.log please?
Daniele 2016-07-19 9:38 GMT-07:00 Aaron Conole <acon...@redhat.com>: > I haven't fully had a chance to investigate this, but it seems there's a > possible deadlock with the dpdk netdev - specifically vhostuser class. > I've seen it intermittently with 2.4, 2.5, and now master. It's quite > rare to reproduce. > > Below is a sample stack trace. In it, notice that thread 12 > (dpdk_watchdog) is blocked in a pthread_mutex_unlock; the mutex object > itself is indicating unlocked status. I don't know how exactly this > would happen: guess is a possible removal of the device while the > dpdk_watchdog is running, but that's just a guess. It has been stuck on > this unlock since last night, on the same device (which no longer has an > entry in the database). I need more time to investigate, but I have two > patch sets I'm working on already, and can't take on another, yet. > > I probably won't get to this in the next month or two for other reasons, > so I'm posting it here so it doesn't get lost (and also because someone > may have or see a fix immediately). > > [root@wsfd-netdev9 ovs]# gdb ./vswitchd/ovs-vswitchd > GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7 > Copyright (C) 2013 Free Software Foundation, Inc. > License GPLv3+: GNU GPL version 3 or later < > http://gnu.org/licenses/gpl.html> > This is free software: you are free to change and redistribute it. > There is NO WARRANTY, to the extent permitted by law. Type "show copying" > and "show warranty" for details. > This GDB was configured as "x86_64-redhat-linux-gnu". > For bug reporting instructions, please see: > <http://www.gnu.org/software/gdb/bugs/>... > Reading symbols from /root/git/ovs/vswitchd/ovs-vswitchd...done. > (gdb) attach 6041 > Attaching to program: /root/git/ovs/./vswitchd/ovs-vswitchd, process 6041 > Reading symbols from /lib64/libdl.so.2...(no debugging symbols > found)...done. > Loaded symbols for /lib64/libdl.so.2 > Reading symbols from /lib64/libpthread.so.0...(no debugging symbols > found)...done. > [New LWP 6065] > [New LWP 6064] > [New LWP 6063] > [New LWP 6062] > [New LWP 6061] > [New LWP 6060] > [New LWP 6059] > [New LWP 6058] > [New LWP 6045] > [New LWP 6044] > [New LWP 6043] > [New LWP 6042] > [Thread debugging using libthread_db enabled] > Using host libthread_db library "/lib64/libthread_db.so.1". > Loaded symbols for /lib64/libpthread.so.0 > Reading symbols from /lib64/librt.so.1...(no debugging symbols > found)...done. > Loaded symbols for /lib64/librt.so.1 > Reading symbols from /lib64/libm.so.6...(no debugging symbols > found)...done. > Loaded symbols for /lib64/libm.so.6 > Reading symbols from /lib64/libnuma.so.1...Reading symbols from > /lib64/libnuma.so.1...(no debugging symbols found)...done. > (no debugging symbols found)...done. > Loaded symbols for /lib64/libnuma.so.1 > Reading symbols from /lib64/libc.so.6...(no debugging symbols > found)...done. > Loaded symbols for /lib64/libc.so.6 > Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols > found)...done. > Loaded symbols for /lib64/ld-linux-x86-64.so.2 > Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols > found)...done. > Loaded symbols for /lib64/libgcc_s.so.1 > 0x00007f0e471b6b7d in poll () from /lib64/libc.so.6 > Missing separate debuginfos, use: debuginfo-install > glibc-2.17-105.el7.x86_64 libgcc-4.8.5-4.el7.x86_64 > numactl-libs-2.0.9-6.el7_2.x86_64 > (gdb) thread apply all bt > > Thread 13 (Thread 0x7f0e055f9700 (LWP 6042)): > #0 0x00007f0e471c17a3 in epoll_wait () from /lib64/libc.so.6 > #1 0x000000000052b7b4 in eal_intr_thread_main () > #2 0x00007f0e47ba9dc5 in start_thread () from /lib64/libpthread.so.0 > #3 0x00007f0e471c11cd in clone () from /lib64/libc.so.6 > > Thread 12 (Thread 0x7f0e04df8700 (LWP 6043)): > #0 0x00007f0e47bacbdc in pthread_mutex_unlock () from > /lib64/libpthread.so.0 > #1 0x000000000062d038 in ovs_mutex_unlock (l_=l_@entry=0x7f0e07574ca8) > at lib/ovs-thread.c:125 > #2 0x0000000000671b28 in dpdk_watchdog (dummy=<optimized out>) > at lib/netdev-dpdk.c:570 > #3 0x000000000062cdd4 in ovsthread_wrapper (aux_=<optimized out>) > at lib/ovs-thread.c:342 > #4 0x00007f0e47ba9dc5 in start_thread () from /lib64/libpthread.so.0 > #5 0x00007f0e471c11cd in clone () from /lib64/libc.so.6 > > Thread 11 (Thread 0x7f0e045f7700 (LWP 6044)): > #0 0x00007f0e471b88f3 in select () from /lib64/libc.so.6 > #1 0x0000000000543b49 in fdset_event_dispatch () > #2 0x0000000000542fde in rte_vhost_driver_session_start () > #3 0x000000000066f01b in start_vhost_loop (dummy=<optimized out>) > ---Type <return> to continue, or q <return> to quit--- > at lib/netdev-dpdk.c:2451 > #4 0x000000000062cdd4 in ovsthread_wrapper (aux_=<optimized out>) > at lib/ovs-thread.c:342 > #5 0x00007f0e47ba9dc5 in start_thread () from /lib64/libpthread.so.0 > #6 0x00007f0e471c11cd in clone () from /lib64/libc.so.6 > > Thread 10 (Thread 0x7f0e03df6700 (LWP 6045)): > #0 0x00007f0e471b6b7d in poll () from /lib64/libc.so.6 > #1 0x000000000064f9c4 in time_poll (pollfds=pollfds@entry=0x7f0df80008c0, > n_pollfds=2, handles=handles@entry=0x0, > timeout_when=9223372036854775807, > elapsed=elapsed@entry=0x7f0e03df5aac) at lib/timeval.c:305 > #2 0x000000000063cecc in poll_block () at lib/poll-loop.c:364 > #3 0x000000000062bef1 in ovsrcu_postpone_thread (arg=<optimized out>) > at lib/ovs-rcu.c:310 > #4 0x000000000062cdd4 in ovsthread_wrapper (aux_=<optimized out>) > at lib/ovs-thread.c:342 > #5 0x00007f0e47ba9dc5 in start_thread () from /lib64/libpthread.so.0 > #6 0x00007f0e471c11cd in clone () from /lib64/libc.so.6 > > Thread 9 (Thread 0x7f0e01df2700 (LWP 6058)): > #0 0x00007f0e471b6b7d in poll () from /lib64/libc.so.6 > #1 0x000000000064f9c4 in time_poll (pollfds=pollfds@entry=0x7f0df00049a0, > n_pollfds=3, handles=handles@entry=0x0, > timeout_when=9223372036854775807, > ---Type <return> to continue, or q <return> to quit--- > elapsed=elapsed@entry=0x7f0e01df1a8c) at lib/timeval.c:305 > #2 0x000000000063cecc in poll_block () at lib/poll-loop.c:364 > #3 0x00000000005a0c8f in udpif_upcall_handler (arg=0x10afbe0) > at ofproto/ofproto-dpif-upcall.c:725 > #4 0x000000000062cdd4 in ovsthread_wrapper (aux_=<optimized out>) > at lib/ovs-thread.c:342 > #5 0x00007f0e47ba9dc5 in start_thread () from /lib64/libpthread.so.0 > #6 0x00007f0e471c11cd in clone () from /lib64/libc.so.6 > > Thread 8 (Thread 0x7f0e025f3700 (LWP 6059)): > #0 0x00007f0e471b6b7d in poll () from /lib64/libc.so.6 > #1 0x000000000064f9c4 in time_poll (pollfds=pollfds@entry=0x7f0dd80049a0, > n_pollfds=3, handles=handles@entry=0x0, > timeout_when=9223372036854775807, > elapsed=elapsed@entry=0x7f0e025f2a8c) at lib/timeval.c:305 > #2 0x000000000063cecc in poll_block () at lib/poll-loop.c:364 > #3 0x00000000005a0c8f in udpif_upcall_handler (arg=0x10afbf8) > at ofproto/ofproto-dpif-upcall.c:725 > #4 0x000000000062cdd4 in ovsthread_wrapper (aux_=<optimized out>) > at lib/ovs-thread.c:342 > #5 0x00007f0e47ba9dc5 in start_thread () from /lib64/libpthread.so.0 > #6 0x00007f0e471c11cd in clone () from /lib64/libc.so.6 > > Thread 7 (Thread 0x7f0e02df4700 (LWP 6060)): > ---Type <return> to continue, or q <return> to quit--- > #0 0x00007f0e471b6b7d in poll () from /lib64/libc.so.6 > #1 0x000000000064f9c4 in time_poll (pollfds=pollfds@entry=0x7f0dec000c50, > n_pollfds=4, handles=handles@entry=0x0, timeout_when=65562769, > elapsed=elapsed@entry=0x7f0e02df3a4c) at lib/timeval.c:305 > #2 0x000000000063cecc in poll_block () at lib/poll-loop.c:364 > #3 0x00000000005a041b in udpif_revalidator (arg=0x10afb20) > at ofproto/ofproto-dpif-upcall.c:921 > #4 0x000000000062cdd4 in ovsthread_wrapper (aux_=<optimized out>) > at lib/ovs-thread.c:342 > #5 0x00007f0e47ba9dc5 in start_thread () from /lib64/libpthread.so.0 > #6 0x00007f0e471c11cd in clone () from /lib64/libc.so.6 > > Thread 6 (Thread 0x7f0e035f5700 (LWP 6061)): > #0 0x00007f0e471b6b7d in poll () from /lib64/libc.so.6 > #1 0x000000000064f9c4 in time_poll (pollfds=pollfds@entry=0x7f0ddc0008e0, > n_pollfds=2, handles=handles@entry=0x0, > timeout_when=9223372036854775807, > elapsed=elapsed@entry=0x7f0e035f4a2c) at lib/timeval.c:305 > #2 0x000000000063cecc in poll_block () at lib/poll-loop.c:364 > #3 0x000000000062d7d6 in ovs_barrier_block (barrier=barrier@entry > =0x107fcf8) > at lib/ovs-thread.c:307 > #4 0x00000000005a02e0 in udpif_revalidator (arg=0x10afb38) > at ofproto/ofproto-dpif-upcall.c:873 > #5 0x000000000062cdd4 in ovsthread_wrapper (aux_=<optimized out>) > ---Type <return> to continue, or q <return> to quit--- > at lib/ovs-thread.c:342 > #6 0x00007f0e47ba9dc5 in start_thread () from /lib64/libpthread.so.0 > #7 0x00007f0e471c11cd in clone () from /lib64/libc.so.6 > > Thread 5 (Thread 0x7f0deb7fe700 (LWP 6062)): > #0 0x00007f0e471b6b7d in poll () from /lib64/libc.so.6 > #1 0x000000000064f9c4 in time_poll (pollfds=pollfds@entry=0x7f0dd00008c0, > n_pollfds=2, handles=handles@entry=0x0, > timeout_when=9223372036854775807, > elapsed=elapsed@entry=0x7f0deb7fda8c) at lib/timeval.c:305 > #2 0x000000000063cecc in poll_block () at lib/poll-loop.c:364 > #3 0x00000000005a0c8f in udpif_upcall_handler (arg=0x10635c0) > at ofproto/ofproto-dpif-upcall.c:725 > #4 0x000000000062cdd4 in ovsthread_wrapper (aux_=<optimized out>) > at lib/ovs-thread.c:342 > #5 0x00007f0e47ba9dc5 in start_thread () from /lib64/libpthread.so.0 > #6 0x00007f0e471c11cd in clone () from /lib64/libc.so.6 > > Thread 4 (Thread 0x7f0debfff700 (LWP 6063)): > #0 0x00007f0e471b6b7d in poll () from /lib64/libc.so.6 > #1 0x000000000064f9c4 in time_poll (pollfds=pollfds@entry=0x7f0dd40008c0, > n_pollfds=2, handles=handles@entry=0x0, > timeout_when=9223372036854775807, > elapsed=elapsed@entry=0x7f0debffea8c) at lib/timeval.c:305 > #2 0x000000000063cecc in poll_block () at lib/poll-loop.c:364 > ---Type <return> to continue, or q <return> to quit--- > #3 0x00000000005a0c8f in udpif_upcall_handler (arg=0x10635d8) > at ofproto/ofproto-dpif-upcall.c:725 > #4 0x000000000062cdd4 in ovsthread_wrapper (aux_=<optimized out>) > at lib/ovs-thread.c:342 > #5 0x00007f0e47ba9dc5 in start_thread () from /lib64/libpthread.so.0 > #6 0x00007f0e471c11cd in clone () from /lib64/libc.so.6 > > Thread 3 (Thread 0x7f0e009ef700 (LWP 6064)): > #0 0x00007f0e471b6b7d in poll () from /lib64/libc.so.6 > #1 0x000000000064f9c4 in time_poll (pollfds=pollfds@entry=0x7f0de4000c50, > n_pollfds=4, handles=handles@entry=0x0, timeout_when=65562771, > elapsed=elapsed@entry=0x7f0e009eea4c) at lib/timeval.c:305 > #2 0x000000000063cecc in poll_block () at lib/poll-loop.c:364 > #3 0x00000000005a041b in udpif_revalidator (arg=0x1063490) > at ofproto/ofproto-dpif-upcall.c:921 > #4 0x000000000062cdd4 in ovsthread_wrapper (aux_=<optimized out>) > at lib/ovs-thread.c:342 > #5 0x00007f0e47ba9dc5 in start_thread () from /lib64/libpthread.so.0 > #6 0x00007f0e471c11cd in clone () from /lib64/libc.so.6 > > Thread 2 (Thread 0x7f0e011f0700 (LWP 6065)): > #0 0x00007f0e471b6b7d in poll () from /lib64/libc.so.6 > #1 0x000000000064f9c4 in time_poll (pollfds=pollfds@entry=0x7f0dcc0008e0, > ---Type <return> to continue, or q <return> to quit--- > n_pollfds=2, handles=handles@entry=0x0, > timeout_when=9223372036854775807, > elapsed=elapsed@entry=0x7f0e011efa2c) at lib/timeval.c:305 > #2 0x000000000063cecc in poll_block () at lib/poll-loop.c:364 > #3 0x000000000062d7d6 in ovs_barrier_block (barrier=barrier@entry > =0x10a39c8) > at lib/ovs-thread.c:307 > #4 0x00000000005a02e0 in udpif_revalidator (arg=0x10634a8) > at ofproto/ofproto-dpif-upcall.c:873 > #5 0x000000000062cdd4 in ovsthread_wrapper (aux_=<optimized out>) > at lib/ovs-thread.c:342 > #6 0x00007f0e47ba9dc5 in start_thread () from /lib64/libpthread.so.0 > #7 0x00007f0e471c11cd in clone () from /lib64/libc.so.6 > > Thread 1 (Thread 0x7f0e481d7b40 (LWP 6041)): > #0 0x00007f0e471b6b7d in poll () from /lib64/libc.so.6 > #1 0x000000000064f9c4 in time_poll (pollfds=pollfds@entry=0x10ab540, > n_pollfds=2, handles=handles@entry=0x0, timeout_when=66006888, > elapsed=elapsed@entry=0x7ffcbb24145c) at lib/timeval.c:305 > #2 0x000000000063cecc in poll_block () at lib/poll-loop.c:364 > #3 0x000000000062bc63 in ovsrcu_synchronize () at lib/ovs-rcu.c:230 > #4 0x00000000005a9240 in xlate_txn_commit () > at ofproto/ofproto-dpif-xlate.c:909 > #5 0x00000000005955fb in type_run (type=<optimized out>) > at ofproto/ofproto-dpif.c:646 > ---Type <return> to continue, or q <return> to quit--- > #6 0x000000000058353f in ofproto_type_run (datapath_type=<optimized out>, > datapath_type@entry=0x10773f0 "netdev") at ofproto/ofproto.c:1706 > #7 0x0000000000573a25 in bridge_run__ () at vswitchd/bridge.c:2891 > #8 0x000000000057865e in bridge_reconfigure (ovs_cfg=ovs_cfg@entry > =0x10a8c30) > at vswitchd/bridge.c:680 > #9 0x0000000000579b14 in bridge_run () at vswitchd/bridge.c:2976 > #10 0x000000000040fed5 in main (argc=11, argv=0x7ffcbb241aa8) > at vswitchd/ovs-vswitchd.c:112 > (gdb) p mutex > $1 = {lock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, > __kind = 2, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, > __size = '\000' <repeats 16 times>, "\002", '\000' <repeats 22 times>, > __align = 0}, where = 0x6d8117 "<unlocked>"} > _______________________________________________ > dev mailing list > dev@openvswitch.org > http://openvswitch.org/mailman/listinfo/dev > _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev