Hi Maxime, Le 30 avril 2018 17:45:52 GMT+02:00, Maxime Coquelin <maxime.coque...@redhat.com> a écrit : >Hi Olivier, > >On 04/24/2018 04:46 PM, Olivier Matz wrote: >> Some parts of dpdk use their own management threads. Most of the >time, >> the affinity of the thread is not properly set: it should not be >scheduled >> on the dataplane cores, because interrupting them can cause packet >losses. >> >> This patchset introduces a new wrapper for thread creation that does >> the job automatically, avoiding code duplication. >> >> v3: >> * new patch: use this API in examples when relevant. >> * replace pthread_kill by pthread_cancel. Note that pthread_join() >> is still needed. >> * rebase: vfio and pdump do not have control pthreads anymore, and >eal >> has 2 new pthreads >> * remove all calls to snprintf/strlcpy that truncate the thread name: >> all strings lengths are already < 16. >> >> v2: >> * set affinity to master core if no core is off, as suggested by >> Anatoly >> >> Olivier Matz (5): >> eal: use sizeof to avoid a double use of a define >> eal: new function to create control threads >> eal: set name when creating a control thread >> eal: set affinity for control threads >> examples: use new API to create control threads >> >> drivers/net/kni/Makefile | 1 + >> drivers/net/kni/rte_eth_kni.c | 3 +- >> examples/tep_termination/main.c | 16 +++---- >> examples/vhost/main.c | 19 +++----- >> lib/librte_eal/bsdapp/eal/eal.c | 4 +- >> lib/librte_eal/bsdapp/eal/eal_thread.c | 2 +- >> lib/librte_eal/common/eal_common_proc.c | 15 ++---- >> lib/librte_eal/common/eal_common_thread.c | 72 >++++++++++++++++++++++++++++ >> lib/librte_eal/common/include/rte_lcore.h | 26 ++++++++++ >> lib/librte_eal/linuxapp/eal/eal.c | 4 +- >> lib/librte_eal/linuxapp/eal/eal_interrupts.c | 17 ++----- >> lib/librte_eal/linuxapp/eal/eal_thread.c | 2 +- >> lib/librte_eal/linuxapp/eal/eal_timer.c | 12 +---- >> lib/librte_eal/rte_eal_version.map | 1 + >> lib/librte_vhost/socket.c | 25 ++-------- >> 15 files changed, 135 insertions(+), 84 deletions(-) >> > >I face a deadlock issue with your series, that Jianfeng patch does not >resolve ("eal: fix threads block on barrier"). Reverting the series and >Jianfeng patch makes the issue to disappear. > >I face the problem in a VM (not seen on the host): ># ./install/bin/testpmd -l 0,1,2 --socket-mem 1024 -n 4 --proc-type >auto >--file-prefix pg -- --portmask=3 --forward-mode=macswap >--port-topology=chained --disable-rss -i --rxq=1 --txq=1 --rxd=256 >--txd=256 --nb-cores=2 --auto-start >EAL: Detected 3 lcore(s) >EAL: Detected 1 NUMA nodes >EAL: Auto-detected process type: PRIMARY >EAL: Multi-process socket /var/run/.pg_unix > > >Then it is stuck. Attaching with GDB, I get below backtrace >information: > >(gdb) info threads > Id Target Id Frame > 3 Thread 0x7f63e1f9f700 (LWP 8808) "rte_mp_handle" >0x00007f63e2591bfd in recvmsg () at >../sysdeps/unix/syscall-template.S:81 > 2 Thread 0x7f63e179e700 (LWP 8809) "rte_mp_async" >pthread_barrier_wait () at >../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_barrier_wait.S:71 >* 1 Thread 0x7f63e32cec00 (LWP 8807) "testpmd" pthread_barrier_wait >() at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_barrier_wait.S:71 >(gdb) bt full >#0 pthread_barrier_wait () at >../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_barrier_wait.S:71 >No locals. >#1 0x0000000000520c54 in rte_ctrl_thread_create >(thread=thread@entry=0x7ffe5c895020, name=name@entry=0x869d86 >"rte_mp_async", attr=attr@entry=0x0, >start_routine=start_routine@entry=0x521030 <async_reply_handle>, >arg=arg@entry=0x0) > at /root/src/dpdk/lib/librte_eal/common/eal_common_thread.c:207 > params = 0x17b1e40 > lcore_id = <optimized out> > cpuset = {__bits = {1, 0 <repeats 15 times>}} > cpu_found = <optimized out> > ret = 0 >#2 0x00000000005220b6 in rte_mp_channel_init () at >/root/src/dpdk/lib/librte_eal/common/eal_common_proc.c:674 > path = "/var/run\000.pg_unix_*", '\000' <repeats 1301 times>... > dir_fd = 4 > mp_handle_tid = 140066969745152 > async_reply_handle_tid = 140066961352448 >#3 0x000000000050c227 in rte_eal_init (argc=argc@entry=23, >argv=argv@entry=0x7ffe5c896378) at >/root/src/dpdk/lib/librte_eal/linuxapp/eal/eal.c:775 > i = <optimized out> > fctret = 11 > ret = <optimized out> > thread_id = 140066989861888 > run_once = {cnt = 1} > logid = 0x17b1e00 "testpmd" > cpuset = "T}\211\\\376\177", '\000' <repeats 117 times>, >"\020", '\000' <repeats 116 times>... > thread_name = "X}\211\\\376\177\000\000\226\301\036\342c\177\000" > __func__ = "rte_eal_init" >#4 0x0000000000473214 in main (argc=23, argv=0x7ffe5c896378) at >/root/src/dpdk/app/test-pmd/testpmd.c:2597 > diag = <optimized out> > port_id = <optimized out> > ret = <optimized out> > __func__ = "main" >(gdb) thread 2 >[Switching to thread 2 (Thread 0x7f63e179e700 (LWP 8809))] >#0 pthread_barrier_wait () at >../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_barrier_wait.S:71 >71 cmpl %edx, (%rdi) >(gdb) bt full >#0 pthread_barrier_wait () at >../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_barrier_wait.S:71 >No locals. >#1 0x0000000000520777 in rte_thread_init (arg=<optimized out>) at >/root/src/dpdk/lib/librte_eal/common/eal_common_thread.c:156 > params = <optimized out> > start_routine = 0x521030 <async_reply_handle> > routine_arg = 0x0 >#2 0x00007f63e258add5 in start_thread (arg=0x7f63e179e700) at >pthread_create.c:308 > __res = <optimized out> > pd = 0x7f63e179e700 > now = <optimized out> > unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140066961352448, >1212869169857371576, 0, 8392704, 0, 140066961352448, >-1291626103561052744, -1291619793368703560}, mask_was_saved = 0}}, priv > >= {pad = {0x0, 0x0, 0x0, 0x0}, data = { > prev = 0x0, cleanup = 0x0, canceltype = 0}}} > not_first_call = <optimized out> > pagesize_m1 = <optimized out> > sp = <optimized out> > freesize = <optimized out> >#3 0x00007f63e22b4b3d in clone () at >../sysdeps/unix/sysv/linux/x86_64/clone.S:113 >No locals. >(gdb) thread 3 >[Switching to thread 3 (Thread 0x7f63e1f9f700 (LWP 8808))] >#0 0x00007f63e2591bfd in recvmsg () at >../sysdeps/unix/syscall-template.S:81 >81 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS) >(gdb) bt full >#0 0x00007f63e2591bfd in recvmsg () at >../sysdeps/unix/syscall-template.S:81 >No locals. >#1 0x000000000052194e in read_msg (s=0x7f63e1f9d3b0, m=0x7f63e1f9d5a0) > >at /root/src/dpdk/lib/librte_eal/common/eal_common_proc.c:258 > msglen = <optimized out> > control = >"\000\000\000\000\000\000\000\000\336~\f\343c\177\000\000\005", '\000' ><repeats 23 times>, "\360\371\033\342c\177\000" > cmsg = <optimized out> > iov = {iov_base = 0x7f63e1f9d5a0, iov_len = 332} > msgh = {msg_name = 0x7f63e1f9d3b0, msg_namelen = 110, msg_iov = >0x7f63e1f9d370, msg_iovlen = 1, msg_control = 0x7f63e1f9d380, >msg_controllen = 48, msg_flags = 0} >#2 mp_handle (arg=<optimized out>) at >/root/src/dpdk/lib/librte_eal/common/eal_common_proc.c:346 > msg = {type = 0, msg = {name = '\000' <repeats 63 times>, >len_param = 0, num_fds = 0, param = '\000' <repeats 20 times>, "\002", >'\000' <repeats 234 times>, fds = {0, 0, 0, 0, 0, 0, 0, 0}}} > sa = {sun_family = 55104, > sun_path = >"\371\341c\177\000\000\352\372\f\343c\177\000\000\000\000\000\000\000\000\000\000\377\377\377\377\377\377\377\377\000\367\371\341c\177\000\000\030\000\000\000\000\000\000\000p\327\371\341c\177\000\000\000\367\371\341c\177\000\000\000\367\371\341c\177", > >'\000' <repeats 34 times>, "\200\037\000\000\377\377"} >#3 0x00007f63e258add5 in start_thread (arg=0x7f63e1f9f700) at >pthread_create.c:308 > __res = <optimized out> > pd = 0x7f63e1f9f700 > now = <optimized out> > unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140066969745152, >1212869169857371576, 0, 8392704, 0, 140066969745152, >-1291625004586295880, -1291619793368703560}, mask_was_saved = 0}}, priv > >= {pad = {0x0, 0x0, 0x0, 0x0}, data = { > prev = 0x0, cleanup = 0x0, canceltype = 0}}} > not_first_call = <optimized out> > pagesize_m1 = <optimized out> > sp = <optimized out> > freesize = <optimized out> >#4 0x00007f63e22b4b3d in clone () at >../sysdeps/unix/sysv/linux/x86_64/clone.S:113 >No locals. > >I don't have more info for now. >
Thanks for the feedback on this issue. I don't see obvious reason for this deadlock yet. I'll investigate it asap (not tomorrow, but wednesday). In the worst case, we can revert the series if I cannot find the root cause rapidly. Olivier