On Sun, 13 Feb 2022 12:39:59 +0100 Thomas Monjalon <tho...@monjalon.net> wrote:
> 17/12/2021 19:29, Stephen Hemminger: > > If DPDK is built with thread sanitizer it reports a race > > in setting of multiprocess file descriptor. The fix is to > > use atomic operations when updating mp_fd. > > Please could explain more the condition of the race? > Is it between init and cleanup of the same file descriptor? > How atomic is helping here? > > > > > > Simple example: > > $ dpdk-testpmd -l 1-3 --no-huge > > ... > > EAL: Error - exiting with code: 1 > > Cause: Creation of mbuf pool for socket 0 failed: Cannot allocate memory > > ================== > > WARNING: ThreadSanitizer: data race (pid=83054) > > Write of size 4 at 0x55e3b7fce450 by main thread: > > #0 rte_mp_channel_cleanup <null> (dpdk-testpmd+0x160d79c) > > #1 rte_eal_cleanup <null> (dpdk-testpmd+0x1614fb5) > > #2 rte_exit <null> (dpdk-testpmd+0x15ec97a) > > #3 mbuf_pool_create.cold <null> (dpdk-testpmd+0x242e1a) > > #4 main <null> (dpdk-testpmd+0x5ab05d) > > > > Previous read of size 4 at 0x55e3b7fce450 by thread T2: > > #0 mp_handle <null> (dpdk-testpmd+0x160c979) > > #1 ctrl_thread_init <null> (dpdk-testpmd+0x15ff76e) > > > > As if synchronized via sleep: > > #0 nanosleep > > ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:362 > > (libtsan.so.0+0x5cd8e) > > #1 get_tsc_freq <null> (dpdk-testpmd+0x1622889) > > #2 set_tsc_freq <null> (dpdk-testpmd+0x15ffb9c) > > #3 rte_eal_timer_init <null> (dpdk-testpmd+0x1622a34) > > #4 rte_eal_init.cold <null> (dpdk-testpmd+0x26b314) > > #5 main <null> (dpdk-testpmd+0x5aab45) > > > > Location is global 'mp_fd' of size 4 at 0x55e3b7fce450 > > (dpdk-testpmd+0x0000027c7450) > > > > Thread T2 'rte_mp_handle' (tid=83057, running) created by main thread at: > > #0 pthread_create > > ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:962 > > (libtsan.so.0+0x58ba2) > > #1 rte_ctrl_thread_create <null> (dpdk-testpmd+0x15ff870) > > #2 rte_mp_channel_init.cold <null> (dpdk-testpmd+0x269986) > > #3 rte_eal_init <null> (dpdk-testpmd+0x1615b28) > > #4 main <null> (dpdk-testpmd+0x5aab45) > > > The issue is that two threads are sharing a global variable without barriers or atomic. The variable mp_fd is set in control thread rte_mp_channel_init/rte_mp_channel_cleanup but then read by the thread that handles multiprocess (mp_handle). This sharing of global data without barrier or lock is unsafe/undefined, and can break on weakly ordered CPU's like ARM. Kind of surprised that we don't see bug already since compiler could decide that mp_fd in the function mp_handle() is invariant and not test it and have the thread run forever. This is a bug from the beginning of MP support in DPDK.