Hello, Florin, From below info, we can see a real race condition happens.
thread 2 free: 149 thread 3 free: 155 thread 0: alloc: 149 thread 0: alloc: 149 [Time 0] thread 2 free: 149 Thread 2 free 149 and now 149 is at the pool vec_len (_pool_var (p)->free_indices); [Time 1&Time 2 ] Now thread 0 wants to get, and thread 3 wants to put at the same time. Before thread 3 put, thread 0 get current vec_len: 149. Thread 3 makes: vec_add1 (_pool_var (p)->free_indices, _pool_var (l)); Thread 0 makes: _vec_len (_pool_var (p)->free_indices) = _pool_var (l) - 1; [Time 3] Due to race condition, current vec_len is the same as time 0, and the old element 149 returns again. Even though performance is important, and we still need to use lock to resolve this race condition. From: vpp-dev@lists.fd.io [mailto:vpp-dev@lists.fd.io] On Behalf Of Yu, Ping Sent: Tuesday, August 28, 2018 5:29 PM To: Florin Coras <fcoras.li...@gmail.com> Cc: Florin Coras (fcoras) <fco...@cisco.com>; vpp-dev@lists.fd.io; Yu, Ping <ping...@intel.com> Subject: Re: [vpp-dev] TLS half open lock Hi, Florin, Yes, you are right, and all alloc operation is performed by thread 0. An interesting thing is that if running “test echo clients nclients 300 uri tls://10.10.1.1/1111” in clients with 4 threads, I can easily catch the case one same index be alloc twice by thread 0. thread 0: alloc: 145 thread 0: alloc: 69 thread 4 free: 151 thread 0: alloc: 151 thread 2 free: 149 thread 3 free: 155 thread 0: alloc: 149 thread 0: alloc: 149 thread 0: alloc: 58 thread 0: alloc: 9 thread 0: alloc: 29 thread 3 free: 146 thread 0: alloc: 146 thread 2 free: 153 thread 0: alloc: 144 thread 0: alloc: 153 thread 0: alloc: 124 thread 3 free: 25 thread 0: alloc: 25 From: Florin Coras [mailto:fcoras.li...@gmail.com] Sent: Tuesday, August 28, 2018 10:24 AM To: Yu, Ping <ping...@intel.com<mailto:ping...@intel.com>> Cc: Florin Coras (fcoras) <fco...@cisco.com<mailto:fco...@cisco.com>>; vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io> Subject: Re: [vpp-dev] TLS half open lock Hi Ping, The expectation is that all connects/listens come on the main thread (with the worker barrier held). In other words, we only need to support a one writer, multiple readers scenario. Florin On Aug 27, 2018, at 6:29 PM, Yu, Ping <ping...@intel.com<mailto:ping...@intel.com>> wrote: Hi, Florin, To check if it is about to expand is also lockless. Is there any issue if two threads check the pool simultaneously, and just one slot is available? One code will do normal get, and the other thread is expanding the pool? Thanks Ping From: vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io> [mailto:vpp-dev@lists.fd.io] On Behalf Of Florin Coras Sent: Tuesday, August 28, 2018 12:51 AM To: Yu, Ping <ping...@intel.com<mailto:ping...@intel.com>> Cc: Florin Coras (fcoras) <fco...@cisco.com<mailto:fco...@cisco.com>>; vpp-dev@lists.fd.io<mailto:vpp-dev@lists.fd.io> Subject: Re: [vpp-dev] TLS half open lock Hi Ping, The current implementation only locks the half-open pool if the pool is about to expand. This is done to increase speed by avoiding unnecessary locking, i.e., if pool is not about to expand, it should be safe to get a new element from it without affecting readers. Now the thing to figure out is why this is happening. Does the slowdown due to the “big lock” avoid some race or is there more to it? First of all, how many workers do you have configured and how many sessions are you allocating/connecting? Do you see failed connects? Your tls_ctx_half_open_get line numbers don’t match my code. Did you by chance modify something else? Thanks, Florin On Aug 27, 2018, at 9:22 AM, Yu, Ping <ping...@intel.com<mailto:ping...@intel.com>> wrote: Hello, all Recently I found that the TLS half open lock is not well implemented, and if enabling multiple thread, there are chances to get the following core dump info in debug mode. (gdb) where #0 0x00007f7a0848e428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54 #1 0x00007f7a0849002a in __GI_abort () at abort.c:89 #2 0x0000000000407f0b in os_panic () at /home/pyu4/git-home/vpp_clean/vpp/src/vpp/vnet/main.c:331 #3 0x00007f7a08867bd0 in debugger () at /home/pyu4/git-home/vpp_clean/vpp/src/vppinfra/error.c:84 #4 0x00007f7a08868008 in _clib_error (how_to_die=2, function_name=0x0, line_number=0, fmt=0x7f7a0a0add78 "%s:%d (%s) assertion `%s' fails") at /home/pyu4/git-home/vpp_clean/vpp/src/vppinfra/error.c:143 #5 0x00007f7a09e10be0 in tls_ctx_half_open_get (ctx_index=48) at /home/pyu4/git-home/vpp_clean/vpp/src/vnet/tls/tls.c:126 #6 0x00007f7a09e11889 in tls_session_connected_callback (tls_app_index=0, ho_ctx_index=48, tls_session=0x7f79c9b6d1c0, is_fail=0 '\000') at /home/pyu4/git-home/vpp_clean/vpp/src/vnet/tls/tls.c:404 #7 0x00007f7a09d5ea6e in session_stream_connect_notify (tc=0x7f79c9b655fc, is_fail=0 '\000') at /home/pyu4/git-home/vpp_clean/vpp/src/vnet/session/session.c:648 #8 0x00007f7a099cb969 in tcp46_syn_sent_inline (vm=0x7f79c8a25100, node=0x7f79c9a60500, from_frame=0x7f79c8b2a9c0, is_ip4=1) at /home/pyu4/git-home/vpp_clean/vpp/src/vnet/tcp/tcp_input.c:2306 #9 0x00007f7a099cbe00 in tcp4_syn_sent (vm=0x7f79c8a25100, node=0x7f79c9a60500, from_frame=0x7f79c8b2a9c0) at /home/pyu4/git-home/vpp_clean/vpp/src/vnet/tcp/tcp_input.c:2387 #10 0x00007f7a08fefa35 in dispatch_node (vm=0x7f79c8a25100, node=0x7f79c9a60500, type=VLIB_NODE_TYPE_INTERNAL, dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x7f79c8b2a9c0, last_time_stamp=902372436923868) at /home/pyu4/git-home/vpp_clean/vpp/src/vlib/main.c:988 #11 0x00007f7a08feffee in dispatch_pending_node (vm=0x7f79c8a25100, pending_frame_index=7, last_time_stamp=902372436923868) at /home/pyu4/git-home/vpp_clean/vpp/src/vlib/main.c:1138 #12 0x00007f7a08ff1bed in vlib_main_or_worker_loop (vm=0x7f79c8a25100, is_main=0) at /home/pyu4/git-home/vpp_clean/vpp/src/vlib/main.c:1554 #13 0x00007f7a08ff240c in vlib_worker_loop (vm=0x7f79c8a25100) at /home/pyu4/git-home/vpp_clean/vpp/src/vlib/main.c:1634 #14 0x00007f7a09035541 in vlib_worker_thread_fn (arg=0x7f79ca4a41c0) at /home/pyu4/git-home/vpp_clean/vpp/src/vlib/threads.c:1760 #15 0x00007f7a0888aa38 in clib_calljmp () from /home/pyu4/git-home/vpp_clean/vpp/build-root/install-vpp_debug-native/vpp/lib/libvppinfra.so #16 0x00007f7761198d70 in ?? () #17 0x00007f7a090300be in vlib_worker_thread_bootstrap_fn (arg=0x7f79ca4a41c0) at /home/pyu4/git-home/vpp_clean/vpp/src/vlib/threads.c:684 seemly current code design may have race condition and to make the same index returned, and after one free one index, the other will cause core dump. I did a simple fix to add a big lock as the following, and this kind of core dump never happen again. Do you guys have a better solution for the lock? + clib_rwlock_writer_lock (&tm->half_open_rwlock); pool_get_aligned_will_expand (tm->half_open_ctx_pool, will_expand, 0); if (PREDICT_FALSE (will_expand && vlib_num_workers ())) { - clib_rwlock_writer_lock (&tm->half_open_rwlock); pool_get (tm->half_open_ctx_pool, ctx); memset (ctx, 0, sizeof (*ctx)); ctx_index = ctx - tm->half_open_ctx_pool; - clib_rwlock_writer_unlock (&tm->half_open_rwlock); } else { @@ -104,6 +103,8 @@ tls_ctx_half_open_alloc (void) memset (ctx, 0, sizeof (*ctx)); ctx_index = ctx - tm->half_open_ctx_pool; } + clib_rwlock_writer_unlock (&tm->half_open_rwlock); return ctx_index; } -=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#10302): https://lists.fd.io/g/vpp-dev/message/10302 Mute This Topic: https://lists.fd.io/mt/24975100/675152 Group Owner: vpp-dev+ow...@lists.fd.io<mailto:vpp-dev+ow...@lists.fd.io> Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [fcoras.li...@gmail.com<mailto:fcoras.li...@gmail.com>] -=-=-=-=-=-=-=-=-=-=-=-
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#10319): https://lists.fd.io/g/vpp-dev/message/10319 Mute This Topic: https://lists.fd.io/mt/24975100/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-