Hi Benoit, I've tried but got this crash. Maybe some fixes were merged to handle this issue? (I'm using 21.01 + additional patches from the master)
Program received signal SIGSEGV, Segmentation fault. 0x000000000042c037 in __asan::FakeStack::AddrIsInFakeStack(unsigned long, unsigned long*, unsigned long*) () (gdb) bt #0 0x000000000042c037 in __asan::FakeStack::AddrIsInFakeStack(unsigned long, unsigned long*, unsigned long*) () #1 0x00000000004a31b4 in __asan::ThreadStackContainsAddress(__sanitizer::ThreadContextBase*, void*) () #2 0x00000000004b2f4a in __sanitizer::ThreadRegistry::FindThreadContextLocked(bool (*)(__sanitizer::ThreadContextBase*, void*), void*) () #3 0x00000000004a30eb in __asan::FindThreadByStackAddress(unsigned long) () #4 0x0000000000429265 in __asan::AddressDescription::AddressDescription(unsigned long, unsigned long, bool) () #5 0x000000000042adb3 in __asan::ErrorGeneric::ErrorGeneric(unsigned int, unsigned long, unsigned long, unsigned long, unsigned long, bool, unsigned long) () #6 0x000000000049cd81 in __asan::ReportGenericError(unsigned long, unsigned long, unsigned long, unsigned long, bool, unsigned long, unsigned int, bool) () #7 0x000000000049d5e8 in __asan_report_load4 () #8 0x00007ffff4fc18e8 in clib_march_select_fn_ptr_by_name (r=0x7ffff7858740 <vnet_interface_output_node_march_fn_registration>, name=0x7ffff6fd5980 <.str> "hsw") at /home/vpp/vpp/src/vppinfra/cpu.h:94 #9 0x00007ffff4fc17a1 in vnet_interface_output_node_get (vm=0x7ffff3a2d340 <vlib_global_main>) at /home/vpp/vpp/src/vnet/interface_output.c:538 #10 0x00007ffff4f152cd in vnet_register_interface (vnm=0x7ffff7a6b320 <vnet_main>, dev_class_index=34, dev_instance=0, hw_class_index=31, hw_instance=0) at /home/vpp/vpp/src/vnet/interface.c:810 #11 0x00007ffff4ff34ca in vnet_main_init (vm=0x7ffff3a2d340 <vlib_global_main>) at /home/vpp/vpp/src/vnet/misc.c:81 #12 0x00007ffff353d476 in call_init_exit_functions_internal (vm=0x7ffff3a2d340 <vlib_global_main>, headp=0x7ffff3a2d998 <vlib_global_main+1624>, call_once=1, do_sort=1) at /home/vpp/vpp/src/vlib/init.c:350 #13 0x00007ffff353d258 in vlib_call_init_exit_functions (vm=0x7ffff3a2d340 <vlib_global_main>, headp=0x7ffff3a2d998 <vlib_global_main+1624>, call_once=1) at /home/vpp/vpp/src/vlib/init.c:364 #14 0x00007ffff353d561 in vlib_call_all_init_functions (vm=0x7ffff3a2d340 <vlib_global_main>) at /home/vpp/vpp/src/vlib/init.c:386 #15 0x00007ffff35c8711 in vlib_main (vm=0x7ffff3a2d340 <vlib_global_main>, input=0x7fff9cb4bec0) at /home/vpp/vpp/src/vlib/main.c:2213 #16 0x00007ffff376b808 in thread0 (arg=140737280922432) at /home/vpp/vpp/src/vlib/unix/main.c:670 #17 0x00007ffff269204c in clib_calljmp () at /home/vpp/vpp/src/vppinfra/longjmp.S:123 #18 0x00007fffffffc980 in ?? () #19 0x00007ffff376ad55 in vlib_unix_main (argc=2, argv=0x7fffffffe498) at /home/vpp/vpp/src/vlib/unix/main.c:747 #20 0x00000000004c8fa8 in main (argc=2, argv=0x7fffffffe498) at /home/vpp/vpp/src/vpp/vnet/main.c:338 (gdb) On Mon, 18 Oct 2021 at 14:29, Benoit Ganne (bganne) <bga...@cisco.com> wrote: > You can try running with AddressSanitizer: > https://fd.io/docs/vpp/master/troubleshooting/sanitizer.html#id2 > That could catch the corruption earlier with more clues. > > Best > ben > > > -----Original Message----- > > From: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> On Behalf Of Stanislav > > Zaikin > > Sent: vendredi 15 octobre 2021 14:54 > > To: vpp-dev <vpp-dev@lists.fd.io> > > Subject: Re: [vpp-dev] assert in pool_elt_at_index > > > > Any thoughts will be much appreciated. This bug is quite reproducible in > > my environment, but I don't know what else to check. > > > > > > On Thu, 14 Oct 2021 at 08:51, Stanislav Zaikin via lists.fd.io > > <http://lists.fd.io> <zstaseg=gmail....@lists.fd.io > > <mailto:gmail....@lists.fd.io> > wrote: > > > > > > Hi Florin, > > Hi Rajith, > > > > It shouldn't be the pool expansion case, I have > > 8341f76fd1cd4351961cd8161cfed2814fc55103. > > Moreover, in this case _e would be different from > > &load_balance_pool[3604]. I've found some of those expansions (in other > > places), in those cases a pointer to the element has a different address. > > > > > > > > > > On Thu, 14 Oct 2021 at 06:45, Rajith PR <raj...@rtbrick.com > > <mailto:raj...@rtbrick.com> > wrote: > > > > > > HI Stanislav, > > > > My guess is you don't have the commit below. > > > > commit 8341f76fd1cd4351961cd8161cfed2814fc55103 > > Author: Dave Barach <d...@barachs.net > > <mailto:d...@barachs.net> > > > Date: Wed Jun 3 08:05:15 2020 -0400 > > > > fib: add barrier sync, pool/vector expand cases > > > > load_balance_alloc_i(...) is not thread safe when the > > load_balance_pool or combined counter vectors expand. > > > > Type: fix > > > > Signed-off-by: Dave Barach <d...@barachs.net > > <mailto:d...@barachs.net> > > > Change-Id: I7f295ed77350d1df0434d5ff461eedafe79131de > > > > > > Thanks, > > Rajith > > > > On Thu, Oct 14, 2021 at 3:57 AM Florin Coras > > <fcoras.li...@gmail.com <mailto:fcoras.li...@gmail.com> > wrote: > > > > > > Hi Stanislav, > > > > The only thing I can think of is that main thread > grows > > the pool, or the pool’s bitmap, without a worker barrier while the worker > > that asserts is trying to access it. Is main thread busy doing something > > (e.g., adding routes/interfaces) when the assert happens? > > > > Regards, > > Florin > > > > > > > > On Oct 13, 2021, at 2:52 PM, Stanislav > Zaikin > > <zsta...@gmail.com <mailto:zsta...@gmail.com> > wrote: > > > > Hi Florin, > > > > I wasn't aware of those helper functions, > thanks! > > But yeah, it also returns 0 (sorry, but there's the trace of another > > crash) > > > > Thread 3 "vpp_wk_0" received signal > SIGABRT, > > Aborted. > > [Switching to Thread 0x7f9cc0f6a700 (LWP > 3546)] > > __GI_raise (sig=sig@entry=6) at > > ../sysdeps/unix/sysv/linux/raise.c:51 > > 51 ../sysdeps/unix/sysv/linux/raise.c: No > such > > file or directory. > > (gdb) bt > > #0 __GI_raise (sig=sig@entry=6) at > > ../sysdeps/unix/sysv/linux/raise.c:51 > > #1 0x00007f9d61542921 in __GI_abort () at > > abort.c:79 > > #2 0x00007f9d624da799 in os_panic () at > > /home/vpp/vpp/src/vppinfra/unix-misc.c:177 > > #3 0x00007f9d62420f49 in debugger () at > > /home/vpp/vpp/src/vppinfra/error.c:84 > > #4 0x00007f9d62420cc7 in _clib_error > > (how_to_die=2, function_name=0x0, line_number=0, fmt=0x7f9d644348d0 > "%s:%d > > (%s) assertion `%s' fails") at /home/vpp/vpp/src/vppinfra/error.c:143 > > #5 0x00007f9d636695b4 in load_balance_get > > (lbi=4569) at /home/vpp/vpp/src/vnet/dpo/load_balance.h:222 > > #6 0x00007f9d63668247 in > mpls_lookup_node_fn_hsw > > (vm=0x7f9ceb0138c0, node=0x7f9ceee6f700, from_frame=0x7f9cef9c9240) at > > /home/vpp/vpp/src/vnet/mpls/mpls_lookup.c:229 > > #7 0x00007f9d63008076 in dispatch_node > > (vm=0x7f9ceb0138c0, node=0x7f9ceee6f700, type=VLIB_NODE_TYPE_INTERNAL, > > dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x7f9cef9c9240, > > last_time_stamp=1837178878370487) at /home/vpp/vpp/src/vlib/main.c:1217 > > #8 0x00007f9d630089e7 in > dispatch_pending_node > > (vm=0x7f9ceb0138c0, pending_frame_index=2, > > last_time_stamp=1837178878370487) at /home/vpp/vpp/src/vlib/main.c:1376 > > #9 0x00007f9d63002441 in > vlib_main_or_worker_loop > > (vm=0x7f9ceb0138c0, is_main=0) at /home/vpp/vpp/src/vlib/main.c:1904 > > #10 0x00007f9d630012e7 in vlib_worker_loop > > (vm=0x7f9ceb0138c0) at /home/vpp/vpp/src/vlib/main.c:2038 > > #11 0x00007f9d6305995d in > vlib_worker_thread_fn > > (arg=0x7f9ce1b88540) at /home/vpp/vpp/src/vlib/threads.c:1868 > > #12 0x00007f9d62445214 in clib_calljmp () > at > > /home/vpp/vpp/src/vppinfra/longjmp.S:123 > > #13 0x00007f9cc0f69c90 in ?? () > > #14 0x00007f9d63051b83 in > > vlib_worker_thread_bootstrap_fn (arg=0x7f9ce1b88540) at > > /home/vpp/vpp/src/vlib/threads.c:585 > > #15 0x00007f9cda360355 in eal_thread_loop > > (arg=0x0) at ../src-dpdk/lib/librte_eal/linux/eal_thread.c:127 > > #16 0x00007f9d629246db in start_thread > > (arg=0x7f9cc0f6a700) at pthread_create.c:463 > > #17 0x00007f9d6162371f in clone () at > > ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 > > (gdb) select 5 > > (gdb) print pifi( load_balance_pool, 4569 ) > > $1 = 0 > > (gdb) source ~/vpp/extras/gdb/gdbinit > > Loading vpp functions... > > Load vlLoad pe > > Load pifi > > Load node_name_from_index > > Load vnet_buffer_opaque > > Load vnet_buffer_opaque2 > > Load bitmap_get > > Done loading vpp functions... > > (gdb) pifi load_balance_pool 4569 > > pool_is_free_index (load_balance_pool, > 4569)$2 = 0 > > > > > > On Wed, 13 Oct 2021 at 21:55, Florin Coras > > <fcoras.li...@gmail.com <mailto:fcoras.li...@gmail.com> > wrote: > > > > > > Hi Stanislav, > > > > Just to make sure the gdb macro is > okay, > > could you run from gdb: pifi(pool, index)? The function is defined in > > gdb_funcs.c. > > > > Regards, > > Florin > > > > > > On Oct 13, 2021, at 11:30 AM, > Stanislav > > Zaikin <zsta...@gmail.com <mailto:zsta...@gmail.com> > wrote: > > > > Hello folks, > > > > I'm facing a strange issue with 2 > worker > > threads. Sometimes I get a crash either in "ip6-lookup" or "mpls-lookup" > > nodes. They happen with assert in the pool_elt_at_index macro and always > > inside the "load_balance_get" function. But the load_balance dpo looks > > perfectly good, I mean it still has a lock and on regular deletion (in > the > > case when the load_balance dpo is deleted) it should be erased properly > > (with dpo_reset). It happens usually when the main core is executing > > vlib_worker_thread_barrier_sync_int(), and the other worker is executing > > vlib_worker_thread_barrier_check(). > > And the strangest thing is, when I > run the > > vpp's gdb helper for checking "pool_index_is_free" or pifi, it shows me > > that the index isn't free (and the macro in that case shouldn't fire). > > > > > > Any thoughts and inputs are > appreciated. > > > > > > Thread 3 "vpp_wk_0" received > signal SIGABRT, > > Aborted. > > [Switching to Thread > 0x7fb4f2e22700 (LWP > > 3244)] > > __GI_raise (sig=sig@entry=6) at > > ../sysdeps/unix/sysv/linux/raise.c:51 > > 51 > ../sysdeps/unix/sysv/linux/raise.c: No > > such file or directory. > > (gdb) bt > > #0 __GI_raise (sig=sig@entry=6) > at > > ../sysdeps/unix/sysv/linux/raise.c:51 > > #1 0x00007fb5933fa921 in > __GI_abort () at > > abort.c:79 > > #2 0x00007fb594392799 in os_panic > () at > > /home/vpp/vpp/src/vppinfra/unix-misc.c:177 > > #3 0x00007fb5942d8f49 in debugger > () at > > /home/vpp/vpp/src/vppinfra/error.c:84 > > #4 0x00007fb5942d8cc7 in > _clib_error > > (how_to_die=2, function_name=0x0, line_number=0, fmt=0x7fb5962ec8d0 > "%s:%d > > (%s) assertion `%s' fails") at /home/vpp/vpp/src/vppinfra/error.c:143 > > #5 0x00007fb5954bd694 in > load_balance_get > > (lbi=3604) at /home/vpp/vpp/src/vnet/dpo/load_balance.h:222 > > #6 0x00007fb5954bc070 in > ip6_lookup_inline > > (vm=0x7fb51ceccd00, node=0x7fb520f6b700, frame=0x7fb52128e4c0) at > > /home/vpp/vpp/src/vnet/ip/ip6_forward.h:117 > > #7 0x00007fb5954bbdd5 in > > ip6_lookup_node_fn_hsw (vm=0x7fb51ceccd00, node=0x7fb520f6b700, > > frame=0x7fb52128e4c0) at /home/vpp/vpp/src/vnet/ip/ip6_forward.c:736 > > #8 0x00007fb594ec0076 in > dispatch_node > > (vm=0x7fb51ceccd00, node=0x7fb520f6b700, type=VLIB_NODE_TYPE_INTERNAL, > > dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x7fb52128e4c0, > > last_time_stamp=1808528151240447) at /home/vpp/vpp/src/vlib/main.c:1217 > > #9 0x00007fb594ec09e7 in > > dispatch_pending_node (vm=0x7fb51ceccd00, pending_frame_index=5, > > last_time_stamp=1808528151240447) at /home/vpp/vpp/src/vlib/main.c:1376 > > #10 0x00007fb594eba441 in > > vlib_main_or_worker_loop (vm=0x7fb51ceccd00, is_main=0) at > > /home/vpp/vpp/src/vlib/main.c:1904 > > #11 0x00007fb594eb92e7 in > vlib_worker_loop > > (vm=0x7fb51ceccd00) at /home/vpp/vpp/src/vlib/main.c:2038 > > #12 0x00007fb594f1195d in > > vlib_worker_thread_fn (arg=0x7fb513a48100) at > > /home/vpp/vpp/src/vlib/threads.c:1868 > > #13 0x00007fb5942fd214 in > clib_calljmp () at > > /home/vpp/vpp/src/vppinfra/longjmp.S:123 > > #14 0x00007fb4f2e21c90 in ?? () > > #15 0x00007fb594f09b83 in > > vlib_worker_thread_bootstrap_fn (arg=0x7fb513a48100) at > > /home/vpp/vpp/src/vlib/threads.c:585 > > #16 0x00007fb50c218355 in > eal_thread_loop > > (arg=0x0) at ../src-dpdk/lib/librte_eal/linux/eal_thread.c:127 > > #17 0x00007fb5947dc6db in > start_thread > > (arg=0x7fb4f2e22700) at pthread_create.c:463 > > #18 0x00007fb5934db71f in clone () > at > > ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 > > (gdb) select 5 > > (gdb) print _e > > $1 = (load_balance_t *) > 0x7fb52651e580 > > (gdb) print load_balance_pool[3604] > > $2 = {cacheline0 = 0x7fb52651e580 > "\001", > > lb_n_buckets = 1, lb_n_buckets_minus_1 = 0, lb_proto = DPO_PROTO_IP6, > > lb_flags = LOAD_BALANCE_FLAG_NONE, lb_fib_entry_flags = > > (FIB_ENTRY_FLAG_CONNECTED | FIB_ENTRY_FLAG_LOCAL), lb_locks = 1, lb_map = > > 4294967295, lb_urpf = 4094, lb_hash_config = 31, lb_buckets = 0x0, > > lb_buckets_inline = > {{{{dpoi_type = > > DPO_RECEIVE, dpoi_proto = DPO_PROTO_IP6, dpoi_next_node = 2, dpoi_index = > > 2094}, as_u64 = 8993661649164}}, {{{dpoi_type = DPO_FIRST, dpoi_proto = > > DPO_PROTO_IP4, dpoi_next_node = 0, dpoi_index = 0}, as_u64 = 0}}, > > {{{dpoi_type = DPO_FIRST, dpoi_proto = DPO_PROTO_IP4, > > dpoi_next_node = 0, > dpoi_index = > > 0}, as_u64 = 0}}, {{{dpoi_type = DPO_FIRST, dpoi_proto = DPO_PROTO_IP4, > > dpoi_next_node = 0, dpoi_index = 0}, as_u64 = 0}}}} > > (gdb) print > &load_balance_pool[3604] > > $3 = (load_balance_t *) > 0x7fb52651e580 > > (gdb) source > ~/vpp/extras/gdb/gdbinit > > Loading vpp functions... > > Load vlLoad pe > > Load pifi > > Load node_name_from_index > > Load vnet_buffer_opaque > > Load vnet_buffer_opaque2 > > Load bitmap_get > > Done loading vpp functions... > > (gdb) pifi load_balance_pool 3604 > > pool_is_free_index > (load_balance_pool, > > 3604)$4 = 0 > > (gdb) info threads > > Id Target Id Frame > > 1 Thread 0x7fb596bd2c40 (LWP > 727) > > "vpp_main" 0x00007fb594f1439b in clib_time_now_internal (c=0x7fb59517ccc0 > > <vlib_global_main>, n=1808528155236639) at > > /home/vpp/vpp/src/vppinfra/time.h:215 > > 2 Thread 0x7fb4f3623700 (LWP > 2976) > > "eal-intr-thread" 0x00007fb5934dba47 in epoll_wait (epfd=17, > > events=0x7fb4f3622d80, maxevents=1, timeout=-1) at > > ../sysdeps/unix/sysv/linux/epoll_wait.c:30 > > * 3 Thread 0x7fb4f2e22700 (LWP > 3244) > > "vpp_wk_0" __GI_raise (sig=sig@entry=6) at > > ../sysdeps/unix/sysv/linux/raise.c:51 > > 4 Thread 0x7fb4f2621700 (LWP > 3246) > > "vpp_wk_1" 0x00007fb594ebf897 in vlib_worker_thread_barrier_check () at > > /home/vpp/vpp/src/vlib/threads.h:439 > > > > > > -- > > > > Best regards > > Stanislav Zaikin > > > > > > > > > > > > > > > > -- > > > > Best regards > > Stanislav Zaikin > > > > > > > > > > > > > > NOTICE TO RECIPIENT This e-mail message and any attachments > > are confidential and may be privileged. If you received this e-mail in > > error, any review, use, dissemination, distribution, or copying of this > e- > > mail is strictly prohibited. Please notify us immediately of the error by > > return e-mail and please delete this message from your system. For more > > information about Rtbrick, please visit us at www.rtbrick.com > > <http://www.rtbrick.com> > > > > > > > > -- > > > > Best regards > > Stanislav Zaikin > > > > > > > > > > > > > > -- > > > > Best regards > > Stanislav Zaikin > -- Best regards Stanislav Zaikin
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#20354): https://lists.fd.io/g/vpp-dev/message/20354 Mute This Topic: https://lists.fd.io/mt/86295132/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-