You can try running with AddressSanitizer: 
https://fd.io/docs/vpp/master/troubleshooting/sanitizer.html#id2
That could catch the corruption earlier with more clues.

Best
ben

> -----Original Message-----
> From: vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> On Behalf Of Stanislav
> Zaikin
> Sent: vendredi 15 octobre 2021 14:54
> To: vpp-dev <vpp-dev@lists.fd.io>
> Subject: Re: [vpp-dev] assert in pool_elt_at_index
> 
> Any thoughts will be much appreciated. This bug is quite reproducible in
> my environment, but I don't know what else to check.
> 
> 
> On Thu, 14 Oct 2021 at 08:51, Stanislav Zaikin via lists.fd.io
> <http://lists.fd.io>  <zstaseg=gmail....@lists.fd.io
> <mailto:gmail....@lists.fd.io> > wrote:
> 
> 
>       Hi Florin,
>       Hi Rajith,
> 
>       It shouldn't be the pool expansion case, I have
> 8341f76fd1cd4351961cd8161cfed2814fc55103.
>       Moreover, in this case _e would be different from
> &load_balance_pool[3604]. I've found some of those expansions (in other
> places), in those cases a pointer to the element has a different address.
> 
> 
> 
> 
>       On Thu, 14 Oct 2021 at 06:45, Rajith PR <raj...@rtbrick.com
> <mailto:raj...@rtbrick.com> > wrote:
> 
> 
>               HI Stanislav,
> 
>               My guess is you don't have the commit below.
> 
>               commit 8341f76fd1cd4351961cd8161cfed2814fc55103
>               Author: Dave Barach <d...@barachs.net
> <mailto:d...@barachs.net> >
>               Date:   Wed Jun 3 08:05:15 2020 -0400
> 
>                   fib: add barrier sync, pool/vector expand cases
> 
>                   load_balance_alloc_i(...) is not thread safe when the
>                   load_balance_pool or combined counter vectors expand.
> 
>                   Type: fix
> 
>                   Signed-off-by: Dave Barach <d...@barachs.net
> <mailto:d...@barachs.net> >
>                   Change-Id: I7f295ed77350d1df0434d5ff461eedafe79131de
> 
> 
>               Thanks,
>               Rajith
> 
>               On Thu, Oct 14, 2021 at 3:57 AM Florin Coras
> <fcoras.li...@gmail.com <mailto:fcoras.li...@gmail.com> > wrote:
> 
> 
>                       Hi Stanislav,
> 
>                       The only thing I can think of is that main thread grows
> the pool, or the pool’s bitmap, without a worker barrier while the worker
> that asserts is trying to access it. Is main thread busy doing something
> (e.g., adding routes/interfaces) when the assert happens?
> 
>                       Regards,
>                       Florin
> 
> 
> 
>                               On Oct 13, 2021, at 2:52 PM, Stanislav Zaikin
> <zsta...@gmail.com <mailto:zsta...@gmail.com> > wrote:
> 
>                               Hi Florin,
> 
>                               I wasn't aware of those helper functions, 
> thanks!
> But yeah, it also returns 0 (sorry, but there's the trace of another
> crash)
> 
>                               Thread 3 "vpp_wk_0" received signal SIGABRT,
> Aborted.
>                               [Switching to Thread 0x7f9cc0f6a700 (LWP 3546)]
>                               __GI_raise (sig=sig@entry=6) at
> ../sysdeps/unix/sysv/linux/raise.c:51
>                               51 ../sysdeps/unix/sysv/linux/raise.c: No such
> file or directory.
>                               (gdb) bt
>                               #0  __GI_raise (sig=sig@entry=6) at
> ../sysdeps/unix/sysv/linux/raise.c:51
>                               #1  0x00007f9d61542921 in __GI_abort () at
> abort.c:79
>                               #2  0x00007f9d624da799 in os_panic () at
> /home/vpp/vpp/src/vppinfra/unix-misc.c:177
>                               #3  0x00007f9d62420f49 in debugger () at
> /home/vpp/vpp/src/vppinfra/error.c:84
>                               #4  0x00007f9d62420cc7 in _clib_error
> (how_to_die=2, function_name=0x0, line_number=0, fmt=0x7f9d644348d0 "%s:%d
> (%s) assertion `%s' fails") at /home/vpp/vpp/src/vppinfra/error.c:143
>                               #5  0x00007f9d636695b4 in load_balance_get
> (lbi=4569) at /home/vpp/vpp/src/vnet/dpo/load_balance.h:222
>                               #6  0x00007f9d63668247 in 
> mpls_lookup_node_fn_hsw
> (vm=0x7f9ceb0138c0, node=0x7f9ceee6f700, from_frame=0x7f9cef9c9240) at
> /home/vpp/vpp/src/vnet/mpls/mpls_lookup.c:229
>                               #7  0x00007f9d63008076 in dispatch_node
> (vm=0x7f9ceb0138c0, node=0x7f9ceee6f700, type=VLIB_NODE_TYPE_INTERNAL,
> dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x7f9cef9c9240,
> last_time_stamp=1837178878370487) at /home/vpp/vpp/src/vlib/main.c:1217
>                               #8  0x00007f9d630089e7 in dispatch_pending_node
> (vm=0x7f9ceb0138c0, pending_frame_index=2,
> last_time_stamp=1837178878370487) at /home/vpp/vpp/src/vlib/main.c:1376
>                               #9  0x00007f9d63002441 in 
> vlib_main_or_worker_loop
> (vm=0x7f9ceb0138c0, is_main=0) at /home/vpp/vpp/src/vlib/main.c:1904
>                               #10 0x00007f9d630012e7 in vlib_worker_loop
> (vm=0x7f9ceb0138c0) at /home/vpp/vpp/src/vlib/main.c:2038
>                               #11 0x00007f9d6305995d in vlib_worker_thread_fn
> (arg=0x7f9ce1b88540) at /home/vpp/vpp/src/vlib/threads.c:1868
>                               #12 0x00007f9d62445214 in clib_calljmp () at
> /home/vpp/vpp/src/vppinfra/longjmp.S:123
>                               #13 0x00007f9cc0f69c90 in ?? ()
>                               #14 0x00007f9d63051b83 in
> vlib_worker_thread_bootstrap_fn (arg=0x7f9ce1b88540) at
> /home/vpp/vpp/src/vlib/threads.c:585
>                               #15 0x00007f9cda360355 in eal_thread_loop
> (arg=0x0) at ../src-dpdk/lib/librte_eal/linux/eal_thread.c:127
>                               #16 0x00007f9d629246db in start_thread
> (arg=0x7f9cc0f6a700) at pthread_create.c:463
>                               #17 0x00007f9d6162371f in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
>                               (gdb) select 5
>                               (gdb) print pifi( load_balance_pool, 4569 )
>                               $1 = 0
>                               (gdb) source ~/vpp/extras/gdb/gdbinit
>                               Loading vpp functions...
>                               Load vlLoad pe
>                               Load pifi
>                               Load node_name_from_index
>                               Load vnet_buffer_opaque
>                               Load vnet_buffer_opaque2
>                               Load bitmap_get
>                               Done loading vpp functions...
>                               (gdb) pifi load_balance_pool 4569
>                               pool_is_free_index (load_balance_pool, 4569)$2 
> = 0
> 
> 
>                               On Wed, 13 Oct 2021 at 21:55, Florin Coras
> <fcoras.li...@gmail.com <mailto:fcoras.li...@gmail.com> > wrote:
> 
> 
>                                       Hi Stanislav,
> 
>                                       Just to make sure the gdb macro is okay,
> could you run from gdb: pifi(pool, index)? The function is defined in
> gdb_funcs.c.
> 
>                                       Regards,
>                                       Florin
> 
> 
>                                       On Oct 13, 2021, at 11:30 AM, Stanislav
> Zaikin <zsta...@gmail.com <mailto:zsta...@gmail.com> > wrote:
> 
>                                       Hello folks,
> 
>                                       I'm facing a strange issue with 2 worker
> threads. Sometimes I get a crash either in "ip6-lookup" or "mpls-lookup"
> nodes. They happen with assert in the pool_elt_at_index macro and always
> inside the "load_balance_get" function. But the load_balance dpo looks
> perfectly good, I mean it still has a lock and on regular deletion (in the
> case when the load_balance dpo is deleted) it should be erased properly
> (with dpo_reset). It happens usually when the main core is executing
> vlib_worker_thread_barrier_sync_int(), and the other worker is executing
> vlib_worker_thread_barrier_check().
>                                       And the strangest thing is, when I run 
> the
> vpp's gdb helper for checking "pool_index_is_free" or pifi, it shows me
> that the index isn't free (and the macro in that case shouldn't fire).
> 
> 
>                                       Any thoughts and inputs are appreciated.
> 
> 
>                                       Thread 3 "vpp_wk_0" received signal 
> SIGABRT,
> Aborted.
>                                       [Switching to Thread 0x7fb4f2e22700 (LWP
> 3244)]
>                                       __GI_raise (sig=sig@entry=6) at
> ../sysdeps/unix/sysv/linux/raise.c:51
>                                       51 ../sysdeps/unix/sysv/linux/raise.c: 
> No
> such file or directory.
>                                       (gdb) bt
>                                       #0  __GI_raise (sig=sig@entry=6) at
> ../sysdeps/unix/sysv/linux/raise.c:51
>                                       #1  0x00007fb5933fa921 in __GI_abort () 
> at
> abort.c:79
>                                       #2  0x00007fb594392799 in os_panic () at
> /home/vpp/vpp/src/vppinfra/unix-misc.c:177
>                                       #3  0x00007fb5942d8f49 in debugger () at
> /home/vpp/vpp/src/vppinfra/error.c:84
>                                       #4  0x00007fb5942d8cc7 in _clib_error
> (how_to_die=2, function_name=0x0, line_number=0, fmt=0x7fb5962ec8d0 "%s:%d
> (%s) assertion `%s' fails") at /home/vpp/vpp/src/vppinfra/error.c:143
>                                       #5  0x00007fb5954bd694 in 
> load_balance_get
> (lbi=3604) at /home/vpp/vpp/src/vnet/dpo/load_balance.h:222
>                                       #6  0x00007fb5954bc070 in 
> ip6_lookup_inline
> (vm=0x7fb51ceccd00, node=0x7fb520f6b700, frame=0x7fb52128e4c0) at
> /home/vpp/vpp/src/vnet/ip/ip6_forward.h:117
>                                       #7  0x00007fb5954bbdd5 in
> ip6_lookup_node_fn_hsw (vm=0x7fb51ceccd00, node=0x7fb520f6b700,
> frame=0x7fb52128e4c0) at /home/vpp/vpp/src/vnet/ip/ip6_forward.c:736
>                                       #8  0x00007fb594ec0076 in dispatch_node
> (vm=0x7fb51ceccd00, node=0x7fb520f6b700, type=VLIB_NODE_TYPE_INTERNAL,
> dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x7fb52128e4c0,
> last_time_stamp=1808528151240447) at /home/vpp/vpp/src/vlib/main.c:1217
>                                       #9  0x00007fb594ec09e7 in
> dispatch_pending_node (vm=0x7fb51ceccd00, pending_frame_index=5,
> last_time_stamp=1808528151240447) at /home/vpp/vpp/src/vlib/main.c:1376
>                                       #10 0x00007fb594eba441 in
> vlib_main_or_worker_loop (vm=0x7fb51ceccd00, is_main=0) at
> /home/vpp/vpp/src/vlib/main.c:1904
>                                       #11 0x00007fb594eb92e7 in 
> vlib_worker_loop
> (vm=0x7fb51ceccd00) at /home/vpp/vpp/src/vlib/main.c:2038
>                                       #12 0x00007fb594f1195d in
> vlib_worker_thread_fn (arg=0x7fb513a48100) at
> /home/vpp/vpp/src/vlib/threads.c:1868
>                                       #13 0x00007fb5942fd214 in clib_calljmp 
> () at
> /home/vpp/vpp/src/vppinfra/longjmp.S:123
>                                       #14 0x00007fb4f2e21c90 in ?? ()
>                                       #15 0x00007fb594f09b83 in
> vlib_worker_thread_bootstrap_fn (arg=0x7fb513a48100) at
> /home/vpp/vpp/src/vlib/threads.c:585
>                                       #16 0x00007fb50c218355 in 
> eal_thread_loop
> (arg=0x0) at ../src-dpdk/lib/librte_eal/linux/eal_thread.c:127
>                                       #17 0x00007fb5947dc6db in start_thread
> (arg=0x7fb4f2e22700) at pthread_create.c:463
>                                       #18 0x00007fb5934db71f in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
>                                       (gdb) select 5
>                                       (gdb) print _e
>                                       $1 = (load_balance_t *) 0x7fb52651e580
>                                       (gdb) print load_balance_pool[3604]
>                                       $2 = {cacheline0 = 0x7fb52651e580 
> "\001",
> lb_n_buckets = 1, lb_n_buckets_minus_1 = 0, lb_proto = DPO_PROTO_IP6,
> lb_flags = LOAD_BALANCE_FLAG_NONE, lb_fib_entry_flags =
> (FIB_ENTRY_FLAG_CONNECTED | FIB_ENTRY_FLAG_LOCAL), lb_locks = 1, lb_map =
> 4294967295, lb_urpf = 4094, lb_hash_config = 31, lb_buckets = 0x0,
>                                         lb_buckets_inline = {{{{dpoi_type =
> DPO_RECEIVE, dpoi_proto = DPO_PROTO_IP6, dpoi_next_node = 2, dpoi_index =
> 2094}, as_u64 = 8993661649164}}, {{{dpoi_type = DPO_FIRST, dpoi_proto =
> DPO_PROTO_IP4, dpoi_next_node = 0, dpoi_index = 0}, as_u64 = 0}},
> {{{dpoi_type = DPO_FIRST, dpoi_proto = DPO_PROTO_IP4,
>                                                 dpoi_next_node = 0, 
> dpoi_index =
> 0}, as_u64 = 0}}, {{{dpoi_type = DPO_FIRST, dpoi_proto = DPO_PROTO_IP4,
> dpoi_next_node = 0, dpoi_index = 0}, as_u64 = 0}}}}
>                                       (gdb) print &load_balance_pool[3604]
>                                       $3 = (load_balance_t *) 0x7fb52651e580
>                                       (gdb) source ~/vpp/extras/gdb/gdbinit
>                                       Loading vpp functions...
>                                       Load vlLoad pe
>                                       Load pifi
>                                       Load node_name_from_index
>                                       Load vnet_buffer_opaque
>                                       Load vnet_buffer_opaque2
>                                       Load bitmap_get
>                                       Done loading vpp functions...
>                                       (gdb) pifi load_balance_pool 3604
>                                       pool_is_free_index (load_balance_pool,
> 3604)$4 = 0
>                                       (gdb) info threads
>                                         Id   Target Id         Frame
>                                         1    Thread 0x7fb596bd2c40 (LWP 727)
> "vpp_main" 0x00007fb594f1439b in clib_time_now_internal (c=0x7fb59517ccc0
> <vlib_global_main>, n=1808528155236639) at
> /home/vpp/vpp/src/vppinfra/time.h:215
>                                         2    Thread 0x7fb4f3623700 (LWP 2976)
> "eal-intr-thread" 0x00007fb5934dba47 in epoll_wait (epfd=17,
> events=0x7fb4f3622d80, maxevents=1, timeout=-1) at
> ../sysdeps/unix/sysv/linux/epoll_wait.c:30
>                                       * 3    Thread 0x7fb4f2e22700 (LWP 3244)
> "vpp_wk_0" __GI_raise (sig=sig@entry=6) at
> ../sysdeps/unix/sysv/linux/raise.c:51
>                                         4    Thread 0x7fb4f2621700 (LWP 3246)
> "vpp_wk_1" 0x00007fb594ebf897 in vlib_worker_thread_barrier_check () at
> /home/vpp/vpp/src/vlib/threads.h:439
> 
> 
>                                       --
> 
>                                       Best regards
>                                       Stanislav Zaikin
> 
> 
> 
> 
> 
> 
> 
>                               --
> 
>                               Best regards
>                               Stanislav Zaikin
> 
> 
> 
> 
> 
> 
>               NOTICE TO RECIPIENT This e-mail message and any attachments
> are confidential and may be privileged. If you received this e-mail in
> error, any review, use, dissemination, distribution, or copying of this e-
> mail is strictly prohibited. Please notify us immediately of the error by
> return e-mail and please delete this message from your system. For more
> information about Rtbrick, please visit us at www.rtbrick.com
> <http://www.rtbrick.com>
> 
> 
> 
>       --
> 
>       Best regards
>       Stanislav Zaikin
> 
> 
> 
> 
> 
> 
> --
> 
> Best regards
> Stanislav Zaikin
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#20336): https://lists.fd.io/g/vpp-dev/message/20336
Mute This Topic: https://lists.fd.io/mt/86295132/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to