Any thoughts will be much appreciated. This bug is quite reproducible in my
environment, but I don't know what else to check.

On Thu, 14 Oct 2021 at 08:51, Stanislav Zaikin via lists.fd.io <zstaseg=
gmail....@lists.fd.io> wrote:

> Hi Florin,
> Hi Rajith,
>
> It shouldn't be the pool expansion case, I have
> 8341f76fd1cd4351961cd8161cfed2814fc55103.
> Moreover, in this case _e would be different from &load_balance_pool[3604].
> I've found some of those expansions (in other places), in those cases a
> pointer to the element has a different address.
>
>
> On Thu, 14 Oct 2021 at 06:45, Rajith PR <raj...@rtbrick.com> wrote:
>
>> HI Stanislav,
>>
>> My guess is you don't have the commit below.
>>
>> commit 8341f76fd1cd4351961cd8161cfed2814fc55103
>> Author: Dave Barach <d...@barachs.net>
>> Date:   Wed Jun 3 08:05:15 2020 -0400
>>
>>     fib: add barrier sync, pool/vector expand cases
>>
>>     load_balance_alloc_i(...) is not thread safe when the
>>     load_balance_pool or combined counter vectors expand.
>>
>>     Type: fix
>>
>>     Signed-off-by: Dave Barach <d...@barachs.net>
>>     Change-Id: I7f295ed77350d1df0434d5ff461eedafe79131de
>>
>> Thanks,
>> Rajith
>>
>> On Thu, Oct 14, 2021 at 3:57 AM Florin Coras <fcoras.li...@gmail.com>
>> wrote:
>>
>>> Hi Stanislav,
>>>
>>> The only thing I can think of is that main thread grows the pool, or the
>>> pool’s bitmap, without a worker barrier while the worker that asserts is
>>> trying to access it. Is main thread busy doing something (e.g., adding
>>> routes/interfaces) when the assert happens?
>>>
>>> Regards,
>>> Florin
>>>
>>> On Oct 13, 2021, at 2:52 PM, Stanislav Zaikin <zsta...@gmail.com> wrote:
>>>
>>> Hi Florin,
>>>
>>> I wasn't aware of those helper functions, thanks! But yeah, it also
>>> returns 0 (sorry, but there's the trace of another crash)
>>>
>>> Thread 3 "vpp_wk_0" received signal SIGABRT, Aborted.
>>> [Switching to Thread 0x7f9cc0f6a700 (LWP 3546)]
>>> __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
>>> 51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
>>> (gdb) bt
>>> #0  __GI_raise (sig=sig@entry=6) at
>>> ../sysdeps/unix/sysv/linux/raise.c:51
>>> #1  0x00007f9d61542921 in __GI_abort () at abort.c:79
>>> #2  0x00007f9d624da799 in os_panic () at
>>> /home/vpp/vpp/src/vppinfra/unix-misc.c:177
>>> #3  0x00007f9d62420f49 in debugger () at
>>> /home/vpp/vpp/src/vppinfra/error.c:84
>>> #4  0x00007f9d62420cc7 in _clib_error (how_to_die=2, function_name=0x0,
>>> line_number=0, fmt=0x7f9d644348d0 "%s:%d (%s) assertion `%s' fails") at
>>> /home/vpp/vpp/src/vppinfra/error.c:143
>>> #5  0x00007f9d636695b4 in load_balance_get (lbi=4569) at
>>> /home/vpp/vpp/src/vnet/dpo/load_balance.h:222
>>> #6  0x00007f9d63668247 in mpls_lookup_node_fn_hsw (vm=0x7f9ceb0138c0,
>>> node=0x7f9ceee6f700, from_frame=0x7f9cef9c9240) at
>>> /home/vpp/vpp/src/vnet/mpls/mpls_lookup.c:229
>>> #7  0x00007f9d63008076 in dispatch_node (vm=0x7f9ceb0138c0,
>>> node=0x7f9ceee6f700, type=VLIB_NODE_TYPE_INTERNAL,
>>> dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x7f9cef9c9240,
>>> last_time_stamp=1837178878370487) at /home/vpp/vpp/src/vlib/main.c:1217
>>> #8  0x00007f9d630089e7 in dispatch_pending_node (vm=0x7f9ceb0138c0,
>>> pending_frame_index=2, last_time_stamp=1837178878370487) at
>>> /home/vpp/vpp/src/vlib/main.c:1376
>>> #9  0x00007f9d63002441 in vlib_main_or_worker_loop (vm=0x7f9ceb0138c0,
>>> is_main=0) at /home/vpp/vpp/src/vlib/main.c:1904
>>> #10 0x00007f9d630012e7 in vlib_worker_loop (vm=0x7f9ceb0138c0) at
>>> /home/vpp/vpp/src/vlib/main.c:2038
>>> #11 0x00007f9d6305995d in vlib_worker_thread_fn (arg=0x7f9ce1b88540) at
>>> /home/vpp/vpp/src/vlib/threads.c:1868
>>> #12 0x00007f9d62445214 in clib_calljmp () at
>>> /home/vpp/vpp/src/vppinfra/longjmp.S:123
>>> #13 0x00007f9cc0f69c90 in ?? ()
>>> #14 0x00007f9d63051b83 in vlib_worker_thread_bootstrap_fn
>>> (arg=0x7f9ce1b88540) at /home/vpp/vpp/src/vlib/threads.c:585
>>> #15 0x00007f9cda360355 in eal_thread_loop (arg=0x0) at
>>> ../src-dpdk/lib/librte_eal/linux/eal_thread.c:127
>>> #16 0x00007f9d629246db in start_thread (arg=0x7f9cc0f6a700) at
>>> pthread_create.c:463
>>> #17 0x00007f9d6162371f in clone () at
>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
>>> (gdb) select 5
>>> (gdb)
>>> *print pifi( load_balance_pool, 4569 )$1 = 0*
>>> (gdb) source ~/vpp/extras/gdb/gdbinit
>>> Loading vpp functions...
>>> Load vlLoad pe
>>> Load pifi
>>> Load node_name_from_index
>>> Load vnet_buffer_opaque
>>> Load vnet_buffer_opaque2
>>> Load bitmap_get
>>> Done loading vpp functions...
>>> (gdb) pifi load_balance_pool 4569
>>> pool_is_free_index (load_balance_pool, 4569)$2 = 0
>>>
>>> On Wed, 13 Oct 2021 at 21:55, Florin Coras <fcoras.li...@gmail.com>
>>> wrote:
>>>
>>>> Hi Stanislav,
>>>>
>>>> Just to make sure the gdb macro is okay, could you run from gdb:
>>>> pifi(pool, index)? The function is defined in gdb_funcs.c.
>>>>
>>>> Regards,
>>>> Florin
>>>>
>>>> On Oct 13, 2021, at 11:30 AM, Stanislav Zaikin <zsta...@gmail.com>
>>>> wrote:
>>>>
>>>> Hello folks,
>>>>
>>>> I'm facing a strange issue with 2 worker threads. Sometimes I get a
>>>> crash either in "ip6-lookup" or "mpls-lookup" nodes. They happen with
>>>> assert in the *pool_elt_at_index* macro and always inside the "
>>>> *load_balance_get*" function. But the load_balance dpo looks perfectly
>>>> good, I mean it still has a lock and on regular deletion (in the case when
>>>> the load_balance dpo is deleted) it should be erased properly (with
>>>> dpo_reset). It happens usually when the main core is executing
>>>> vlib_worker_thread_barrier_sync_int(), and the other worker is executing
>>>> vlib_worker_thread_barrier_check().
>>>> And the strangest thing is, when I run the vpp's gdb helper for
>>>> checking "pool_index_is_free" or pifi, it shows me that the index isn't
>>>> free (and the macro in that case shouldn't fire).
>>>>
>>>> Any thoughts and inputs are appreciated.
>>>>
>>>> Thread 3 "vpp_wk_0" received signal SIGABRT, Aborted.
>>>> [Switching to Thread 0x7fb4f2e22700 (LWP 3244)]
>>>> __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
>>>> 51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
>>>> (gdb) bt
>>>> #0  __GI_raise (sig=sig@entry=6) at
>>>> ../sysdeps/unix/sysv/linux/raise.c:51
>>>> #1  0x00007fb5933fa921 in __GI_abort () at abort.c:79
>>>> #2  0x00007fb594392799 in os_panic () at
>>>> /home/vpp/vpp/src/vppinfra/unix-misc.c:177
>>>> #3  0x00007fb5942d8f49 in debugger () at
>>>> /home/vpp/vpp/src/vppinfra/error.c:84
>>>> #4  0x00007fb5942d8cc7 in _clib_error (how_to_die=2, function_name=0x0,
>>>> line_number=0, fmt=0x7fb5962ec8d0 "%s:%d (%s) assertion `%s' fails") at
>>>> /home/vpp/vpp/src/vppinfra/error.c:143
>>>> #5  0x00007fb5954bd694 in load_balance_get (lbi=3604) at
>>>> /home/vpp/vpp/src/vnet/dpo/load_balance.h:222
>>>> #6  0x00007fb5954bc070 in ip6_lookup_inline (vm=0x7fb51ceccd00,
>>>> node=0x7fb520f6b700, frame=0x7fb52128e4c0) at
>>>> /home/vpp/vpp/src/vnet/ip/ip6_forward.h:117
>>>> #7  0x00007fb5954bbdd5 in ip6_lookup_node_fn_hsw (vm=0x7fb51ceccd00,
>>>> node=0x7fb520f6b700, frame=0x7fb52128e4c0) at
>>>> /home/vpp/vpp/src/vnet/ip/ip6_forward.c:736
>>>> #8  0x00007fb594ec0076 in dispatch_node (vm=0x7fb51ceccd00,
>>>> node=0x7fb520f6b700, type=VLIB_NODE_TYPE_INTERNAL,
>>>> dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x7fb52128e4c0,
>>>> last_time_stamp=1808528151240447) at /home/vpp/vpp/src/vlib/main.c:1217
>>>> #9  0x00007fb594ec09e7 in dispatch_pending_node (vm=0x7fb51ceccd00,
>>>> pending_frame_index=5, last_time_stamp=1808528151240447) at
>>>> /home/vpp/vpp/src/vlib/main.c:1376
>>>> #10 0x00007fb594eba441 in vlib_main_or_worker_loop (vm=0x7fb51ceccd00,
>>>> is_main=0) at /home/vpp/vpp/src/vlib/main.c:1904
>>>> #11 0x00007fb594eb92e7 in vlib_worker_loop (vm=0x7fb51ceccd00) at
>>>> /home/vpp/vpp/src/vlib/main.c:2038
>>>> #12 0x00007fb594f1195d in vlib_worker_thread_fn (arg=0x7fb513a48100) at
>>>> /home/vpp/vpp/src/vlib/threads.c:1868
>>>> #13 0x00007fb5942fd214 in clib_calljmp () at
>>>> /home/vpp/vpp/src/vppinfra/longjmp.S:123
>>>> #14 0x00007fb4f2e21c90 in ?? ()
>>>> #15 0x00007fb594f09b83 in vlib_worker_thread_bootstrap_fn
>>>> (arg=0x7fb513a48100) at /home/vpp/vpp/src/vlib/threads.c:585
>>>> #16 0x00007fb50c218355 in eal_thread_loop (arg=0x0) at
>>>> ../src-dpdk/lib/librte_eal/linux/eal_thread.c:127
>>>> #17 0x00007fb5947dc6db in start_thread (arg=0x7fb4f2e22700) at
>>>> pthread_create.c:463
>>>> #18 0x00007fb5934db71f in clone () at
>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
>>>> (gdb) select 5
>>>> (gdb) print _e
>>>> $1 = (load_balance_t *) 0x7fb52651e580
>>>> (gdb) print load_balance_pool[3604]
>>>> $2 = {cacheline0 = 0x7fb52651e580 "\001", lb_n_buckets = 1,
>>>> lb_n_buckets_minus_1 = 0, lb_proto = DPO_PROTO_IP6, lb_flags =
>>>> LOAD_BALANCE_FLAG_NONE, lb_fib_entry_flags = (FIB_ENTRY_FLAG_CONNECTED |
>>>> FIB_ENTRY_FLAG_LOCAL), lb_locks = 1, lb_map = 4294967295, lb_urpf = 4094,
>>>> lb_hash_config = 31, lb_buckets = 0x0,
>>>>   lb_buckets_inline = {{{{dpoi_type = DPO_RECEIVE, dpoi_proto =
>>>> DPO_PROTO_IP6, dpoi_next_node = 2, dpoi_index = 2094}, as_u64 =
>>>> 8993661649164}}, {{{dpoi_type = DPO_FIRST, dpoi_proto = DPO_PROTO_IP4,
>>>> dpoi_next_node = 0, dpoi_index = 0}, as_u64 = 0}}, {{{dpoi_type =
>>>> DPO_FIRST, dpoi_proto = DPO_PROTO_IP4,
>>>>           dpoi_next_node = 0, dpoi_index = 0}, as_u64 = 0}},
>>>> {{{dpoi_type = DPO_FIRST, dpoi_proto = DPO_PROTO_IP4, dpoi_next_node = 0,
>>>> dpoi_index = 0}, as_u64 = 0}}}}
>>>> (gdb) print &load_balance_pool[3604]
>>>> $3 = (load_balance_t *) 0x7fb52651e580
>>>> (gdb) source ~/vpp/extras/gdb/gdbinit
>>>> Loading vpp functions...
>>>> Load vlLoad pe
>>>> Load pifi
>>>> Load node_name_from_index
>>>> Load vnet_buffer_opaque
>>>> Load vnet_buffer_opaque2
>>>> Load bitmap_get
>>>> Done loading vpp functions...
>>>> (gdb) pifi load_balance_pool 3604
>>>> pool_is_free_index (load_balance_pool, 3604)$4 = 0
>>>> (gdb) info threads
>>>>   Id   Target Id         Frame
>>>>   1    Thread 0x7fb596bd2c40 (LWP 727) "vpp_main" 0x00007fb594f1439b in
>>>> clib_time_now_internal (c=0x7fb59517ccc0 <vlib_global_main>,
>>>> n=1808528155236639) at /home/vpp/vpp/src/vppinfra/time.h:215
>>>>   2    Thread 0x7fb4f3623700 (LWP 2976) "eal-intr-thread"
>>>> 0x00007fb5934dba47 in epoll_wait (epfd=17, events=0x7fb4f3622d80,
>>>> maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
>>>> * 3    Thread 0x7fb4f2e22700 (LWP 3244) "vpp_wk_0" __GI_raise
>>>> (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
>>>>   4    Thread 0x7fb4f2621700 (LWP 3246) "vpp_wk_1" 0x00007fb594ebf897
>>>> in vlib_worker_thread_barrier_check () at
>>>> /home/vpp/vpp/src/vlib/threads.h:439
>>>>
>>>> --
>>>> Best regards
>>>> Stanislav Zaikin
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> Best regards
>>> Stanislav Zaikin
>>>
>>>
>>>
>>>
>>>
>>>
>> NOTICE TO RECIPIENT This e-mail message and any attachments are
>> confidential and may be privileged. If you received this e-mail in error,
>> any review, use, dissemination, distribution, or copying of this e-mail is
>> strictly prohibited. Please notify us immediately of the error by return
>> e-mail and please delete this message from your system. For more
>> information about Rtbrick, please visit us at www.rtbrick.com
>>
>
>
> --
> Best regards
> Stanislav Zaikin
>
> 
>
>

-- 
Best regards
Stanislav Zaikin
-=-=-=-=-=-=-=-=-=-=-=-
Links: You receive all messages sent to this group.
View/Reply Online (#20334): https://lists.fd.io/g/vpp-dev/message/20334
Mute This Topic: https://lists.fd.io/mt/86295132/21656
Group Owner: vpp-dev+ow...@lists.fd.io
Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Reply via email to