Hi Neale, I've tried with simplified version first (I know there's a _vec_resize_will_expand macro), something like: always_inline u8 clib_bitmap_will_expand(uword * ai, uword i) { uword i0 = i / BITS (ai[0]); return vec_len(ai) < i0; } I've put a check in load_balance_destroy just before pool_put.
And there's no abort anymore. So, the prize comes to you :) In my case, there's a lot of control traffic and data traffic. Control traffic creates some routes and tunnels, and because the rate is high, not all the control traffic reaches the control plane. And some tunnels\routes are getting deleted. I can prepare the patch, but one thing is concerning me: deleting an element from the pool looks like something almost free while setting a worker barrier is quite expensive. Also, that affects only debug mode (on the other hand, it might have other side effects, I don't know). In your opinion, does it make sense to put a barrier in that case? On Fri, 22 Oct 2021 at 20:37, Neale Ranns <ne...@graphiant.com> wrote: > > > Hi Stanislav, > > > > I see no smoking guns :( > > > > The only cause I can think of is that when a load-balance is returned to > the pool, the pool’s bitmap of free indicies may expand, which would > confuse readers/workers. But I don’t see any of your threads having just > pool_put a load-balance. Since you have a reliable reproduction > environment, could you cook your own pool_put_would_expand macro to test > this theory? > > > > /neale > > > > > > *From: *Stanislav Zaikin <zsta...@gmail.com> > *Date: *Friday, 22 October 2021 at 15:06 > *To: *Neale Ranns <ne...@graphiant.com> > *Cc: *vpp-dev <vpp-dev@lists.fd.io> > *Subject: *Re: [vpp-dev] assert in pool_elt_at_index > > Hi Neale, > > > > Sure, here it is: > > https://gist.github.com/zstas/c2316d4e95a84fa28f0e0be00eb6fb19 > > > > Thanks in advance. > > > > On Fri, 22 Oct 2021 at 09:55, Neale Ranns <ne...@graphiant.com> wrote: > > Hi Stanislav, > > > > Can you do: > > thread apply all bt > > I’d like to see what the other threads are doing. > > > > /neale > > > > *From: *vpp-dev@lists.fd.io <vpp-dev@lists.fd.io> on behalf of Stanislav > Zaikin via lists.fd.io <zstaseg=gmail....@lists.fd.io> > *Date: *Wednesday, 13 October 2021 at 20:30 > *To: *vpp-dev <vpp-dev@lists.fd.io> > *Subject: *[vpp-dev] assert in pool_elt_at_index > > Hello folks, > > > > I'm facing a strange issue with 2 worker threads. Sometimes I get a crash > either in "ip6-lookup" or "mpls-lookup" nodes. They happen with assert in > the *pool_elt_at_index* macro and always inside the "*load_balance_get*" > function. But the load_balance dpo looks perfectly good, I mean it still > has a lock and on regular deletion (in the case when the load_balance dpo > is deleted) it should be erased properly (with dpo_reset). It happens > usually when the main core is executing > vlib_worker_thread_barrier_sync_int(), and the other worker is executing > vlib_worker_thread_barrier_check(). > > And the strangest thing is, when I run the vpp's gdb helper for checking > "pool_index_is_free" or pifi, it shows me that the index isn't free (and > the macro in that case shouldn't fire). > > > > Any thoughts and inputs are appreciated. > > > > Thread 3 "vpp_wk_0" received signal SIGABRT, Aborted. > [Switching to Thread 0x7fb4f2e22700 (LWP 3244)] > __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 > 51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. > (gdb) bt > #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 > #1 0x00007fb5933fa921 in __GI_abort () at abort.c:79 > #2 0x00007fb594392799 in os_panic () at > /home/vpp/vpp/src/vppinfra/unix-misc.c:177 > #3 0x00007fb5942d8f49 in debugger () at > /home/vpp/vpp/src/vppinfra/error.c:84 > #4 0x00007fb5942d8cc7 in _clib_error (how_to_die=2, function_name=0x0, > line_number=0, fmt=0x7fb5962ec8d0 "%s:%d (%s) assertion `%s' fails") at > /home/vpp/vpp/src/vppinfra/error.c:143 > #5 0x00007fb5954bd694 in load_balance_get (lbi=3604) at > /home/vpp/vpp/src/vnet/dpo/load_balance.h:222 > #6 0x00007fb5954bc070 in ip6_lookup_inline (vm=0x7fb51ceccd00, > node=0x7fb520f6b700, frame=0x7fb52128e4c0) at > /home/vpp/vpp/src/vnet/ip/ip6_forward.h:117 > #7 0x00007fb5954bbdd5 in ip6_lookup_node_fn_hsw (vm=0x7fb51ceccd00, > node=0x7fb520f6b700, frame=0x7fb52128e4c0) at > /home/vpp/vpp/src/vnet/ip/ip6_forward.c:736 > #8 0x00007fb594ec0076 in dispatch_node (vm=0x7fb51ceccd00, > node=0x7fb520f6b700, type=VLIB_NODE_TYPE_INTERNAL, > dispatch_state=VLIB_NODE_STATE_POLLING, frame=0x7fb52128e4c0, > last_time_stamp=1808528151240447) at /home/vpp/vpp/src/vlib/main.c:1217 > #9 0x00007fb594ec09e7 in dispatch_pending_node (vm=0x7fb51ceccd00, > pending_frame_index=5, last_time_stamp=1808528151240447) at > /home/vpp/vpp/src/vlib/main.c:1376 > #10 0x00007fb594eba441 in vlib_main_or_worker_loop (vm=0x7fb51ceccd00, > is_main=0) at /home/vpp/vpp/src/vlib/main.c:1904 > #11 0x00007fb594eb92e7 in vlib_worker_loop (vm=0x7fb51ceccd00) at > /home/vpp/vpp/src/vlib/main.c:2038 > #12 0x00007fb594f1195d in vlib_worker_thread_fn (arg=0x7fb513a48100) at > /home/vpp/vpp/src/vlib/threads.c:1868 > #13 0x00007fb5942fd214 in clib_calljmp () at > /home/vpp/vpp/src/vppinfra/longjmp.S:123 > #14 0x00007fb4f2e21c90 in ?? () > #15 0x00007fb594f09b83 in vlib_worker_thread_bootstrap_fn > (arg=0x7fb513a48100) at /home/vpp/vpp/src/vlib/threads.c:585 > #16 0x00007fb50c218355 in eal_thread_loop (arg=0x0) at > ../src-dpdk/lib/librte_eal/linux/eal_thread.c:127 > #17 0x00007fb5947dc6db in start_thread (arg=0x7fb4f2e22700) at > pthread_create.c:463 > #18 0x00007fb5934db71f in clone () at > ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 > (gdb) select 5 > (gdb) print _e > $1 = (load_balance_t *) 0x7fb52651e580 > (gdb) print load_balance_pool[3604] > $2 = {cacheline0 = 0x7fb52651e580 "\001", lb_n_buckets = 1, > lb_n_buckets_minus_1 = 0, lb_proto = DPO_PROTO_IP6, lb_flags = > LOAD_BALANCE_FLAG_NONE, lb_fib_entry_flags = (FIB_ENTRY_FLAG_CONNECTED | > FIB_ENTRY_FLAG_LOCAL), lb_locks = 1, lb_map = 4294967295, lb_urpf = 4094, > lb_hash_config = 31, lb_buckets = 0x0, > lb_buckets_inline = {{{{dpoi_type = DPO_RECEIVE, dpoi_proto = > DPO_PROTO_IP6, dpoi_next_node = 2, dpoi_index = 2094}, as_u64 = > 8993661649164}}, {{{dpoi_type = DPO_FIRST, dpoi_proto = DPO_PROTO_IP4, > dpoi_next_node = 0, dpoi_index = 0}, as_u64 = 0}}, {{{dpoi_type = > DPO_FIRST, dpoi_proto = DPO_PROTO_IP4, > dpoi_next_node = 0, dpoi_index = 0}, as_u64 = 0}}, {{{dpoi_type > = DPO_FIRST, dpoi_proto = DPO_PROTO_IP4, dpoi_next_node = 0, dpoi_index = > 0}, as_u64 = 0}}}} > (gdb) print &load_balance_pool[3604] > $3 = (load_balance_t *) 0x7fb52651e580 > (gdb) source ~/vpp/extras/gdb/gdbinit > Loading vpp functions... > Load vlLoad pe > Load pifi > Load node_name_from_index > Load vnet_buffer_opaque > Load vnet_buffer_opaque2 > Load bitmap_get > Done loading vpp functions... > (gdb) pifi load_balance_pool 3604 > pool_is_free_index (load_balance_pool, 3604)$4 = 0 > (gdb) info threads > Id Target Id Frame > 1 Thread 0x7fb596bd2c40 (LWP 727) "vpp_main" 0x00007fb594f1439b in > clib_time_now_internal (c=0x7fb59517ccc0 <vlib_global_main>, > n=1808528155236639) at /home/vpp/vpp/src/vppinfra/time.h:215 > 2 Thread 0x7fb4f3623700 (LWP 2976) "eal-intr-thread" > 0x00007fb5934dba47 in epoll_wait (epfd=17, events=0x7fb4f3622d80, > maxevents=1, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 > * 3 Thread 0x7fb4f2e22700 (LWP 3244) "vpp_wk_0" __GI_raise > (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 > 4 Thread 0x7fb4f2621700 (LWP 3246) "vpp_wk_1" 0x00007fb594ebf897 in > vlib_worker_thread_barrier_check () at /home/vpp/vpp/src/vlib/threads.h:439 > > > -- > > Best regards > Stanislav Zaikin > > > > -- > > Best regards > Stanislav Zaikin > -- Best regards Stanislav Zaikin
-=-=-=-=-=-=-=-=-=-=-=- Links: You receive all messages sent to this group. View/Reply Online (#20387): https://lists.fd.io/g/vpp-dev/message/20387 Mute This Topic: https://lists.fd.io/mt/86295132/21656 Group Owner: vpp-dev+ow...@lists.fd.io Unsubscribe: https://lists.fd.io/g/vpp-dev/unsub [arch...@mail-archive.com] -=-=-=-=-=-=-=-=-=-=-=-