On 26 Feb 2024, at 11:10, LIU Yulong wrote:
> Hi Eelco, > > Thank you for the quick response. > > I did not add those logs, because in order to reproduce the issue, we > have to send lots of packets to the host. > So there are too many ukeys created/deleted to do logging. Maybe a circular buffer with all alloc/free (+ 1ukey address, and timestamp), 1 or 2 Mb of memory can hold a lot. > And can we ensure that this [1] is the only place for ovs to free the ukey? > > [1] > https://github.com/openvswitch/ovs/blob/v2.17.8/ofproto/ofproto-dpif-upcall.c#L2084 Yes, this should be the only place, and should always be done through an RCU delayed delete. > For last mail, can this issue be concurrent read-and-update/delete? > The revalidator_sweep__ is trying to lock the ukey->mutex, while > another thread is updating the ukey->mutex to NULL and free ukey. This should not happen as the delete should happen by the delayed RCU delete, and if the ukey is still in the cmap after the delayed delete (quiescent state) something is wrong. > LIU Yulong > > > On Mon, Feb 26, 2024 at 5:41 PM Eelco Chaudron <echau...@redhat.com> wrote: >> >> >> >> On 26 Feb 2024, at 9:33, LIU Yulong wrote: >> >>> Hi, >>> >>> I have read the code by comparing the call stack of the core files >>> carefully, and found >>> a potential race condition. Please confirm whether the following 3 threads >>> have a race condition. Just did some code trace, can such >>> race condition happen? >>> >>> * PMD thread1 ===================================: >>> -> pmd_thread_main >>> -> dp_netdev_process_rxq_port >>> -> dp_netdev_input >>> -> dp_netdev_input__ >>> -> handle_packet_upcall >>> -> dp_netdev_upcall >>> -> upcall_cb >>> -> ukey_install >>> -> ukey_install__ >>> -> try_ukey_replace: >>> ovs_mutex_lock(&new_ukey->mutex); >>> <---------- the CMAP_FOR_EACH loop in the revalidator_sweep__ run a >>> bit earlier than the cmap_replace next line, so the old_ukey can be >>> iterated. [1] >>> cmap_replace(&umap->cmap, &old_ukey->cmap_node, >>> &new_ukey->cmap_node, new_ukey->hash); >>> ovsrcu_postpone(ukey_delete__, old_ukey); >>> <-------- delete the ukey asynchronously. [2] >>> transition_ukey(old_ukey, UKEY_DELETED); <-------- >>> transition the ukey state to UKEY_DELETED, most core files show that >>> the ukey last state change was at this line. [3] >>> transition_ukey(new_ukey, UKEY_VISIBLE); >>> >>> [1] >>> https://github.com/openvswitch/ovs/blob/v2.17.8/ofproto/ofproto-dpif-upcall.c#L1892 >>> [2] >>> https://github.com/openvswitch/ovs/blob/v2.17.8/ofproto/ofproto-dpif-upcall.c#L1896 >>> [3] >>> https://github.com/openvswitch/ovs/blob/v2.17.8/ofproto/ofproto-dpif-upcall.c#L1897 >>> >>> This function try_ukey_replace was called many times, because the >>> `counter_upcall_ukey_replace` is not zero. >>> For instance: >>> { >>> name = 0x55ba9755206b "upcall_ukey_replace", >>> count = 0x55ba971c7610 <upcall_ukey_replace_count>, >>> total = 2287997, >>> last_total = 2287997, >>> min = {221, 247, 444, 278, 324, 570, 379, 464, 283, 280, 0, 427}, >>> hr = {3300, 4378, 3557, 4554, 3748, 3710, 4340, 3559, 4296, 3759, >>> 3522, 4136, 3660, 4428, 3802, 3652, 3880, 3375, 4806, 4221, 4158, >>> 3816, 3750, 3846, 3761, 3653, 4293, 3816, 3723, 3691, 4033, 468, 4117, >>> 3659, 4007, 3536, >>> 3439, 4440, 3388, 4079, 3876, 3865, 4339, 3757, 3481, 4027, 3989, >>> 3633, 3737, 3564, 3403, 3992, 3793, 4390, 4124, 4354, 4164, 4383, >>> 4237, 3667} >>> } >>> >>> * RCU thread2 ===================================: >>> -> ovsrcu_postpone_thread >>> -> ovsrcu_call_postponed >>> -> ukey_delete__ <------------ This >>> function is not thead safe IMO, it has mark >>> OVS_NO_THREAD_SAFETY_ANALYSIS. [4] >>> >>> recirc_refs_unref(&ukey->recircs); >>> xlate_cache_delete(ukey->xcache); >>> ofpbuf_delete(ovsrcu_get(struct ofpbuf *, &ukey->actions)); >>> ovs_mutex_destroy(&ukey->mutex); <-------------- Just >>> set ukey mutex to NULL. [5][6][7] >>> free(ukey); >>> >>> [4] >>> https://github.com/openvswitch/ovs/blob/v2.17.8/ofproto/ofproto-dpif-upcall.c#L2074 >>> [5] >>> https://github.com/openvswitch/ovs/blob/v2.17.8/ofproto/ofproto-dpif-upcall.c#L2083 >>> [6] https://github.com/openvswitch/ovs/blob/v2.17.8/lib/ovs-thread.c#L131 >>> [7] https://github.com/openvswitch/ovs/blob/v2.17.8/lib/ovs-thread.c#L124 >>> >>> * revalidator thread3 ===================================: >>> >>> -> udpif_revalidator >>> -> revalidator_sweep >>> -> revalidator_sweep__ >>> >>> CMAP_FOR_EACH(ukey, cmap_node, &umap->cmap) { >>> enum ukey_state ukey_state; >>> >>> if (ovs_mutex_trylock(&ukey->mutex)) { <-------------- >>> Core at here, because of the NULL pointer. [8] >>> continue; >>> } >>> [8] >>> https://github.com/openvswitch/ovs/blob/v2.17.8/ofproto/ofproto-dpif-upcall.c#L2900 >>> >>> CMIIW, if this race condition can happen, IMO, it is mostly because >>> the umap is not locked during the sweep CMAP_FOR_EACH loop. >>> Or some RCU protection did not work properly. >> >> The code looks fine to me, we are taking the lock when we update the cmap, >> which is a requirement, however iterating over the cmap as done above does >> not require a lock. >> >> If the RCU protection would have been broken, we would have seen a lot more >> errors. Did you have any luck adding the traces I suggested earlier, to see >> if it’s a use after free? >> >> Cheers, >> >> Eelco >> >>> On Wed, Feb 21, 2024 at 6:40 PM Eelco Chaudron <echau...@redhat.com> wrote: >>>> >>>> >>>> >>>> On 21 Feb 2024, at 4:26, LIU Yulong wrote: >>>> >>>>> Thank you very much for your reply. >>>>> >>>>> The problem is not easy to reproduce, we have to wait a random long time >>>>> to see >>>>> if the issue happens again. It can be more than one day or longer. >>>>> OVS 2.17 with dpdk 20.11 had run to core before, so it's hard to say >>>>> if it is related to DPDK. >>>>> I'm running the ovs without offload to see if the issue can happen in >>>>> recent days. >>>>> >>>>> And again, TLDR, paste more thread call stacks. >>>>> Most of the threads are in the state of sched_yield, nanosleep, >>>>> epoll_wait and poll. >>>> >>>> If this looks like a memory trash issue, it might be hard to figure out. >>>> Does the ukey show any kind of pattern, i.e. does the trashed data look >>>> like anything known? >>>> Maybe it’s a use after free, so you could add some debugging code >>>> logging/recording all free and xmalloc of the ukey structure, to see that >>>> when it crashes it was actually allocated? >>>> >>>> Hope this helps you getting started. >>>> >>>> //Eelco >>>> >>>>> The following threads are in working state. So hope this can have >>>>> clues for investigation. >>>>> >>>>> Thread 14 (Thread 0x7fd34002b700 (LWP 91928)): >>>>> #0 0x00007fd344487b6d in recvmsg () at >>>>> ../sysdeps/unix/syscall-template.S:81 >>>>> #1 0x0000562773cb8d03 in mp_handle () >>>>> #2 0x00007fd344480e65 in start_thread (arg=0x7fd34002b700) at >>>>> pthread_create.c:307 >>>>> #3 0x00007fd34260988d in clone () at >>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 >>>>> >>>>> Thread 13 (Thread 0x7fd3359d7700 (LWP 91929)): >>>>> #0 0x00007fd34448799d in accept () at >>>>> ../sysdeps/unix/syscall-template.S:81 >>>>> #1 0x0000562773cd8f3c in socket_listener () >>>>> #2 0x00007fd344480e65 in start_thread (arg=0x7fd3359d7700) at >>>>> pthread_create.c:307 >>>>> #3 0x00007fd34260988d in clone () at >>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 >>>>> >>>>> Thread 6 (Thread 0x7fd304663700 (LWP 91965)): >>>>> #0 0x00007fd34448771d in read () at ../sysdeps/unix/syscall-template.S:81 >>>>> #1 0x00007fd343b42bfb in _mlx5dv_devx_get_event () from >>>>> /lib64/libmlx5.so.1 >>>>> #2 0x0000562773936d86 in mlx5_vdpa_event_handle () >>>>> #3 0x00007fd344480e65 in start_thread (arg=0x7fd304663700) at >>>>> pthread_create.c:307 >>>>> #4 0x00007fd34260988d in clone () at >>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 >>>>> >>>>> Thread 2 (Thread 0x7fd305730700 (LWP 91943)): >>>>> #0 ccmap_find_slot_protected (count=<synthetic pointer>, >>>>> hash=hash@entry=1669671676, b=b@entry=0x7fd2f8012a80) at >>>>> lib/ccmap.c:278 >>>>> #1 ccmap_inc_bucket_existing (b=b@entry=0x7fd2f8012a80, >>>>> hash=hash@entry=1669671676, inc=inc@entry=1) at lib/ccmap.c:281 >>>>> #2 0x0000562773d4b015 in ccmap_try_inc >>>>> (impl=impl@entry=0x7fd2f8012a40, hash=hash@entry=1669671676, >>>>> inc=inc@entry=1) at lib/ccmap.c:464 >>>>> #3 0x0000562773d4b224 in ccmap_inc (ccmap=ccmap@entry=0x7fd2f802a7e8, >>>>> hash=1669671676) at lib/ccmap.c:485 >>>>> #4 0x0000562773d4975a in classifier_replace (cls=<optimized out>, >>>>> rule=rule@entry=0x7fd2fac70e28, version=<optimized out>, >>>>> conjs=<optimized out>, n_conjs=<optimized out>) >>>>> at lib/classifier.c:579 >>>>> #5 0x0000562773d49e99 in classifier_insert (cls=<optimized out>, >>>>> rule=rule@entry=0x7fd2fac70e28, version=<optimized out>, >>>>> conj=<optimized out>, n_conj=<optimized out>) >>>>> at lib/classifier.c:694 >>>>> #6 0x0000562773d00fc8 in replace_rule_start >>>>> (ofproto=ofproto@entry=0x5627778cc420, ofm=ofm@entry=0x7fd3057235f0, >>>>> old_rule=<optimized out>, new_rule=new_rule@entry=0x7fd2fac70e20) >>>>> at ofproto/ofproto.c:5645 >>>>> #7 0x0000562773d010e4 in add_flow_start (ofproto=0x5627778cc420, >>>>> ofm=0x7fd3057235f0) at ofproto/ofproto.c:5256 >>>>> #8 0x0000562773d0122d in modify_flows_start__ >>>>> (ofproto=ofproto@entry=0x5627778cc420, ofm=ofm@entry=0x7fd3057235f0) >>>>> at ofproto/ofproto.c:5824 >>>>> #9 0x0000562773d01eac in modify_flow_start_strict >>>>> (ofm=0x7fd3057235f0, ofproto=0x5627778cc420) at ofproto/ofproto.c:5953 >>>>> #10 ofproto_flow_mod_start (ofproto=0x5627778cc420, >>>>> ofm=ofm@entry=0x7fd3057235f0) at ofproto/ofproto.c:8112 >>>>> #11 0x0000562773d0225a in ofproto_flow_mod_learn_start >>>>> (ofm=ofm@entry=0x7fd3057235f0) at ofproto/ofproto.c:5491 >>>>> #12 0x0000562773d040ad in ofproto_flow_mod_learn >>>>> (ofm=ofm@entry=0x7fd3057235f0, keep_ref=<optimized out>, >>>>> limit=<optimized out>, below_limitp=below_limitp@entry=0x7fd305723510) >>>>> at ofproto/ofproto.c:5576 >>>>> #13 0x0000562773d2641e in xlate_learn_action >>>>> (ctx=ctx@entry=0x7fd305729a60, learn=learn@entry=0x562777db4618) at >>>>> ofproto/ofproto-dpif-xlate.c:5547 >>>>> #14 0x0000562773d2aafb in do_xlate_actions (ofpacts=<optimized out>, >>>>> ofpacts_len=<optimized out>, ctx=0x7fd305729a60, >>>>> is_last_action=<optimized out>, group_bucket_action=<optimized out>) >>>>> at ofproto/ofproto-dpif-xlate.c:7232 >>>>> #15 0x0000562773d26c85 in xlate_recursively >>>>> (actions_xlator=0x562773d29490 <do_xlate_actions>, >>>>> is_last_action=true, deepens=false, rule=0x562777db4470, >>>>> ctx=0x7fd305729a60) >>>>> at ofproto/ofproto-dpif-xlate.c:4383 >>>>> #16 xlate_table_action (ctx=0x7fd305729a60, in_port=<optimized out>, >>>>> table_id=<optimized out>, may_packet_in=<optimized out>, >>>>> honor_table_miss=<optimized out>, with_ct_orig=<optimized out>, >>>>> is_last_action=true, xlator=0x562773d29490 <do_xlate_actions>) at >>>>> ofproto/ofproto-dpif-xlate.c:4512 >>>>> #17 0x0000562773d2ab8d in xlate_ofpact_resubmit >>>>> (resubmit=0x56277781db28, resubmit=0x56277781db28, >>>>> resubmit=0x56277781db28, is_last_action=true, ctx=0x7fd305729a60) >>>>> at ofproto/ofproto-dpif-xlate.c:4823 >>>>> #18 do_xlate_actions (ofpacts=<optimized out>, ofpacts_len=<optimized >>>>> out>, ctx=0x7fd305729a60, is_last_action=<optimized out>, >>>>> group_bucket_action=<optimized out>) >>>>> at ofproto/ofproto-dpif-xlate.c:7107 >>>>> #19 0x0000562773d26c85 in xlate_recursively >>>>> (actions_xlator=0x562773d29490 <do_xlate_actions>, >>>>> is_last_action=true, deepens=false, rule=0x562777ab9220, >>>>> ctx=0x7fd305729a60) >>>>> at ofproto/ofproto-dpif-xlate.c:4383 >>>>> #20 xlate_table_action (ctx=0x7fd305729a60, in_port=<optimized out>, >>>>> table_id=<optimized out>, may_packet_in=<optimized out>, >>>>> honor_table_miss=<optimized out>, with_ct_orig=<optimized out>, >>>>> is_last_action=true, xlator=0x562773d29490 <do_xlate_actions>) at >>>>> ofproto/ofproto-dpif-xlate.c:4512 >>>>> #21 0x0000562773d2ab8d in xlate_ofpact_resubmit >>>>> (resubmit=0x562777b9f3c8, resubmit=0x562777b9f3c8, >>>>> resubmit=0x562777b9f3c8, is_last_action=true, ctx=0x7fd305729a60) >>>>> at ofproto/ofproto-dpif-xlate.c:4823 >>>>> #22 do_xlate_actions (ofpacts=<optimized out>, ofpacts_len=<optimized >>>>> out>, ctx=0x7fd305729a60, is_last_action=<optimized out>, >>>>> group_bucket_action=<optimized out>) >>>>> at ofproto/ofproto-dpif-xlate.c:7107 >>>>> #23 0x0000562773d26c85 in xlate_recursively >>>>> (actions_xlator=0x562773d29490 <do_xlate_actions>, >>>>> is_last_action=true, deepens=false, rule=0x562777d6dc90, >>>>> ctx=0x7fd305729a60) >>>>> at ofproto/ofproto-dpif-xlate.c:4383 >>>>> #24 xlate_table_action (ctx=0x7fd305729a60, in_port=<optimized out>, >>>>> table_id=<optimized out>, may_packet_in=<optimized out>, >>>>> honor_table_miss=<optimized out>, with_ct_orig=<optimized out>, >>>>> ---Type <return> to continue, or q <return> to quit--- >>>>> is_last_action=true, xlator=0x562773d29490 <do_xlate_actions>) at >>>>> ofproto/ofproto-dpif-xlate.c:4512 >>>>> #25 0x0000562773d2ab8d in xlate_ofpact_resubmit >>>>> (resubmit=0x562777c13c58, resubmit=0x562777c13c58, >>>>> resubmit=0x562777c13c58, is_last_action=true, ctx=0x7fd305729a60) >>>>> at ofproto/ofproto-dpif-xlate.c:4823 >>>>> #26 do_xlate_actions (ofpacts=<optimized out>, ofpacts_len=<optimized >>>>> out>, ctx=0x7fd305729a60, is_last_action=<optimized out>, >>>>> group_bucket_action=<optimized out>) >>>>> at ofproto/ofproto-dpif-xlate.c:7107 >>>>> #27 0x0000562773d26c85 in xlate_recursively >>>>> (actions_xlator=0x562773d29490 <do_xlate_actions>, >>>>> is_last_action=true, deepens=false, rule=0x562778046000, >>>>> ctx=0x7fd305729a60) >>>>> at ofproto/ofproto-dpif-xlate.c:4383 >>>>> #28 xlate_table_action (ctx=0x7fd305729a60, in_port=<optimized out>, >>>>> table_id=<optimized out>, may_packet_in=<optimized out>, >>>>> honor_table_miss=<optimized out>, with_ct_orig=<optimized out>, >>>>> is_last_action=true, xlator=0x562773d29490 <do_xlate_actions>) at >>>>> ofproto/ofproto-dpif-xlate.c:4512 >>>>> #29 0x0000562773d2ac1c in do_xlate_actions (ofpacts=<optimized out>, >>>>> ofpacts_len=<optimized out>, ctx=0x7fd305729a60, >>>>> is_last_action=<optimized out>, group_bucket_action=<optimized out>) >>>>> at ofproto/ofproto-dpif-xlate.c:7110 >>>>> #30 0x0000562773d26c85 in xlate_recursively >>>>> (actions_xlator=0x562773d29490 <do_xlate_actions>, >>>>> is_last_action=true, deepens=false, rule=0x562777ad0ac0, >>>>> ctx=0x7fd305729a60) >>>>> at ofproto/ofproto-dpif-xlate.c:4383 >>>>> #31 xlate_table_action (ctx=0x7fd305729a60, in_port=<optimized out>, >>>>> table_id=<optimized out>, may_packet_in=<optimized out>, >>>>> honor_table_miss=<optimized out>, with_ct_orig=<optimized out>, >>>>> is_last_action=true, xlator=0x562773d29490 <do_xlate_actions>) at >>>>> ofproto/ofproto-dpif-xlate.c:4512 >>>>> #32 0x0000562773d2ac1c in do_xlate_actions (ofpacts=<optimized out>, >>>>> ofpacts_len=<optimized out>, ctx=0x7fd305729a60, >>>>> is_last_action=<optimized out>, group_bucket_action=<optimized out>) >>>>> at ofproto/ofproto-dpif-xlate.c:7110 >>>>> #33 0x0000562773d26c85 in xlate_recursively >>>>> (actions_xlator=0x562773d29490 <do_xlate_actions>, >>>>> is_last_action=true, deepens=false, rule=0x56277785c0d0, >>>>> ctx=0x7fd305729a60) >>>>> at ofproto/ofproto-dpif-xlate.c:4383 >>>>> #34 xlate_table_action (ctx=0x7fd305729a60, in_port=<optimized out>, >>>>> table_id=<optimized out>, may_packet_in=<optimized out>, >>>>> honor_table_miss=<optimized out>, with_ct_orig=<optimized out>, >>>>> is_last_action=true, xlator=0x562773d29490 <do_xlate_actions>) at >>>>> ofproto/ofproto-dpif-xlate.c:4512 >>>>> #35 0x0000562773d2ac1c in do_xlate_actions (ofpacts=<optimized out>, >>>>> ofpacts_len=<optimized out>, ctx=0x7fd305729a60, >>>>> is_last_action=<optimized out>, group_bucket_action=<optimized out>) >>>>> at ofproto/ofproto-dpif-xlate.c:7110 >>>>> #36 0x0000562773d26c85 in xlate_recursively >>>>> (actions_xlator=0x562773d29490 <do_xlate_actions>, >>>>> is_last_action=true, deepens=false, rule=0x562777a86ce0, >>>>> ctx=0x7fd305729a60) >>>>> at ofproto/ofproto-dpif-xlate.c:4383 >>>>> #37 xlate_table_action (ctx=0x7fd305729a60, in_port=<optimized out>, >>>>> table_id=<optimized out>, may_packet_in=<optimized out>, >>>>> honor_table_miss=<optimized out>, with_ct_orig=<optimized out>, >>>>> is_last_action=true, xlator=0x562773d29490 <do_xlate_actions>) at >>>>> ofproto/ofproto-dpif-xlate.c:4512 >>>>> #38 0x0000562773d2ac1c in do_xlate_actions (ofpacts=<optimized out>, >>>>> ofpacts_len=<optimized out>, ctx=0x7fd305729a60, >>>>> is_last_action=<optimized out>, group_bucket_action=<optimized out>) >>>>> at ofproto/ofproto-dpif-xlate.c:7110 >>>>> #39 0x0000562773d26c85 in xlate_recursively >>>>> (actions_xlator=0x562773d29490 <do_xlate_actions>, >>>>> is_last_action=true, deepens=false, rule=0x56277781b710, >>>>> ctx=0x7fd305729a60) >>>>> at ofproto/ofproto-dpif-xlate.c:4383 >>>>> #40 xlate_table_action (ctx=0x7fd305729a60, in_port=<optimized out>, >>>>> table_id=<optimized out>, may_packet_in=<optimized out>, >>>>> honor_table_miss=<optimized out>, with_ct_orig=<optimized out>, >>>>> is_last_action=true, xlator=0x562773d29490 <do_xlate_actions>) at >>>>> ofproto/ofproto-dpif-xlate.c:4512 >>>>> #41 0x0000562773d2ac1c in do_xlate_actions >>>>> (ofpacts=ofpacts@entry=0x562777833a38, >>>>> ofpacts_len=ofpacts_len@entry=32, ctx=ctx@entry=0x7fd305729a60, >>>>> is_last_action=is_last_action@entry=true, >>>>> group_bucket_action=group_bucket_action@entry=false) at >>>>> ofproto/ofproto-dpif-xlate.c:7110 >>>>> #42 0x0000562773d30f68 in clone_xlate_actions (actions=0x562777833a38, >>>>> actions_len=32, ctx=0x7fd305729a60, is_last_action=<optimized out>, >>>>> group_bucket_action=<optimized out>) >>>>> at ofproto/ofproto-dpif-xlate.c:5809 >>>>> #43 0x0000562773d26c85 in xlate_recursively >>>>> (actions_xlator=0x562773d30d70 <clone_xlate_actions>, >>>>> is_last_action=true, deepens=true, rule=0x562777ad5640, >>>>> ctx=0x7fd305729a60) >>>>> at ofproto/ofproto-dpif-xlate.c:4383 >>>>> #44 xlate_table_action (ctx=0x7fd305729a60, in_port=<optimized out>, >>>>> table_id=<optimized out>, may_packet_in=<optimized out>, >>>>> honor_table_miss=<optimized out>, with_ct_orig=<optimized out>, >>>>> is_last_action=true, xlator=0x562773d30d70 <clone_xlate_actions>) >>>>> at ofproto/ofproto-dpif-xlate.c:4512 >>>>> #45 0x0000562773d2cf20 in patch_port_output >>>>> (ctx=ctx@entry=0x7fd305729a60, out_dev=0x562777ebeec0, >>>>> is_last_action=is_last_action@entry=true, in_dev=0x562777f43ee0, >>>>> in_dev=0x562777f43ee0) >>>>> at ofproto/ofproto-dpif-xlate.c:3890 >>>>> #46 0x0000562773d2d2f7 in compose_output_action__ >>>>> (ctx=ctx@entry=0x7fd305729a60, ofp_port=2, xr=xr@entry=0x0, >>>>> check_stp=check_stp@entry=true, >>>>> is_last_action=is_last_action@entry=true, >>>>> truncate=truncate@entry=false) at ofproto/ofproto-dpif-xlate.c:4205 >>>>> #47 0x0000562773d2fdd0 in compose_output_action (truncate=false, >>>>> is_last_action=true, xr=0x0, ofp_port=<optimized out>, >>>>> ctx=0x7fd305729a60) at ofproto/ofproto-dpif-xlate.c:4360 >>>>> #48 xlate_output_action (ctx=ctx@entry=0x7fd305729a60, port=<optimized >>>>> out>, controller_len=<optimized out>, >>>>> may_packet_in=may_packet_in@entry=true, >>>>> is_last_action=is_last_action@entry=true, >>>>> truncate=truncate@entry=false, >>>>> group_bucket_action=group_bucket_action@entry=false) at >>>>> ofproto/ofproto-dpif-xlate.c:5305 >>>>> ---Type <return> to continue, or q <return> to quit--- >>>>> #49 0x0000562773d2972f in do_xlate_actions (ofpacts=<optimized out>, >>>>> ofpacts_len=<optimized out>, ctx=0x7fd305729a60, >>>>> is_last_action=<optimized out>, group_bucket_action=<optimized out>) >>>>> at ofproto/ofproto-dpif-xlate.c:6960 >>>>> #50 0x0000562773d26c85 in xlate_recursively >>>>> (actions_xlator=0x562773d29490 <do_xlate_actions>, >>>>> is_last_action=true, deepens=false, rule=0x562777ad5430, >>>>> ctx=0x7fd305729a60) >>>>> at ofproto/ofproto-dpif-xlate.c:4383 >>>>> #51 xlate_table_action (ctx=0x7fd305729a60, in_port=<optimized out>, >>>>> table_id=<optimized out>, may_packet_in=<optimized out>, >>>>> honor_table_miss=<optimized out>, with_ct_orig=<optimized out>, >>>>> is_last_action=true, xlator=0x562773d29490 <do_xlate_actions>) at >>>>> ofproto/ofproto-dpif-xlate.c:4512 >>>>> #52 0x0000562773d2ac1c in do_xlate_actions (ofpacts=<optimized out>, >>>>> ofpacts_len=<optimized out>, ctx=0x7fd305729a60, >>>>> is_last_action=<optimized out>, group_bucket_action=<optimized out>) >>>>> at ofproto/ofproto-dpif-xlate.c:7110 >>>>> #53 0x0000562773d26c85 in xlate_recursively >>>>> (actions_xlator=0x562773d29490 <do_xlate_actions>, >>>>> is_last_action=true, deepens=false, rule=0x562777ac2fb0, >>>>> ctx=0x7fd305729a60) >>>>> at ofproto/ofproto-dpif-xlate.c:4383 >>>>> #54 xlate_table_action (ctx=0x7fd305729a60, in_port=<optimized out>, >>>>> table_id=<optimized out>, may_packet_in=<optimized out>, >>>>> honor_table_miss=<optimized out>, with_ct_orig=<optimized out>, >>>>> is_last_action=true, xlator=0x562773d29490 <do_xlate_actions>) at >>>>> ofproto/ofproto-dpif-xlate.c:4512 >>>>> #55 0x0000562773d2ac1c in do_xlate_actions >>>>> (ofpacts=ofpacts@entry=0x56277797b318, >>>>> ofpacts_len=ofpacts_len@entry=8, ctx=ctx@entry=0x7fd305729a60, >>>>> is_last_action=is_last_action@entry=true, >>>>> group_bucket_action=group_bucket_action@entry=false) at >>>>> ofproto/ofproto-dpif-xlate.c:7110 >>>>> #56 0x0000562773d330d6 in xlate_actions (xin=xin@entry=0x7fd30572a920, >>>>> xout=xout@entry=0x7fd30572ad38) at ofproto/ofproto-dpif-xlate.c:7924 >>>>> #57 0x0000562773d2241b in upcall_xlate (wc=0x7fd30572bfe0, >>>>> odp_actions=0x7fd30572b7b0, upcall=0x7fd30572acd0, >>>>> udpif=0x562777850cf0) at ofproto/ofproto-dpif-upcall.c:1340 >>>>> #58 process_upcall (udpif=udpif@entry=0x562777850cf0, >>>>> upcall=upcall@entry=0x7fd30572acd0, >>>>> odp_actions=odp_actions@entry=0x7fd30572b7b0, >>>>> wc=wc@entry=0x7fd30572bfe0) >>>>> at ofproto/ofproto-dpif-upcall.c:1602 >>>>> #59 0x0000562773d22c39 in upcall_cb (packet=<optimized out>, >>>>> flow=0x7fd30572bd40, ufid=<optimized out>, pmd_id=<optimized out>, >>>>> type=<optimized out>, userdata=<optimized out>, >>>>> actions=0x7fd30572b7b0, wc=0x7fd30572bfe0, >>>>> put_actions=0x7fd30572b7f0, aux=0x562777850cf0) at >>>>> ofproto/ofproto-dpif-upcall.c:1461 >>>>> #60 0x0000562773d52198 in dp_netdev_upcall >>>>> (pmd=pmd@entry=0x7fd305731010, packet_=packet_@entry=0x21152c180, >>>>> flow=flow@entry=0x7fd30572bd40, wc=wc@entry=0x7fd30572bfe0, >>>>> ufid=ufid@entry=0x7fd30572b790, type=type@entry=DPIF_UC_MISS, >>>>> userdata=userdata@entry=0x0, actions=actions@entry=0x7fd30572b7b0, >>>>> put_actions=put_actions@entry=0x7fd30572b7f0) >>>>> at lib/dpif-netdev.c:9141 >>>>> #61 0x0000562773d66f13 in handle_packet_upcall >>>>> (put_actions=0x7fd30572b7f0, actions=0x7fd30572b7b0, >>>>> key=0x7fd30572cc40, packet=0x21152c180, pmd=0x7fd305731010) at >>>>> lib/dpif-netdev.c:11303 >>>>> #62 fast_path_processing (pmd=pmd@entry=0x7fd305731010, >>>>> packets_=packets_@entry=0x7fd30572d0c0, >>>>> keys=keys@entry=0x7fd30572cc30, >>>>> flow_map=flow_map@entry=0x7fd30572cae0, >>>>> index_map=index_map@entry=0x7fd30572cad0 "", in_port=<optimized >>>>> out>) at lib/dpif-netdev.c:11426 >>>>> #63 0x0000562773d680c1 in dp_netdev_input__ (pmd=<optimized out>, >>>>> packets=<optimized out>, md_is_valid=md_is_valid@entry=false, >>>>> port_no=<optimized out>) at lib/dpif-netdev.c:11520 >>>>> #64 0x0000562773d6a15d in dp_netdev_input (pmd=<optimized out>, >>>>> packets=<optimized out>, port_no=<optimized out>) at >>>>> lib/dpif-netdev.c:11558 >>>>> #65 0x0000562773d6a2cf in dp_netdev_process_rxq_port >>>>> (pmd=pmd@entry=0x7fd305731010, rxq=0x56277796a5e0, port_no=3) at >>>>> lib/dpif-netdev.c:6660 >>>>> #66 0x0000562773d6a759 in pmd_thread_main (f_=<optimized out>) at >>>>> lib/dpif-netdev.c:8267 >>>>> #67 0x0000562773df805f in ovsthread_wrapper (aux_=<optimized out>) at >>>>> lib/ovs-thread.c:422 >>>>> #68 0x00007fd344480e65 in start_thread (arg=0x7fd305730700) at >>>>> pthread_create.c:307 >>>>> #69 0x00007fd34260988d in clone () at >>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 >>>>> >>>>> Thread 1 (Thread 0x7fd334307700 (LWP 91942)): >>>>> #0 0x00007fd342541337 in __GI_raise (sig=sig@entry=6) at >>>>> ../nptl/sysdeps/unix/sysv/linux/raise.c:55 >>>>> #1 0x00007fd342542a28 in __GI_abort () at abort.c:90 >>>>> #2 0x0000562773e2c4ee in ovs_abort_valist (err_no=<optimized out>, >>>>> format=<optimized out>, args=args@entry=0x7fd334302340) at >>>>> lib/util.c:499 >>>>> #3 0x0000562773e2c584 in ovs_abort (err_no=err_no@entry=0, >>>>> format=format@entry=0x5627740cad18 "%s: %s() passed uninitialized >>>>> ovs_mutex") at lib/util.c:491 >>>>> #4 0x0000562773df72e1 in ovs_mutex_lock_at >>>>> (l_=l_@entry=0x7fd2f907df68, where=where@entry=0x5627740a7fc0 >>>>> "ofproto/ofproto-dpif-upcall.c:2214") at lib/ovs-thread.c:75 >>>>> #5 0x0000562773d1e98d in ukey_delete (umap=umap@entry=0x562777853400, >>>>> ukey=ukey@entry=0x7fd2f907df20) at ofproto/ofproto-dpif-upcall.c:2214 >>>>> #6 0x0000562773d202da in revalidator_sweep__ >>>>> (revalidator=revalidator@entry=0x562777897b00, >>>>> purge=purge@entry=false) at ofproto/ofproto-dpif-upcall.c:3048 >>>>> #7 0x0000562773d241a6 in revalidator_sweep >>>>> (revalidator=0x562777897b00) at ofproto/ofproto-dpif-upcall.c:3072 >>>>> #8 udpif_revalidator (arg=0x562777897b00) at >>>>> ofproto/ofproto-dpif-upcall.c:1086 >>>>> #9 0x0000562773df805f in ovsthread_wrapper (aux_=<optimized out>) at >>>>> lib/ovs-thread.c:422 >>>>> #10 0x00007fd344480e65 in start_thread (arg=0x7fd334307700) at >>>>> pthread_create.c:307 >>>>> #11 0x00007fd34260988d in clone () at >>>>> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 >>>>> >>>>> >>>>> Thanks >>>>> LIU Yulong >>>>> >>>>> On Mon, Feb 19, 2024 at 8:12 PM Eelco Chaudron <echau...@redhat.com> >>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On 19 Feb 2024, at 13:09, Ilya Maximets wrote: >>>>>> >>>>>>> On 2/19/24 11:14, Eelco Chaudron wrote: >>>>>>>> >>>>>>>> >>>>>>>> On 19 Feb 2024, at 10:34, LIU Yulong wrote: >>>>>>>> >>>>>>>>> Hi OVS experts, >>>>>>>>> >>>>>>>>> Our ovs-vswitchd runs to core at the ovs_mutex_trylock(&ukey->mutex) >>>>>>>>> in the >>>>>>>>> function revalidator_sweep__. >>>>>>>>> >>>>>>>>> I've sent the mail before but have no response. >>>>>>>>> https://mail.openvswitch.org/pipermail/ovs-discuss/2023-August/052604.html >>>>>>>>> >>>>>>>>> So I'm trying to send this mail again. And I may apologize in advance >>>>>>>>> because >>>>>>>>> I would like to post as much useful information as possible to help >>>>>>>>> identify >>>>>>>>> potential issues. So this mail will have a really long text. >>>>>>>>> >>>>>>>>> Compared to the mail 2023-August/052604.html, we upgrade the OVS to >>>>>>>>> 2.17.8 >>>>>>>>> and DPDK to 22.11 to pray for good luck that maybe the community has >>>>>>>>> potential >>>>>>>>> fixes for this issue. But unfortunately, the ovs-vswitchd still runs >>>>>>>>> to core. >>>>>>>> >>>>>>>> As you mentioned it looks like some memory corruption, which I have >>>>>>>> not seen before. >>>>>>>> >>>>>>>> Have you tried this without rte offload? This is the only feature I >>>>>>>> never used. >>>>>>>> There is a 2.17.9 with DPDK 22.11.6 you could try. >>>>>>> >>>>>>> OVS 2.17 is not supposed to work with DPDK 22.11, it's supposed to work >>>>>>> with 21.11. >>>>>>> See the compatibility table here: >>>>>>> https://docs.openvswitch.org/en/latest/faq/releases/ >>>>>>> >>>>>>> Though it's hard to tell if DPDK version is anyhow related to the issue. >>>>>> >>>>>> My mistake, I was supposed to type 21.11.6 :( But yes if they are using >>>>>> 22.11, that could also be the problem. I would suggest using the >>>>>> supported version and see if the problem goes away. >>>>>> >>>>>> //Eelco >>>>>> >>>>>>> Best regards, Ilya Maximets. >>>>>> >>>> >> _______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss