Hi Adrián, Are there any updates on this issue? We've conducted some analysis but are more eagerly awaiting the community's authoritative explanation。
thanks very much ! 发件人: Adrián Moreno 发送时间: 2025-05-06 16:56 收件人: chenyongch...@cestc.cn 抄送: ovs-discuss 主题: Re: [ovs-discuss] Adjusting the bond-rebalance-interval configuration triggers the USERSPACE_INVALID_PORT_DROP error during PMD packet transmission On Tue, Apr 29, 2025 at 01:48:42PM +0800, chenyongchang--- via discuss wrote: > > Hello, > In a high-traffic scenario, when modifying the bond-rebalance-interval > configuration for an OVS-DPDK bond interface, > we observed that OVS-DPDK generated USERSPACE_INVALID_PORT_DROP errors. > > After analysis, executing the command ovs-vsctl set port dpdk_tun_port > other_config:bond-rebalance-interval=1000 > triggered the following process, ultimately leading to the > USERSPACE_INVALID_PORT_DROP errors: > > 1. Execution of memset(bond->hash, 0, hash_len); > Call stack: > #0 bond_entry_reset (bond=0x4c64bc0) at ofproto/bond.c:1852 > #1 0x0000000001a2a238 in bond_reconfigure (bond=0x4c64bc0, s=0x7fff6d1dec10) > at ofproto/bond.c:514 > #2 0x0000000001a4e253 in bundle_set (ofproto_=0x4c21110, aux=0x4c39d90, > s=0x7fff6d1deb90) at ofproto/ofproto-dpif.c:3484 > #3 0x0000000001a31b27 in ofproto_bundle_register (ofproto=0x4c21110, > aux=0x4c39d90, s=0x7fff6d1deb90) at ofproto/ofproto.c:1430 > #4 0x0000000001a1c80e in port_configure (port=0x4c39d90) at > vswitchd/bridge.c:1384 > #5 0x0000000001a1b7b3 in bridge_reconfigure (ovs_cfg=0x4bb37c0) at > vswitchd/bridge.c:1005 > #6 0x0000000001a223e7 in bridge_run () at vswitchd/bridge.c:3423 > #7 0x0000000001a27b9e in main (argc=11, argv=0x7fff6d1def38) at > vswitchd/ovs-vswitchd.c:129 > > 2. Execution of member_map[i] = OFPP_NONE > Call stack: > #0 bond_add_lb_output_buckets (bond=0x37220f0) at ofproto/bond.c:2135 > #1 0x0000000001a29b4f in update_recirc_rules__ (bond=0x37220f0) at > ofproto/bond.c:356 > #2 0x0000000001a29ebe in update_recirc_rules (bond=0x37220f0) at > ofproto/bond.c:426 > #3 0x0000000001a2a262 in bond_reconfigure (bond=0x37220f0, s=0x7fffffffe230) > at ofproto/bond.c:520 > #4 0x0000000001a4e292 in bundle_set (ofproto_=0x366afa0, aux=0x3713290, > s=0x7fffffffe1b0) at ofproto/ofproto-dpif.c:3484 > #5 0x0000000001a31b66 in ofproto_bundle_register (ofproto=0x366afa0, > aux=0x3713290, s=0x7fffffffe1b0) at ofproto/ofproto.c:1430 > #6 0x0000000001a1c80e in port_configure (port=0x3713290) at > vswitchd/bridge.c:1384 > #7 0x0000000001a1b7b3 in bridge_reconfigure (ovs_cfg=0x3660180) at > vswitchd/bridge.c:1005 > #8 0x0000000001a223b7 in bridge_run () at vswitchd/bridge.c:3422 > #9 0x0000000001a27b92 in main (argc=1, argv=0x7fffffffe558) > > 3.PMD thread sending packets found port_no=0xffffffff > Call stack: > #0 dp_execute_output_action (pmd=0x7fff68731010, packets_=0x7fff53ff8f50, > should_steal=true, port_no=4294967295) > at lib/dpif-netdev.c:9273 > #1 0x0000000001acaf6d in dp_execute_lb_output_action (pmd=0x7fff68731010, > packets_=0x7fff53ff9ca0, should_steal=true, > bond=1) at lib/dpif-netdev.c:9350 > #2 0x0000000001acb0b6 in dp_execute_cb (aux_=0x7fff53ff9b30, > packets_=0x7fff53ff9ca0, a=0x7fff4800f074, should_steal=true) > at lib/dpif-netdev.c:9379 > #3 0x0000000001b526b5 in odp_execute_actions (dp=0x7fff53ff9b30, > batch=0x7fff53ff9ca0, steal=true, > actions=0x7fff4800f074, actions_len=8, dp_execute_action=0x1acafc0 > <dp_execute_cb>) at lib/odp-execute.c:1016 > #4 0x0000000001acbc8e in dp_netdev_execute_actions (pmd=0x7fff68731010, > packets=0x7fff53ff9ca0, should_steal=true, > flow=0x7fff4800ea70, actions=0x7fff4800f074, actions_len=8) at > lib/dpif-netdev.c:9698 > #5 0x0000000001ac8133 in packet_batch_per_flow_execute > (batch=0x7fff53ff9c90, pmd=0x7fff68731010) > at lib/dpif-netdev.c:8338 > #6 0x0000000001aca3ad in dp_netdev_input__ (pmd=0x7fff68731010, > packets=0x7fff53ffbdf0, md_is_valid=false, port_no=4) > at lib/dpif-netdev.c:9055 > #7 0x0000000001aca3ff in dp_netdev_input (pmd=0x7fff68731010, > packets=0x7fff53ffbdf0, port_no=4) at lib/dpif-netdev.c:9064 > #8 0x0000000001ac0da2 in dp_netdev_process_rxq_port (pmd=0x7fff68731010, > rxq=0x3720220, port_no=4) > at lib/dpif-netdev.c:5690 > #9 0x0000000001ac566a in pmd_thread_main (f_=0x7fff68731010) at > lib/dpif-netdev.c:7334 > #10 0x0000000001bc4b1b in ovsthread_wrapper (aux_=0x3711920) at > lib/ovs-thread.c:422 > #11 0x00007ffff76f4802 in start_thread () from /lib64/libc.so.6 > --Type <RET> for more, q to quit, c to continue without paging-- > #12 0x00007ffff7694314 in clone () from /lib64/libc.so.6 > > The main issue arises from a timing discrepancy between the main thread and > the PMD thread when operating on pmd->tx_bonds, > which causes the PMD to temporarily resolve the egress interface to > 0xffffffff (an invalid value). > What solutions does the community propose to address this problem? Reconfiguring the bonding flows for a simple change in the rebalance_interval seems an overkill. It was added so that users could disable rebalancing but just increasing or decreasing the interval (without initial or final values being zero) should not trigger a bond reset. Also, if bond_reconfigure resets the bond hashes, we should probaly not wait until bond_run() calls bond_update_post_recirc_rules__() to initialize them. Even for recirc-driven bonds, this makes an initial update of post-recirc rules with all hashes being zero. I'll take a deeper look probably next week. > > our ovs version 2.17.5 lts. Note 2.17.5 is not EOL. Thanks. Adrián
_______________________________________________ discuss mailing list disc...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-discuss