Hi, in a setup where I use native XDP to redirect packets to a bonding interface that's backed by two ixgbe slaves, I noticed that the ixgbe driver constantly resets the NIC with the following kernel output:
ixgbe 0000:01:00.1 ixgbe-x520-2: Detected Tx Unit Hang (XDP) Tx Queue <4> TDH, TDT <17e>, <17e> next_to_use <181> next_to_clean <17e> tx_buffer_info[next_to_clean] time_stamp <0> jiffies <10025c380> ixgbe 0000:01:00.1 ixgbe-x520-2: tx hang 19 detected on queue 4, resetting adapter ixgbe 0000:01:00.1 ixgbe-x520-2: initiating reset due to tx timeout ixgbe 0000:01:00.1 ixgbe-x520-2: Reset adapter This only occurs in combination with a bonding interface and XDP, so I don't know if this is an issue with ixgbe or the bonding driver. I first discovered this with Linux 6.8.0-57, but kernel 6.14.0 and 6.15.0-rc1 show the same issue. I managed to reproduce this bug in a lab environment. Here are some details about my setup and the steps to reproduce the bug: NIC: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01) CPU: Ampere(R) Altra(R) Processor Q80-30 CPU @ 3.0GHz Also reproduced on: - Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz - Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz Kernel: 6.15.0-rc1 (built from mainline) # ethtool -i ixgbe-x520-1 driver: ixgbe version: 6.15.0-rc1 firmware-version: 0x00012b2c, 1.3429.0 expansion-rom-version: bus-info: 0000:01:00.0 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes The two ports of the NIC (named "ixgbe-x520-1" and "ixgbe-x520-2") are directly connected with each other using a DAC cable. Both ports are configured to be slaves of a bonding with mode balance-rr. Neither the direct connection of both ports nor the round-robin bonding mode are a requirement to reproduce the issue. This setup just allows it to be easier reproduced in an isolated environment. The issue is also visible with a regular 802.3ad link aggregation with a switch on the other side. # modprobe bonding # ip link set dev ixgbe-x520-1 down # ip link set dev ixgbe-x520-2 down # ip link add bond0 type bond mode balance-rr # ip link set dev ixgbe-x520-1 master bond0 # ip link set dev ixgbe-x520-2 master bond0 # ip link set dev ixgbe-x520-1 up # ip link set dev ixgbe-x520-2 up # ip link set dev bond0 up # cat /proc/net/bonding/bond0 Ethernet Channel Bonding Driver: v6.15.0-rc1 Bonding Mode: load balancing (round-robin) MII Status: up MII Polling Interval (ms): 0 Up Delay (ms): 0 Down Delay (ms): 0 Peer Notification Delay (ms): 0 Slave Interface: ixgbe-x520-1 MII Status: up Speed: 10000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 6c:b3:11:08:5c:3c Slave queue ID: 0 Slave Interface: ixgbe-x520-2 MII Status: up Speed: 10000 Mbps Duplex: full Link Failure Count: 0 Permanent HW addr: 6c:b3:11:08:5c:3e Slave queue ID: 0 # ethtool -l ixgbe-x520-1 Channel parameters for ixgbe-x520-1: Pre-set maximums: RX: n/a TX: n/a Other: 1 Combined: 63 Current hardware settings: RX: n/a TX: n/a Other: 1 Combined: 63 (same for ixgbe-x520-2) In the following the xdp-tools from https://github.com/xdp-project/xdp-tools/ are used. Enable XDP on the bonding and make sure all received packets will be dropped: # xdp-tools/xdp-bench/xdp-bench drop -e -i 1 bond0 Redirect a batch of packets to the bonding interface: # xdp-tools/xdp-trafficgen/xdp-trafficgen udp --dst-mac <mac of bond0> --src-port 5000 --dst-port 6000 --threads 16 --num-packets 1000000 bond0 Shortly after that (3-4 seconds), one or more "Detected Tx Unit Hang" errors (see above) will show up in the kernel log. The high number of packets and thread count (--threads 16) is not required to trigger the issue but greatly improves the probability. Do you have any ideas what may be causing this issue or what I can do to diagnose this further? Please let me know when I should provide any more information. Thanks! Marcus