On Thu, Apr 10, 2025 at 04:54:35PM +0200, Marcus Wichelmann wrote: > Am 10.04.25 um 16:30 schrieb Michal Kubiak: > > On Wed, Apr 09, 2025 at 05:17:49PM +0200, Marcus Wichelmann wrote: > >> Hi, > >> > >> in a setup where I use native XDP to redirect packets to a bonding > >> interface > >> that's backed by two ixgbe slaves, I noticed that the ixgbe driver > >> constantly > >> resets the NIC with the following kernel output: > >> > >> ixgbe 0000:01:00.1 ixgbe-x520-2: Detected Tx Unit Hang (XDP) > >> Tx Queue <4> > >> TDH, TDT <17e>, <17e> > >> next_to_use <181> > >> next_to_clean <17e> > >> tx_buffer_info[next_to_clean] > >> time_stamp <0> > >> jiffies <10025c380> > >> ixgbe 0000:01:00.1 ixgbe-x520-2: tx hang 19 detected on queue 4, > >> resetting adapter > >> ixgbe 0000:01:00.1 ixgbe-x520-2: initiating reset due to tx timeout > >> ixgbe 0000:01:00.1 ixgbe-x520-2: Reset adapter > >> > >> This only occurs in combination with a bonding interface and XDP, so I > >> don't > >> know if this is an issue with ixgbe or the bonding driver. > >> I first discovered this with Linux 6.8.0-57, but kernel 6.14.0 and > >> 6.15.0-rc1 > >> show the same issue. > >> > >> > >> I managed to reproduce this bug in a lab environment. Here are some details > >> about my setup and the steps to reproduce the bug: > >> > >> [...] > >> > >> Do you have any ideas what may be causing this issue or what I can do to > >> diagnose this further? > >> > >> Please let me know when I should provide any more information. > >> > >> > >> Thanks! > >> Marcus > >> > > > > Hi Marcus, > > Hi Michal, > > thank you for looking into it. And not even 24 hours after my report, I'm > very impressed! ;) > > > I have just successfully reproduced the problem on our lab machine. What > > is interesting is that I do not seem to have to use a bonding interface > > to get the "Tx timeout" that causes the adapter to reset. > > Interesting. I just tried again but had no luck yet with reproducing it > without a bonding interface. May I ask how your setup looks like? > > > I will try to debug the problem more closely and let you know of any > > updates. > > > > Thanks, > > Michal > > Great! > > Marcus >
Hi Marcus, > thank you for looking into it. And not even 24 hours after my report, I'm > very impressed! ;) Thanks! :-) > Interesting. I just tried again but had no luck yet with reproducing it > without a bonding interface. May I ask how your setup looks like? For now, I've just grabbed the first available system with the HW controlled by the "ixgbe" driver. In my case it was: Ethernet controller: Intel Corporation Ethernet Controller X550 Also, for my first attempt, I didn't use the upstream kernel - I just tried the kernel installed on that system. It was the Fedora kernel: 6.12.8-200.fc41.x86_64 I think that may be the "beauty" of timing issues - sometimes you can change just one piece in your system and get a completely different replication ratio. Anyway, the higher the repro probability, the easier it is to debug the timing problem. :-) Thanks, Michal