On Sun, Apr 8, 2018, at 7:46 PM, Jacob S. Moroni wrote: > Hello Madalin, > > I've been experiencing some issues with the DPAA Ethernet driver, > specifically related to frame transmission. Hopefully you can point > me in the right direction. > > TLDR: Attempting to transmit faster than a few frames per second causes > the TX FQ CGR to enter into the congested state and remain there forever, > even after transmission stops. > > The hardware is a T2080RDB, running from the tip of net-next, using > the standard t2080rdb device tree and corenet64_smp_defconfig kernel > config. No changes were made to any of the files. The issue occurs > with 4.16.1 stable as well. In fact, the only time I've been able > to achieve reliable frame transmission was with the SDK 4.1 kernel. > > For my tests, I'm running iperf3 both with and without the -R > option (send/receive). When using a USB Ethernet adapter, there > are no issues. > > The issue is that it seems like the TX frame queues are getting > "stuck" when attempting to transmit at rates greater than a few frames > per second. Ping works fine, but it seems like anything that could > potentially cause multiple TX frames to be enqueued causes issues. > > If I run iperf3 in reverse mode (with the T2080RDB receiving), then > I can achieve ~940 Mbps, but this is also somewhat unreliable. > > If I run it with the T2080RDB transmitting, the test will never > complete. Sometimes it starts transmitting for a few seconds then stops, > and other times it never even starts. This also seems to force the > interface into a bad state. > > The ethtool stats show that the interface has entered > congestion a few times, and that it's currently congested. The fact > that it's currently congested even after stopping transmission > indicates that the FQ somehow stopped being drained. I've also > noticed that whenever this issue occurs, the TX confirmation > counters are always less than the TX packet counters. > > When it gets into this state, I can see that the memory usage is > climbing, up until about the point of where the CGR threshold > is (about 100 MB). > > Any idea what could prevent the TX FQ from being drained? My first > guess was flow control, but it's completely disabled. > > I tried messing with the egress congestion threshold, workqueue > assignments, etc., but nothing seemed to have any effect. > > If you need any more information or want me to run any tests, > please let me know. > > Thanks, > -- > Jacob S. Moroni > m...@jakemoroni.com
It turns out that irqbalance was causing all of the issues. After disabling it and rebooting, the interfaces worked perfectly. Perhaps there's an issue with how the qman/bman portals are defined as per-cpu variables. During the portal's probe, the CPUs are assigned one-by-one and subsequently passed into request_irq as the argument. However, it seems like if the IRQ affinity changes, then the ISR could be passed a reference to a per-cpu variable belonging to another CPU. At least I know where to look now. - Jake