Hi everyone, I am sorry for my late reply to this thread. My email provider flagged it as spam, so I only saw the conversation now. It seems that you have reached a conclusion on how to proceed, but I thought I should anyway share my notes/observations on this issue (in case they can be useful).
My employer has a large number of Mediatek-based (mt7620 and mt7621) routers in production. Most routers have a minimum of two internet connections - one fixed and one using mobile broadband. Some time in 2017 we started receiving reports from a few customers that the switch would stop working. The link was up, but no data would go through. Looking at the logs, we could always see the "TX timeout" error message and we started to look for a cause. We quickly ruled out any kind of crash, as the LTE was still up and wifi worked fine. After getting a few of these reports, we started looking for things that were common between the different installations. We struggled to find any, there were all sorts of devices connected to the different ports on the routers. The only thing the different cases had in common, was that the problem disappeared when whatever was connected to the WAN port was disconnected. However, again, the equipment that provided the fixed connection came from all sorts of vendors. After scratching our heads for a while and not getting anywhere, I asked here on the mailing list and was told that restarting networking should at least make the switch works fine again. We added a watchdog doing exactly that when the TX timeout message would appear. Restarting networking improved the situation considerably, but the switch would still sometimes get stuck and never recover. This triggered us to make a second attempt at recreating the problem. Our test was the same as what Rene described. We assumed the problem had something to do with sending large amounts of traffic and at a high speed, so we used iperf3 as a traffic generator and sent traffic between different machines connected to the switch. One of these machines were quite unstable and prone to crash, and we noticed that whenever that machine would crash the TX timeout issue would trigger and no traffic would pass through the switch. A normal packet capture didn't reveal anything interesting, but connecting a network tap did. Looking at the packets captured from the tap, we could see a flood of pause frames from the crashed machine. When this flood occurred, the switch stopped transmitting packets on all the ports and not just the one that the crashed machine was connected to. This caught us by surprise, but doing some research it seems to be a common behavior among "normal" switches. Also, if we waited long enough, the switch would never recover. After discovering that pause frames seemed to be at least one trigger for TX timeout, we added support to the driver for enabling/disabling flow control on each of the ports + an init script that does the disabling. Since we deployed this change on our routers, we have not had a single report about switches that stop working. We do sometimes still see the "TX timeout" error, but it is no longer critical. We never tried to disable flow control on the CPU-port only, which seems like a more elegant approach than disabling each port individually. I do agree that disabling pause frames is more a work-around than a solution, but it has at least eradicated the problem for us. I never got around to submitting our patch, but if anyone would find it useful I can do it quite soon. BR, Kristian _______________________________________________ openwrt-devel mailing list openwrt-devel@lists.openwrt.org https://lists.openwrt.org/mailman/listinfo/openwrt-devel