On 8/10/2023 4:23 PM, Josh Lange wrote:
First of all, thanks to everyone involved in the NuttX project. We
really appreciate all the work that has gone into keeping this
operating system maintained and functional on a wide variety of hardware.
We have several different NuttX-based projects that are using both
LPC1769 and LPC4078 processors with an Ethernet interface for
communication. These projects are fairly mature, having been
developed and used for several years now. We've seen occasional
glitches on Ethernet before, but we've more or less been able to
tolerate them so far. This is no longer the case in our current
application of the system, and we'd really like to try to eliminate as
many issues as we can from the software side of things. We are
primarily using Modbus TCP, which is a fairly simple request/response
protocol.
We have seen the issue manifest itself in several ways:
* Assertion failures in lpc17_40_ethernet.c:
** DEBUGASSERT((priv->lp_inten & ETH_INT_TXDONE) != 0) in
lpc17_40_response
** DEBUGASSERT(lpc17_40_txdesc(priv) == OK) in lpc17_40_txdone_work
* Incorrect TCP sequence numbers in messages coming back from the
embedded device.
Typically we will be able to run for many hundreds or thousands of
packets before we hit one of these cases, but it does seem to depend
to an extent on external factors such as which switch the device is
connected to, the amount of broadcast traffic on the network, etc.
The nature of the failures makes me think that there may be a race
condition of some kind that we're hitting, but I don't otherwise have
a lot of other evidence to base that on.
In an attempt to narrow down the cause of these issues, I pulled out a
few dev boards and tried to run some of the stock NuttX example apps
(TCP echo server, TCP blaster server, uIP web server) on them with
settings as close to defaults as possible, using a freshly-checked-out
copy of NuttX and the NuttX apps.
* On the STM32H743 Nucleo-144 board, all the network examples I tried
appear to work flawlessly. This matches my general experience running
NuttX on these parts; we have used them on several projects and have
been very pleased with their performance overall.
* On the SAM E54 Xplained Pro board, I had mixed results. I am not
using this chip for any current projects, but I had the board handy
and it is supported by NuttX, so I gave it a try in an attempt to
collect more data. The TCP echo server and web server work as
expected. Using the TCP blaster example, only a fraction of the
packets seem to make the round trip to the PC client application.
Watching in wireshark, I see some runs of clean traffic interspersed
with bursts of duplicate TCP packets and packets with invalid sequence
numbers.
* On the LPC4088 Quickstart board, only the TCP echo server works
reliably. The web server will accept the initial connection and
return a status code, but then hangs. Looking at the exchange with
wireshark, I see the embedded board returns a fragment of the HTML
content from the middle of the page, then a bunch of TCP packets with
incorrect sequence numbers. Using the TCP blaster example, I can see
some traffic generated, again with a lot of invalid sequence numbers,
but the PC client application does not report any successfully
received packets. I tried changing a number of networking- and
Ethernet-related settings in menuconfig and was only ever able to make
it less functional than this, never more.
* On the LPC1769 LPCXpresso board, I see identical results to the
LPC4088 board. This is not surprising as the two chips use the same
Ethernet peripheral, but I figured it was worth checking for
completeness.
Since the STM32H743 seems to work correctly, I don't believe there is
an issue with the TCP/IP stack in NuttX, but possibly an issue with
the drivers for the Ethernet peripherals on the chips that are having
issues. In my own application, I can't rule out the possibility of my
code causing problems, but I certainly would expect to be able to use
the provided NuttX apps such as the web server on any platform with a
network interface. The fact that at least one of the problems I'm
seeing in my application matches a problem that I'm seeing with the
example apps (missing/incorrect TCP sequence numbers) leads me to
believe that I'm probably triggering the same issue, but I know that's
not necessarily true.
I've been looking at this for a while now, and I'm more or less out of
ideas on how to proceed. I'll be the first to admit that I don't
fully understand how the network drivers and the OS are supposed to
interact. Unless I'm missing something, the fact that so many network
operations are deferred using worker threads really appears to make
this area of the system difficult to debug. I've done a lot of
testing with network warning/error/info messages turned on, and found
the signal/noise ratio to be pretty poor. If anyone with more
experience or familiarity with the NuttX TCP stack and/or Ethernet
drivers could provide any comments, tips, or insight on this issue or
how best to debug this type of problem, I would really appreciate it.
Thanks,
--Josh
The LPC17xx and LPC40xx Ethernet worked well in the distant past. These
were once popular parts and there were several network solutions based
on them using NuttX. But these are not parts often used in active
development these days. My suspicion is that some incompatible change
may have crept in. The only thing that is unique about these parts is
that they have a rather small dedicated memory for Ethernet DMA and all
packet DMAs must go through this dedicated multi-port memory.
If I were you, I would go back and find some older NuttX versions and
locate a working version. Assuming there is one (and there should be),
I would then use 'git bisect' to find the exact change that introduced
this problem.