Hi, Reto,
I am working for Hexagon Mining in Baar (some of our people were on the last
nuttx summit). Our project involves nuttx and runs on a stm32h743zi. Our nuttx
version was recently (mid January) updated.
We recently recognised that the ethernet of our platform dies after some time.
After some investigation it looks like that the DMA of the ethernet ends up in
a deadlock since no further descriptors are available and also not freed. It
all starts with a packet which is not processed right away but in the next
interrupt handling cycle after the next ETH_DMACSR_RI appeared. It then either
collapses or will have a massive slow down. In case of a collapse everything
will stop (incl. ping) since the chain is stopped around the DMA.
So far it only happened during a longer data transfer. The device works fine
over days just sitting there and responding to the broadcasts on the network.
To verify the problems we also used the TCPblaster example with a minimum code
base from our side. There, it only happens if multiple threads are used. One
thread alone is handled well as it seems. The TCPblaster worked fine on the
stm32f7.
I am wondering if anyone had this or a similar problem before. If you need more
information please let me know.
I don't know anything about your specific deadlock, but errors like
these have occurred and been fixed in the past. The deadlock normally
occurs like this:
1. Some network activity is started and runs on the low priority work
queue. Most networking occurs FIFO on the low priority worker thread.
2. That network task task the network lock (net_lock()) giving it
exclusive access to the network
3. Then it waits for some event or resource with the network locked.
4. The task that will provide the event or resource also requires the
network lock --> Deadlock
With IOBs, there are other related kinds of deadlocks that are possible.
1. Some network activity is started and runs on the low priority work
queue.
2. The network task needs IOBs but we are out of IOBs so that network
task unlocks the network (allowing network activity) but blocks in
the low priority work queue waiting for a free IOB.
3. The task the will release the IOB is also queued for execution on
the low priority work queue. But since the queue is block because
the network task is waiting on the working queue, the IOB cannot be
release --> Deadlock
There are a couple of solutions to this latter IOB case: First you can
analyze the deadlock and find the culprit. Then modify the design so
that the deadlock cannot occur.
If this is the situation, that a really simple fix is to increase the
number of low priority worker threads. By default, there is only one so
the FIFO nature of the single work queue tends to deadlock. But if you
increase the number of threads to two these is much less likelihood of
deadlocking in this way.
Greg