Hi, Reto,
I am working for Hexagon Mining in Baar (some of our people were on the last 
nuttx summit). Our project involves nuttx and runs on a stm32h743zi. Our nuttx 
version was recently (mid January) updated.

We recently recognised that the ethernet of our platform dies after some time. 
After some investigation it looks like that the DMA of the ethernet ends up in 
a deadlock since no further descriptors are available and also not freed. It 
all starts with a packet which is not processed right away but in the next 
interrupt handling cycle after the next ETH_DMACSR_RI appeared. It then either 
collapses or will have a massive slow down. In case of a collapse everything 
will stop (incl. ping) since the chain is stopped around the DMA.

So far it only happened during a longer data transfer. The device works fine 
over days just sitting there and responding to the broadcasts on the network. 
To verify the problems we also used the TCPblaster example with a minimum code 
base from our side. There, it only happens if multiple threads are used. One 
thread alone is handled well as it seems. The TCPblaster worked fine on the 
stm32f7.

I am wondering if anyone had this or a similar problem before. If you need more 
information please let me know.

I don't know anything about your specific deadlock, but errors like these have occurred and been fixed in the past.  The deadlock normally occurs like this:

1. Some network activity is started and runs on the low priority work
   queue.  Most networking occurs FIFO on the low priority worker thread.
2. That network task task the network lock (net_lock()) giving it
   exclusive access to the network
3. Then it waits for some event or resource with the network locked.
4. The task that will provide the event or resource also requires the
   network lock --> Deadlock

With IOBs, there are other related kinds of deadlocks that are possible.

1. Some network activity is started and runs on the low priority work
   queue.
2. The network task needs IOBs but we are out of IOBs so that network
   task unlocks the network (allowing network activity) but blocks in
   the low priority work queue waiting for a free IOB.
3. The task the will release the IOB is also queued for execution on
   the low priority work queue.  But since the queue is block because
   the network task is waiting on the working queue, the IOB cannot be
   release --> Deadlock

There are a couple of solutions to this latter IOB case:  First you can analyze the deadlock and find the culprit.  Then modify the design so that the deadlock cannot occur.

If this is the situation, that a really simple fix is to increase the number of low priority worker threads.  By default, there is only one so the FIFO nature of the single work queue tends to deadlock.  But if you increase the number of threads to two these is much less likelihood of deadlocking in this way.

Greg


Reply via email to