Hi Andrew, On Tue, Dec 8, 2020 at 5:51 PM Andrew Lunn <and...@lunn.ch> wrote: > > > > > So I assumed that it's a PCIe dma bandwidth issue, but I could be wrong - > > I didn't do any PCIe bandwidth measurements. > > Sometimes it is actually cache operations which take all the > time. This needs to invalidate the cache, so that when the memory is > then accessed, it get fetched from RAM. On SMP machines, cache > invalidation can be expensive, due to all the cross CPU operations. > I've actually got better performance by building a UP kernel on some > low core count ARM CPUs. > > There are some tricks which can be played. Do you actually need all > 9K? Does the descriptor tell you actually how much is used? You can > get a nice speed up if you just unmap 64 bytes for a TCP ACK, rather > than the full 9K. >
Thank you for the suggestion! The original driver developer chose 9K because presumably that's the largest frame size supported by the chip. Yes, I believe the chip will tell us via the descriptor how much it has written, I would have to double-check. I was already looking for a "trick" to transfer only the required number of bytes, but I was led to believe that dma_map_single() and dma_unmap_single() always needed to match. So: dma_map_single(9K) followed by dma_unmap_single(9K) is correct, and dma_map_single(9K) followed by dma_unmap_single(1500 bytes) means trouble. How can we get around that?