On Tue, 8 Dec 2020 18:02:30 -0500 Sven Van Asbroeck wrote: > On Tue, Dec 8, 2020 at 5:51 PM Andrew Lunn <and...@lunn.ch> wrote: > > > So I assumed that it's a PCIe dma bandwidth issue, but I could be wrong - > > > I didn't do any PCIe bandwidth measurements. > > > > Sometimes it is actually cache operations which take all the > > time. This needs to invalidate the cache, so that when the memory is > > then accessed, it get fetched from RAM. On SMP machines, cache > > invalidation can be expensive, due to all the cross CPU operations. > > I've actually got better performance by building a UP kernel on some > > low core count ARM CPUs. > > > > There are some tricks which can be played. Do you actually need all > > 9K? Does the descriptor tell you actually how much is used? You can > > get a nice speed up if you just unmap 64 bytes for a TCP ACK, rather > > than the full 9K.
Good point! > Thank you for the suggestion! The original driver developer chose 9K because > presumably that's the largest frame size supported by the chip. > > Yes, I believe the chip will tell us via the descriptor how much it has > written, I would have to double-check. I was already looking for a > "trick" to transfer only the required number of bytes, but I was led to > believe that dma_map_single() and dma_unmap_single() always needed to match. > > So: > dma_map_single(9K) followed by dma_unmap_single(9K) is correct, and > dma_map_single(9K) followed by dma_unmap_single(1500 bytes) means trouble. > > How can we get around that? You can set DMA_ATTR_SKIP_CPU_SYNC and then sync only the part of the buffer that got written.