On 12/8/20 3:02 PM, Sven Van Asbroeck wrote: > Hi Andrew, > > On Tue, Dec 8, 2020 at 5:51 PM Andrew Lunn <and...@lunn.ch> wrote: >> >>> >>> So I assumed that it's a PCIe dma bandwidth issue, but I could be wrong - >>> I didn't do any PCIe bandwidth measurements. >> >> Sometimes it is actually cache operations which take all the >> time. This needs to invalidate the cache, so that when the memory is >> then accessed, it get fetched from RAM. On SMP machines, cache >> invalidation can be expensive, due to all the cross CPU operations. >> I've actually got better performance by building a UP kernel on some >> low core count ARM CPUs. >> >> There are some tricks which can be played. Do you actually need all >> 9K? Does the descriptor tell you actually how much is used? You can >> get a nice speed up if you just unmap 64 bytes for a TCP ACK, rather >> than the full 9K. >> > > Thank you for the suggestion! The original driver developer chose 9K because > presumably that's the largest frame size supported by the chip. > > Yes, I believe the chip will tell us via the descriptor how much it has > written, I would have to double-check. I was already looking for a > "trick" to transfer only the required number of bytes, but I was led to > believe that dma_map_single() and dma_unmap_single() always needed to match. > > So: > dma_map_single(9K) followed by dma_unmap_single(9K) is correct, and > dma_map_single(9K) followed by dma_unmap_single(1500 bytes) means trouble. > > How can we get around that?
dma_sync_single_for_{cpu,device} is what you would need in order to make a partial cache line invalidation. You would still need to unmap the same address+length pair that was used for the initial mapping otherwise the DMA-API debugging will rightfully complain. -- Florian