> That's a good question. I used perf to create a flame graph of what > the cpu was doing when receiving data at high speed. It showed that > __dma_page_dev_to_cpu took up most of the cpu time. Which is triggered > by dma_unmap_single(9K, DMA_FROM_DEVICE). > > So I assumed that it's a PCIe dma bandwidth issue, but I could be wrong - > I didn't do any PCIe bandwidth measurements.
Sometimes it is actually cache operations which take all the time. This needs to invalidate the cache, so that when the memory is then accessed, it get fetched from RAM. On SMP machines, cache invalidation can be expensive, due to all the cross CPU operations. I've actually got better performance by building a UP kernel on some low core count ARM CPUs. There are some tricks which can be played. Do you actually need all 9K? Does the descriptor tell you actually how much is used? You can get a nice speed up if you just unmap 64 bytes for a TCP ACK, rather than the full 9K. Andrew