On Fri, 29 Jan 2021 14:52:35 -0500 Sven Van Asbroeck wrote: > From: Sven Van Asbroeck <thesve...@gmail.com> > > The buffers in the lan743x driver's receive ring are always 9K, > even when the largest packet that can be received (the mtu) is > much smaller. This performs particularly badly on cpu archs > without dma cache snooping (such as ARM): each received packet > results in a 9K dma_{map|unmap} operation, which is very expensive > because cpu caches need to be invalidated. > > Careful measurement of the driver rx path on armv7 reveals that > the cpu spends the majority of its time waiting for cache > invalidation. > > Optimize as follows: > > 1. set rx ring buffer size equal to the mtu. this limits the > amount of cache that needs to be invalidated per dma_map(). > > 2. when dma_unmap()ping, skip cpu sync. Sync only the packet data > actually received, the size of which the chip will indicate in > its rx ring descriptors. this limits the amount of cache that > needs to be invalidated per dma_unmap(). > > These optimizations double the rx performance on armv7. > Third parties report 3x rx speedup on armv8. > > Performance on dma cache snooping architectures (such as x86) > is expected to stay the same. > > Tested with iperf3 on a freescale imx6qp + lan7430, both sides > set to mtu 1500 bytes, measure rx performance: > > Before: > [ ID] Interval Transfer Bandwidth Retr > [ 4] 0.00-20.00 sec 550 MBytes 231 Mbits/sec 0 > After: > [ ID] Interval Transfer Bandwidth Retr > [ 4] 0.00-20.00 sec 1.33 GBytes 570 Mbits/sec 0 > > Test by Anders Roenningen (and...@ronningen.priv.no) on armv8, > rx iperf3: > Before 102 Mbits/sec > After 279 Mbits/sec > > Signed-off-by: Sven Van Asbroeck <thesve...@gmail.com>
You may need to rebase to see this: drivers/net/ethernet/microchip/lan743x_main.c:2123:41: warning: restricted __le32 degrades to integer