On Fri, Jan 12, 2018 at 01:29:17PM +0000, Matan Azrad wrote: > Hi Gaetan > > From: Gaëtan Rivet, Friday, January 12, 2018 12:29 PM > > Hi Matan, > > > > The other commits make sense to me so no issue there. > > I'm just surprised by this one so a quick question. > > > > On Tue, Dec 19, 2017 at 05:14:29PM +0000, Matan Azrad wrote: > > > Connecting the sub-devices each other by cyclic linked list can help > > > to iterate over them by Rx burst functions because there is no need to > > > check the sub-devices ring wraparound. > > > > > > Create the aforementioned linked-list and change the Rx burst > > > functions iteration accordingly. > > > > I'm surprised that a linked-list iteration, with the usual dereferencing, is > > better than doing some integer arithmetic. > > This memory references are the same as the previous code because in the new > code the linked list elements are still in continuous memory, so probably the > addresses stay in the cache. > The removed calculations and wraparound branch probably caused to the > performance gain. > > > Maybe the locality of the referenced data helps. > > > Sure.
This means that the sub_device definition is critical for the datapath. It probably goes beyond a cache-line and could be optimized. > > > Anyway, were you able to count the cycles gained by this change? It might be > > interesting to do a measure with a CPU-bound bench, such as with a dummy > > device under the fail-safe (ring or such). MLX devices use a lot of PCI > > bandwidth, so the bottleneck could be masked in a physical setting. > > > > No comments otherwise, if you are sure that this is a performance gain, the > > implementation seems ok to me. > > Yes, I checked it and saw the little gain obviously. > (just run the test with and without this patch and saw the statistics). Oh I'm sure you checked, I just wanted to make sure you properly considered the methodology. Acked-by: Gaetan Rivet <gaetan.ri...@6wind.com> -- Gaëtan Rivet 6WIND