On Fri, Aug 25, 2023 at 11:06:01AM +0200, Morten Brørup wrote: > +CC mempool maintainers > > > From: Bruce Richardson [mailto:bruce.richard...@intel.com] > > Sent: Friday, 25 August 2023 10.23 > > > > On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote: > > > Bruce, > > > > > > With this patch [1], it is noted that the ring producer and consumer data > > should not be on adjacent cache lines, for performance reasons. > > > > > > [1]: > > https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1ffd4b66 > > e75485cc8b63b9aedfbdfe8b0 > > > > > > (It's obvious that they cannot share the same cache line, because they are > > accessed by two different threads.) > > > > > > Intuitively, I would think that having them on different cache lines would > > suffice. Why does having an empty cache line between them make a difference? > > > > > > And does it need to be an empty cache line? Or does it suffice having the > > second structure start at two cache lines after the start of the first > > structure (e.g. if the size of the first structure is two cache lines)? > > > > > > I'm asking because the same principle might apply to other code too. > > > > > Hi Morten, > > > > this was something we discovered when working on the distributor library. > > If we have cachelines per core where there is heavy access, having some > > cachelines as a gap between the content cachelines can help performance. We > > believe this helps due to avoiding issues with the HW prefetchers (e.g. > > adjacent cacheline prefetcher) bringing in the second cacheline > > speculatively when an operation is done on the first line. > > I guessed that it had something to do with speculative prefetching, but > wasn't sure. Good to get confirmation, and that it has a measureable effect > somewhere. Very interesting! > > NB: More comments in the ring lib about stuff like this would be nice. > > So, for the mempool lib, what do you think about applying the same technique > to the rte_mempool_debug_stats structure (which is an array indexed per > lcore)... Two adjacent lcores heavily accessing their local mempool caches > seems likely to me. But how heavy does the access need to be for this > technique to be relevant? >
No idea how heavy the accesses need to be for this to have a noticable effect. For things like debug stats, I wonder how worthwhile making such a change would be, but then again, any change would have very low impact too in that case. /Bruce