On Sun, Aug 27, 2023 at 05:40:33PM +0200, Morten Brørup wrote: > > From: Mattias Rönnblom [mailto:hof...@lysator.liu.se] > > Sent: Sunday, 27 August 2023 15.55 > > > > On 2023-08-27 10:34, Morten Brørup wrote: > > > +CC Honnappa and Konstantin, Ring lib maintainers > > > +CC Mattias, PRNG lib maintainer > > > > > >> From: Bruce Richardson [mailto:bruce.richard...@intel.com] > > >> Sent: Friday, 25 August 2023 11.24 > > >> > > >> On Fri, Aug 25, 2023 at 11:06:01AM +0200, Morten Brørup wrote: > > >>> +CC mempool maintainers > > >>> > > >>>> From: Bruce Richardson [mailto:bruce.richard...@intel.com] > > >>>> Sent: Friday, 25 August 2023 10.23 > > >>>> > > >>>> On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote: > > >>>>> Bruce, > > >>>>> > > >>>>> With this patch [1], it is noted that the ring producer and > > >> consumer data > > >>>> should not be on adjacent cache lines, for performance reasons. > > >>>>> > > >>>>> [1]: > > >>>> > > >> > > https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1f > > >> fd4b66 > > >>>> e75485cc8b63b9aedfbdfe8b0 > > >>>>> > > >>>>> (It's obvious that they cannot share the same cache line, because > > >> they are > > >>>> accessed by two different threads.) > > >>>>> > > >>>>> Intuitively, I would think that having them on different cache > > >> lines would > > >>>> suffice. Why does having an empty cache line between them make a > > >> difference? > > >>>>> > > >>>>> And does it need to be an empty cache line? Or does it suffice > > >> having the > > >>>> second structure start at two cache lines after the start of the > > >> first > > >>>> structure (e.g. if the size of the first structure is two cache > > >> lines)? > > >>>>> > > >>>>> I'm asking because the same principle might apply to other code > > >> too. > > >>>>> > > >>>> Hi Morten, > > >>>> > > >>>> this was something we discovered when working on the distributor > > >> library. > > >>>> If we have cachelines per core where there is heavy access, having > > >> some > > >>>> cachelines as a gap between the content cachelines can help > > >> performance. We > > >>>> believe this helps due to avoiding issues with the HW prefetchers > > >> (e.g. > > >>>> adjacent cacheline prefetcher) bringing in the second cacheline > > >>>> speculatively when an operation is done on the first line. > > >>> > > >>> I guessed that it had something to do with speculative prefetching, > > >> but wasn't sure. Good to get confirmation, and that it has a > > measureable > > >> effect somewhere. Very interesting! > > >>> > > >>> NB: More comments in the ring lib about stuff like this would be > > nice. > > >>> > > >>> So, for the mempool lib, what do you think about applying the same > > >> technique to the rte_mempool_debug_stats structure (which is an array > > >> indexed per lcore)... Two adjacent lcores heavily accessing their > > local > > >> mempool caches seems likely to me. But how heavy does the access need > > to > > >> be for this technique to be relevant? > > >>> > > >> > > >> No idea how heavy the accesses need to be for this to have a > > noticable > > >> effect. For things like debug stats, I wonder how worthwhile making > > such > > >> a > > >> change would be, but then again, any change would have very low > > impact > > >> too > > >> in that case. > > > > > > I just tried adding padding to some of the hot structures in our own > > application, and observed a significant performance improvement for > > those. > > > > > > So I think this technique should have higher visibility in DPDK by > > adding a new cache macro to rte_common.h: > > > > > > /** > > > * Empty cache line, to guard against speculative prefetching. > > > * > > > > "to guard against false sharing-like effects on systems with a > > next-N-lines hardware prefetcher" > > > > > * Use as spacing between data accessed by different lcores, > > > * to prevent cache thrashing on CPUs with speculative prefetching. > > > */ > > > #define RTE_CACHE_GUARD(name) char > > cache_guard_##name[RTE_CACHE_LINE_SIZE] __rte_cache_aligned; > > > > > > > You could have a macro which specified how much guarding there needs to > > be, ideally defined on a per-CPU basis. (These things has nothing to do > > with the ISA, but everything to do with the implementation.) > > > > I'm not sure N is always 1. > > > > So the guard padding should be RTE_CACHE_LINE_SIZE * > > RTE_CACHE_GUARD_LINES bytes, and wrap the whole thing in > > #if RTE_CACHE_GUARD_LINES > 0 > > #endif > > > > ...so you can disable this (cute!) hack (on custom DPDK builds) in case > > you have disabled hardware prefetching, which seems generally to be a > > good idea for packet processing type applications. > > > > ...which leads me to another suggestions: add a note on disabling > > hardware prefetching in the optimization guide. > > > > Seems like a very good idea to have this in <rte_common.h>, and > > otherwise make this issue visible and known. > > Good points, Mattias! > > I also prefer the name-less macro you suggested below. > > So, this gets added to rte_common.h: > > /** > * Empty cache lines, to guard against false sharing-like effects > * on systems with a next-N-lines hardware prefetcher. > * > * Use as spacing between data accessed by different lcores, > * to prevent cache thrashing on hardware with speculative prefetching. > */ > #if RTE_CACHE_GUARD_LINES > 0 > #define _RTE_CACHE_GUARD_HELPER2(unique) \ > char cache_guard_ ## unique[RTE_CACHE_LINE_SIZE * > RTE_CACHE_GUARD_LINES] \ > __rte_cache_aligned; > #define _RTE_CACHE_GUARD_HELPER1(unique) _RTE_CACHE_GUARD_HELPER2(unique) > #define RTE_CACHE_GUARD _RTE_CACHE_GUARD_HELPER1(__COUNTER__) > #else > #define RTE_CACHE_GUARD > #endif > > And a line in /config/x86/meson.build for x86 architecture: > > dpdk_conf.set('RTE_CACHE_LINE_SIZE', 64) > + dpdk_conf.set('RTE_CACHE_GUARD_LINES', 1) > > I don't know about various architectures and implementations, so we should > probably use a default of 1, matching the existing guard size in the ring lib. > > @Bruce, I hope you can help with the configuration part of this. > This all seems a good idea. For the config, I'm not sure what is best because I can't see many folks wanting to change the default very often. I'd probably tend towards a value in rte_config.h file, but putting a per-architecture default in meson.build is probably ok too, if we see different archs wanting different defaults. A third alternative is maybe just to put the #define in rte_common.h alongside the macro definition.
I don't think we want an actual meson config option for this, I see it being too rarely used to make it worth expanding out that list. /Bruce