> From: Mattias Rönnblom [mailto:hof...@lysator.liu.se] > Sent: Sunday, 27 August 2023 15.55 > > On 2023-08-27 10:34, Morten Brørup wrote: > > +CC Honnappa and Konstantin, Ring lib maintainers > > +CC Mattias, PRNG lib maintainer > > > >> From: Bruce Richardson [mailto:bruce.richard...@intel.com] > >> Sent: Friday, 25 August 2023 11.24 > >> > >> On Fri, Aug 25, 2023 at 11:06:01AM +0200, Morten Brørup wrote: > >>> +CC mempool maintainers > >>> > >>>> From: Bruce Richardson [mailto:bruce.richard...@intel.com] > >>>> Sent: Friday, 25 August 2023 10.23 > >>>> > >>>> On Fri, Aug 25, 2023 at 08:45:12AM +0200, Morten Brørup wrote: > >>>>> Bruce, > >>>>> > >>>>> With this patch [1], it is noted that the ring producer and > >> consumer data > >>>> should not be on adjacent cache lines, for performance reasons. > >>>>> > >>>>> [1]: > >>>> > >> > https://git.dpdk.org/dpdk/commit/lib/librte_ring/rte_ring.h?id=d9f0d3a1f > >> fd4b66 > >>>> e75485cc8b63b9aedfbdfe8b0 > >>>>> > >>>>> (It's obvious that they cannot share the same cache line, because > >> they are > >>>> accessed by two different threads.) > >>>>> > >>>>> Intuitively, I would think that having them on different cache > >> lines would > >>>> suffice. Why does having an empty cache line between them make a > >> difference? > >>>>> > >>>>> And does it need to be an empty cache line? Or does it suffice > >> having the > >>>> second structure start at two cache lines after the start of the > >> first > >>>> structure (e.g. if the size of the first structure is two cache > >> lines)? > >>>>> > >>>>> I'm asking because the same principle might apply to other code > >> too. > >>>>> > >>>> Hi Morten, > >>>> > >>>> this was something we discovered when working on the distributor > >> library. > >>>> If we have cachelines per core where there is heavy access, having > >> some > >>>> cachelines as a gap between the content cachelines can help > >> performance. We > >>>> believe this helps due to avoiding issues with the HW prefetchers > >> (e.g. > >>>> adjacent cacheline prefetcher) bringing in the second cacheline > >>>> speculatively when an operation is done on the first line. > >>> > >>> I guessed that it had something to do with speculative prefetching, > >> but wasn't sure. Good to get confirmation, and that it has a > measureable > >> effect somewhere. Very interesting! > >>> > >>> NB: More comments in the ring lib about stuff like this would be > nice. > >>> > >>> So, for the mempool lib, what do you think about applying the same > >> technique to the rte_mempool_debug_stats structure (which is an array > >> indexed per lcore)... Two adjacent lcores heavily accessing their > local > >> mempool caches seems likely to me. But how heavy does the access need > to > >> be for this technique to be relevant? > >>> > >> > >> No idea how heavy the accesses need to be for this to have a > noticable > >> effect. For things like debug stats, I wonder how worthwhile making > such > >> a > >> change would be, but then again, any change would have very low > impact > >> too > >> in that case. > > > > I just tried adding padding to some of the hot structures in our own > application, and observed a significant performance improvement for > those. > > > > So I think this technique should have higher visibility in DPDK by > adding a new cache macro to rte_common.h: > > > > /** > > * Empty cache line, to guard against speculative prefetching. > > * > > "to guard against false sharing-like effects on systems with a > next-N-lines hardware prefetcher" > > > * Use as spacing between data accessed by different lcores, > > * to prevent cache thrashing on CPUs with speculative prefetching. > > */ > > #define RTE_CACHE_GUARD(name) char > cache_guard_##name[RTE_CACHE_LINE_SIZE] __rte_cache_aligned; > > > > You could have a macro which specified how much guarding there needs to > be, ideally defined on a per-CPU basis. (These things has nothing to do > with the ISA, but everything to do with the implementation.) > > I'm not sure N is always 1. > > So the guard padding should be RTE_CACHE_LINE_SIZE * > RTE_CACHE_GUARD_LINES bytes, and wrap the whole thing in > #if RTE_CACHE_GUARD_LINES > 0 > #endif > > ...so you can disable this (cute!) hack (on custom DPDK builds) in case > you have disabled hardware prefetching, which seems generally to be a > good idea for packet processing type applications. > > ...which leads me to another suggestions: add a note on disabling > hardware prefetching in the optimization guide. > > Seems like a very good idea to have this in <rte_common.h>, and > otherwise make this issue visible and known.
Good points, Mattias! I also prefer the name-less macro you suggested below. So, this gets added to rte_common.h: /** * Empty cache lines, to guard against false sharing-like effects * on systems with a next-N-lines hardware prefetcher. * * Use as spacing between data accessed by different lcores, * to prevent cache thrashing on hardware with speculative prefetching. */ #if RTE_CACHE_GUARD_LINES > 0 #define _RTE_CACHE_GUARD_HELPER2(unique) \ char cache_guard_ ## unique[RTE_CACHE_LINE_SIZE * RTE_CACHE_GUARD_LINES] \ __rte_cache_aligned; #define _RTE_CACHE_GUARD_HELPER1(unique) _RTE_CACHE_GUARD_HELPER2(unique) #define RTE_CACHE_GUARD _RTE_CACHE_GUARD_HELPER1(__COUNTER__) #else #define RTE_CACHE_GUARD #endif And a line in /config/x86/meson.build for x86 architecture: dpdk_conf.set('RTE_CACHE_LINE_SIZE', 64) + dpdk_conf.set('RTE_CACHE_GUARD_LINES', 1) I don't know about various architectures and implementations, so we should probably use a default of 1, matching the existing guard size in the ring lib. @Bruce, I hope you can help with the configuration part of this. > > > To be used like this: > > > > struct rte_ring { > > char name[RTE_RING_NAMESIZE] __rte_cache_aligned; > > /**< Name of the ring. */ > > int flags; /**< Flags supplied at creation. */ > > const struct rte_memzone *memzone; > > /**< Memzone, if any, containing the rte_ring */ > > uint32_t size; /**< Size of ring. */ > > uint32_t mask; /**< Mask (size-1) of ring. */ > > uint32_t capacity; /**< Usable size of ring */ > > > > - char pad0 __rte_cache_aligned; /**< empty cache line */ > > + RTE_CACHE_GUARD(prod); /**< Isolate producer status. */ + RTE_CACHE_GUARD; Note: Commenting anonymous cache guard fields seems somewhat superfluous, and would largely be a copy of the general RTE_CACHE_GUARD description anyway, so I have removed the comments. > > > > /** Ring producer status. */ > > union { > > struct rte_ring_headtail prod; > > struct rte_ring_hts_headtail hts_prod; > > struct rte_ring_rts_headtail rts_prod; > > } __rte_cache_aligned; > > > > - char pad1 __rte_cache_aligned; /**< empty cache line */ > > + RTE_CACHE_GUARD(both); /**< Isolate producer from consumer. */ + RTE_CACHE_GUARD; > > > > /** Ring consumer status. */ > > union { > > struct rte_ring_headtail cons; > > struct rte_ring_hts_headtail hts_cons; > > struct rte_ring_rts_headtail rts_cons; > > } __rte_cache_aligned; > > > > - char pad2 __rte_cache_aligned; /**< empty cache line */ > > + RTE_CACHE_GUARD(cons); /**< Isolate consumer status. */ + RTE_CACHE_GUARD; > > }; > > > > > > And for the mempool library: > > > > #ifdef RTE_LIBRTE_MEMPOOL_STATS > > /** > > * A structure that stores the mempool statistics (per-lcore). > > * Note: Cache stats (put_cache_bulk/objs, get_cache_bulk/objs) are > not > > * captured since they can be calculated from other stats. > > * For example: put_cache_objs = put_objs - put_common_pool_objs. > > */ > > struct rte_mempool_debug_stats { > > uint64_t put_bulk; /**< Number of puts. */ > > uint64_t put_objs; /**< Number of objects successfully > put. */ > > uint64_t put_common_pool_bulk; /**< Number of bulks enqueued in > common pool. */ > > uint64_t put_common_pool_objs; /**< Number of objects enqueued in > common pool. */ > > uint64_t get_common_pool_bulk; /**< Number of bulks dequeued from > common pool. */ > > uint64_t get_common_pool_objs; /**< Number of objects dequeued > from common pool. */ > > uint64_t get_success_bulk; /**< Successful allocation number. > */ > > uint64_t get_success_objs; /**< Objects successfully > allocated. */ > > uint64_t get_fail_bulk; /**< Failed allocation number. */ > > uint64_t get_fail_objs; /**< Objects that failed to be > allocated. */ > > uint64_t get_success_blks; /**< Successful allocation number > of contiguous blocks. */ > > uint64_t get_fail_blks; /**< Failed allocation number of > contiguous blocks. */ > > + RTE_CACHE_GUARD(debug_stats); /**< Isolation between lcores. */ + RTE_CACHE_GUARD; > > } __rte_cache_aligned; > > #endif > > > > struct rte_mempool { > > [...] > > #ifdef RTE_LIBRTE_MEMPOOL_STATS > > /** Per-lcore statistics. > > * > > * Plus one, for unregistered non-EAL threads. > > */ > > struct rte_mempool_debug_stats stats[RTE_MAX_LCORE + 1]; > > #endif > > } __rte_cache_aligned; > > > > > > It also seems relevant for the PRNG library: > > > > /lib/eal/common/rte_random.c: > > > > struct rte_rand_state { > > uint64_t z1; > > uint64_t z2; > > uint64_t z3; > > uint64_t z4; > > uint64_t z5; > > + RTE_CACHE_GUARD(z); + RTE_CACHE_GUARD; > > } __rte_cache_aligned; > > > > Yes. > > Should there be two cache guard macros? One parameter-free > RTE_CACHE_GUARD and a RTE_CACHE_NAMED_GUARD(name) macro? > > Maybe it's better just to keep the single macro, but have a convention > with some generic name (i.e., not 'z' above) for the guard field, like > 'cache_guard' or just 'guard'. Having unique name makes no sense, except > in rare cases where you need multiple guard lines per struct. There is no need to access these fields, so let's use auto-generated unique names, and not offer an API with a name. This also makes the API as simple as possible. > > > /* One instance each for every lcore id-equipped thread, and one > > * additional instance to be shared by all others threads (i.e., all > > * unregistered non-EAL threads). > > */ > > static struct rte_rand_state rand_states[RTE_MAX_LCORE + 1]; > >