> From: Stephen Hemminger [mailto:step...@networkplumber.org] > Sent: Monday, 11 September 2023 17.04 > > On Wed, 6 Sep 2023 19:54:32 +0200 > Morten Brørup <m...@smartsharesystems.com> wrote: > > > > > > > - idx = rte_lcore_id(); > > > + seed = __atomic_load_n(&rte_rand_seed, __ATOMIC_RELAXED); > > > + if (unlikely(seed != rand_state->seed)) { > > > > Please note that rte_rand_seed lives in a completely different cache > > line than RTE_PER_LCORE(rte_rand_state), so the comparison with > > rte_rand_seed requires reading one more cache line than the original > > implementation, which only uses the cache line holding > > rand_states[idx]. > > > > This is in the hot path. > > > > If we could register a per-thread INIT function, the lazy > > initialization could be avoided, and only one cache line accessed. > > Since rte_rand_seed rarely changes, it will get cached by each cpu.
Agreed. And the point I was trying to convey was that this will consume cache on each CPU. Cache is a scarce resource on some hardware. If reading rte_rand_seed is gated by (bool) rand_state->initialized, the __rte_rand_get_state() function is less likely to evict other data from the cache to make room for rte_rand_seed in the cache. > The problem was before there was prefetch cache overlap causing false > sharing. Yes, and my patch also wastes a cache line per state, because of the cache guard. So far, we have use this simple function for a discussion of principles and design patterns... so let me bring something new into the discussion, which is relevant on both the abstract and the practical level: We could keep using the rte_module_data[RTE_MAX_LCORE] design pattern generally. But we could make an exception for data structures significantly smaller than a cache line. If we use TLS for those, we would avoid gaps in the memory allocated for those, which is even more cache efficient. Let's assume that (unsigned int) RTE_PER_LCORE(_lcore_id) is currently the only data in TLS. If we add (struct rte_rand_state) RTE_PER_LCORE(rand_state), it will go into the same cache line as RTE_PER_LCORE(_lcore_id), and effectively only cost a fraction of a cache line.