> From: Stephen Hemminger [mailto:step...@networkplumber.org]
> Sent: Monday, 11 September 2023 17.04
> 
> On Wed, 6 Sep 2023 19:54:32 +0200
> Morten Brørup <m...@smartsharesystems.com> wrote:
> 
> > >
> > > - idx = rte_lcore_id();
> > > + seed = __atomic_load_n(&rte_rand_seed, __ATOMIC_RELAXED);
> > > + if (unlikely(seed != rand_state->seed)) {
> >
> > Please note that rte_rand_seed lives in a completely different cache
> > line than RTE_PER_LCORE(rte_rand_state), so the comparison with
> > rte_rand_seed requires reading one more cache line than the original
> > implementation, which only uses the cache line holding
> > rand_states[idx].
> >
> > This is in the hot path.
> >
> > If we could register a per-thread INIT function, the lazy
> > initialization could be avoided, and only one cache line accessed.
> 
> Since rte_rand_seed rarely changes, it will get cached by each cpu.

Agreed. And the point I was trying to convey was that this will consume cache 
on each CPU. Cache is a scarce resource on some hardware.

If reading rte_rand_seed is gated by (bool) rand_state->initialized, the 
__rte_rand_get_state() function is less likely to evict other data from the 
cache to make room for rte_rand_seed in the cache.

> The problem was before there was prefetch cache overlap causing false
> sharing.

Yes, and my patch also wastes a cache line per state, because of the cache 
guard.

So far, we have use this simple function for a discussion of principles and 
design patterns... so let me bring something new into the discussion, which is 
relevant on both the abstract and the practical level:

We could keep using the rte_module_data[RTE_MAX_LCORE] design pattern 
generally. But we could make an exception for data structures significantly 
smaller than a cache line. If we use TLS for those, we would avoid gaps in the 
memory allocated for those, which is even more cache efficient.

Let's assume that (unsigned int) RTE_PER_LCORE(_lcore_id) is currently the only 
data in TLS. If we add (struct rte_rand_state) RTE_PER_LCORE(rand_state), it 
will go into the same cache line as RTE_PER_LCORE(_lcore_id), and effectively 
only cost a fraction of a cache line.


Reply via email to