> +struct lcore_state { > + uint64_t a; > + uint64_t b; > + uint64_t sum; > +}; > + > +static __rte_always_inline void > +update(struct lcore_state *state) > +{ > + state->sum += state->a * state->b; > +} > + > +static RTE_DEFINE_PER_LCORE(struct lcore_state, tls_lcore_state); > + > +static __rte_noinline void > +tls_update(void) > +{ > + update(&RTE_PER_LCORE(tls_lcore_state));
I would normally access TLS variables directly, not through a pointer, i.e.: RTE_PER_LCORE(tls_lcore_state.sum) += RTE_PER_LCORE(tls_lcore_state.a) * RTE_PER_LCORE(tls_lcore_state.b); On the other hand, then it wouldn't be 1:1 comparable to the two other test cases. Besides, I expect the compiler to optimize away the indirect access, and produce the same output (as for the alternative implementation) anyway. No change requested. Just noticing. > +} > + > +struct __rte_cache_aligned lcore_state_aligned { > + uint64_t a; > + uint64_t b; > + uint64_t sum; Please add RTE_CACHE_GUARD here, for 100 % matching the common design pattern. > +}; > + > +static struct lcore_state_aligned sarray_lcore_state[RTE_MAX_LCORE]; > + printf("Latencies [ns/update]\n"); > + printf("Thread-local storage Static array Lcore variables\n"); > + printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9, > + sarray_latency * 1e9, lvar_latency * 1e9); I prefer cycles over ns. Perhaps you could show both? With RTE_CACHE_GUARD added where mentioned, Acked-by: Morten Brørup <m...@smartsharesystems.com>