On 2024-09-12 11:39, Morten Brørup wrote:
+struct lcore_state {
+       uint64_t a;
+       uint64_t b;
+       uint64_t sum;
+};
+
+static __rte_always_inline void
+update(struct lcore_state *state)
+{
+       state->sum += state->a * state->b;
+}
+
+static RTE_DEFINE_PER_LCORE(struct lcore_state, tls_lcore_state);
+
+static __rte_noinline void
+tls_update(void)
+{
+       update(&RTE_PER_LCORE(tls_lcore_state));

I would normally access TLS variables directly, not through a pointer, i.e.:

RTE_PER_LCORE(tls_lcore_state.sum) += RTE_PER_LCORE(tls_lcore_state.a) * 
RTE_PER_LCORE(tls_lcore_state.b);

On the other hand, then it wouldn't be 1:1 comparable to the two other test 
cases.

Besides, I expect the compiler to optimize away the indirect access, and 
produce the same output (as for the alternative implementation) anyway.

No change requested. Just noticing.

+}
+
+struct __rte_cache_aligned lcore_state_aligned {
+       uint64_t a;
+       uint64_t b;
+       uint64_t sum;

Please add RTE_CACHE_GUARD here, for 100 % matching the common design pattern.


Will do.

+};
+
+static struct lcore_state_aligned sarray_lcore_state[RTE_MAX_LCORE];


+       printf("Latencies [ns/update]\n");
+       printf("Thread-local storage  Static array  Lcore variables\n");
+       printf("%20.1f %13.1f %16.1f\n", tls_latency * 1e9,
+              sarray_latency * 1e9, lvar_latency * 1e9);

I prefer cycles over ns. Perhaps you could show both?


That's makes you an x86 guy. :) Since only on x86 those cycles makes any sense.

I didn't want to use cycles since it would be a very small value on certain (e.g., old ARM) platforms.

But, elsewhere in the perf tests TSC cycles are used, so maybe I should switch to using such nevertheless.


With RTE_CACHE_GUARD added where mentioned,

Acked-by: Morten Brørup <m...@smartsharesystems.com>

Reply via email to