> From: Mattias Rönnblom [mailto:hof...@lysator.liu.se]
> Sent: Monday, 16 September 2024 13.13
> 
> On 2024-09-16 12:52, Mattias Rönnblom wrote:
> > Add basic micro benchmark for lcore variables, in an attempt to assure
> > that the overhead isn't significantly greater than alternative
> > approaches, in scenarios where the benefits aren't expected to show up
> > (i.e., when plenty of cache is available compared to the working set
> > size of the per-lcore data).
> >
> 
> Here are some test results for a Raptor Cove @ 3,2 GHz (GCC 11):
> 
>   + ------------------------------------------------------- +
>   + Test Suite : lcore variable perf autotest
>   + ------------------------------------------------------- +
> Latencies [TSC cycles/update]
> Modules/Variables  Static array  Thread-local Storage  Lcore variables
>                  1           3.9           5.5              3.7
>                  2           3.8           5.5              3.8
>                  4           4.9           5.5              3.7
>                  8           3.8           5.5              3.8
>                 16          11.3           5.5              3.7
>                 32          20.9           5.5              3.7
>                 64          23.5           5.5              3.7
>                128          23.2           5.5              3.7
>                256          23.5           5.5              3.7
>                512          24.1           5.5              3.7
>               1024          25.3           5.5              3.9
>   + TestCase [ 0] : test_lcore_var_access succeeded
>   + ------------------------------------------------------- +
> 
> 
> The reason for TLS being slower than lcore variables (which in turn
> relies on TLS for lcore id lookup) is the lazy initialization
> conditional that is imposed on variant. Could that be avoided (which is
> module-dependent I suppose), it beats lcore variables at ~3.0 cycles/update.

I think you should not assume lazy initialization of TLS in your benchmark.
Our application uses TLS, and when spinning up a new thread, we call an 
per-lcore init function of each module before calling the per-lcore run 
function. This design pattern is also described in Figure 1.4 [1] in the 
Programmer's Guide.

[1]: https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html

> 
> I must say I'm surprised to see lcore variables doing this good, at
> these very modest working set sizes. Probably, you can stay at near-zero
> L1 misses with lcore variables (and TLS), but start missing the L1 with
> static arrays.

Reply via email to