On 2024-09-16 13:54, Morten Brørup wrote:
From: Mattias Rönnblom [mailto:hof...@lysator.liu.se]
Sent: Monday, 16 September 2024 13.13
On 2024-09-16 12:52, Mattias Rönnblom wrote:
Add basic micro benchmark for lcore variables, in an attempt to assure
that the overhead isn't significantly greater than alternative
approaches, in scenarios where the benefits aren't expected to show up
(i.e., when plenty of cache is available compared to the working set
size of the per-lcore data).
Here are some test results for a Raptor Cove @ 3,2 GHz (GCC 11):
+ ------------------------------------------------------- +
+ Test Suite : lcore variable perf autotest
+ ------------------------------------------------------- +
Latencies [TSC cycles/update]
Modules/Variables Static array Thread-local Storage Lcore variables
1 3.9 5.5 3.7
2 3.8 5.5 3.8
4 4.9 5.5 3.7
8 3.8 5.5 3.8
16 11.3 5.5 3.7
32 20.9 5.5 3.7
64 23.5 5.5 3.7
128 23.2 5.5 3.7
256 23.5 5.5 3.7
512 24.1 5.5 3.7
1024 25.3 5.5 3.9
+ TestCase [ 0] : test_lcore_var_access succeeded
+ ------------------------------------------------------- +
The reason for TLS being slower than lcore variables (which in turn
relies on TLS for lcore id lookup) is the lazy initialization
conditional that is imposed on variant. Could that be avoided (which is
module-dependent I suppose), it beats lcore variables at ~3.0 cycles/update.
I think you should not assume lazy initialization of TLS in your benchmark.
Our application uses TLS, and when spinning up a new thread, we call an
per-lcore init function of each module before calling the per-lcore run
function. This design pattern is also described in Figure 1.4 [1] in the
Programmer's Guide.
[1]: https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html
Per-lcore init functions may be an option, and also may not, depending
on what API you need to adhere to. But maybe I should add non-lazy TLS
variant as well.
I should probably add some information on lcore variables in the EAL
programmer's guide as well.
Non-lazy TLS would be a more viable option if there were proper
framework support for it. Now, I'm not sure there is a better way to do
it in a DPDK library than how it's done for tracing, where there's an
explicit call per thread created. Other DPDK-internal users of
RTE_PER_LCORE seems to depend on lazy initialization.
I must say I'm surprised to see lcore variables doing this good, at
these very modest working set sizes. Probably, you can stay at near-zero
L1 misses with lcore variables (and TLS), but start missing the L1 with
static arrays.